Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Use grep and extended regular expressions to analyze text files

News Red Hat Certification Program Understanding and using essential tools Access a shell prompt and issue commands with correct syntax Finding Help Managing files in RHEL Working with hard and soft links Working with archives and compressed files Using the Midnight Commander as file manager
Text files processing Using redirection and pipes Use grep and extended regular expressions to analyze text files Connecting to the server via ssh, using multiple consoles and screen command Introduction to Unix permissions model Tips Sysadmin Horror Stories Unix History with some Emphasis on Scripting Humor

Extracted from Professor Nikolai Bezroukov unpublished lecture notes.

Copyright 2010-2018, Dr. Nikolai Bezroukov. This is a fragment of the copyrighted unpublished work. All rights reserved.

The Linux grep command searches a file for lines matching a fixed string or regular expressions (also called patterns).  Older versions of grep by default accepts only basic regular expression (DOS-style), although in RHEL 7 it also accepts extended regular expressions.

In other words grep understands three different versions of regular expression syntax: "basic," "extended" and "perl-style"

There are two other grep commands called egrep (extended grep) which accepts extended regular expressions and fgrep (which implements fast search for fixed strings only; no regular expression).

Linux uses GNU implementation of grep which combines these variations into a single utility. The egrep command is equivalent to grep -E  invocation ( or grep --extended-regexp), and the fgrep command is equivalent to grep -F  invocation (or grep --fixed-strings invocation).

The strange name grep originates in the early days of Unix, whereby one of the line-editor commands was g/re/p (globally search for a regular expression and print the matching lines). Because this editor command was used so often, a separate grep command was created to search files without first starting the line editor.

TIPS:

grep sometimes mis-recognize text files as binary. To suppress this use  option -a

-a, --text
Process a binary file as if it were text; this is equivalent to the --binary-files=text option.

One interesting test of Linux sysadmin book including Red hat certification books of whether the coverage of grep includes options -A  n (print n lines AFTER match) and -B n (print n lines BEFORE match)  Lower quality books does not cover those options, but they are extremely useful for sysadmins.

-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line containing a group separator (--) between contiguous groups of matches. With the -o or --only-matching option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines. Places a line containing a group separator (--) between contiguous groups of matches. With the -o or --only-matching option, this has no effect and a warning is given.

Another similar "litmus test of quality" option is -P

-P, --perl-regexp Interpret PATTERN as a Perl regular expression.

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

For example, [[:alnum:]] means [0-9A-Za-z], except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most meta-characters lose their special meaning inside bracket expressions.

To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.

Extended regular expressions

egrep uses matching patterns called extended regular expressions, which are similar to the pattern matching capabilities Bash extended test command ( [[..]] ).

The extended regular expression uses the following metasymbols:

Notice that the symbols are not exactly the same as the globing symbols used for file matching. For example, on the command line a question mark represents any character, whereas in grep, the period has this effect.

The characters ?, +, {, |, (, and ) must appear escaped with backslashes to prevent Bash from treating them as file-matching characters.

For example if we search /var/log/messages first for message that contain work kernel and then work failure we will get

[root@test01 log]# grep  kernel  messages | grep failed
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.

The asterisk (*) is a placeholder representing zero or more characters. Using this metasymbol we can rewrite previous query as:

[root@test01 log]# egrep 'kernel.*failed' messages
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data

Invoking grep as fgrep, or using the short option -F or long option --fixed-strings switch to the search of fixed staring and does not interpret any pattern-matching characters. For this mode they are just regular symbols without any special meaning. 

[root@test01 log]# fgrep kernel  messages | fgrep failed

The caret (^) character indicates the beginning of a line. Use the caret to check for a pattern at the start of a line. The --invert-match (or -v) switch shows the lines that do not match. Lines that match are not shown. This often valuable for analyzing config file -- it allow to delete all the comments making "meaningful" line more visible

[root@test01 etc]# grep -v '^#' /etc/sudoers

Defaults !visiblepw

Defaults always_set_home

Defaults env_reset
Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS"
Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"

Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin

root ALL=(ALL) ALL

%wheel ALL=(ALL) NOPASSWD: ALL

The --ignore-case (or -i) switch makes the search case insensitive.

grep -i error /var/log/messages

Regular expressions can be joined together with a vertical bar (|). This has the same effect as combining the results of two separate grep commands.

egrep -i 'error|fail|crash' /var/log/messages
[root@test01 etc]# egrep -i 'error|fail|crash' /var/log/messages
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
Sep 24 22:48:02 localhost kernel: crash memory driver: version 1.1
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT Xorg log watcher.
Sep 24 22:48:14 localhost systemd: Job abrt-xorg.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for Harvest vmcores for ABRT.
Sep 24 22:48:14 localhost systemd: Job abrt-vmcore.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for Install ABRT coredump hook.
Sep 24 22:48:14 localhost systemd: Job abrt-ccpp.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.
Sep 24 22:48:14 localhost systemd: Job abrt-oops.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost rngd: read error
Sep 24 22:48:14 localhost rngd: read error
Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.828375 INFO sfdisk with --part-type failed [1], retrying with -c
Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.926344 INFO sfdisk with --part-type failed [1], retrying with -c
Sep 24 22:50:04 localhost python: 2018/09/24 22:50:04.956978 WARNING Download failed, switching to host plugin

To identify the matching line, the --line-number (or -n) switch displays both the line number and the line. Using cut, head, and tail, the first line number can be saved in a variable. The number of bytes into the file can be shown with --byte-offset (or -b).

$ grep -n "crash" orders.txt

The --count (or -c) switch counts the number of matches and displays the total.

grep recognizes the standard character classes as well.

$ grep "[[:cntrl:]]" orders.txt

A complete list of Linux grep switches can be found in man page

Some useful grep Options

Being able to invert the search logic with the -v flag and show lines that don’t match the given pattern can be quite useful. You can also make it so the grep command outputs only matching filenames (rather than the lines in those files that contain the search pattern) by adding the -l option. For example systemd daemon introduced in RHEL 7 pollutes the log. It often makes sense to exclude those messages when you analyze /var/log/messages

grep -v systems /var/log/messages | less

Matching context

When searching for specific lines in a file, you may actually want to also see a line or two above or below the matching line, rather than just the matching line itself. This can be accomplished in three ways, depending on whether you want lines above, lines below, or both, by using -A, -B, and -C, respectively.

For example, to show one additional line above and below the matching line (and add line numbers too, by using the -n option):

grep -n -C1 error /var/log/message

Notice that the line that has a match has a colon after the line number, while the other context lines are preceded with a dash. Very subtle, but knowing what to look for helps you find your match instantly!

Matches in color

Another useful feature of GNU  grep command is that it  highlights the matching passage in each line if you use the verbose --color=always option. Here’s how it looks:

grep -n -C 1 --color=alwayserror /var/log/message

Counting matches rather than showing matching lines

When you’re going through a large file and have a lot of matches, it’s often quite useful to just get a report of how many lines matched rather than having all the output stream past on your screen. This is accomplished with the -c option:

grep -v systemd | grep -c "kernel" /var/log/messages

Locating Files

The Linux locate command consults a database and returns a list of all pathnames containing a certain group of characters, much like a fixed-string grep.

[root@test01 etc]# locate ssh_config
/etc/ssh/ssh_config
/usr/share/man/man5/ssh_config.5.gz

The locate database is maintained by a command called updatedb. It is usually executed once a day by Linux distributions. For this reason, locate is very fast but useful only in finding files that are at least one day old.

Finding Files

The Linux find command searches for files that meet specific conditions such as files with a certain name or files greater than a certain size. find is similar to the following ls command piped into a loop (here variable $MATCH contain the condition string for which we serach):

ls --recursive | while read FILE ; do
     # test file for a match
    if [[ $MATCH ]] ; then
       printf "%s\n" "$FILE"
    fi
done

This script recursively searches directories under the current directory, looking for a filename that matches some condition called MATCH.

find is much more powerful than this script fragment. Like the built-in test command, find switches create expressions describing the qualities of the files to find. There are also switches to change the overall behavior of find and other switches to indicate actions to perform when a match is made.

The basic matching switch is -name, which indicates the name of the file to find. Name can be a specific filename or it can contain shell path wildcard globbing characters like * and ?. If pattern matching is used, the pattern must be enclosed in quotation marks to prevent the shell from expanding it before the find command examines it.

find /etc -name "*.conf"

The previous find command matches any type of file, including files such as pipes or directories, which is not usually the intention of a user. The -type switch limits the files to a certain type of file. The -type f switch matches only regular files, the most common kind of search. The type can also be b (block device), c (character device), d (directory), p (pipe), l (symbolic link), or s (socket).

find /etc -name "*.conf"  -type f

The switch -name "*.conf"  -type f  is an example of a find expression. These switches match a file that meets both of these conditions (implicitly, a logical “and”). There are other operator switches for combining conditions into logical expressions, as follows:

For example, to count the number of regular files and directories, do this:

[root@test01 etc]# find /etc -name "*.conf"  -type f | wc -l
145

The number of files without suffix .conf  can be counted as well.

find . ! -name "*.conf" -type f | wc -l

Parentheses must be escaped by a backslash or quotes to prevent Bash from interpreting them as a subshell. Using parentheses, the number of files ending in .txt or .sh can be expressed as

$ find . "(" -name "*.conf" -or -name "*.config" ")" -type f | wc -l

Some expression switches refer to measurements of time. Historically, find times were measured in days, but the GNU version adds min switches for minutes. find looks for an exact match.

To search for files older than an amount of time, include a plus or minus sign. If a plus sign (+) precedes the amount of time, find searches for times greater than this amount. If a minus sign (-) precedes the time measurement, find searches for times less than this amount. The plus and minus zero days designations are not the same: +0 in days means “older than no days,” or in other words, files one or more days old. Likewise, -5 in minutes means “younger than 5 minutes” or “zero to four minutes old”.

There are several switches used to test the access time, which is the time a file was last read or written. The -anewer switch checks to see whether one file was accessed more recently than a specified file. -atime tests the number of days ago a file was accessed. -amin checks the access time in minutes.

Likewise, you can check the inode change time with -cnewer, -ctime, and -cmin. The inode time usually, but not always, represents the time the file was created. You can check the modified time, which is the time a file was last written to, by using -newer, -mtime, and -mmin.

To find files that haven't been changed in more than one day:

find /etc -name "*.conf" -type f -mtime +0

To find files that were modified in the hour:

[root@test01 etc]# find /etc -type f -mmin -60
/etc/sudoers

The -size switch tests the size of a file. The default measurement is 512-byte blocks, which is counterintuitive to many users and a common source of errors. Unlike the time-measurement switches, which have different switches for different measurements of time, to change the unit of measurement for size you must follow the amount with a b (bytes), c (characters), k (kilobytes), or w (16-bit words). There is no m (megabyte). Like the time measurements, the amount can have a minus sign (-) to test for files smaller than the specified size, or a plus sign (+) to test for larger files.

For example, use this to find log files greater than 1GBMB:

$ find / -type f  -size +1G

find shows the matching paths on standard output. Historically, the -print switch had to be used. Printing the paths is now the default behavior for most Unix-like operating systems, including Linux. If compatibility is a concern, add -print to the end of the find parameters.

To perform a different action on a successful match, use -exec. The -exec switch runs a program on each matching file. This is often combined with rm to delete matching files, or grep to further test the files to see whether they contain a certain pattern. The name of the file is inserted into the command by a pair of curly braces ({}) and the command ends with an escaped semicolon. (If the semicolon is not escaped, the shell interprets it as the end of the find command instead.)

$ find . -type f -name "*.txt" -exec grep 10.10.10.10 {} \;

More than one action can be specified. To show the filename after a grep match, include -print.

$ find . -type f -name "*.txt" -exec grep Table {} \; -print

find expects {} to appear by itself (that is, surrounded by whitespace). It can't be combined with other characters, such as in an attempt to form a new pathname.

The -exec switch can be slow for a large number of files: The command must be executed for each match. When you have the option of piping the results to a second command, the execution speed is significantly faster than when using -exec. A pipe generates the results with two commands instead of hundreds or thousands of commands.

The -ok switch works the same way as -exec except that it interactively verifies whether the command should run.

$ find . -type f -name "*.txt" -ok rm {} \;
< rm ... ./orders.txt > ? n
< rm ... ./advocacy/linux.txt > ? n
< rm ... ./advocacy/old_orders.txt > ? n
				

The -ls action switch lists the matching files with more detail.

The -printf switch use set of macros which indicate what kind of information about the file to print:

Using find with xargs

Utility  grep does not accept a list of filenames from the standard input -- it treats standard input as line to be analysed.

At the same time find produces the list of files. to convert  this list into grep invocation you can use -exec otpion but it is slightly convoluted. Often a better solution is to use xargs utility.

xargs is a program that turns a stream of filenames into iterative calls to whatever program is specified, each time supplied multiple filenames as the  argument in such way as not to exceed the reading buffer of shell

Let’s say that the output of find is a list of four files, one, two, three, and four. Using xargs, these files could be given to searched by grep  for particular lines much like with -exec option, but you can supply multiple file to grep  making processing more efficient and

find | xargs grep pattern
find /var/log -name "*.gz" -type f -size +0 -print | xargs gzip -xv | grep error

 -type f matches just plain files, and -size +0 matches files that aren’t zero bytes in size.

Tip

If you have spaces in filenames you need to use -print0 in find and then add an option -0 (zero) to xargs:

find $HOME -name "html" -print0 | xargs -0 grep -i 

NEWS CONTENTS

Old News ;-)

Recommended Links

Google matched content

Softpanorama Recommended

https://www.certdepot.net/

Red Hat Certified System Administrator (RHCSA EX200) – Study Guide

...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

 

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: October 18, 2018