Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Use grep and extended regular expressions to analyze text files

News Red Hat Certification Program Understanding and using essential tools Access a shell prompt and issue commands with correct syntax Finding Help Managing files in RHEL Working with hard and soft links Working with archives and compressed files Using the Midnight Commander as file manager
Text files processing Using redirection and pipes Using grep and extended regular expressions to analyze text files Connecting to the server via ssh, using multiple consoles and screen command Introduction to Unix permissions model Examples of Usage of Unix Find Command Unix find tutorial: Find search expressions Finding files using file name or path Finding SUID/SGUID files
AWK Regular Expressions ngrep -- searching network packets like Unix grep ack - grep replacement POSIX regular Expressions Overview of regular expressions in Perl Regular Expressions Cheat Sheet Regular Expressions Tips    
grep command GNU grep Regular Expressions Linux grep reference Unix Find Tutorial Using -exec option with find Tips Sysadmin Horror Stories Unix History with some Emphasis on Scripting Humor

Extracted from Professor Nikolai Bezroukov unpublished lecture notes.

Copyright 2010-2018, Dr. Nikolai Bezroukov. This is a fragment of the copyrighted unpublished work. All rights reserved.


Introduction

The Linux grep command searches a file for lines matching a fixed string or regular expressions (also incorrectly called patterns).  We will assume GNU version of grep. Alternatives are quite similar and more powerful, but GNU grep is a standard de-facto and currently it does implement Perl-style regex -P (Perl regex) option, which are the recommended form of regex to use.  

The strange name grep originates in the early days of Unix, whereby Unix ed editor commands was g/re/p (globally search for a regular expression and print the matching lines). Because this editor command was used so often, a separate grep command was created to search files without first starting the line editor. From Wikipedia:

Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor[5] and lexical analysis in a compiler.[6] Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files.[5][7][8][9] For speed, Thompson implemented regular expression matching by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation.[10] He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "Global search for Regular Expression and Print matching lines"[11]). Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.[6]

... ... ...

Many variations of these original forms of regular expressions were used in Unix[9] programs at Bell Labs in the 1970s, including vi, lex, sed, AWK, and expr, and in other programs such as Emacs. Regexes were subsequently adopted by a wide range of programs, with these early forms standardized in the POSIX.2 standard in 1992.

... ... ...

Starting in 1997, Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regex functionality and is used by many modern tools including PHP and Apache HTTP Server.

The power of grep stems from the ability of using regular expression, so we need to pay proper attention to study of regular expression, while studying grep. GNU grep used in Linux accepts three types of regular expressions, which complicates its usage. Historically they emerged in order basic regex, extended regex, and Perl-style regex. Now they should be used in reverse order, with Perl-style regex as preferable notation and engine: 

Unfortunately that means that sysadmins need to know at least two ("basic" and "extended")  or basic" and Per-style". And  preferably all three.   This "multiple personalities"  (aka schizoid) behavior is very confusing.  I hate the fact that nobody has the courage to implement a new standard grep and that the current implementation has all warts accumulated during the 30 years of Unix existence.

I highly recommend using -P option (Perl regular expressions) as default by redefining grep -P as an alias grep. It makes grep behavior less insane.  Sysadmin who do not know Perl but widely use AWK are encouraged to use AWK instead of grep in all complex cases, which require extended regular expressions.

Knowing extended regular expression is valuable it you also use awk instead of Perl. Otherwise I would say learn Perl-style regular expressions

Linux uses GNU implementation of grep, which combines old separate versions of grep into a single utility.  But this utility has two aliases (fgrep and egrep) and invocation via particular alias changes its behavior by invoking particular regex engine, as if you specified option -F or -E on the command string. So the classic  names survived; they are just implemented via aliases.

  1. fgrep command search for fixed strings only and  is equivalent to grep -F  invocation (or grep --fixed-strings invocation). It implements very fast search for fixed strings only; no regular expression
  2. grep This is a legacy grep which implemented basic (DOS -style) regular expression with some extensions.
  3. egrep (extended grep) which accepts extended regular expressions and fgrep (). The egrep command is equivalent to grep -E  invocation ( or grep --extended-regexp), In POSIX standard certain named classes of characters are predefined  [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].  For example, [[:alnum:]]  means [0-9A-Za-z], except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most meta-characters lose their special meaning inside bracket expressions.

You can also add alias grepp for grep -P, which is simpler to type then egrep and makes use of more powerful and flexible Perl-style  regular expressions. 

-P, --perl-regexp  Interpret PATTERN as a Perl regular expression.

NOTE: Please note that you can't alias it to pgrep: the name pgrep was taken before this mode of grep was implemented: utility pgrep exists and implements search of process table like ps | grep.

A complete list of Linux grep switches can be found in man page, Below are nine most useful grep options, that any sysadmin must know and use:

  1. -i Ignore case. Match either upper- or lowercase.
  2. -v Print only lines that do not match the pattern.
  3. -H, --with-filename Print the file name for each match. This is the default when there is more than one file to search but essential if you
  4. -A n Show n lines after the matching line.
  5. -B n Show n lines before the matching line.
  6. -C n Show n lines before and after the matching line.
  7. -n Print the matched line and its line number.
  8. -l Print only the names of files with matching lines (this is the lowercase letter “L”). So output  contains only matching filenames (rather than the lines in those files that contain the search pattern)
  9. -c Print only the count of matching lines.

Being able to invert the search logic with the -v flag is a very important foe sysadmins and widely used feature of grep. Among other things it allows delete "log noise". Some daemons in RHEL 7 by default are configured in such a way that they relentlessly spam the log making it unusable. Two worst offenders are systemd and dbus. For your own  server you, of course can reconfigure them to higher level of alert stopping this nasty spam. But for servers you do not own this is impossible and the only way to deal with them is to filter those messages out.  For example systemd daemon introduced in RHEL 7 pollutes the log. It often makes sense to exclude those messages when you analyze /var/log/messages

grep -Pv 'systemd\: (Start|Created|Removed|Stopping)|systemd\-login|dbus.*freedesktop\.problems|dbus.*(Activating|bluez)|pulseaudio' messages

Of course, you are better off writing a more sophisticated filter in Perl or Python. But as a "quick and dirty" solution this is OK.

TIPS:

Extended regular expressions

egrep uses matching patterns called extended regular expressions, which are similar to the pattern matching capabilities Bash extended test command ( [[..]] ).

The extended regular expression uses the following metasymbols:

As you see there are bog difference in semantic of metacharacters in extended regular expressions in comparison with basic regular expressions. And that creates problems, because you need to know and use both. If you add to this Perl regex it would be three. And that's one  too many for human brain ;-)

And differences are such that errors can lead to SNAFU. For example, in basic regular expression a question mark represents any character (like dot in extended regular expression, whereas extended regex this mean optional presiding character or expression (Previous symbol or subexpression should occurs in string zero or exactly  one time  for a match).

You need to encapsulate regex in  quotes, as metacharacters ?, +, {, |, (, and ) have special meaning in shell too  and Bash will treat them accordingly if the regex is not in quotes.

For example if we search /var/log/messages first for message that contain work kernel and then work failure we will get

[root@test01 log]# fgrep -i kernel  messages | fgrep -i failed
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.

The asterisk (*) is a placeholder representing zero or more characters. Using this metasymbol with egrep instead of grep e can rewrite previous query without a pipe as:

[root@test01 log]# egrep 'kernel.*failed' messages
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data

Invoking grep as fgrep, or using the short option -F or long option --fixed-strings switch to the search of fixed staring and does not interpret any pattern-matching characters. For this mode they are just regular symbols without any special meaning. 

[root@test01 log]# fgrep kernel  messages | fgrep failed

The caret (^) character indicates the beginning of a line. Use the caret to check for a pattern at the start of a line. The --invert-match (or -v) switch shows the lines that do not match. Lines that match are not shown. This often valuable for analyzing config file -- it allow to delete all the comments making "meaningful" line more visible (note that only egrep allow this  type of regex):

[root@test01 etc]# egrep -v '^#' /etc/sudoers

Defaults !visiblepw

Defaults always_set_home

Defaults env_reset
Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS"
Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"

Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin

root ALL=(ALL) ALL

%wheel ALL=(ALL) NOPASSWD: ALL

The --ignore-case (or -i) switch makes the search case insensitive.

fgrep -i error /var/log/messages

Regular expressions can be joined together with a vertical bar (|). This has the same effect as combining the results of two separate grep commands.

egrep -i 'error|fail|crash' /var/log/messages
[root@test01 etc]# egrep -i 'error|fail|crash' /var/log/messages
Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
Sep 24 22:48:02 localhost kernel: crash memory driver: version 1.1
Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT Xorg log watcher.
Sep 24 22:48:14 localhost systemd: Job abrt-xorg.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for Harvest vmcores for ABRT.
Sep 24 22:48:14 localhost systemd: Job abrt-vmcore.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for Install ABRT coredump hook.
Sep 24 22:48:14 localhost systemd: Job abrt-ccpp.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.
Sep 24 22:48:14 localhost systemd: Job abrt-oops.service/start failed with result 'dependency'.
Sep 24 22:48:14 localhost rngd: read error
Sep 24 22:48:14 localhost rngd: read error
Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.828375 INFO sfdisk with --part-type failed [1], retrying with -c
Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.926344 INFO sfdisk with --part-type failed [1], retrying with -c
Sep 24 22:50:04 localhost python: 2018/09/24 22:50:04.956978 WARNING Download failed, switching to host plugin

To identify the matching line, the --line-number (or -n) switch displays both the line number and the line. Using cut, head, and tail, the first line number can be saved in a variable. The number of bytes into the file can be shown with --byte-offset (or -b).

$ grep -n "crash" orders.txt

The --count (or -c) switch counts the number of matches and displays the total. This is mainly useful in bash scripts as it allow to see how many matches occurred and change the logic accordingly.

grep recognizes the standard character classes as well.

$ egrep "[[:cntrl:]]" alice_in_wonderland

Backreferences

Suppose you want to search for a string which contains a certain substring in more than one place. An example is the heading tag in HTML. Suppose I wanted to search for <h1>some string</h1>  . This is easy enough to do. But suppose I wanted to do the same but allow H2 H3 H4 H5 H6  in place of H1. The expression <h[1-6]>.*</h[1-6]>  is not good enough since it matches <h1>Hello world</h3>  but we want the opening tag to match the closing one. To do this, we use a backreference

Backreference is the expression \n where n is a number, matches the contents of the n'th set of parentheses in the expression

For example:
grep -Pi '\<h([1-6]\).*</h\1>' index.shtml
matches what we were trying to match before.
grep -Pi '\<h([1-6]).*</h\1>' ../Public*/index.shtml
<h2><a name="Latest">Recent updates</a></h2>
<h2><a href="switchboard.shtml">Softpanorama Switchboard</a></h2>
<h4><a href="switchboard.shtml">Switchboard </a>-- Links, Links, Links...</h4>
<h4><a name="Bookshelf">Bookshelf</a></h4>
<h4><a href="switchboard.shtml#recent_papers">Recent articles</a>:</h4>

Using quotes

Single quotes are the safest to use, because they protect your regular expression from the shell. For example, grep ! file  will often produce an error (since the shell thinks that "!" is referring to the shell command history) while grep '!' file  will not.

When should you use single quotes ?

The answer is this: if you want to use shell variables, you need double quotes; otherwise always use single quotes.

For example,

grep "$HOME" file 

searches file for the name of your home directory, while

grep '$HOME' file 

searches for the string $HOME

Matching context

When searching for specific lines in a file, you may actually want to also see a line or two above or below the matching line, rather than just the matching line itself. This can be accomplished in three ways, depending on whether you want lines above, lines below, or both, by using -A, -B, and -C, respectively.

For example, to show one additional line above and below the matching line (and add line numbers too, by using the -n option):

grep -n -C1 error /var/log/messages

Notice that the line that has a match has a colon after the line number, while the other context lines are preceded with a dash. Very subtle, but knowing what to look for helps you find your match instantly!

Counting matches rather than showing matching lines

When you’re going through a large file and have a lot of matches, it’s often quite useful to just get a report of how many lines matched rather than having all the output stream past on your screen. This is accomplished with the -c option:

grep -c "kernel" /var/log/messages

Grep Recursive Search

GNU grep can recursively search for a regex or fixed string via  -r option (or --recursive).  By default it does not follow symbolic links.  To follow all symbolic links, use the -R option (or --dereference-recursive).

But traditionally for recursive search grep is combined with find  (see below)

Displaying matches in color

Another useful feature of GNU  grep command is that it  highlights the matching passage in each line if you use the verbose --color=always option. Here’s how it looks:

grep -n -C 1 --color=always error /var/log/messages

Gotchas

You will be laughing, but it is possible to use thing outside basic regular expressions with grep invoked without -E or -P options.

In the example bellow we are searching all occurrences of the words fatal, error and crash using escaped | as word separator.

grep 'fatal\|error\|crash' /var/log/messages

If you use the extended regular expression option -E (or --extended-regexp) then the operator | should not be escaped, as shown bellow:

grep -E 'fatal|error|critical' /var/log/nginx/error.log

Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Recommended Links

Google matched content

Softpanorama Recommended



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: October 29, 2018