Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

grep tutorial

by Dr. Nikolai Bezroukov
Version 2.0 (Oct 23, 2018)

News Searching Algorithms Recommended Books Recommended Links Using grep and extended regular expressions to analyze text files

Regex

 Linux grep reference Solaris Grep reference
fgrep egrep ngrep -- searching network packets like Unix grep pcregrep gzgrep bzgrep Agrep  
ack - grep replacement find xargs String search algorithms  History Sysadmin Horror Stories Humor Etc

Introduction

The Linux grep command searches a file for lines matching a fixed string or regular expressions (also incorrectly called patterns).  We will assume GNU version of grep. Alternatives are quite similar and more powerful, but GNU grep is a standard de-facto and currently it does implement Perl-style regex -P (Perl regex) option, which are the recommended form of regex to use.  

By default grep output matching lines that it can find in the lest of files specified by arguments and  then exists with the  return code zero, if one or more lines were marched, one if no lines were matched. In case of inaccessible input files or syntax errors in specified regex grep returns code larger then one. 

The strange name grep originates in the early days of Unix, whereby Unix ed editor commands was g/re/p (globally search for a regular expression and print the matching lines). Because this editor command was used so often, a separate grep command was created to search files without first starting the line editor. From Wikipedia:

Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor[5] and lexical analysis in a compiler.[6] Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files.[5][7][8][9] For speed, Thompson implemented regular expression matching by just-in-time compilation (JIT) to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation.[10] He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "Global search for Regular Expression and Print matching lines"[11]). Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design.[6]

... ... ...

Many variations of these original forms of regular expressions were used in Unix[9] programs at Bell Labs in the 1970s, including vi, lex, sed, AWK, and expr, and in other programs such as Emacs. Regexes were subsequently adopted by a wide range of programs, with these early forms standardized in the POSIX.2 standard in 1992.

... ... ...

Starting in 1997, Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl's regex functionality and is used by many modern tools including PHP and Apache HTTP Server.

The power of grep stems from the ability of using regular expression, so we need to pay proper attention to study of regular expression, while studying grep. GNU grep used in Linux accepts three types of regular expressions, which complicates its usage. Historically they emerged in order basic regex, extended regex, and Perl-style regex. Now they should be used in reverse order, with Perl-style regex as preferable notation and engine: 

Unfortunately that means that sysadmins need to know at least two ("basic" and "extended")  or basic" and Per-style". And  preferably all three.   This "multiple personalities"  (aka schizoid) behavior is very confusing.  I hate the fact that nobody has the courage to implement a new standard grep and that the current implementation has all warts accumulated during the 30 years of Unix existence.

I highly recommend using -P option (Perl regular expressions) as default by redefining grep -P as an alias grep. It makes grep behavior less insane.  Sysadmin who do not know Perl but widely use AWK are encouraged to use AWK instead of grep in all complex cases, which require extended regular expressions.

Knowing extended regular expression is valuable it you also use awk instead of Perl. Otherwise I would say learn Perl-style regular expressions

Linux uses GNU implementation of grep, which combines old separate versions of grep into a single utility.  But this utility has two aliases (fgrep and egrep) and invocation via particular alias changes its behavior by invoking particular regex engine, as if you specified option -F or -E on the command string. So the classic  names survived; they are just implemented via aliases.

  1. fgrep command search for fixed strings only and  is equivalent to grep -F  invocation (or grep --fixed-strings invocation). It implements very fast search for fixed strings only; no regular expression
  2. grep This is a legacy grep which implemented basic (DOS -style) regular expression with some extensions.
  3. egrep (extended grep) which accepts extended regular expressions and fgrep (). The egrep command is equivalent to grep -E  invocation ( or grep --extended-regexp), In POSIX standard certain named classes of characters are predefined  [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].  For example, [[:alnum:]]  means [0-9A-Za-z], except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most meta-characters lose their special meaning inside bracket expressions.

You can also add alias grepp for grep -P, which is simpler to type then egrep and makes use of more powerful and flexible Perl-style  regular expressions. 

-P, --perl-regexp  Interpret PATTERN as a Perl regular expression.

A complete list of Linux grep switches can be found in man page, Below are nine additional most useful grep options, that any sysadmin must know and use:

  1. -i Ignore case. Match either upper- or lowercase.
  2. -v Print only lines that do not match the pattern.
  3. -H, --with-filename Print the file name for each match. This is the default when there is more than one file to search but essential if you
  4. -A n Show n lines after the matching line.
  5. -B n Show n lines before the matching line.
  6. -C n Show n lines before and after the matching line.
  7. -n Print the matched line and its line number.
  8. -l Print only the names of files with matching lines (this is the lowercase letter “L”). So output  contains only matching filenames (rather than the lines in those files that contain the search pattern). There is also an opposite option  -L  ( --files-without-match ) which also suppress normal output; instead print the name of each input file that have zero matches of specified regex. Scanning of the file stops on the first match, which increases efficiency. 
  9. -c Print only the count of matching lines.

Being able to invert the search logic with the -v flag is a very important foe sysadmins and widely used feature of grep. Among other things it allows delete "log noise". Some daemons in RHEL 7 by default are configured in such a way that they relentlessly spam the log making it unusable. Two worst offenders are systemd and dbus. For your own  server you, of course can reconfigure them to higher level of alert stopping this nasty spam. But for servers you do not own this is impossible and the only way to deal with them is to filter those messages out.  For example systemd daemon introduced in RHEL 7 pollutes the log. It often makes sense to exclude those messages when you analyze /var/log/messages

grep -Pv 'systemd\: (Start|Created|Removed|Stopping)|systemd\-login|dbus.*freedesktop\.problems|dbus.*(Activating|bluez)|pulseaudio' messages

Of course, you are better off writing a more sophisticated filter in Perl or Python. But as a "quick and dirty" solution this is OK.

NOTE: Please note that you can't alias it to pgrep: the name pgrep was taken before this mode of grep was implemented: utility pgrep exists and implements search of process table like ps | grep.

TIPS:

Fgrep -- searching for fixed string

Invoking grep as fgrep, or using the short option -F or long option --fixed-strings switch to the search of fixed staring and does not interpret any pattern-matching characters. In this case grep returns all matching lines that contain particular string as a substring of the line. All characters in this case are interpreted literally, and are  not assigned and special meaning. 

[root@test01 log]# fgrep kernel  messages | fgrep failed

You can view fixed string  as the most primitive form of regular expression -- regular expression without any metasymbols. But this extreme case allow to search file much more efficiently.   GNU grep implements special algorithm for fast matching of such "fixed" string that allow to do it very fast even in a very large files. To activate this algorithm you should iether use option -F or invoke grep as fgrep. For example,

fgrep foo file # returns all the lines that contain a string "foo" in the file "file".

This option is often used for filtering data which comes form STDIN instead of a file. For example,

locate | fgrep /sysconfig  # lists all entries in locate database which contain string /sysconfig

Grep regular expressions

As we mentioned above grep allow to use three types of regular expression: basic, extended and Perl-style:

Perl-style regular expressions

Any regular expressions consists of literals (strings that are interpreted "as is") and metacharacters, which  specified particular type of matching on literals or by  themselves. Perl-style regex have the following major metacharacter:

  • . -- matches any character, except (in some cases) newline (character grouping [^\n])
  • \d -- matches a digit (character grouping [0-9])
  • \D -- matches a non-digit (character grouping [^0-9]
  • \w -- matches a word character (character grouping [a-zA-Z0-9_] (underscore is counted as a word character here)
  • \W -- matches a non-word character (character grouping [^a-zA-Z0-9_]
  • \s -- matches a 'space' character (character grouping [\t\n ]. (tab, newline, space)
  • \S -- matches a 'non-space' character (character grouping [^\t\n ]).
  •  (matches any character, when you say m"(.*)"s. See modifiers, below.))
  • $ -- anchor which matches the 'end of line', if placed at the end of a regular expression.
  • ^ -- anchor that matches 'beginning of line' if placed at the beginning of a regular expression.
  • \b, \B -- anchors that matches a word boundary (\b) or lack of word boundary (\B).
  • It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

    Here are a few examples:

    grep -P '404 - - ' /var/log/http* # allow to see all 404 error messages in http logs
    grep -p '40\d/'  /var/log/http* # matches 400, 
    	401, 403, etc.

    Here are more examples of simple regular expression that might be reused in other contexts:

    grep -P 't.t'	      # matches t followed by any letter followed by t	
    grep -P '^131'        # 131 at the beginning of a line
    grep -P '0$'	      # match lines that ends with  zero
    grep -P 'error\d+'    # matches lines with the word error followed by  digits 		
    grep -Pv '^$'         # Allow to remove empty lines  from the  output
    

    Character classes

    Now let's add complexity by introducing classes of characters.

    They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not":

    grep -P '[abcde]'		# Either a or b or c or d or e
    grep -P '[a-e]'			# same thing ("-" denote range here)
    grep -P '[a-z]'			# Anything from a to z inclusive
    grep -P '[^a-z]'		# Any non lower case letter
    grep -P '[a-zA-Z]' 		# Any letter
    grep -P '\w'	 		# Same thing as above
    grep -P '[a-z]+'		# Any non-zero sequence of lower case letters
    grep -P '[01]'			# Either "0" or "1"
    grep -P '[^0-9a-zA-Z]'    	# matches any non-word character.

    If you need to match a word whose length is unknown, you probably should not use an * or *? because a zero length word makes no sense.

    Now let's introduce two so called anchors, a special characters that tell regex engine that the match should start of end in a certain position of the string. Two most common anchors are ^ and $:

    For example to match the first word on the line we can use the following regex :

    grep -P '^\w+'

    Several additional examples:

    grep -P '0'		# zero: "0"
    grep -P '0*'		# zero of more zeros		
    grep -P '0+'		# one or more zeros
    grep -P '0*0'		# same as above
    grep -P '\d'		# any digit but only one
    grep -P '\d+'           # any integer
    grep -P '\d+\.\d*'      # a subset of real numbers. Please note that 0. is a real number
    grep -P '\d+\.\d+\.\d+\.\d+' # IP addresses )no control of the number of digits so 1000.1000.1000.1000 would match  this regex
    grep -P '/\d+\.\d+\.\d+\.255' # IP addresses ending with 255

    At this point you can probably benefit from doing several exercises on the computer. Let's repeat key Perl regex metacharacters for reference:

    \n		# A newline
    \t		# A tab
    \w		# Any alphanumeric (word) character.
    		# The same as [a-zA-Z0-9_]
    \W		# Any non-word character.
    		# The same as [^a-zA-Z0-9_]
    \d		# Any digit. The same as [0-9]
    \D		# Any non-digit. The same as [^0-9]
    \s		# Any whitespace character: space,
    		# tab, newline, etc
    \S		# Any non-whitespace character
    \b		# A word boundary, outside [] only
    \B		# No word boundary

    NOTE: Characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded  by a backslash, for example:

    \|		# Vertical bar
    \[		# An open square bracket
    \)		# A closing parenthesis
    \*		# An asterisk
    \^		# A carat symbol
    \/		# A slash
    \\		# A backslash

    Metacharacters in Character Classes

    The character class [0123456789] or, shorter, [0-9] defines the class of decimal digits, and [0-9a-fA-F] defines the class of hexadecimal digits. You should use a dash to define a range of consecutive characters. Character classes let you match any of a range of characters. You can use variable interpolation inside the character class, but you must be careful when doing so. You can use metacharacters inside character classes but not as endpoints of a range. For example, you can do the following:

    grep -P '[\d\s]'
    Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be confusing but that's how it is.

    How to Create Complex Regex

    Complex patterns are constructed from simple regular expressions using the following metacharacters:

    Meta-characters are characters that have an additional meaning above and beyond their literal meaning. For example, the period character can have two meanings in a pattern. First, it can be used to match a period character in the searched string - this is its literal meaning. And second, it can be used to match any character in the searched string except for the newline character - this is its meta-meaning. The following two components that can be used to construct complex patterns:

    Anchors

    The metacharacter differ in their behaviors. some of them can  match zero number of characters of a particular class, but most require at least one such character. Here are examples of metacharacters that we already know:

    Substrings matched by those metacharacters always have positive width. Or to put it differently the regular expression engine 'eats' characters in the process of matching.

    The second group of characters does not eat any characters -- that means that they do not require any character to be present. This subclass is usually called anchors. Here are most important anchors:

    Anchors don't match a character, they match a condition. In other words, the regex '^cat\b' will match a string with the word 'cat' at the beginning of the line

    Alternation

    Alternation is the way to tell Perl regex engine that you wish to match one of two or more patterns. In other words, the regular expression:

    grep -P '^foreach|^for|^while' myscript.pl

    in a regular expression tells Perl regex engine "look for the line beginning with the string 'for' OR the string 'if' or the string 'while'." As an example, start with the following statement:

    The ( | ) syntax split regular expression on sections and each section will be tried independently. Alternation always tries to match the first item in the parentheses. If it doesn't match, the second pattern is then tried and so on.

    In this case the string foreach will never be matched as for will match before it. This is so common a mistake that I would like to recommend to  put longest string first in such cases.

    grep -P 'word(s?)'

    The useful option for matching words is -i  (ignore case). for example

    grep -P 'word(s?)'

    Backreferences

    Suppose you want to search for a string which contains a certain substring in more than one place. An example is the heading tag in HTML. Suppose I wanted to search for <h1>some string</h1>  . This is easy enough to do. But suppose I wanted to do the same but allow H2 H3 H4 H5 H6  in place of H1. The expression <h[1-6]>.*</h[1-6]>  is not good enough since it matches <h1>Hello world</h3>  but we want the opening tag to match the closing one. To do this, we use a backreference

    Backreference is the expression \n where n is a number, matches the contents of the n'th set of parentheses in the expression

    For example:
    grep -Pi '\<h([1-6]\).*</h\1>' index.shtml
    matches what we were trying to match before.
    grep -Pi '\<h([1-6]).*</h\1>' ../Public*/index.shtml
    <h2><a name="Latest">Recent updates</a></h2>
    <h2><a href="switchboard.shtml">Softpanorama Switchboard</a></h2>
    <h4><a href="switchboard.shtml">Switchboard </a>-- Links, Links, Links...</h4>
    <h4><a name="Bookshelf">Bookshelf</a></h4>
    <h4><a href="switchboard.shtml#recent_papers">Recent articles</a>:</h4>
    

    Extended regular expressions

    egrep uses matching patterns called extended regular expressions, which are similar to the pattern matching capabilities of Bash extended test command ( [[..]] ).

    The extended regular expression uses the following compatible with Perl regex metasymbols:

    Notice that the symbols are not exactly the same as the globing symbols used for file matching. For example, on the command line a question mark represents any character, whereas in grep, the period has this effect.

    The characters ?, +, {, |, (, and ) must appear escaped with backslashes to prevent Bash from treating them as file-matching characters.

    For example if we search /var/log/messages first for message that contain work kernel and then work failure we will get

    [root@test01 log]# grep  kernel  messages | grep failed
    Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
    Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
    Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
    Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.

    The asterisk (*) is a placeholder representing zero or more characters. Using this metasymbol we can rewrite previous query as:

    [root@test01 log]# egrep 'kernel.*failed' messages
    Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
    Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
    Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data

    The caret (^) character indicates the beginning of a line. Use the caret to check for a pattern at the start of a line. The --invert-match (or -v) switch shows the lines that do not match. Lines that match are not shown. This often valuable for analyzing config file -- it allow to delete all the comments making "meaningful" line more visible

    [root@test01 etc]# grep -v '^#' /etc/sudoers
    
    Defaults !visiblepw
    
    Defaults always_set_home
    
    Defaults env_reset
    Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS"
    Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
    Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
    Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
    Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
    
    Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin
    
    root ALL=(ALL) ALL
    
    %wheel ALL=(ALL) NOPASSWD: ALL
    

    The --ignore-case (or -i) switch makes the search case insensitive.

    grep -i error /var/log/messages

    Regular expressions can be joined together with a vertical bar (|). This has the same effect as combining the results of two separate grep commands.

    egrep -i 'error|fail|crash' /var/log/messages
    [root@test01 etc]# egrep -i 'error|fail|crash' /var/log/messages
    Sep 24 22:48:02 localhost kernel: tsc: Fast TSC calibration failed
    Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
    Sep 24 22:48:02 localhost kernel: acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge.
    Sep 24 22:48:02 localhost kernel: crash memory driver: version 1.1
    Sep 24 22:48:02 localhost kernel: psmouse serio1: trackpoint: failed to get extended button data
    Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT Xorg log watcher.
    Sep 24 22:48:14 localhost systemd: Job abrt-xorg.service/start failed with result 'dependency'.
    Sep 24 22:48:14 localhost systemd: Dependency failed for Harvest vmcores for ABRT.
    Sep 24 22:48:14 localhost systemd: Job abrt-vmcore.service/start failed with result 'dependency'.
    Sep 24 22:48:14 localhost systemd: Dependency failed for Install ABRT coredump hook.
    Sep 24 22:48:14 localhost systemd: Job abrt-ccpp.service/start failed with result 'dependency'.
    Sep 24 22:48:14 localhost systemd: Dependency failed for ABRT kernel log watcher.
    Sep 24 22:48:14 localhost systemd: Job abrt-oops.service/start failed with result 'dependency'.
    Sep 24 22:48:14 localhost rngd: read error
    Sep 24 22:48:14 localhost rngd: read error
    Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.828375 INFO sfdisk with --part-type failed [1], retrying with -c
    Sep 24 22:48:29 localhost python: 2018/09/24 22:48:29.926344 INFO sfdisk with --part-type failed [1], retrying with -c
    Sep 24 22:50:04 localhost python: 2018/09/24 22:50:04.956978 WARNING Download failed, switching to host plugin

    To identify the matching line, the --line-number (or -n) switch displays both the line number and the line. Using cut, head, and tail, the first line number can be saved in a variable. The number of bytes into the file can be shown with --byte-offset (or -b).

    $ grep -n "crash" orders.txt

    The --count (or -c) switch counts the number of matches and displays the total.

    grep recognizes the standard character classes as well.

    $ grep "[[:cntrl:]]" orders.txt

    A complete list of Linux grep switches can be found in man page

    Basic regex

    Basic regular expression (also called DOS-style regular expression) is the most well known by  sysadmin type of regex as it is used on command line with other utilities such as ls.

    NOTE: In grep basic regular expressions  allow alternation but you need to remember to use backslash before any special character in a regular expressions. For example:

    grep 'if|while'     #-- wrong
    
    grep 'if\|while'     #-- will work, please note single quotes

    Please use egrep or grep -P instead. In complex cases please always use  -P option (Perl regular expression option -- available only in GNU grep)

    In complex cases  always use Perl or use grep -P option

    Using quotes

    Single quotes are the safest to use, because they protect your regular expression from the shell. For example, grep ! file  will often produce an error (since the shell thinks that "!" is referring to the shell command history) while grep '!' file  will not.

    When should you use single quotes ?

    The answer is this: if you want to use shell variables, you need double quotes; otherwise always use single quotes.

    For example,

    grep "$HOME" file 

    searches file for the name of your home directory, while

    grep '$HOME' file 

    searches for the string $HOME

    Major options

    A complete list of Linux grep switches can be found in man page. Default options can be specified via environmnet variable GREP_OPTIONS

    Below are most useful grep options, that any sysadmin must know and use:

    1. -i Ignore case. Match either upper- or lowercase.
    2. -v Print only lines that do not match the pattern.
    3. -H, --with-filename Print the file name for each match. This is the default when there is more than one file to search but essential if you
    4. -A n Show n lines after the matching line.
    5. -B n Show n lines before the matching line.
    6. -C n Show n lines before and after the matching line.
    7. -n Print the matched line and its line number.
    8. -l Print only the names of files with matching lines (this is the lowercase letter "L"). So output contains only matching filenames (rather than the lines in those files that contain the search pattern). There is also an opposite option -L ( --files-without-match ) which also suppress normal output; instead print the name of each input file that have zero matches of specified regex. Scanning of the file stops on the first match, which increases efficiency.
    9. -c Print only the count of matching lines.
    10. -w (or --word-regexp ) Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
    11. -x (or --line-regexp ) Select only those matches that exactly match the whole line.

    Listing filename of the matching line, options -H and -l

    If grep is invoked with two or more files as arguments, it lists the filename in which particular match is found. Problems arise if this is a single file. In this case by default grep does not list the name of the file. As grep of open used with find via option -exec (see below) this is a very important "use case" and GNU grep provides a special option of handle it -- option -H. In older versions of grep there was no such possibility and you need to imitate it by supplying dummy file /dev/null as the second file in such cases. For example

    egrep -H 'error|crash' /var/log/messages

    egrep 'error|crash' /var/log/messages /dev/null # same effect as specified option -H, used for old version of grep used in Solaris, HP-UX, and AIX.

    Option -l allows you to list only files that contain the search string. To reverse section, obtaining "do not contain" effect use option -v. This option is mainly useful in scripts. It is seldom used on command line. For example, if we have daily HTTP logs and want to determine when particular IP accessed the site we can use:

    egrep -l '10.10.5.4' http_logs*

    Printing context of the matching line, options -B (before), and -A (after)

    GNU grep is able to output lines in the vicinity of the match line. Which in many cases it is extremely important to the context of matching like. This is a typical situation in troubleshooting. GNU grep provides very flexible capabilities for this (you can also print the line number of matching line with option -n, see below).

    Ignoring case

    i, --ignore-case
    Ignore case distinctions in both the PATTERN and the input files. (-i is specified by POSIX .)
    This is an important option often used on grepping the logs for error messages, using specific keywords.

    Reversal of matching

    -v, --invert-match Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX .)

    Specifying multiple regular expressions in the file

    The relevant option is:

    -f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX .)

    Regex in lines are interpreted as connected by logical "OR", so it there are three lines containing regex, any line that matches at least one of them will be printed in output. See discussion of a very perverted example at Getting all the matches with 'grep -f' option - Stack Overflow

    This is convenient option for scanning the logs for error messages, using specific keywords.

    Grep Recursive Search

    GNU grep can recursively search for a regex or fixed string via -r option (or --recursive). By default it does not follow symbolic links. To follow all symbolic links, use the -R option (or --dereference-recursive).

    But traditionally for recursive search grep is combined with find (see below)

    Examples

    Fgrep:

    fgrep -l 'hahaha' * # just the names of matching files
    fgrep  'May 16'  /var/logs/https/access # we are searching string, so fgrep is better
    fgrep -v 'yahoo.com' /var/logs/https/access  # filtering yahoo.com using -v options
    find . -type f -print | xargs fgrep -l 'hahaha'   

    More complex example: remove lines from invoices.txt if they appear in paid_today.txt (note the elegance of the solution -- one input file serves as a set of fixed string for grep to match in the other):

    fgrep -xvf paid_today.txt invoices.txt > paidinvoices.txt

    Grep:

    Suppose you want to match a specific number of repetitions of a pattern. A good example is IP address. You could search for an arbitrary IP address like this:

    grep -P '[:digit:]{1,3}(\.[:digit:]{1,3}){3}' file

    There is actually no difference between [0-9] and [[:digit:]] but the latter can be faster.

    The same can be done for phone numbers written in 999-999-9999 form:

    ([[:digit:]]{3}[[:punct:]]){2}[[:digit:]]{4}

    To search email that has come from a certain address:

    grep -P '^From:.*somebody\@' /var/spool/mail/root

    To search several variants of the same name:

    grep -P 'Nic?k\(olai\)\? Bezroukov '  # matches "Nick Bezroukov" or "Nikolai Bezroukov".
    grep -P 'cat|dog' file # matches lines containing the word "cat" or the word "dog" 
    grep -P '^\(From\|To\|Subject\):'  # matches corresponding part of the email header 

    Using -l option

    grep -Pl 'nobody\@nowhere' 

    Using -w option

    grep -w '\<abuse' *
    grep -w 'abuse\>' *

    The first command searches for those lines where any word in that line begins with the letters 'abuse' and the second command searches for those lines where any word in that line ends with the letter 'abuse'

    See also

    Grep with pipes

    The output of grep can also be piped to another program as follows:

    ps -ef | grep httpd | wc  -l
    The example above counts how many 'httpd' processes are running:

    We can count letters from particular spammer using the following pipe:

    ls -l | xargs -n1 fgrep -H '[email protected]'
    Note: option -H -- it instructs grep to output the name of the file in all cases. By default grep output the name of the file only if more then one file is specified as the argument.

    During debugging comments often obscure the logic of the program and interfere with the search of a bug. Here is how to display non comment lines of myscript.pl

    grep -v '^#' ~/myscript.pl | less

    Grep is very useful as a simple yet powerful HTML analyzer. Here is how find HTML tags that are not closed before the line break.

    egrep '<[^>]*$' *.html
    You can also use grep to search from files that are unzipped from standard input but it is better to use available wrappers such as zgrep and bzgrep. zgrep is an wrapper for grep that can invoke the grep on compressed or gzip'ed files. All options specified are passed directly to grep. If no file is specified, then the standard input is decompressed and fed to grep. Otherwise the given files are uncompressed if necessary and fed to grep.
    grep gzip files
    ---------------
    zgrep foo myfile.gz                           # all lines containing the pattern 'foo'
    zgrep 'GET /blog' access_log.gz               # all lines containing 'GET /blog'
    zgrep 'GET /blog' access_log.gz | more        # same thing, case-insensitive

    See also zgrep(1) - Linux man page

    Tips

    Tip 1: How to block an extra line when grepping ps output for a string or pattern:

    ps -ef | grep '[c]ron'

    If the pattern had been written without the square brackets, it would have matched not only the ps output line for cron, but also the ps output line for grep. on the length of a line except the available memory.

    Tip 2: How do I search directories recursively?

    grep -r 'hello' ~/*.html

    Newer version of grep has -r option. Example above searches for `hello' in all html files under the user home directory. For more control of which files are searched, use find and xargs. For example,

    find ~ -name *html -type f -print | xargs grep 'hello'   

    Tip 3: How do I output context around the matching lines?

    grep -C 2 'hello' * # prints two lines of context around each matching line.   

    Using grep with find

    The Linux find command searches for files that meet specific conditions such as files with a certain name or files greater than a certain size. find is similar to the following loop where MATCH is the matching criteria:

    ls --recursive | while read FILE ; do
         # test file for a match
        if [ $MATCH ] ; then
           printf "%s\n" "$FILE"
        fi
    done

    This script recursively searches directories under the current directory, looking for a filename that matches some condition called MATCH.

    find is much more powerful than this script fragment. Like the built-in test command, find switches create expressions describing the qualities of the files to find. There are also switches to change the overall behavior of find and other switches to indicate actions to perform when a match is made.

    The basic matching switch is -name, which indicates the name of the file to find. Name can be a specific filename or it can contain shell path wildcard globbing characters like * and ?. If pattern matching is used, the pattern must be enclosed in quotation marks to prevent the shell from expanding it before the find command examines it.

    find /etc -name "*.conf"

    The previous find command matches any type of file, including files such as pipes or directories, which is not usually the intention of a user. The -type switch limits the files to a certain type of file. The -type f switch matches only regular files, the most common kind of search. The type can also be b (block device), c (character device), d (directory), p (pipe), l (symbolic link), or s (socket).

    find /etc -name "*.conf"  -type f

    The switch -name "*.conf" -type f is an example of a find expression. These switches match a file that meets both of these conditions (implicitly, a logical "and"). There are other operator switches for combining conditions into logical expressions, as follows:

    For example, to count the number of regular files and directories, do this:

    [root@test01 etc]# find /etc -name "*.conf"  -type f | wc -l
    145

    The number of files without suffix .conf can be counted as well.

    find . ! -name "*.conf" -type f | wc -l

    Parentheses must be escaped by a backslash or quotes to prevent Bash from interpreting them as a subshell. Using parentheses, the number of files ending in .txt or .sh can be expressed as

    $ find . "(" -name "*.conf" -or -name "*.config" ")" -type f | wc -l

    Some expression switches refer to measurements of time. Historically, find times were measured in days, but the GNU version adds min switches for minutes. find looks for an exact match.

    To search for files older than an amount of time, include a plus or minus sign. If a plus sign (+) precedes the amount of time, find searches for times greater than this amount. If a minus sign (-) precedes the time measurement, find searches for times less than this amount. The plus and minus zero days designations are not the same: +0 in days means "older than no days," or in other words, files one or more days old. Likewise, -5 in minutes means "younger than 5 minutes" or "zero to four minutes old".

    There are several switches used to test the access time, which is the time a file was last read or written. The -anewer switch checks to see whether one file was accessed more recently than a specified file. -atime tests the number of days ago a file was accessed. -amin checks the access time in minutes.

    Likewise, you can check the inode change time with -cnewer, -ctime, and -cmin. The inode time usually, but not always, represents the time the file was created. You can check the modified time, which is the time a file was last written to, by using -newer, -mtime, and -mmin.

    To find files that haven't been changed in more than one day:

    find /etc -name "*.conf" -type f -mtime +0

    To find files that were modified in the hour:

    [root@test01 etc]# find /etc -type f -mmin -60
    /etc/sudoers

    The -size switch tests the size of a file. The default measurement is 512-byte blocks, which is counterintuitive to many users and a common source of errors. Unlike the time-measurement switches, which have different switches for different measurements of time, to change the unit of measurement for size you must follow the amount with a b (bytes), c (characters), k (kilobytes), or w (16-bit words). There is no m (megabyte). Like the time measurements, the amount can have a minus sign (-) to test for files smaller than the specified size, or a plus sign (+) to test for larger files.

    For example, use this to find log files greater than 1GBMB:

    $ find / -type f  -size +1G

    find shows the matching paths on standard output. Historically, the -print switch had to be used. Printing the paths is now the default behavior for most Unix-like operating systems, including Linux. If compatibility is a concern, add -print to the end of the find parameters.

    To perform a different action on a successful match, use -exec. The -exec switch runs a program on each matching file. This is often combined with rm to delete matching files, or grep to further test the files to see whether they contain a certain pattern. The name of the file is inserted into the command by a pair of curly braces ({}) and the command ends with an escaped semicolon. (If the semicolon is not escaped, the shell interprets it as the end of the find command instead.)

    $ find . -type f -name "*.txt" -exec grep 10.10.10.10 {} \;

    More than one action can be specified. To show the filename after a grep match, include -print.

    $ find . -type f -name "*.txt" -exec grep Table {} \; -print

    find expects {} to appear by itself (that is, surrounded by whitespace). It can't be combined with other characters, such as in an attempt to form a new pathname.

    The -exec switch can be slow for a large number of files: The command must be executed for each match. When you have the option of piping the results to a second command, the execution speed is significantly faster than when using -exec. A pipe generates the results with two commands instead of hundreds or thousands of commands.

    The -ok switch works the same way as -exec except that it interactively verifies whether the command should run.

    $ find . -type f -name "*.txt" -ok rm {} \;
    < rm ... ./orders.txt > ? n
    < rm ... ./advocacy/linux.txt > ? n
    < rm ... ./advocacy/old_orders.txt > ? n
    				

    The -ls action switch lists the matching files with more detail.

    The -printf switch makes find act like a searching version of the statftime command. The % format codes indicate what kind of information about the file to print. Many of these provide the same functions as statftime, but use a different code.

    Grep alternatives

    There are also several variants of grep that can search directly in archives, for example gzgrep and bzgrep. gzgrep is an envelope for grep that can invoke the grep on compressed or gzip'ed files. All options specified are passed directly to grep. If no file is specified, then the standard input is decompressed and fed to grep. Otherwise the given files are uncompressed if necessary and fed to grep.

    Grep has one useful option for grepping file extacted from archive

    --label=LABEL
    Display input actually coming from standard input as input coming from file LABEL.
    This is especially useful for tools like zgrep, e.g. gzip -cd foo.gz | grep --label=foo something

    Clearly as one of the oldest Unix utilities grep can be improved. There are several alternative implementations, each of which is better then original grep in several major ways but not enough to displace grep:

    Dr. Nikolai Bezroukov


    Top Visited
    Switchboard
    Latest
    Past week
    Past month

    NEWS CONTENTS

    Old News ;-)

    [Nov 15, 2018] Is Glark a Better Grep Linux.com The source for Linux information

    Notable quotes:
    "... stringfilenames ..."
    Nov 15, 2018 | www.linux.com

    Is Glark a Better Grep? GNU grep is one of my go-to tools on any Linux box. But grep isn't the only tool in town. If you want to try something a bit different, check out glark a grep alternative that might might be better in some situations.

    What is glark? Basically, it's a utility that's similar to grep, but it has a few features that grep does not. This includes complex expressions, Perl-compatible regular expressions, and excluding binary files. It also makes showing contextual lines a bit easier. Let's take a look.

    I installed glark (yes, annoyingly it's yet another *nix utility that has no initial cap) on Linux Mint 11. Just grab it with apt-get install glark and you should be good to go.

    Simple searches work the same way as with grep : glark stringfilenames . So it's pretty much a drop-in replacement for those.

    But you're interested in what makes glark special. So let's start with a complex expression, where you're looking for this or that term:

    glark -r -o thing1 thing2 *

    This will search the current directory and subdirectories for "thing1" or "thing2." When the results are returned, glark will colorize the results and each search term will be highlighted in a different color. So if you search for, say "Mozilla" and "Firefox," you'll see the terms in different colors.

    You can also use this to see if something matches within a few lines of another term. Here's an example:

    glark --and=3 -o Mozilla Firefox -o ID LXDE *

    This was a search I was using in my directory of Linux.com stories that I've edited. I used three terms I knew were in one story, and one term I knew wouldn't be. You can also just use the --and option to spot two terms within X number of lines of each other, like so:

    glark --and=3 term1 term2

    That way, both terms must be present.

    You'll note the --and option is a bit simpler than grep's context line options. However, glark tries to stay compatible with grep, so it also supports the -A , -B and -C options from grep.

    Miss the grep output format? You can tell glark to use grep format with the --grep option.

    Most, if not all, GNU grep options should work with glark .

    Before and After

    If you need to search through the beginning or end of a file, glark has the --before and --after options (short versions, -b and -a ). You can use these as percentages or as absolute number of lines. For instance:

    glark -a 20 expression *

    That will find instances of expression after line 20 in a file.

    The glark Configuration File

    Note that you can have a ~/.glarkrc that will set common options for each use of glark (unless overridden at the command line). The man page for glark does include some examples, like so:

    after-context:     1
    before-context:    6
    context:           5
    file-color:        blue on yellow
    highlight:         off
    ignore-case:       false
    quiet:             yes
    text-color:        bold reverse
    line-number-color: bold
    verbose:           false
    grep:              true
    

    Just put that in your ~/.glarkrc and customize it to your heart's content. Note that I've set mine to grep: false and added the binary-files: without-match option. You'll definitely want the quiet option to suppress all the notes about directories, etc. See the man page for more options. It's probably a good idea to spend about 10 minutes on setting up a configuration file.

    Final Thoughts

    One thing that I have noticed is that glark doesn't seem as fast as grep . When I do a recursive search through a bunch of directories containing (mostly) HTML files, I seem to get results a lot faster with grep . This is not terribly important for most of the stuff I do with either utility. However, if you're doing something where performance is a major factor, then you may want to see if grep fits the bill better.

    Is glark "better" than grep? It depends entirely on what you're doing. It has a few features that give it an edge over grep, and I think it's very much worth trying out if you've never given it a shot.

    [Oct 29, 2018] Getting all the matches with 'grep -f' option

    Perverted example, but interesting question.
    Oct 29, 2018 | stackoverflow.com

    Arturo ,Mar 24, 2017 at 8:59

    I would like to find all the matches of the text I have in one file ('file1.txt') that are found in another file ('file2.txt') using the grep option -f, that tells to read the expressions to be found from file.

    'file1.txt'

    a

    a

    'file2.txt'

    a

    When I run the command:

    grep -f file1.txt file2.txt -w

    I get only once the output of the 'a'. instead I would like to get it twice, because it occurs twice in my 'file1.txt' file. Is there a way to let grep (or any other unix/linux) tool to output a match for each line it reads? Thanks in advance. Arturo

    RomanPerekhrest ,Mar 24, 2017 at 9:02

    the matches of the text - some exact text? should it compare line to line? – RomanPerekhrest Mar 24 '17 at 9:02

    Arturo ,Mar 24, 2017 at 9:04

    Yes it contains exact match. I added the -w options, following your input. Yes, it is a comparison line by line. – Arturo Mar 24 '17 at 9:04

    Remko ,Mar 24, 2017 at 9:19

    Grep works as designed, giving only one output line. You could use another approach:
    while IFS= read -r pattern; do
        grep -e $pattern file2.txt
    done < file1.txt
    

    This would use every line in file1.txt as a pattern for the grep, thus resulting in the output you're looking for.

    Arturo ,Mar 24, 2017 at 9:30

    That did the trick!. Thank you. And it is even much faster than my previous grep command. – Arturo Mar 24 '17 at 9:30

    ar7 ,Mar 24, 2017 at 9:12

    When you use
    grep -f pattern.txt file.txt
    

    It means match the pattern found in pattern.txt in the file file.txt .

    It is giving you only one output because that is all is there in the second file.

    Try interchanging the files,

    grep -f file2.txt file1.txt -w
    

    Does this answer your question?

    Arturo ,Mar 24, 2017 at 9:17

    I understand that, but still I would like to find a way to print a match each time a pattern (even a repeated one) from 'pattern.txt' is found in 'file.txt'. Even a tool or a script rather then 'grep -f' would suffice. – Arturo Mar 24 '17 at 9:17

    [Nov 09, 2017] Searching files

    Notable quotes:
    "... With all this said, there's a very popular alternative to grep called ack , which excludes this sort of stuff for you by default. It also allows you to use Perl-compatible regular expressions (PCRE), which are a favourite for many programmers. It has a lot of utilities that are generally useful for working with source code, so while there's nothing wrong with good old grep since you know it will always be there, if you can install ack I highly recommend it. There's a Debian package called ack-grep , and being a Perl script it's otherwise very simple to install. ..."
    "... Unix purists might be displeased with my even mentioning a relatively new Perl script alternative to classic grep , but I don't believe that the Unix philosophy or using Unix as an IDE is dependent on sticking to the same classic tools when alternatives with the same spirit that solve new problems are available. ..."
    sanctum.geek.nz

    More often than attributes of a set of files, however, you want to find files based on their contents, and it's no surprise that grep, in particular grep -R, is useful here. This searches the current directory tree recursively for anything matching 'someVar':

    $ grep -FR someVar .
    

    Don't forget the case insensitivity flag either, since by default grep works with fixed case:

    $ grep -iR somevar .
    

    Also, you can print a list of files that match without printing the matches themselves with grep -l:

    $ grep -lR someVar .
    

    If you write scripts or batch jobs using the output of the above, use a while loop with read to handle spaces and other special characters in filenames:

    grep -lR someVar | while IFS= read -r file; do
        head "$file"
    done
    

    If you're using version control for your project, this often includes metadata in the .svn, .git, or .hg directories. This is dealt with easily enough by excluding (grep -v) anything matching an appropriate fixed (grep -F) string:

    $ grep -R someVar . | grep -vF .svn
    

    Some versions of grep include --exclude and --exclude-dir options, which may be tidier.

    With all this said, there's a very popular alternative to grep called ack, which excludes this sort of stuff for you by default. It also allows you to use Perl-compatible regular expressions (PCRE), which are a favourite for many programmers. It has a lot of utilities that are generally useful for working with source code, so while there's nothing wrong with good old grep since you know it will always be there, if you can install ack I highly recommend it. There's a Debian package called ack-grep, and being a Perl script it's otherwise very simple to install.

    Unix purists might be displeased with my even mentioning a relatively new Perl script alternative to classic grep, but I don't believe that the Unix philosophy or using Unix as an IDE is dependent on sticking to the same classic tools when alternatives with the same spirit that solve new problems are available.

    
    

    [Nov 01, 2017] Default grep options by Tom Ryder

    May 18, 2012 | sanctum.geek.nz

    When you're searching a set of version-controlled files for a string with grep , particularly if it's a recursive search, it can get very annoying to be presented with swathes of results from the internals of the hidden version control directories like .svn or .git , or include metadata you're unlikely to have wanted in files like .gitmodules .

    GNU grep uses an environment variable named GREP_OPTIONS to define a set of options that are always applied to every call to grep . This comes in handy when exported in your .bashrc file to set a "standard" grep environment for your interactive shell. Here's an example of a definition of GREP_OPTIONS that excludes a lot of patterns which you'd very rarely if ever want to search with grep :

    GREP_OPTIONS=
    for pattern in .cvs .git .hg .svn; do
        GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern
    done
    export GREP_OPTIONS
    

    Note that --exclude-dir is a relatively recent addition to the options for GNU grep , but it should only be missing on very legacy GNU/Linux machines by now. If you want to keep your .bashrc file compatible, you could apply a little extra hackery to make sure the option is available before you set it up to be used:

    GREP_OPTIONS=
    if grep --help | grep -- --exclude-dir &>/dev/null; then
        for pattern in .cvs .git .hg .svn; do
            GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern"
        done
    fi
    export GREP_OPTIONS
    

    Similarly, you can ignore single files with --exclude . There's also --exclude-from=FILE if your list of excluded patterns starts getting too long.

    Other useful options available in GNU grep that you might wish to add to this environment variable include:

    If you don't want to use GREP_OPTIONS , you could instead simply set up an alias :

    alias grep='grep --exclude-dir=.git'
    

    You may actually prefer this method as it's essentially functionally equivalent, but if you do it this way, when you want to call grep without your standard set of options, you only have to prepend a backslash to its call:

    $ \grep pattern file
    

    Commenter Andy Pearce also points out that using this method can avoid some build problems where GREP_OPTIONS would interfere.

    Of course, you could solve a lot of these problems simply by using ack but that's another post. Posted in Bash Tagged ack , alias , color , default , environment , exclude , grep , grep_options , options , pcre , variable , version control

    [Oct 31, 2017] Counting with grep and uniq by Tom Ryder

    Feb 18, 2012 | sanctum.geek.nz

    A common idiom in Unix is to count the lines of output in a file or pipe with wc -l :

    $ wc -l example.txt
    43
    $ ps -e | wc -l
    97
    

    Sometimes you want to count the number of lines of output from a grep call, however. You might do it this way:

    $ ps -ef | grep apache | wc -l
    6
    

    But grep has built-in counting of its own, with the -c option:

    $ ps -ef | grep -c apache
    6

    The above is more a matter of good style than efficiency, but another tool with a built-in counting option that could save you time is the oft-used uniq . The below example shows a use of uniq to filter a sorted list into unique rows:

    $ ps -ef | awk '{print $1}' | sort | uniq
    105
    daemon
    lp
    mysql
    nagios
    postfix
    root
    snmp
    tom
    UID
    www-data
    

    If it would be useful to know in this case how many processes were being run by each of these users, you can include the -c option for uniq :

    $ ps -ef | awk '{print $1}' | sort | uniq -c
        1 105
        1 daemon
        1 lp
        1 mysql
        1 nagios
        2 postfix
        78 root
        1 snmp
        7 tom
        1 UID
        5 www-data
    

    You could even sort this output itself to show the users running the most processes first with sort -rn :

    $ ps -ef | awk '{print $1}' | sort | uniq -c | sort -rn
        78 root
        8 tom
        5 www-data
        2 postfix
        1 UID
        1 snmp
        1 nagios
        1 mysql
        1 lp
        1 daemon
        1 105
    

    Incidentally, if you're not counting results and really do just want a list of unique users, you can leave out the uniq and just add the -u flag to sort :

    $ ps -ef | awk '{print $1}' | sort -u
    105
    daemon
    lp
    mysql
    nagios
    postfix
    root
    snmp
    tom
    UID
    www-data
    

    The above means I actually find myself using uniq with no options quite seldom.

    [Jul 30, 2011] pcregrep(1) grep with Perl-compatible regex - Linux man page

    pcregrep searches files for character patterns, in the same way as other grep commands do, but it uses the PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. See pcrepattern for a full description of syntax and semantics of the regular expressions that PCRE supports

    [Aug 4, 2009] Tech Tip View Config Files Without Comments Linux Journal

    I've been using this grep invocation for years to trim comments out of config files. Comments are great but can get in your way if you just want to see the currently running configuration. I've found files hundreds of lines long which had fewer than ten active configuration lines, it's really hard to get an overview of what's going on when you have to wade through hundreds of lines of comments.

    $ grep ^[^#] /etc/ntp.conf

    The regex ^[^#] matches the first character of any line, as long as that character that is not a #. Because blank lines don't have a first character they're not matched either, resulting in a nice compact output of just the active configuration lines.

    [Mar 18, 2009] UNIX BASH scripting Highlight match with color in grep command

    You can change this color by setting the GREP_COLOR environment variable to different combinations (from the color code list given below).

    I use

    $ export GREP_COLOR='1;30;43'

    which basically highlights the matched pattern with foreground color black and background color yellow (shown below in the snap).

    The set display attributes list:

    0 Reset all attributes
    1 Bright
    2 Dim
    4 Underscore
    5 Blink
    7 Reverse
    8 Hidden

    Foreground Colours
    30 Black
    31 Red
    32 Green
    33 Yellow
    34 Blue
    35 Magenta
    36 Cyan
    37 White

    Background Colours
    40 Black
    41 Red
    42 Green
    43 Yellow
    44 Blue
    45 Magenta
    46 Cyan
    47 White

    [Sep 11, 2008] glark by Jeff Pace

    Ruby based

    glark offers grep-like searching of text files, with very powerful, complex regular expressions (e.g., "/foo\w+/ and /bar[^\d]*baz$/ within 4 lines of each other"). It also highlights the matches, displays context (preceding and succeeding lines), does case-insensitive matches, and automatic exclusion of non-text files. It supports most options from the GNU version of grep.

    [May 06, 2008] ack! - Perl-based grep replacement

    There are some tools that look like you will never replace them. One of those (for me) is grep. It does what it does very well (remarks about the shortcomings of regexen in general aside). It works reasonably well with Unicode/UTF-8 (a great opportunity to Fail Miserably for any tool, viz. a2ps).

    Yet, the other day I read about ack, which claims to be "better than grep, a search tool for programmers". Woo. Better than grep? In what way?

    The ack homepage lists the top ten reasons why one should use it instead of grep. Actually, it's thirteen reasons but then some are dupes. So I'd say "about ten reasons". Let's look at them in order.

    1. It's blazingly fast because it only searches the stuff you want searched.

      Wait, how does it know what I want? A DWIM-Interface at last? Not quite. First off, ack is faster than grep for simple searches. Here's an example:

      $ time ack 1Jsztn-000647-SL exim_main.log >/dev/null
      real    0m3.463s
      user    0m3.280s
      sys     0m0.180s
      $ time grep -F 1Jsztn-000647-SL exim_main.log >/dev/null
      real    0m14.957s
      user    0m14.770s
      sys     0m0.160s
      

      Two notes: first, yes, the file was in the page cache before I ran ack; second, I even made it easy for grep by telling it explicitly I was looking for a fixed string (not that it helped much, the same command without -F was faster by about 0.1s). Oh and for completeness, the exim logfile I searched has about two million lines and is 250M. I've run those tests ten times for each, the times shown above are typical.

      So yes, for simple searches, ack is faster than grep. Let's try with a more complicated pattern, then. This time, let's use the pattern (klausman|gentoo) on the same file. Note that we have to use -E for grep to use extended regexen, which ack in turn does not need, since it (almost) always uses them. Here, grep takes its sweet time: 3:56, nearly four minutes. In contrast, ack accomplished the same task in 49 seconds (all times averaged over ten runs, then rounded to integer seconds).

      As for the "being clever" side of speed, see below, points 5 and 6

    2. ack is pure Perl, so it runs on Windows just fine.

      This isn't relevant to me, since I don't use windows for anything where I might need grep. That said, it might be a killer feature for others.

    3. The standalone version uses no non-standard modules, so you can put it in your ~/bin without fear.

      Ok, this is not so much of a feature than a hard criterion. If I needed extra modules for the whole thing to run, that'd be a deal breaker. I already have tons of libraries, I don't need more undergrowth around my dependency tree.

    4. Searches recursively through directories by default, while ignoring .svn, CVS and other VCS directories.

      This is a feature, yet one that wouldn't pry me away from grep: -r is there (though it distinctly feels like an afterthought). Since ack ignores a certain set of files and directories, its recursive capabilities where there from the start, making it feel more seamless.

    5. ack ignores most of the crap you don't want to search

      To be precise:

      • VCS directories
      • blib, the Perl build directory
      • backup files like foo~ and #foo#
      • binary files, core dumps, etc.

      Most of the time, I don't want to search those (and have to exclude them with grep -v from find results). Of course, this ignore-mode can be switched off with ack (-u). All that said, it sure makes command lines shorter (and easier to read and construct). Also, this is the first spot where ack's Perl-centricism shows. I don't mind, even though I prefer that other language with P.

    6. Ignoring .svn directories means that ack is faster than grep for searching through trees.

      Dupe. See Point 5

    7. Lets you specify file types to search, as in --perl or --nohtml.

      While at first glance, this may seem limited, ack comes with a plethora of definitions (45 if I counted correctly), so it's not as perl-centric as it may seem from the example. This feature saves command-line space (if there's such a thing), since it avoids wild find-constructs. The docs mention that --perl also checks the shebang line of files that don't have a suffix, but make no mention of the other "shipped" file type recognizers doing so.

    8. File-filtering capabilities usable without searching with ack -f. This lets you create lists of files of a given type.

      This mostly is a consequence of the feature above. Even if it weren't there, you could simply search for "."

    9. Color highlighting of search results.

      While I've looked upon color in shells as kinda childish for a while, I wouldn't want to miss syntax highlighting in vim, colors for ls (if they're not as sucky as the defaults we had for years) or match highlighting for grep. It's really neat to see that yes, the pattern you grepped for indeed matches what you think it does. Especially during evolutionary construction of command lines and shell scripts.

    10. Uses real Perl regular expressions, not a GNU subset

      Again, this doesn't bother me much. I use egrep/grep -E all the time, anyway. And I'm no Perl programmer, so I don't get withdrawal symptoms every time I use another regex engine.

    11. Allows you to specify output using Perl's special variables

      This sounds neat, yet I don't really have a use case for it. Also, my perl-fu is weak, so I probably won't use it anyway. Still, might be a killer feature for you.

      The docs have an example:

      ack '(Mr|Mr?s)\. (Smith|Jones)' --output='$&'
    12. Many command-line switches are the same as in GNU grep:

      Specifically mentioned are -w, -c and -l. It's always nice if you don't have to look up all the flags every time.

    13. Command name is 25% fewer characters to type! Save days of free-time! Heck, it's 50% shorter compared to grep -r

      Okay, now we have proof that not only the ack webmaster can't count, he's also making up reasons for fun. Works for me.

    Bottom line: yes, ack is an exciting new tool which partly replaces grep. That said, a drop-in replacement it ain't. While the standalone version of ack needs nothing but a perl interpreter and its standard modules, for embedded systems that may not work out (vs. the binary with no deps beside a libc). This might also be an issue if you need grep early on during boot and /usr (where your perl resides) isn't mounted yet. Also, default behaviour is divergent enough that it might yield nasty surprises if you just drop in ack instead of grep. Still, I recommend giving ack a try if you ever use grep on the command line. If you're a coder who often needs to search through working copies/checkouts, even more so.

    Update

    I've written a followup on this, including some tips for day-to-day usage (and an explanation of grep's sucky performance).

    Comments

    René "Necoro" Neumann writes (in German, translation by me):

    Stumbled across your blog entry about "ack" today. I tried it and found it to be cool :). So I created two ebuilds for it:

    Just wanted to let you know (there is no comment function on your blog).

    [May 31, 2006] Linux.com GNU grep's new features By: Michael Stutz

    It looks like GNU grep became too overloaded with features ("christmas tree"). In many complex cases custom Perl script can compete with grep.

    If you haven't been paying attention to GNU grep recently, you should be happily surprised by some of the new features and options that have come about with the 2.5 series. They bring it functionality you can't get anywhere else -- including the ability to output only matched patterns (not lines), color output, and new file and directory options.

    Granted, the addition of this feature set caused a number of bugs that made it necessary to rewrite part of the code, but the latest 2.5.1a bugfix release is eminently usable.

    One highlight of the new version is its ability to output only matched patterns. This is one of the most exciting features, because it adds completely new functionality to the tool. Remember, "grep" is an acronym -- it got its name from a function in the old Unix ed utility, global / regular expression / print -- and its purpose was to output lines from its input that match a given regular expression.

    It remains such, but the new -o option (or --only-matching) specifies that only the matched patterns themselves are to be output, and not the entire lines they come on. If more than one match is found on a single line, those matches are output on lines of their own.

    With this new option, suddenly GNU grep is transformed from a utility that outputs lines into a tool for harvesting patterns. You can use it to harvest data from input files, such as pulling out referrers from your server logs, or URLs from a file:

    egrep -o '(((http(s)?|ftp|telnet|news|gopher)://|mailto:)[^\(\)[:space:]]+)' logfile

    Or grab email addresses from a file:

    egrep -o '\@/:[:space:]]+\>@[a-zA-Z_\.]+?\.[a-zA-Z]{2,3}' somefile

    Use it to pull out all the senders from an email archive and sort into a file of unique addresses:

    grep '^From: ' huge-mail-archive | egrep -o '\@/:[:space:]]+\>@[a-zA-Z_\.]+?\.[a-zA-Z]{2,3}' | sort | uniq > email.addresses

    New uses for this feature keep popping up. You can use it, for instance, as a tool for testing regular expressions. Say you've whipped up a complicated regexp to do some task. You think it's the world's greatest regexp, it's going to do everything short of solving all the world's problems -- but at runtime, it doesn't seem to go as planned.

    Next time this happens, use the -o option when you're in the design stage, and have grep read from the standard input, where you can feed it test data -- you'll see right away whether or not it matches exactly what you think it does. Since grep will be tossing back to you not the matched lines but the actual matches to the expression, it'll give you a pretty good clue how to fix it.

    Output matches in color

    Use the --color option to display matches in the input in color (red, by default). Color is added via ANSI escape sequences, which don't work in all displays, but grep is smart enough to detect this and won't use color (even if specified) if you're sending the output down a pipeline. Otherwise, if you piped the output to (say) less, the ANSI escape sequences would send garbage to the screen. If, on the other hand, that's really what you want to do, there's a workaround: use the --color=always to force it, and call less with the -R flag (which prints all raw control characters). That way, the color codes will escape correctly and you'll page through screens of text with your matched patterns in full color:

    grep --color=always "regexp" myfile | less -R

    The GREP_COLOR environment variable controls which color is used. To change the color from red to something else, set GREP_COLOR to a numeric value according to this chart:

    30	black
    31	red
    32	green
    33	yellow
    34	blue
    35	purple
    36	cyan
    37	white
    

    For example, to have matches highlighted in a shade of green:

    GREP_COLOR=32; export GREP_COLOR; grep pattern myfile

    Use Perl regexps

    One of the biggest developments in regular expressions to occur in the last few decades has been the Perl programming language, with its own regular expression dialect. GNU grep now takes Perl-style regexps with the -P option. (It's not always compiled in by default, so if you get an error message of "grep: The -P option is not supported" when you try to use it, you'll have to get the sources and recompile.)

    To search for a bell character (Ctrl-g), you can now use:

    grep -P '\cG' myfile

    This is considered a "major variant" of grep, as with the -E and -F options (which are the egrep and fgrep tools, respectively), but it doesn't yet come with an associated program name -- perhaps new versions will have a prep binary (it sounds much better than pgrep) that will mean the same thing as using -P.

    Dealing with input

    A number of new features have to do with files and input. The new --label option lets you specify a text "label" to standard input. Where it's really useful is when you're grepping a lot of files at once, plus standard input, and you're making use of the labels that grep prefixes its matches with. Normally, standard input would be the only one with a label you couldn't control -- it's always prefixed with "(standard input)" as its label. Now, it can be prefixed with whatever argument you give the --label option.

    grep changes quick reference

    -Cx prints context lines before and after matches and must have argument x.

    --color outputs matches in color (default red).

    -D action specifies an action to take on device files (the default is "read").

    --exclude= filespec excludes files matching filespec.

    --include= filespec only searches through files matching filespec.

    --label= name makes name the new label for stdin.

    --line-buffered turns on line buffering.

    -m X stops searching input after finding X matched lines.

    -o outputs only matched patterns, not entire lines.

    -P uses Perl-style regular expressions.

    When searching through multiple files, you can control which files to search for with the --include and --exclude options. For example, to search for "linux" only in files with .txt extensions in the /usr/local/src directory tree, use:

    grep -r --include=*.txt linux /usr/local/src

    When you're recursively searching directories of files, you'll get errors when grep comes across a device file. With the new --devices option, you can specify what you want it to do on these files, by giving it an optional action. The default action is "read," which means to just read the file as any other file. But you can also specify "skip," which will skip the file entirely. Those are currently the only two methods for handling devices.

    To search for "linux" in all files on the system, excluding special device files, use:

    grep -r --device=skip linux /

    Finally, the --line-buffered option turns on line buffering, and --m (or --max-count) gives the maximum number of matched lines to show, after which grep will stop searching the given input. For example, this command searches a huge file with line buffering, exiting after at most 10 matched lines occur:

    grep --line-buffered -m 10 huge.file

    POSIX updates

    Some of the other new updates were made are so that GNU grep conforms to POSIX.2, including subtle changes in exit status.

    One of these changes is that the interpretation of character classes is now locale-dependent. That means that ranges specified in bracketed expressions like [A-Z] don't mean the same thing everywhere. If the system's current locale environment calls for its own characters or sorting, these settings will override any default character range.

    Another related update is a change to the old -C option, which outputs a specified number of lines of context before and after matched lines. In the past, when you used -C without an option, grep would output two lines of before-and-after context, but now you have to give an argument; if you don't, grep will report an error and exit. That's something to look out for if you've got any old shells scripts or routines sitting around that call grep.

    [ z a z z y b o b . c o m ] -usr-share-doc-tips

    GNU grep comes with a recursive option (-r,-R) that allows you to recursively grep for a pattern through all files and any subdirectories.

    But what happens if you aren't using GNU grep? You can use find to assist...

    find /path/to/files -exec grep "pattern" {} \;

    You can, of course, provide your usual options to grep, e.g.

    find /path/to/files -exec grep -li "pattern" {} \;

    pcregrep-4.5-1.i386 RPM

    pcregrep searches files for character patterns, in the same way as other grep commands do, but it uses the PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. See pcre(3) for a full description of syntax and semantics.

    If no files are specified, pcregrep reads the standard input. By default, each line that matches the pattern is copied to the standard output, and if there is more than one file, the file name is printed before each line of output. However, there are options that can change how pcregrep behaves.

    Lines are limited to BUFSIZ characters. BUFSIZ is defined in <stdio.h>. The newline character is removed from the end of each line before it is matched against the pattern.

    Re Replacing GNU grep revisited

    Chris Costello said:
    > On Sunday, June 22, 2003, Sean Farley wrote:
    >> Reasons to consider for switching:
    >> 1. GNU's grep -r option "is broken" according to the following post.
    >>    The only thing I have noticed is that FreeGrep has more options for
    >> controlling how symbolic links are traversed.
    >>       http://groups.google.com/groups?hl=en&lr=lang_en&ie=UTF-8&selm=xzp7kchblor.fsf_flood.ping.uio.no%40ns.sol.net
    >
    >    A workaround for this problem in the meantime would be to use
    >
    >      find <directory> -type f | xargs grep EXPR
    >
    >    Just FYI.
    
    Rumors of my demise are greatly exaggerated.  And to call myself busy any
    more is an understatement.
    
    But yes, I got an email from Ted Unangst telling me about the OpenBSD move
    to FreeGrep and this pleases me greatly.  I have been glancing over thier
    CVS tree (via the web) and they have made a number of changes to fix the
    bugs being discussed here.  Aside from a handful of errors (which are
    presumably correctable), the speed is still an issue.
    
    It is horribly slow when compared to the GNU version.  FreeBSD will see
    better times than OpenBSD due to some changes made to the regex code a few
    years ago which I adapted from the 4.4BSD-Lite2 code for grep, but it
    still lags behind GNU in performance.
    
    Jamie
    

    Recommended Links

    Google matched content

    Softpanorama Recommended

    Top articles

    Sites

    GNU documenation

    Tutirials

    Regex:

    Articles

    Related linux commands:

    egrep - Search file(s) for lines that match an extended expression
    fgrep - Search file(s) for lines that match a fixed string
    pgrep - find signal processes by name
    Why GNU grep is fast - comparison with BSD grep
    find - Search for files that meet a desired criteria
    gawk - Find and Replace text within file(s)
    locate - Find files
    sed - Stream Editor - Find and Replace text within file(s)
    tr - Translate, squeeze, and/or delete characters
    whereis - Search the user's $path, man pages and source files for a program
    BeyondGrep : ack - A tool like grep, optimized for programmers
    Equivalent Windows commands: QGREP / FINDSTR - Search for strings in files

    Recommended Articles

    [May 06, 2008] ack! - Perl-based grep replacement

    Searching in Unusual Ways and Places

    Clearly, grep is a command I can't live without. I constantly use it on its own and in pipes with other commands. For example:

    % ps -aux | egrep 'chavez|PID'
    USER      PID  %CPU  %MEM    VSZ  RSS   TTY    STAT  START   TIME  COMMAND
    chavez  14355   0.0   1.6   2556  1792  pts/2  S     10:23   0:00  -tcsh
    chavez  18684  89.5   9.6  27680  5280  ?      R N   Sep25  85:26  /home/j03/l988

    I use this command combination often enough with different usernames that I've defined an alias for it.

    Less Well-Known Regular Expression Constructs

    Most are familiar with the asterisk, plus sign, and question mark modifiers to regular expression items (match zero or more, one or more, or exactly one of the item, respectively). However, you can specify how many of each item should be matched even more precisely using some extended regular expression constructs (use egrep or grep -E):

    Form Meaning

    {n} Match exactly n of the preceding item.
    {n,} Match n or more of the preceding item.
    {n,m} Match at least n and no more than m of the preceding item.

    Here are some simple examples:

    % grep -E "t{2}" bio
    She has written eight books, including
    Essential Cultural Studies from Pitt. When
    she's not writing
    
    % grep -E "[0-9]{3,}" bio
    network of Unix and Windows NT/2000/XP
    systems. She
    
    % grep -E "(the ){2,}|(and ){2,}" bio
    and and creating murder mystery games. She
    you'd like to receive the the free newsletter

    The first command searches for double t's; the second command looks for numbers of three or more digits; and the third command searches for two consecutive instances of the words "the" and "and" (it's a primitive copy editor). You might be tempted to formulate the final item as:

    (the |and ){2,}

    However, this won't work, as it will match "and the," which is not generally an error.

    Finally, be aware that the constuct {,m}, which might mean "match m or fewer of the preceding item," is not defined.


    Random Findings


    Etc

    Society

    Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

    Quotes

    War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

    Bulletin:

    Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

    History:

    Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

    Classic books:

    The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

    Most popular humor pages:

    Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

    The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


    Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

    This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

    You can use PayPal to to buy a cup of coffee for authors of this site

    Disclaimer:

    The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

    Last modified: February 19, 2020