Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Softpanorama Bulletin
Vol 23, No.07 (July, 2012)

Prev | Contents | Next

Bulletin 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Jan Feb Mar Apr May Jun Jul Sept Oct Nov Dec

Redefining Perl Input record separator

Introduction

The special variable $/ is called the 'input record separator'. Usually, it's set to be the newline character, "\n" , and  'record'  is equivalent to a line.  This is very similar AWK RS variable with one important difference: Perl's record separator must be a fixed string, not a pattern. Perl has three tricks in its sleeve as for e 'input record separator':

  1. if input record separator is undefined the first reading of the file will read the whole file.
  2. If input record separator is blank then the next paragraph (till blank line) will be read. For instance, our data was defined in terms of paragraphs, rather than lines, we could read one paragraph at a time by changing $/ .
  3. If input separator is a reference to a number, then exactly this number of bytes will be read each time

One interesting fact is that function  chomp actually doesn't remove a trailing newline character (like many introductory book incorrectly state). It removes a trailing record separator. If you set record separate to a different value then its default value then "\n", then the behavior of chop and chomp functions will change accordingly.

Here is a relevant quote from Perl documentation:

The input record separator, newline by default. This influences Perl's idea of what a line' is. Works like awk's RS variable, including treating empty lines as a terminator, if set to the null string. (An empty line cannot contain any spaces or tabs.)

You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n"  means something slightly different than setting to "", if the file contains consecutive empty lines.

Setting to ""  will treat two or more consecutive empty lines as a single empty line.

Setting to "\n\n"  will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)

    undef $/;           # enable "slurp" mode
    $_ = <FH>;          # whole file now here
    s/\n[ \t]+/ /g;

Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)

After splitting the defined record separator string occupies is the last symbol(s) of the line. It is not deleted. Lines are just split after the next record separator is found.

Reading Entire Files

Often you script will be simpler if you can read the while input file into a string or array. This is possible in Perl if $/ set to the undefined value. This is sometimes called slurp mode, because it slurps in the whole file as one big string.  So, for instance, to read the file /etc/quotes.dat into a variable, we do this:
$/ = undef;
open(QUOTES, "/etc/quotes.dat") or die("can'r open file. Possible reason: $!");
$text = <QUOTES>;
Instead of assignment $/ = undef you may also use undef function:
undef $/
This is essentially the equivalent way of putting undef value in any variable including $/. Here is an example of reading the whole file in Perl:
{
  local $/ = undef;
  open SYSIN, "/etc/passwd" or die "Couldn't open file: $!";
  $passwd_file=<SYSIN>;
  close SYSIN;
}

Processing HTML

One interesting way to process HTML is to use "<" as input record separator. This way you will have so called "prefix" notation for each tag: the tag will be the first on the line and after it there will be related to tag text, if any.

Please note that in this case each line will end in "<".

You can go further then that and use an opening or closing tag as a  record separate such as "<li" or "<h2".

Reading Paragraphs at a Time

If you set the input record separator, $/ , to the empty string, "", Perl reads in one paragraph at a time. Paragraphs must be separated by a completely blank line, with no spaces on it at all. Of course, you can use split or similar to extract individual lines from each paragraph. This program creates a 'paragraph summary' by printing out the first line of each paragraph in a file:

Let's try to write a simple email summarizer which first splits email into header and body and then prints the first three line of the body:

#!/usr/bin/perl
$/ = ""; 
$header=<>; # read mail header
$body=<>;  read mail body;
@body_text=split(/\n/,$body);
print "Total number of lines in email:".scalar(@body)."\n";
$limit=(scalar($body_text)>3)? 3 : scalar(@body_text);
for ($i=0; $i<=$limit; $++) { 
   print "\t$body_text[$i]\n"; 
} 
The key idea here that after we got the whole body as a single string $body" we can split it using newlines as a delimiter. Diamond operator (<>) reads blocks of text until the the record separator is found. As it is an empty string the whole paragraph will be slurped into the string. 

Reading fixed-length records

If $/ is a pointer to a an integer then the read operation will read exactly this number of bytes each time it is invoked.

$/ = \256; # Set IRS to fixed length records

while ( <> ) {
    ...
}

Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:

    $/ = \32768; # or \"32768", or \$var_containing_32768
    open(SYSIN, $myfile);
    $_ = <SYSIN>;

will read a record of no more than 32768 bytes from SYSIN. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces.

The alternative method is to use read and unpack:

# $RECORDSIZE is the length of a record, in bytes.
# $TEMPLATE is the unpack template for the record
# SYSIN is the file to read from
# @FIELDS is an array, one element per field

until ( eof(SYSIN) ) {
    read(SYSIN, $record, $RECORDSIZE) == $RECORDSIZE 
        or die "short read\n";
    @FIELDS = unpack($TEMPLATE, $record);
}

You must simply read a particular number of bytes into a buffer. This buffer then contains one record's data, which you decode using unpack with the right format.

For binary data, the catch is often determining the right format. If you're reading data written by a C program, this can mean peeking at C include files or manpages describing the structure layout, and this requires knowledge of C. On a system supporting gcc, then you may be able to use the c2ph tool distributed with Perl to cajole your C compiler into helping you with this.

The tailwtmp program at the end of this chapter uses the format described in utmp (5) under Linux and works on its /var/log/wtmp and /var/run/utmp files. Once you commit to working in a binary format, machine dependencies creep in fast. It probably won't work unaltered on your system, but the procedure is still illustrative. Here is the relevant layout from the C include file on Linux:

#define UT_LINESIZE           12
#define UT_NAMESIZE           8
#define UT_HOSTSIZE           16

struct utmp {                       /* here are the pack template codes */
    short ut_type;                  /* s for short, must be padded      */
    pid_t ut_pid;                   /* i for integer                    */
    char ut_line[UT_LINESIZE];      /* A12 for 12-char string           */
    char ut_id[2];                  /* A2, but need x2 for alignment    */
    time_t ut_time;                 /* l for long                       */
    char ut_user[UT_NAMESIZE];      /* A8 for 8-char string             */
    char ut_host[UT_HOSTSIZE];      /* A16 for 16-char string           */
    long ut_addr;                   /* l for long                       */
};

Once you figure out the binary layout, feed that (in this case, "s x2 i A12 A2 x2 l A8 A16 l") to pack with an empty field list to determine the record's size. Remember to check the return value of read when you read in your record to make sure you got back the number of bytes you asked for.

If your records are text strings, use the "a" or "A" unpack templates.

Fixed-length records are useful in that the n th record begins at byte offset SIZE * (n-1) in the file, where SIZE is the size of a single record. See the indexing code in Recipe 8.8 for an example of this.

Reading Records with a Pattern Separator

Read the whole file and use split:

undef $/;
@chunks = split(/pattern/, <FILEHANDLE>);

Perl's record separator must be a fixed string, not a regular expression. To sidestep this limitation, undefine the input record separator entirely so that the next line-read operation gets the rest of the file. This is sometimes called slurp mode, because it slurps in the whole file as one big string. Then split that huge string using the record separating pattern as the first argument.

Here's an example, where the input stream is a text file that includes lines consisting of ".Se", ".Ch", and ".Ss", which are special codes in the troff macro set that this book was developed under. These lines are the separators, and we want to find text that falls between them.

# .Ch, .Se and .Ss divide chunks of STDIN
{
    local $/ = undef;
    @chunks = split(/^\.(Ch|Se|Ss)$/m, <>);
}
print "I read ", scalar(@chunks), " chunks.\n";

We create a localized version of $/ so its previous value gets restored after the block finishes. By using split with parentheses in the pattern, captured separators are also returned. This way the data elements in the return list alternate with elements containing "Se", "Ch", or "Ss".

If you didn't want delimiters returned but still needed parentheses, you could use non-capturing parentheses in the pattern: /^\.(?:Ch|Se|Ss)$/m .

If you just want to split before a pattern but include the pattern in the return, use a look-ahead assertion: /^(?=\.(?:Ch|Se|Ss))/m . That way each chunk starts with the pattern.

Be aware that this uses a lot of memory if the file is large. However, with today's machines and your typical text files, this is less often an issue now than it once was. Just don't try it on a 200-MB logfile unless you have plenty of virtual memory to use to swap out to disk with! Even if you do have enough swap space, you'll likely end up thrashing.

More complex example

The fortune program is a small but important part of the Unix culture. It provides display of random entry from the "fortune cookie" databases on login. It first appeared in Version 7 Unix (The manual page for the original Unix fortune(6)). Fortune database is a text file with quotations, each seperated by the cahacter % in it own line.

Let's try to rework slightly example of Fortune Cookie Dispenser from Simon Cozens book. The fortune cookies file for the UNIX fortune program – as well as some 'tagline' generators for e-mail and news articles – consist of paragraphs separated by a percent sign on a line of its own, like this:

We all agree on the necessity of compromise. We just can't agree on 
when it's necessary to compromise. 
-- Larry Wall 

%
All language designers are arrogant. Goes with the territory... 
-- Larry Wall 

%

Oh, get a hold of yourself. Nobody's proposing that we parse English. 
-- Larry Wall

%

Now I'm being shot at from both sides. That means I *must* be right. 
-- Larry Wall 

%
Assuming that file in this format was saved in /etc/quotes.dat we can now write a program to pick a random quote from the file:
#!/usr/bin/perl
$/ = "\n%\n";
open QUOTES, "/etc/quotes.dat" or die " can't open /etc/quotes.dat. ReasonL $!;
@file = <QUOTES>;
$random = rand(@file);
$fortune = $file[$random];
chomp $fortune; # remove a single character record separator on the end 
print "$fortune\n";

This is what you get (or might get – it is random, after all):

perl fortune.pl
Now I'm being shot at from both sides. That means I *must* be right.
-- Larry Wall

What we set record separator to "\n%n\" read operation will consume a block of text from one separator to another despite the fact that the block consist of several lines. Now a 'line' is everything up to a newline character and then a percent sign on its own and then another new line, and when we read the file into an array, it ends up looking something like this:

my @file = (
"We all agree on the necessity of compromise.
We just can't agree on when it's necessary to compromise.
-- Larry Wall",
"All language designers are arrogant. Goes with the territory...
-- Larry Walln",
...
);
We want a random line from the file. The operator for this is rand :
my $random = rand(@file);
my $fortune = $file[$random];
Function rand produces a random number between zero and the number given as an argument. What's the argument we give it? As you know, an array in a scalar context gives you the number of elements in the array. Function rand actually generates a fractional number, this number when used as an index will be cut to an integer by stripping the fractional part.
my $fortune = $file[rand @file];
Now we have our fortune, but it still has the record separator on the end, so we need to chomp to remove it:
chomp $fortune ;
Finally, we can print it back out, remembering that we need to put a new line on the end:
print $fortune, "n";

Recommended Reading

The $/  variable in perlvar (1) and in the "Special Variables" section of Chapter 2 of Programming Perl; the split  function in perlfunc (1) and Chapter 3 of Programming Perl; we talk more about the special variable $/  in Chapter 8

See Also

The unpack, pack, and read functions in perlfunc (1) and in Chapter 3 of Programming Perl; Recipe 1.1



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019