Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Introduction to Perl for Unix System Administrators

(Perl without excessive complexity)

by Dr Nikolai Bezroukov


Ch 4. File handling

Version 0.72

Contents

Filehandes and standard files

To access files Perl uses so called file handles. There are two types of filehandles in Perl -- standard and user-defined. Like in C there are just three standard filehandles in Perl:

You can also read from and write to any other file(s). To access a file from your Perl script, you must perform the following steps:

1. Unless you use a standard filehandle (and in a couple of other cases, see <> operator below) your script should  first open the file. This operation binds a filehandle to a particular file. This is usually done using open statement that tells the system what file your Perl script wants to access and how it will access it (read, write of append). Open operation associate your file handle with a pointer to some internal data structure for this file for all subsequent operations.

2. The script can than perform only operation specified on opened file -- either read from the file or write to the file, depending on how you have opened the file.

3. After completion of all operations with file the script may close the file. This tells the system that your script no longer needs the access to the file and disconnect the file and its file handle. If you don't do this the system will close the file automatically when you script will finish execution

Writing to a file is buffered by default. To ask Perl to flush immediately after each write or print command, set the special variable $| to 1. Setting this value is very helpful when you are printing to a web browser in a CGI script or writing to a socket. 

To ask Perl to flush immediately after each write or print command, set the special variable $| to 1. Setting this value is very helpful when you are printing to a web browser in a CGI script or writing to a socket. 

Opening a File

Opening the file is essentially an operation of association of the file name in the filesystem and a filehandle.  To open a file, call the built-in function open():

open(SYSPASS, "/etc/passwd");

   		|_____________ the path to file to be opened
            

       |_______________________ filehandle
   

The first argument is file handle. It should be used in all other operation with this file. I recommend naming all your filehandles with prefix SYS. That makes the code a little bit more readable.

After the file has been opened, your Perl script accesses the file by referring to this handle. Actually you can think about file handle as a pointer to the system block that operating system allocated to the file.

The second argument is the name of the file you want to open. You can supply either the full pathname, as in /etc/passwd, or relative pathname. In Windows you can supply pathname using Unix conventions (with "/" as  the delimiter), but you still need to specify a logical disk. If only the filename is supplied, the file is assumed to be in the current working directory.

In Windows you can supply pathname using Unix conventions (with "/" as  the delimiter), but you still need to specify a logical disk

By default, Perl assumes that file needs to be opened for reading. To open a file for writing, put a > (greater than) character in front of your filename (like in Unix shell):

open(SYSOUT, ">myoutput.txt");

When you open a file for writing, the existing file will be overwritten.

The analogy with Unix shell notation holds in appending too -- to append to an existing file, you need to use ">>" in front of the filename:

open(SYSOUT, ">>myoutput.txt");

Notation in open statement was borrowed from i/o notation in Unix shells.
If you can do something additional using this notation in Unix shells,
you most probably will be able to do it in the Perl open statement as well

For example you can use "<" sign for opening file for reading. Sometime you need explicitly open  standard input (usually the keyboard) and standard output (usually the screen) respectively:

open(STDIN, '-');	# Open standard input
open(STDOUT, '>-');	# Open standard output

The table below summarize tree major opening modes in Perl:

read mode open(SYSIN, $fname);

open(SYSIN, "<$fname");

Enables the script to read the existing contents of the file but does not enable it to write into the file
write mode open(SYSIN, ">$fname"); Destroys the current contents of the file and overwrites them with the output supplied by the script
append mode open(SYSIN, ">>$fname"); Appends output supplied by the script to the existing contents of the file

Checking Whether the Open Succeeded

You can use open() function to test whether the file is actually available, and exit the program or take some other appropriate action if not. It returns true (a non-zero value) if the open succeeds. For exiting the script with message Perl provided die() built-in function. for example:

$fname="etc/passwd";
unless (open(SYSIN, "<$fname")) {
        die("unable to open $fname for reading. Reason: $!\n");
}

Note that this is an example when the second form of the if statement (unless) is really useful, because we need to take action only if the action fail. Please note that unless statement should have two closing brackets:


unless (open(SYSIN, "<$fname")) {
            |________________|
       |______________________|

In case, God forbid, you miss one, Perl diagnostic is really misleading.   

Note that unless statement should contain two closing brackets.
 In case you miss one diagnostic is really misleading. 

But more often this logic is written using a simpler and more transparent Perl idiom which came from shell:

open(SYSIN, "<$fname") || die("unable to open $fname for reading. Reason: $!\n");

You will often see scripts that use this idiom based on proprieties of the  short circuit || (logical OR) operator (it evaluates the first operand and if it succeed then || operation never evaluates the second operand).

If open returns false, you can find out what went wrong by analyzing build in variable $! or using the file-test operators, that are discussed below.  Here is how this variable is defined in perlvar

 In this case you should always provide additional diagnostic about why you cannot open the file. 

Always check any open for failure.  Never assume that open will succeed in all cases. Use built-in variable $! to make diagnostic message more informative

Closing a File

When you are finished reading from or writing to a file, you can tell the system that you are finished by calling close():

close(SYSOUT);

Note that close() is not required unless you want to reopen the same file later in the program (for example for writing). Perl automatically closes the file when the script terminates or when you open another file using a previously defined file handle.

Reading from a File

Instead of a function like READ or some kind of pipe notation Perl use a non conventional notation. To read one line from a file, you need to assign the file handle in angle brackets to a variable. I cannot explain advantages of this decision which make Perl  noticeably closer closer to syntactic perversions used in shell languages, but that how it was done. For example:

$line = <SYSIN>;

If you just write filename in angle brackets than default  variable $_ will be filled with the current record.

The question arise how to know when the input ends. The answer is that Perl assign value undef to any record that you are trying to read after the end of the file.  if file is empty then it can be the first record.

Reading all text into an array

The current PCs and servers have memory of several gigabytes. That means that when processing small to medium files it might be more convenient to read the file into the array of strings at once. That can be done by assigning a file handle in angle brackets to an array. Here is the Perl code that does exactly this with passwd file.

#!/usr/local/bin/perl
# Script to print the password file on the console like cat 
fname='/etc/passwd';
open(SYSIN, $fname);	        # Open the file
@text = <SYSIN>;		# Read it into an array @text
close(SYSIN);			# Close the file

#now we can, for example print them
print @text;			# Print the array
Please note that if you accept the file name from the the user, you need to strip ass "dangerous" characters from the input. Otherwise your script can be used for execution of arbitrary commands:

fname =~tr($'"<>/;!|)( );

The file is defined by the SYSIN handle and use it one right side of the assignment statement means that all likes will go into the array. The statement

@text = <SYSIN>;

reads the file denoted by the filehandle SYSIN into the array @text. Note that the <SYSIN> expression reads in the file entirely via implicit loop. This happens because the reading takes place in the context of an array variable. If we replace @text by the scalar $text, then only the next one line would be read in. Please note that each line is stored complete with its newline character at the end.

Each line read from the file is stored in Perl with its newline character at the end

that means that

$answer=<SYSIN>;
if ("OK" eq $answer) { ....}

will never be true. The right way to program such a test in Perl is to use chomp function that we already discussed:

$answer=chomp(<SYSIN>);
if ("OK" eq $answer) { ....}

Not that the statement $answer=chomp(<SYSIN>) reads only one record of the file the file denoted by the filehandle SYSIN, because we use a scalar of the left side of the assignment statement.

Here is a very simple imitation of Unix tail utility based on reading all the file into the memory:

#!/usr/bin/perl
my @text = <>;
print @text[-12 ... -1];

Processing file one record at a time

Typical way to process file one record at a time is to use while loop

#!/usr/bin/perl
$fname=($ARGV[1]) ? $RGV[0]: "example.txt";
open FILE, "$fname" or die "cannot open the file $fname. Reason: $!\n";
my $lineno = 1;
while () {
   print "$lineno, $_ ";
   $lineno++;
}
Perl idiom while (<>) actually means while (defined(<>)). After file ends  the variable is assigned the value undef. so the loop ends.

Writing to a File

Please note that you first need to open file for writing and check if the open operation succeed. More often this test is written using a simpler and more transparent Perl idiom which came from shell:

open(SYSOUT, ">$fname") || die("unable to open $fname for writing. Reason $!\n");

It is also essential to let the user know if opening operation failed. For example, the user might not  have permission to access a certain file, or there is no space left of the drive.  There's never really good reason to skip diagnostics.

To write to a file, specify the file handle when you call the function print():

print SYSOUT "Test\n";

The file handle must be the first parameter of the print function. it does not matter if are writing a new file or are appending to an existing one.

Writing to a file is buffered by default. To ask Perl to flush immediately after each write or print command, set the special variable $| to 1. Setting this value is very helpful when you are printing to a web browser in a CGI script or writing to a socket. 

To ask Perl to flush immediately after each write or print command, set the special variable $| to 1. Setting this value is very helpful when you are printing to a web browser in a CGI script or writing to a socket. 

We can write the while file at one if the content in the file is in array.

print SYSOUT  @text;

The copying procedure is simple enough: read a line from the source file, and then write it to
the destination:

while (<IN>) {
   print OUT $_ ;
}

Getting filenames from the Command-Line

Perl enables you to use the command-line arguments any way you want by defining a special array variable called @ARGV. When a Perl script starts up, this variable contains a list consisting of the command-line arguments. For example, the command

$ script6_12 myfile1 myfile2

sets @ARGV to the list

("myfile1", "myfile2")

In Unix the shell you are running (sh, csh, or whatever you are using) is responsible for turning a command line such as
myscript *.c
into arguments. In Windows your script is responsible for interpretation of such arguments.

  As with all other array variables, you can access individual elements of @ARGV. For example, the statement

$var = $ARGV[0];

assigns the first element of @ARGV to the scalar variable $var. You even can assign to some or all of @ARGV if you like. This not always a perversion, you can provide default values this way after checking that user does not supplied them (undef) For example:

if (scalar(@ARGV)) {
	$ARGV[0] = "/home/nnb/"; # set deafult for the first argument
}

As with any array to determine the number of command-line arguments, used a scalar built-in function. We also can  use assignment of  the array to a scalar variable:

$args_number = @ARGV;

C programmers expect that the first element of @ARGV, contains the name of the script.
This not the case in Perl.

# search.pl -- this program will search all files for a word 
# and print total number of lines that contain the word
# format
#     search word file1 file2 ...
print ("Word to search for: $ARGV[0]\n");
for ($fc=1; $fc<=@ARGV; $fc++) {
   unless (open (SYSIN, $ARGV[$fc])) {
     die ("Can't open input file $ARGV[$fc]\n");
   }
   $wc=0;
   while ($line = <SYSIN>) {           
      if (index($line,$ARGV[0])>-1) {$wc++} # check if the line contains the word
   } 
   close (SYSIN); # we need to close file to be able to open the next one
}
print ("total number of lines that contain $ARGV[0]: $wc\n");

The <> Operator and reading from the sequence of files

In many programming language (Pascal, Ada, Modula2) sequence <> (usually called diamond) is used as an "not equal". Unfortunately here like in some other places Perl redefines that meaning in a new and controversial way -- we can think that it is a victim of the Larry Wall fascinations with digrams ;-).

Diamond (<>) operator in Perl is an input operator that provide reading of a sequence of files presented as a command line arguments. That means that  it contains a hidden reference to the array @ARGV:

  1. When the Perl interpreter encounts the <> operator for the first time, the action depends on whether command line arguments are present or not (is ARGV empty). If yes, it opens the file whose name is stored in $ARGV[0]. If not it opens STDIN.
  2. After opening the file it executes shift(@ARGV);
    When the <> operator exhausts an input file, the Perl interpreter close the file and goes back to step 1 and repeats the cycle again.

Diamond operator can be used to imitate behavior of standard Unix utilities working with files

That simplifies scripting scripts that behave similar to UNIX commands that accept any number of files as arguments:

cat file1 file2 file3 ...

The cat command writes to STDOUT all of the files specified on the command line, starting with file1.

We can simulate this behavior in Perl using the <> operator:

# perlcut.pl
while (<>) { print; }

The script operates on all of the files specified on the command line in order, starting with file1. When file1 has been processed, the script then proceeds on to file2, and so on until all of the files have been exhausted.

When it reaches the end of the last file on the command line, the <> operator returns the undef value. However, if you call the <> operator after this it will try to open STDIN. (Recall that <> reads from the STDIN if there were no arguments on the command line.) This means that you have to be more careful when you use <> than when you are reading using <SYSFILE> (where SYSFILE is a file handle). If SYSFILE has been exhausted, repeated attempts to read using <SYSFILE> continue to return the undef value because there isn't anything left to read.

If  file as been exhausted, repeated attempts to read using it  continue to return the undef value because there isn't anything left to read.

Working with pipes

You can specify in the open statement how you open the file for reading, writing, appending, etc. What is more important you can specify pipe as you input:

open(SYSIN, "gzip -d -c $fname |");	# Open for appending

Opening Pipes

On machines running the UNIX operating system, two commands can be linked using a pipe. In this case, the standard output from the first command is linked, or piped, to the standard input to the second command.

Perl enables you to establish a pipe that links a Perl output file to the standard input file of another command. To do this, associate the file with the command by calling open, as follows:

open (SYSPOUT, "| gzip > results.gz"); #  we write to a pipe 
open (SYSPIN, "gzip -dc infile.gz |");  # we read from a pipe

The | character tells the Perl interpreter to establish a pipe. For example you can use a pipe to send mail from within a Perl script. For example:

if open (SYSMES, "| mail nnb@devnull.org") {
	print SYSMES "Hi, Nick!  An example from your book sent this!\n";
	close(SYSMES);
}

Here we need an explicit close. It will close the pipe referenced by the SYSMES handle, which tells the system that the message is complete and can be sent. The call to close actually controls the moment when the message is to be sent. (If you do not call close, SYSMES will be closed when the script terminates and only then the message will be sent).

Filter Scripts in Perl

The most often one need to write a script that perform some action on each line of the file and spit some output (also to the file). This type of scripts is called filters. For example

#print all successful access lines from the HTTP server log
while (<STDIN>) {  # STDIN is the standard input file like in C
   if (index($_,' 200')>-1) {print;}
} 

In the example above:

Summary

Perl accesses files by means of file variables. File variables are associated with files by the open statement.

Files can be opened in any of three modes: read mode, write mode, and append mode. A file opened in read mode cannot be written to; a file opened in either of the other modes cannot be read. Opening a file in write mode destroys the existing contents of the file. To read from an opened file, reference it using <SYSFILE>, where SYSFILE is a placeholder for the file handle associated with the file. To write to a file, specify file handle in print.

Perl defines three built-in file variables:

You can redirect STDIN and STDOUT by specifying < and >, respectively, on the command line. Messages sent to STDERR appear on the screen even if STDOUT is redirected to a file.

The close function closes the file associated with a particular file handle. close never needs to be called unless you want to control exactly when a file is to be made inaccessible.

You can use -w and -s tests to ensure that you do not overwrite a non-empty file.

The <> operator enables you to read data from files specified on the command line. This operator uses the built-in array variable @ARGV, whose elements consist of the items specified on the command line.

Perl enables you to open pipes. A pipe links the output from your Perl script to the input to another script.

Homework

Q: How to open several files to read?
Q: Why does adding a closing newline character to the text string affect how die behaves?
Q: Which is better: to use <>, or to use @ARGV and shift when appropriate?
Q: Can I use casading pipes as input or putput?
Q: Can I connect internal functions in Perl script via pipe
Q: Can I can count how many command-line arguments were passed to the program?
Q: Can I write to a file and then read from it later?

Exercises

  1. Write a script that takes names of files form standard input, and  print all attributes of this files like ls in Unix (or dir in Dos/Windows)
  2. Write a script that takes a list of files from the command line and examines their attributes and date of modification. If a file is created this week, print
     $name is a new file!
    where $name is a placeholder for the name of the file.
  3. Write and debug a script that copies a file named file1 to file2, replacing a selected word to a new one (old and new words are passed as parameters).
  4. Write a script that counts the total number of bytes, words and lines in the files specified on the command line. After that send a message to user ID postmaster indicating the total number of bytes, words and lines in each file.
  5. [Unix] Write a script that takes a list of files and indicates, for each file, whether the user has read, write, or execute permission.
  6. What is wrong with the following script?
    #!/usr/local/bin/perl
    open (OUTFILE, "outfile");
    print OUTFILE ("This is my message\n");


Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Created: November 7 1998; Last modified: September 06, 2009