Softpanorama

May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Prev | Up | Contents | Down | Next

3.1. Perl string operations


Introduction

There are two ways of performing string operations in Perl -- procedural and non-procedural. Here we will discuss procedural capabilities of Perl.  Non-procedural (regular expressions based) capabilities will be discussed in Chapter 5

Procedural string handling in Perl is often simpler, more reliable and more easily debugged that other methods (debugging a complex regular expression is difficult -- especially for users without of couple of years practice in this sport ;-) . I strongly recommend for beginners to use procedural string operations as widely as possible, unless non-procedural capabilities are definitely better fit for a particular task. If such cases  you often can borrow a similar regular expression from a  book or other Perl script and gradually modify it to get to point when it is doing what you expect.

Usage on non-greedy modifies is highly recommended for simplification of regex, as they are essentially non procedural notation for sequential search of the substring in a larger string and as such less prone to errors. 

 

Notes:

Please note that any string in Perl can be converted into array and back (for example using split and join functions), so if  a given operation is simpler on arrays it might make sense to perform such a conversion and then convert the resulting array back to string. This trick is also useful working with words.

All-in-all capabilities of working with strings in Perl are one of the best of any scripting language.

What is especially important is the  many problems can be viewed (or converted to) to the problem of manipulation of strings.

All-in-all capabilities of working with strings in Perl are one of the best of any scripting language in existence. In you add to this that regex are integrated into the language are are not a simple library add-on it is difficult to beat Perl in this game.

What is especially important is the  many problems can be viewed (or converted to) to the problem of manipulation of strings.

Explicit conversion to string

As we already have learned that the  " (double quote) is not a string delimiter in Perl -- it is an operator that concatenates everything with its scope including variables, performing variable substitution. 

Variables can be only scalar -- either simple variable or elements of arrays and hashes. In case you use some complex expression chances are that it will be recognized correctly by syntax analyzer, but your better test this fact to be sure.

The result is interpreted as a string so double quotes can be used to perform the conversion of operator to string -- a surrogate of type casting.

The simplest example of using double quested to force string conversion would be: 

if ("$b" eq "$a") {
   print "String '$a' equals to '$b'";
} else {
   print "String '$a' not equal '$b'";
}

If the case of letters during comparison does not matter the best way is to perform preliminary explicit conversion either to lower case or to upper case using  functions uc() (converts to upper case) or  lc() (converts to lower case). Both functions convert operand to string.  For example previous example can be generalized to:

if (lc($b) eq lc($a)) {
   print "Case insensitive string '$a' equals to string '$b'";
} else {
   print "string '$a' not equal to '$b'";
}

This is a useful idiom in Perl that helps to ensure correct comparison of strings that can be either in lower or in upper case (for example user input): as string comparisons are case sensitive and often you do know what case will be used (for example in answers like "Do you want to continue (yes/no)?", why does anybody wants to distinguish "Yes", "yes", or even " yES"?).  

Mainframe tradition dictates to convert everything to upper case (first mainframes printers did not have a lowercase characters), but you can be more modern and use Unix style which dictates for everything to be converted to lowercase :-).

As a side note I would like to mention that from the point of view of the language designer this possibility actually makes second set of comparison operators used in Perl redundant -- it would be simpler and less error prone to require explicit conversion by usage of quotes if string comparison is used and use one set of operators. But it's too late... Also unless the interpreter is smart that involves additional (and mostly redundant) processing. 

Still adopted in Perl solution is problematic especially for users two use several additional languages in addition to Perl. Use wrong comparison operator is one of the most common Perl errors:  this proved to be a bad design decision  similar to allowing the usage of an assignment in the if statement in C (like in if (i=1) { ... } ).

Concatenation operator (dot)

As we already discussed the dot symbol denote the concatenation operator in Perl. The operator takes two scalars, and combines them together in one scalar. Both of the scalars to the left and right are converted to strings. For example:

$line = $line . "\n"; # add a newline to the string  
$line ="$line\n"; # same thing

Like double quotes the concatenation operator can be used for casting a numeric value into string. The following (rather unrealistic) example demonstrates that when a numeric value that contains non-significant digits (0.00 in this particular case) is converted to string all non-significant digits are lost:

$a=0.00 . '';
if ( 0.00 . '' ) {
   print "$a is True\n";
} else {
   print "$a is False\n"; # Will print "0 is False"
}

The side effect of usage is a number on the left side of the assignment statement is that is will be converted to floating point double precision and than from this representation will be converted back. In other words it will be converted to floating representation with double precision in statement " $a=0.00". That will be number 0. and that what will be printed in print statement  as this number will be converted back to string "0"

In case of the real number 0.00 in statement  . ''; "it will first be converted to string "0"  and only then concatenated with null string in the concatenation operator resulting in string "0" which is false.

All non-significant digits are lost if we convert a string to numeric representation and then back to string

Compare with:

$a="0.00"; 
if ($a) {
   print "$a is true\n" # will print "0.00 is true"
}

In the latter example the string 0.00 will never be converted to numeric and as such will be considered true in the if statement.

It is important to know that concatenation operator enforces a scalar context and this mean that in case of an array the number of elements in the array will be substituted. For example (note that the array @ARGV represents all arguments passed on the command line to the script):

print "Arguments:".@ARGV."\n"; # a very unpleasant error.

The intent seems to be to print all command line arguments, but the Perl interpretation is quite different. Please run this example and find out what will be printed. One of the possible correct solutions is:

print map { "$_\n" } @ARGV; # map function provides an implicit foreach loop. 

The x Operator

This operator is called the string repetition operator and is used to repeat a string. All you have to do is put a string on the left side of the x and a number on the right side. Like this:

"----|" x 5 # This is the same as "----|----|----|----|----|"

You can use any string as the source including strings that contain newline, for example:

print ("Hello\n" x 5);

Functions and operators to manipulate strings

Contrary to popular belief and Perl-hype regular expression are not the "universal opener" and in many cases procedural solutions is more transparent, more easily debugged and more modifiable in the future. Perl has full assortment of PL/1 string operations including substr, length, index and tr (translate). This functions are very powerful and generally they could probably accommodate for 80% of the operations that you ever would want to perform on strings. In case someone hates regular expressions for religious reasons he/she can do almost everything without them (kind of vegetarian diet, I think ;-)

There are also several other string manipulation functions that provide for important special cases (split, chop, chomp, uc, lc, unpack etc.). They are not well designed and we will discuss them after major functions. 

Perl also have powerful array related string fictions like grep, sort, etc. We will discuss them in the part of this chapter devoted to arrays (3.2)

Please remember that one interesting idiosyncrasy of Perl is the concept of so called default argument denoted as $_. If you do not supply an argument for certain functions it will operate on default argument. If you do not supply arguments to the function it will usually operate on the default argument $_.

Getting Substring: substr

Moved to Perl substr function

String searching: index and rindex functions

Moved to Perl index and rindex built-in functions

Assembling string from components, sprintf function

Moved to sprintf in Perl

Truncating last characters in strings: functions chop and chomp

Those two are special and very limited functions that are really shame of Perl language designers. They do just one thing and were not generalized to cover important similar situations. Those two functions are:

Chop function

The built-in function chop, chops the last character off the end of a scalar or array and returns it.  Why just one and why I cannot chop, say, ten characters is a mystery to me (actually equivalent function for arrays pop permit argument with the number of elements popped, see part 3.2 of this chapter). I suspect that this is more "optimization trick" then anything else. Of course chop can be imitated with the substr function, but the case is rather frequent and deserves a special "short-cut" function that does not allocated new string but performs operation "in place". 

The most popular use is for comparison of a string with a line of file that used to be read with newline as a part of input, but chop is safer and better function here. For example:

echo OK | perl -e '$a =<>;
if ($a eq "OK"){
   print "equal\n"
}else{
   print "non equal\n";
}

So to compensate for this we will need first to chop the last character (newline): chop($a); but its better to use chomp ( see below).

Due to the fact that scalars are stored in both string and numeric representation chop can be used for dividing of a number by 10 for example:

$n = 128;
chop($n); # returns 12 (128 divided by ten!)
Function returns the character that was dropped, and there is no way to return chopped string which is somewhat unfortunate and can lead to errors like:
$n = chop($n); # probably truncated by one character string was expected here

This is an error, because it returns just the last character of the string (the character that was chopped) not the truncated string. To return the truncated string one should use substr($string, 0, -1).

At the same time when the order of characters that you process in the string is not important chop can be used for processing sting character by character, the same way as substr(string,$i,1) is usually used in forward direction. 

Conditionally truncating characters: chomp

Chomp is another "Perl wart". It usually removes the newline if such exists. Like chop chomp works both for scalars and arrays. This is essentially a very limited version of trim function as it is known on REXX. Trim as used in REXX is a function that removes repeated first and/or last characters in a string (be it newline or blanks or whatever).  Chomp can work with just trailing string (as defined by $/; it can be a regular expression):

This safer version of chop removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module). It returns the total number of characters removed from all its arguments. It's often used to remove the newline from the end of an input record when you're worried that the final record may be missing its newline. When in paragraph mode ($/ = ""), it removes all trailing newlines from the string. When in slurp mode ($/ = undef) or fixed-length record mode ($/ is a reference to an integer or the like, see perlvar) chomp() won't remove anything. If VARIABLE is omitted, it chomps $_. Example:

while (<>) {
	chomp;	# avoid \n on last field
	@array = split(/:/);
	# ...
}

You can actually chomp anything that's an lvalue, including an assignment:

chomp($cwd = `pwd`);
chomp($answer = <STDIN>);

If you chomp a list, each element is chomped, and the total number of characters removed is returned.

The function chomp is a conditional chop which is usually used for getting rid of newlines on the ends of Perl input. Lets say you define a 'special character' to be "\n" ( a newline). Then a statement such as:

$example = "This has a line with a newline at the end\n";
chomp($example);

In other words, chomp gets rid of the newlines only, not any last character like chop. if the string does not contain a newline at the end it will remain unchanged:

$example = "This doesn't have a newline";
chomp($example);

That makes chomp safer then chop.

Actually it does not need to be a newline -- newline is simply a default value of the special variable $/-- input record separator --  which contains the characters that you want to be chopped. This can be set to any value you want, as in:

$/ = "/"; $path = "/This/is/a/path/"; chomp($path); $/="\n";
print ($path); # will print '/This/is/a/path'

Please note that you need to restore the value of

$/,
unless you want to break a lot of scripts. And yes it's ugly and should be just chomp("/This/is/a/path/","/") but Perl is pretty irregular language.

Manipulating Case: uc() and lc(), ucfirst(), lcfirst()

Function uc() returns an uppercase version of the string that you give it. For example, if you say something like:

$name = uc("Hello");
print $name; # this will prints 'HELLO'
Function ucfirst() returns a capitalized version of the string:
$name = ucfirst("hello");
print $name; # prints "Hello";

If we note absence of head, tail and truncate functions in Perl, the presence of  ucfirst looks like arbitrary and probably redundant and can be implemented using substr.  TO increase usefulness of the function it would be wise to generalize it to provide the possibility to capitalize not only the first letter but any substring by providing second and third arguments.

Symmetrically lc() and lcfirst() return lowercased versions of strings. lc returns all lowercase. Function lcfirst() makes the first character uncapitalized -- sometimes useful for names, but again this is a very limited application and probably function needs some generalization.

One frequent use of ucfirst and lc is to get a capitalized word:

$word=ucfirst(lc($word);

This combination of ucfirst with lc is useful for other string formatting tasks. For example, let's assume that we need to format a string as a title (with each word starting with a capital letter). Here is a very simple solutiuon for this problem:

@words=split(/\s+/,$title);
foreach $w (@words) {
   $w=ucfirst(lc($w) # we are using side effect of foreach loop
} 
$title=join(' ',@words);

Usually articles like "a" and "the" are not capitalized in titles so we can modify the code to accomplish this in the following way:

@words=split(/\s+/,lc($title));
foreach $w (@words) {  
   next if ($w eq 'a' || $w eq 'the');    
   $w=ucfirst($w);
} 
$title=join(' ',@words);
The same effect in a slightly more compact way can be achieved using map instead of foreach loop. This modification we leave as an exercise for the reader.

Related functions

Function split is discussed in regular expression chapter because while it is essentially a string parsing function undestanding of regular expressions is essential for for utilizing its full power. See Split function.

Split function breaks up a string based on some delimiter (which can be iether a string or  a regular expression). In an array context, it returns a list of the things that were found. In a scalar context, it returns the number of things found. The most interesting cases of split usage involve using regular expressions. It is discussed in Chapter 5 ( 5.5 Perl Split function).

Function chr(NUMBER) returns the character represented by NUMBER in the ASCII table. For instance, chr(65) returns the letter A.

Function join(STRING, ARRAY) -- Returns a string that consists of all of the elements of ARRAY joined together using STRING as a delimiter. For instance, join(">>", ("AA", "BB", "cc")) returns "AA>>BB>>cc".

Function hex (EXPR Returns the decimal value of an expression interpreted as  a hex string. If EXPR is omitted, uses $_.  The hex function can handle strings with or without a leading 0x or 0X

$x="hex" ("0xa2"); # $x is 162 
$x="hex" ("a2"); # $x is 162 
$x="hex" (0xa2); # $x is 354 (!)
Function oct (EXPR) returns the decimal value of an expression interpreted as an octal string. If EXPR is omitted, uses $_
$x = oct ("042"); # $x is 34
$x = oct ("42"); # $x is 34
$x = oct ("0x42"); # $x is 66
$x = oct (042); # $x is 28 (!)

Implementation of  some additional useful functions (trim and scan)

The first frequent operation that is not among built-in functions of Perl is trim, ltrim and rtrim -- removal of blanks from both ends of the string, left or right, correspondingly. You can think about it as a  generalization of chomp. You can implement it as a regular expression, for example:

sub trim {
   return $_[0]=~s/^\s*(.*?)\s*$/$1/;
}
sub ltrim
{
   return $_[0] =~ s/^\s+//;	
}
sub rtrim
{
   return $_[0] =~ s/\s+$//;	
}

This implementation accepts one or several strings and applies the same operation to each.

The second function that is often useful is scan. This function removes from the string and returns as a result the first word of the string passed as an argument. If there is no words in the string the function should return empty string.

sub scan {
   if ($_[0] =~s/\s*(\S+)\s+(.*)$/$2/ ) {
      return $1;
   } else {
      return '';
   } # if
}

You can generalize it to multiple arguments the way previous function was implemented. That left as an exercise for the reader.

Another useful function implementation of which can be useful exercise for the reader is subword. It should work like substr but count not symbols but words:

subword(string, n[, length])
Examples
$this_string=subword("Where is this string",3,2) # returns the string "this string" 
$third_word = subword("Where is this string",3) 

Summary

Perl has an impressive array of string manipulating functions that can supplement its regular expressions-based string manipulation capabilities.  Novices should probably avoid overusing regular expression string manipulation capabilities until they became more confident in understanding the associated semantics.  In case the task maps clearly into classic sting function like substr and index is also lead to more clear programs that are easier to modify and maintain.

Several important points:

Questions

1. What will the following fragment  print ?

$name='softpanorama';
if ( index($name, 's') > -1 ) {
   printf("String '%s' has 's' in it\n", 
}

2. What will the following fragment print:

$string='softpanorama'; 
@c = split(//, $string);
print "$c[0]$c[4]$c[2]$s[-3]$s[3]\n";

3. What will the following fragment print?

$str1='remember';
$str2='Perl';
$str3='warts";
$left = $str1 . " " . $str2 . " " . $str3;
$right = "$str1 $str2 $str3";

if ($left == $right) {
  print "strings are equal\n";
} else {
  print "strings are unequal\n"
}

Additional Reading

Prev | Up | Contents | Down | Next



Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: December 17, 2016