Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Split() Function and option g in matching

News Perl Language Recommended Links Perl string operations Reference Shell tr command
sort substr split sprintf  index and rindex chomp
join Perl uc, lc, ucfirst and lcfirst functions x operator in Perl Regular expressions    
Nikolai Bezroukov. Simplified Perl for Unix System Administrators Trim Pipes in Perl Perl history Humor Etc

Split function is one the few Perl functions that have regular expression as an argument. Its purpose is to take a string and convert it to an array or list breaking at points where the first argument (delimiter) specified with the regular expression matches. 

Note:

With a parenthesized list, undef can be used as a dummy placeholder, for example to skip assignment of initial values:

$current_time=`date +"%H:%M:%S"`
( undef, $min, $sec ) = split(':',$current_time)
say $min,':', $sec;

The usual syntax for the split function is

list = split (regex, string_value);

Here, string_value is the string to be split. regex  is a regular expression to be searched for.  Plit does not requre that regular expression be encloses in special regex brackets like /\s+/ or qr/\s+/. It treats any quotes string as a regular expression.

Again, it is important to understand that a new element is started every time regex  is matched; regex  itself is not included as part of any element serving as a separator between elements.).  The resulting list of elements is returned in list.

Function split treats empty values differently if they are at the tail of the line or in front or in the middle of the line. Tail empty values are discarded. Front and middle empty values are preserved. 

For example, the following statement breaks the character string stored in $line into elements delimited by ":", and store them into the array @tokens:

@tokens = split (/,/, $line);
  DB<105> $line=',1,,3,4,5,,'

  DB<106> @v=split(/,/,$line)

  DB<107> x @v
0  ''
1  1
2  ''
3  3
4  4
5  5

You can specify the maximum number of elements of the list produced by split by specifying the maximum as the third argument. For example:

$line = "This:is:a:string";

@tokens = split (/:/, $line, 3);

As before, this breaks the string stored in $line into elements. After two first elements have been created, no more new elements are created. The rest of the string is assigned to the third element of arrays. A In this case, the list assigned to @list is ("This", "is", "a:string").

You can also assign to several scalar variables at once -- you just need to group then in the list:

$line = "11 12 13 14 15";
($var1, $var2, $line) = split (/\s+/, $line, 3);
This splits $line into the list ("11", "12", "13 14 15"). $var1 is assigned 11, $var2 is assigned 12, and $line is assigned "13 14 15". This enables you to assign the "leftovers" to a single variable, which can then be split again at a later time

If the number of elements in the list is more then number of  elements in splitted string, tail elements are assigned the value of undef

   $millisec=555
   ( undef, $min, $sec, $millisec ) = split(':',$current_time)
    say 'undef' unless defined($millisec)
it will print "undef'

In case you split string into words, split can behave non-intuitively

If your string is, for example, is starting with blanks or other delimiters that you want to discard to extract words, your first word will be zero length word. Which is logical is case for example you use : as a delimiter as in :aaa:bbb:ccc (also in  ::aaa:bbb:ccc), but people usually expect different behaviour with blanks.  So in case of blanks such  a logical behaviour nevertheless can be an unpleasant surprise for some programmers.

$a="   aaa bbb ccc  ";
@F=split(/\s+/,$a);
for ($i=0, $i<@F; $i++) {
     print "$i: '$F[$i]'\n"
}
You will get  which is probably not what one wants:
0: ''
1: 'aaa'
2: 'bbb'
3: 'ccc'

And if you think that extracting words explicitly using split(/(\w+),$a)  hoping that in case match part of delimiter will get into result,  you are also wrong

$a="   aaa bbb ccc  ";

@F=split(/(\w+)/,$a);

for ($i=0, $i<@F; $i++) {
     print "$i: '$F[$i]'\n"
}
Your result will be:
   
0: '   '
1: 'aaa'
2: ' '
3: 'bbb'
4: ' '
5: 'ccc'
6: '  '

expected behaviour can be achieved by using option g and "plain vanilla" regular expressions:

 # $_ is used
or
@F=($a=m/(\w)/g);

Using regex options in split

It look like split regex has regex option  g ( g modifier) implicitly set.

Both modifiers m and s also behave in split as expected:

In the example below we split text into blocks each of which starts with <h4> tag at the beginning of the line. Tags that exist somewhere deep within the line should be ignored.

$text=`cat bulletin.html`; 
@f=split(/^\s*<h4/ims,$text);
print $#f;
print $f[5] 
>[Nov 18, 2017] <a href="html_error_codes.html/<a>HTML error codes</h4> 

Man page

Splits the string EXPR into a list of strings and returns that list. By default, empty leading fields are preserved, and empty trailing ones are deleted. (If all fields are empty, they are considered to be trailing.)

In scalar context, returns the number of fields found. In scalar and void context it splits into the @_  array. Use of split in scalar and void context is deprecated, however, because it clobbers your subroutine arguments.

If EXPR is omitted, splits the $_  string. If REGEX is also omitted, splits on whitespace (after skipping any leading whitespace). Anything matching REGEX is taken to be a delimiter separating the fields. (Note that the delimiter may be longer than one character.)

If LIMIT is specified and positive, it represents the maximum number of fields the EXPR will be split into, though the actual number of fields returned depends on the number of times REGEX matches within EXPR. If LIMIT is unspecified or zero, trailing null fields are stripped (which potential users of pop  would do well to remember). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified. Note that splitting an EXPR that evaluates to the empty string always returns the empty list, regardless of the LIMIT specified.

As a special case for split, using the empty regex // splits the string into individual characters:

  1. say join(':', split(//, 'hi there'));

produces the output 'h:i: :t:h:e:r:e'.

Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field. For example:

  1. print join(':', split(/(?=\w)/, 'hi there!'));

produces the output 'h:i :t:h:e:r:e!'. Empty trailing fields, on the other hand, are produced when there is a match at the end of the string (and when LIMIT is given and is not 0), regardless of the length of the match. For example:

  1. print join(':', split(//, 'hi there!', -1)), "\n";
  2. print join(':', split(/\W/, 'hi there!', -1)), "\n";

produce the output 'h:i: :t:h:e:r:e:!:' and 'hi:there:', respectively, both with an empty trailing field.

The LIMIT parameter can be used to split a line partially

  1. ($login, $passwd, $remainder) = split(/:/, $_, 3);

When assigning to a list, if LIMIT is omitted, or zero, Perl supplies a LIMIT one larger than the number of variables in the list, to avoid unnecessary work. For the list above LIMIT would have been 4 by default. In time critical applications it behooves you not to split into more fields than you really need.

If the REGEX contains parentheses, additional list elements are created from each matching substring in the delimiter.

  1. split(/([,-])/, "1-10,20", 3);

produces the list value

  1. (1, '-', 10, ',', 20)

If you had the entire header of a normal Unix email message in $header, you could split it up into fields and their values this way:

  1. $header =~ s/\n(?=\s)//g; # fix continuation lines
  2. %hdrs = (UNIX_FROM => split /^(\S*?):\s*/m, $header);

The regex /REGEX/  may be replaced with an expression to specify regexs that vary at runtime. (To do runtime compilation only once, use /$variable/o  .)

As a special case, specifying a REGEX of space (' '  ) will split on white space just as split  with no arguments does. Thus, split(' ')  can be used to emulate awk's default behavior, whereas split(/ /)  will give you as many null initial fields as there are leading spaces. A split  on /\s+/  is like a split(' ')  except that any leading whitespace produces a null first field. A split  with no arguments really does a split(' ', $_)  internally.

A REGEX of /^/  is treated as if it were /^/m  , since it isn't much use otherwise.

Example:

open(PASSWD, '/etc/passwd');
while (<PASSWD>) {
   chomp;
   ($login, $passwd, $uid, $gid, $gcos, $home, $shell) = split(/:/);
   #...
}

Using round brackets in regular expressions

In this case delimiters are returned as additional elements:  If you use round brackets in the regular expression in split, then parts of the string that match regex in round brackets will be returned as separate elements, not discarded.

@fields=split /(\d+)/,"Perl-1-is-2-really-3-obscure-4-language";
print  join("\n".$i++.": ", @fields);
1: Perl-
2: 1
3: -is-
4: 2
5: -really-
6: 3
7: -obscure-
8: 4
9: -language

NOTE:  As with regular regex matching, any capturing parentheses that are not matched in a split()  will result in generating  undef  element in the result, which can be very confusing

@fields = split /(A)|B/, "1A2B3";
# @fields is (1, 'A', 2, undef, 3)

This is definitely a very difficult to understand this example, which can be a perfect example of "Voodoo Perl".

  1. First the string is split  into three parts  1-A-2B3 because one of possible delimiters is A. A is not consumed as the delimiter because it is in round brackets. the remaining string now is "A2B3".
  2. Now A is generated as the second element and split from the string because A is in round brackets and as such should returned as an element in the array that the split function generates iteration. The remaining string now is 2B3.  So far so good. Everything is logical.
  3. Now string 2B3 is split into 1-B-3 because the delimiter used in B (and it should be consumed).  The remaining string now should be 3, but in reality before you get to this element additional undef will be inserted in the resulting array as side effect of failed patter matching in round brackets. In other word due to presence of (A) mysterious  undef is inserted in the resulting array first. 
  4. ??? Now we, st last have final 3 generated as the last element, but the number of elements in the resulting array is one more then you would expect. This why it is called Voodoo Perl.  

This means never ever use such a construct in your programs.

Additional examples

Note:

( undef, $min, $hour ) = split(/:/.$currebt_time)
$_ = 'AB AB AC';
print m/c$/i

Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Aug 17, 2020] Why split function treats single quotes literals as regex, instead of a special case-

Aug 17, 2020 | perlmonks.org

on Aug 14, 2020 at 02:21 UTC ( # 11120703 = perlquestion : print w/replies , xml ) Need Help?? likbez has asked for the wisdom of the Perl Monks concerning the following question: Reputation: 5

Edit

It looks like Perl split function treats single quotes literal semantically inconsistently with other constructs

But not always :-). For example

($line)=split(' ',$line,1)
is treated consistently (in AWK way). This is the only way I know to avoid using regex for a very common task of trimming the leading blanks.

In general, split function should behave differently if the first argument is string and not a regex. But right now single quoted literal is treated as regular expression. For example:

$line="head xxx tail";
say split('x+',$line);
will print
head  tail

Am I missing something? BTW this would be similar to Python distinguishing between split and re.split but in a more elegant, Perlish way. And a big help for sysadmins.


jwkrahn on Aug 14, 2020 at 03:33 UTC

Re: Why split function treats single quotes literals as regex, instead of a special case?

The single space character is a special case for split, anything else is treated as a regular expression, be it a string, function call, etc.

Regular expressions are also treated a bit differently than regular expressions in qr//, m// and s///.

AnomalousMonk on Aug 14, 2020 at 04:38 UTC

Re^2: Why split function treats single quotes literals as regex, instead of a special case?


by AnomalousMonk on Aug 14, 2020 at 04:38 UTC

The single space character is a special case for split ...
I.e., per split :
As another special case, split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20" , but not e.g. / / ). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/ ; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
You also write:
Regular expressions are also treated a bit differently than regular expressions in qr//, m// and s///.
I don't understand this statement. Can you elaborate? Give a man a fish : <%-{-{-{-<

jwkrahn on Aug 14, 2020 at 09:16 UTC

Re^3: Why split function treats single quotes literals as regex, instead of a special case?
by jwkrahn on Aug 14, 2020 at 09:16 UTC

The regular expression // works differently in split then elsewhere:

$ perl -le' my $x = "1234 abcd 5678"; print $& if $x =~ /[a-z]+/; print $& if $x =~ //; print map qq[ "$_"], split /[a-z]+/, $x; print map qq[ "$_"], split //, $x; ' abcd abcd "1234 " " 5678" "1" "2" "3" "4" " " "a" "b" "c" "d" " " "5" "6" "7" "8" [download]

Also, the line anchors /^/ and /$/ don't require the /m option to match lines in a string.

AnomalousMonk on Aug 14, 2020 at 18:17 UTC

Re^4: Why split function treats single quotes literals as regex, instead of a special case?
by AnomalousMonk on Aug 14, 2020 at 18:17 UTC

jcb on Aug 14, 2020 at 23:07 UTC

Re^4: Why split function treats single quotes literals as regex, instead of a special case?
by jcb on Aug 14, 2020 at 23:07 UTC

Anonymous Monk on Aug 14, 2020 at 10:02 UTC

Re: Why split function treats single quotes literals as regex, instead of a special case?

perldoc -f split

perlfan on Aug 14, 2020 at 16:51 UTC

Re: Why split function treats single quotes literals as regex, instead of a special case?

> Am I missing something?

Yes, this is Perl not Python.

> Why?

I can assert that conextually, splitting on all characters for split //, $string is a lot more meaningful than splitting on nothing and returning just the original $string . The big surprise actually happens for users (like me) who don't realize the first parameter of split is a regular expression. But that surprise quickly turns into joy .

> In general, split function should behave differently if the first argument is string and not a regex.

Should ? That's pretty presumptuous. You'll notice that Perl has FAR few built in functions (particularly string functions) than PHP, JavaScript, or Python. This is because they've all been generalized away into regular expressions. You must also understand that the primary design philosphy is more related to spoken linquistics than written code. The implication here is that humans are lazy and don't want to learn more words than they need to communicate - not true of all humans, of course. But true enough for 99% of them. This is also reflected in the Huffmanization of most Perl syntax. This refers to Huffman compression, which necessarily compresses more frequently used things (characters, words, etc) into the symbols of the smallest size. I mean Perl isn't APL, but certainly gets this idea from it.

The balkanization of built-in functions that are truly special cases of a general case is against any philosophical underpinnings that Perl follows. I am not saying it's perfect, but it is highly resistent to becoming a tower of babble. If that's your interest (not accusing you of being malicious), there are more fruitful avenues to attack Perl. Most notably, the areas of object orientation and threading. But you'll have pretty much zero success convincing anyone who has been around Perl for a while that the approach to split is incorrect .

Oh, also a string (as you're calling it) is a regular expression in the purest sense of the term . It's best described as a concatenation of a finite set of symbols in fixed ordering. For some reason a lot of people think this regex magic is only present in patterns that may have no beginning or no end, or neither. In your case it just happens to have both. Doesn't make it any less of a regular expression, though.

you !!! on Aug 14, 2020 at 19:29 UTC

Re^2: Why split function treats single quotes literals as regex, instead of a special case?


by you !!! on Aug 14, 2020 at 19:29 UTC Reputation: 5

The balkanization of built-in functions that are truly special cases of a general case is against any philosophical underpinnings that Perl follows. I am not saying it's perfect, but it is highly resistant to becoming a tower of babble. If that's your interest (not accusing you of being malicious), there are more fruitful avenues to attack Perl

I respectfully disagree. Perl philosophy states that there should be shortcuts for special cases if they are used often. That's the idea behind suffix conditionals ( return if (index($line,'EOL')>-1) ) and bash-style if statement ( ($debug) && say line; )

You also are missing the idea. My suggestion is that we can enhance the power of Perl by treating single quoted string differently from regex in split. And do this without adding to balkanization.

Balkanization of built-ins is generally what Python got having two different functions. Perl can avoid this providing the same functionality with a single function. That's the idea.

And my point is that this particular change requires minimal work in interpreter as it already treats ' ' in a special way (AWK way).

So this is a suggestion for improving the language, not for balkanization, IMHO. And intuitively it is logical as people understand (and expect) the difference in behavior between single quoted literals and regex in split. So, in a way, the current situation can be viewed as a bug, which became a feature.

perlfan on Aug 15, 2020 at 08:04 UTC

Re^3: Why split function treats single quotes literals as regex, instead of a special case?
by perlfan on Aug 15, 2020 at 08:04 UTC So, in a way, the current situation can be viewed as a bug, which became a feature.

To be fair, this is a lot of perl . But I can't rightfully assert that this behavior was unintentional, in fact it appears to be very intentional (e.g., awk emulation).

> You also are missing the idea.

My understanding is that you wish for "strings" (versus "regexes") to invoke the awk behavior of trimming leading white space. Is that right? I'm not here to judge your suggestion, but I can easily think of several reasons why adding another special case to split is not a great idea.

All I can say is you're the same guy who was looking for the trim method in Perl. If that's not a red flag for being okay with balkanization , I don't know what is.

Finally, I must reiterate. A "string" is a regular expression . The single quoted whitespace is most definitely a special exception since it is also a regular expression. You're recommending not only removing one regex from the pool of potential regexes, but an entire class of them available via quoting - i.e., fixed length strings of a fixed ordering. I am not sure how this is really a suggestion of making all quoted things not be regexes, because then how do you decide if it is "regex" or not? (maybe use a regex? xD)

Replies are listed 'Best First'.

[Aug 16, 2020] Two meanings of undef

Aug 16, 2020 | perlmonks.org

on Aug 15, 2020 at 18:21 UTC ( # 11120786 = perlquestion : print w/replies , xml ) Need Help?? likbez has asked for the wisdom of the Perl Monks concerning the following question:

The system function undef can be used on the right side of split function or array assignment to skip values that you do not need. For example:

$current_time='14:30:05'
(undef, $min, $sec)=$current_time

In this role it is similar to /dev/null in Unix.

But if used on the right side this function deletes the variable from the symbol table, For example

$line=undef; say "Does not exists" unless(defined($line));

Is this correct understanding?

[Nov 13, 2017] Understanding Split and Join

Notable quotes:
"... What happens if the delimiter is indicated to be a null string (a string of zero characters)? ..."
Dec 28, 2006 | perlmonks.com

Re: Understanding Split and Join

I'd put more emphasis on the fact that the first argument to split is always, always, always a regular expression (except for the one special case where it isn't :-). Too often do I see people write code like this:

@stuff = split "|", $string; # or worse ... $delim = "|"; @stuff = split $delim, $string; [download] And expect it to split on the pipe symbol because they have fooled themselves into thinking that the first argument is somehow interpreted as a string rather than a regular expression. duff

jwkrahn (Monsignor) on Dec 28, 2006 at 13:23 UTC

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g; [download]In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

In your regexp example you don't need the parentheses, it will work the same without them.

If $string contains leading whitespace then you will NOT get the same results. To demonstrate examples that produce the same results:

my @list = split ' ', $string; my @list = $string =~ /\S+/g; [download]

chromatic (Archbishop) on Dec 29, 2006 at 00:52 UTC

What happens if the delimiter is indicated to be a null string (a string of zero characters)?

perl behaves inconsistently with regard to the "empty" regex:

my $string = 'Monk'; exit unless $string =~ /(o)/; my @matches = $string =~ //; warn join('=', @matches), "\n"; exit unless $string =~ /(o)/; my @letters = split( //, $string ); warn join('-', @letters), "\n"; [download]

ysth (Canon) on Dec 29, 2006 at 08:02 UTC

chromatic has pointed out that split treats an empty pattern normally, not as a directive to reuse the last successfully matching pattern, as m// and s/// do.

A pattern that split treats specially but m// and s/// treat normally is /^/. Normally, ^ only matches at the beginning of a string. Given the /m flag, it also matches after newlines in the interior of the string. It's common to want to break a string up into lines without removing the newlines as splitting on /\n/ would do. One way to do this is @lines = /^(.*\n?)/mg . Another, perhaps more straightforward, is @lines = split /^/m . Without the /m, the ^ should match only at the beginning of the string, so the split should return only one element, containing the entire original string. Since this is useless, and splitting on /^/m instead is common, /^/ silently becomes /^/m.

This only applies to a pattern consisting of just ^; even the apparently equivalent /^(?#)/ or /^ /x are treated normally and don't split the string at all.

ferreira (Chaplain) on Dec 30, 2006 at 19:34 UTC

Both exceptions, the special treatment of // and /^/ by split, are documented in split .

Both may deserve to be mentioned in the tutorial quickly for the profit of the unaware.

The last remark by ysth about the non-equivalence of /^(?#)/ and /^ /x with // for split purposes is a subtle thing.

More subtle if you compare to the fact that / /x , / # /x or even / (?#)/x have the same treatment as // when passed to this function.

Looks like a case to be fixed either in the docs or in the code of the Perl interpreter itself (if not barred by compatibility issues).

[Nov 12, 2017] Understanding Split and Join

Notable quotes:
"... Hint, you use capturing parenthesis. ..."
Nov 12, 2017 | perlmonks.com

split and join

Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.

A simple example...

Let's first consider a simple use of split: split a string on whitespace.

$line = "Bart Lisa Maggie Marge Homer"; @simpsons = split ( /\s/, $line ); # Splits line and uses single whitespaces # as the delimiter. [download]

@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".

There is an empty element in the list that split placed in @simpsons . That is because \s matched exactly one whitespace character. But in our string, $line , there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.

We can specify a more flexible delimiter that eliminates the creation of an empty string in the list. @simpsons = split ( /\s+/, $line ); #Now splits on one-or-more whitespaces. [download]

@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.

Where do delimiters go?

"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:

$string = "Just humilityanother humilityPerl humilityhacker."; @japh = split ( /humility/, $string ); [download]

The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.

Preserving delimiters

If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.

$string = "alpha-bravo-charlie-delta-echo-foxtrot"; @list = split ( /(-)/, $string ); [download]

@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.

The null delimiter

What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.

$string = "Monk"; @letters = split ( //, $string ); [download]

Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string .

Split's return value

Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:

@mydata = ( "Simpson:Homer:1-800-000-0000:40:M", "Simpson:Marge:1-800-111-1111:38:F", "Simpson:Bart:1-800-222-2222:11:M", "Simpson:Lisa:1-800-333-3333:9:F", "Simpson:Maggie:1-800-444-4444:2:F" ); foreach ( @mydata ) { ( $last, $first, $phone, $age ) = split ( /:/ ); print "You may call $age year old $first $last at $phone.\n"; } [download]

What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_ . That's handy, because foreach aliases $_ to each element (one at a time) of @mydata .

Words about Context

Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.

The limit argument

Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....

( $last, $first, $everything_else) = split ( /:/, $_, 3 ); [download]

Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.

Unspecified split pattern

As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.

If you leave off the pattern, split assumes you want to split on /\s+/ . Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

The star quantifier (zero or more)

Finally, consider what happens if we specify a split delimiter of /\s*/ . The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:

$string = "Hello world!"; @letters = split ( /\s*/, $string ); [download]

@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Notice that the whitespace is gone. You just split $string , character by character (because null matches boundaries), and on whitespace (which gets discarded because it's a delimiter).

Using split versus Regular Expressions

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:

my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g; [download]

In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:

my @bignumbers = $string =~ /(\d{4,})/g; [download]

That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it .

[Nov 12, 2017] Understanding Split and Join

Nov 12, 2017 | perlmonks.com

split and join

Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.

A simple example...

Let's first consider a simple use of split: split a string on whitespace.

$line = "Bart Lisa Maggie Marge Homer"; @simpsons = split ( /\s/, $line ); # Splits line and uses single whitespaces # as the delimiter. [download]

@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".

There is an empty element in the list that split placed in @simpsons . That is because \s matched exactly one whitespace character. But in our string, $line , there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.

We can specify a more flexible delimiter that eliminates the creation of an empty string in the list. @simpsons = split ( /\s+/, $line ); #Now splits on one-or-more whitespaces. [download]

@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.

Where do delimiters go?

"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:

$string = "Just humilityanother humilityPerl humilityhacker."; @japh = split ( /humility/, $string ); [download]

The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.

Preserving delimiters

If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.

$string = "alpha-bravo-charlie-delta-echo-foxtrot"; @list = split ( /(-)/, $string ); [download]

@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.

The null delimiter

What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.

$string = "Monk"; @letters = split ( //, $string ); [download]

Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string .

Split's return value

Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:

@mydata = ( "Simpson:Homer:1-800-000-0000:40:M", "Simpson:Marge:1-800-111-1111:38:F", "Simpson:Bart:1-800-222-2222:11:M", "Simpson:Lisa:1-800-333-3333:9:F", "Simpson:Maggie:1-800-444-4444:2:F" ); foreach ( @mydata ) { ( $last, $first, $phone, $age ) = split ( /:/ ); print "You may call $age year old $first $last at $phone.\n"; } [download]

What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_ . That's handy, because foreach aliases $_ to each element (one at a time) of @mydata .

Words about Context

Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.

The limit argument

Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....

( $last, $first, $everything_else) = split ( /:/, $_, 3 ); [download]

Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.

Unspecified split pattern

As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.

If you leave off the pattern, split assumes you want to split on /\s+/ . Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

The star quantifier (zero or more)

Finally, consider what happens if we specify a split delimiter of /\s*/ . The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:

$string = "Hello world!"; @letters = split ( /\s*/, $string ); [download]

@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Notice that the whitespace is gone. You just split $string , character by character (because null matches boundaries), and on whitespace (which gets discarded because it's a delimiter).

Using split versus Regular Expressions

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:

my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g; [download]

In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:

my @bignumbers = $string =~ /(\d{4,})/g; [download]

That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it .

[Nov 16, 2015] undef can be used as a dummy variable in split function

Instead of

($id, $not_used, credentials, $home_dir, $shell ) = split /:/;

You can write

($id, undef, credentials, $home_dir, $shell ) = split /:/;

In Perl 22 they even did pretty fancy (and generally useless staff). Instead of

my(undef, $card_num, undef, undef, undef, $count) = split /:/;

You can write

use v5.22; 
my(undef, $card_num, (undef)x3, $count) = split /:/;

[Oct 31, 2015] Perl's versatile split function by David Farrell

October 24, 2014

I love Perl's split function. Far more powerful than its feeble cousin join, split has some wonderful features that should make it a regular feature of any Perl programmer's toolbox. Let's look at some examples.

Split a sentence into words

To split a sentence into words, you might think about using a whitespace regex regex like /\s+/ which splits on contiguous whitespace. Split will ignore trailing whitespace, but what if the input string has leading whitespace? A better option is to use a single space string: ' '. This is a special case where Perl emulates awk and will split on all contiguous whitespace, trimming any leading or trailing whitespace as well.

my @words = split ' ', $sentence;

Or loop through each word and do something:

use 5.010;
say for (split ' ', ' 12 Angry Men ');
# 12
# Angry
# Men

The single-space regex is also the default regex for split, which by default operates on $_. This can lead to some seriously minimalist code. For example if I needed to split every name in a list of full names and do something with them:

for (@full_names)
{
    for (split)
    {
        # do something
    }
}

And who says Perl looks like line noise?

Create a char array

To split a word into separate letters, just pass an empty regex // to split:

my @letters = split //, $word;

Parse a URL or filepath

It's tempting to reach for a regex when parsing strings, but for URLs or filepaths split usually works better. For example if you wanted to get the parent directory from a filepath:

my @directories = split '/', '/home/user/documents/business_plan.ods';
my $parent_directory = $directories[-2];

Here I split the filepath on slash and use the negative index -2 to get the parent directory. The challenge with filepaths is that they can have n depth, but the parent directory of a file will always be the last but one element of a filepath, so split works well.

Extract only the first few columns from a separated file

How many times have you parsed a comma separated file, but didn't want all of the columns in the file? Let's say you wanted the first 3 columns from a file, you might do it like this:

while <$read_file>
{
    my @columns = split /,/;
    my $name    = $columns[0];
    my $email   = $columns[1];
    my $account = $columns[2];
    ...
}

This is all well and good, but split can return a limited number of results if you want:

while <$read_file>
{
    my ($name, $email, $account) = split /,/;
    ...
}

Or to revisit an earlier example, splitting on whitespace:

for (@full_names)
{
    my ($firstname, $lastname) = split;
    ...
}

Conclusion

These are just a few examples of Perl's versatile split function. Check out the official documentation online or via the terminal with $ perldoc -f split.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Top articles

Sites

...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: November 22, 2020