|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
The Unix sort command sorts ASCII files. This is a very old utility and its options are obscure. Specifying delimiters and sorting keys can be a nightmare. It reflects state of the art of crating a set of options in early 70th of the last century :-).
|
For small sets a good alternative is Perl sort function which is more modern and more flexible. I would say that in case of complex keys and for small files Perl-based solution is almost always superior. But for large files (for example logs) Unix sort command might be the only tool that is able to handle jobs.
Unix sort can sort very large files. I successfully sorted proxy log files with the size over 10G using Solaris 9 sort implementation on a pretty old V210 with 80G 10RPM drives and single 1.34GHz CPU.
Lines need not to have the same length or the number of fields to be sorted successfully. All you need is the presence of the key.
The input can be from files or the standard input. In case of files Unix sort can accept as input one or several files. In the latter case all input files are merged. As for output, there is always a single file to be written -- sorted sequence of all input records:
The output can be a file or the standard output. In case of single file sort can process file "in place".
Sorting can be controlled by several option:
By default fields in each record (by default lines of input, but you can redefine input report separator) are delimited by blanks, but you can specify a different delimiter using option -t. You can also sort character columns (see below).
You can select and of them or several of them as key with the option -k.
-k field_start [type] [,field_end [type] ]
where:
- field_start and field_end define a key field restricted to a portion of the line.
- type is a modifier from the list of characters bdfiMnr. The b modifier behaves like the -b option, but applies only to the field_start or field_end to which it is attached and characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.) The other modifiers behave like the corresponding options, but apply only to the key field to which they are attached. They have this effect if specified with field_start, field_end or both. If any modifier is attached to a field_start or to a field_end, no option applies to either.
When multiple key fields are defined, later keys are compared only after all earlier keys compare equal.
Except when the -u option is specified, lines that otherwise compare equal are ordered as if none of the options -d, -f, -i, -n or -k were present (but with -r still in effect, if it was specified) and with all bytes in the lines significant to the comparison.
The notation:
-k field_start[type][,field_end[type]]
defines a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end means the last character of the line.
There can be multiple -k definitions each define one sorting field.
A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.
The field_start portion of the keydef option-argument has the form:
field_number[.first_character]
Fields and characters within fields are numbered starting with 1. field_number and first_character, interpreted as positive decimal integers, specify the first character to be used as part of a sort key. If .first_character is omitted, it refers to the first character of the field.
The field_end portion of the keydef option-argument has the form:
field_number[.last_character]The field_number is as described above for field_start. last_character, interpreted as a non-negative decimal integer, specifies the last character to be used as part of the sort key. If last_character evaluates to zero or .last_character is omitted, it refers to the last character of the field specified by field_number.
If the -b option or b type modifier is in effect, characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.)
There is also so called "old style" key definitions -- see UNIX sort old style keys definition.
Selection of sorting keys is similar to selection of fields in cut and is extremely obscure. Be careful and always test your sorting keys on small sample before sorting a large file. As a rule of thump you can assume that no specification works from the first time. So testing it on a small sample is of paramount importance.
Be careful and always make backup and test your sorting keys on small sample before sorting a large file. While Unix sort is non-destructive, it does rearrange records with identical keys. In other words it is not stable (if implemented via quicksort) |
The most common mistake is to forget to use -n option for sorting numeric fields. Also specifying delimiter (option -t) with an unquoted character after it can be a source of problems; it's better to use single quotes around the character that you plan to use as a delimiter. for example -t ':'
The most common mistake is to forget to use -n option for sorting numeric fields |
Here is a standard example of usage of the sort utility, sorting /etc/passwd file (user database) by UID (the third colon-separated field in the passwd file structure):
sort -t ':' -k 2,2 /etc/passwd # incorrect result, the field is numeric
sort -n -t ':' -k 2,2 /etc/passwd # order of the numbers is now correct
sort -t ':' -k 3,3n /etc/passwd
Similarly you can sort /etc/group file
sort -n -t ':' -k 3,3 /etc/group
sort -t ':' -k 3,3n /etc/group
See Sorting key definitions and Examples for more details. Generally you will be surprised how often the result is not what you want due to the obscurity of the definitions
Be careful and always test your sorting keys on a small sample before sorting
the whole file. |
By default sort sorts the file in ascending order using the entire line as a sorting key. Please note that a lot of WEB resources interpret this sort utility behavior incorrectly (most often they state that by default sorting is performed on the first key).
The most important options of Unix sort are
For example:
Sort the entire lines as a key: | sort |
Sort in numeric order: | sort -n |
Comparisons are based on one or more sort keys extracted from each line of input. Again, please remember that by default, there is one sort key, the entire input line.
Lines are ordered according to the collating sequence of the current locale. By changing locale you can change the behavior of the sort.
In Solaris there are two variants of sort: System V version and BSD version. Both have identical options:
The sort command can (and should) be used in pipes or have its output redirected as desired. Here are some practically important examples that illustrates using of this utility (for more examples please look into our sort examples collection page):
sort file | less
In simple cases cut can be used instead of AWK. For example the following example couts distinc visitors from HTTP logs (assuming this is the first field in the logs):
cat http.log | cut -d " " -f 1 | sort | uniq -c | sort -nr
sort -n file | cat -n
This can be useful if you want to count the number of lines in which the first entry is in a given range: simply subtract the line numbers corresponding to the beginning and end of the range.
As I mentioned about by default the sort command uses entire lines as a key. It compares the characters starting with the first, until non-matching character or the end of the shortest line. Leading blanks (spaces and tabs) are considered valid characters to compare. Thus a line beginning with a space precedes a line beginning with the letter A. If you do not want this effect you need to delete leading spaces beforehand.
Multiple sort keys may be used on the same command line. If two lines have the same value in the same field, sort uses the next set of sort keys to resolve the equal comparison. For example,
sort -k 5,5 -k 2,2 infile
means to sort based on field 5. If two lines have the same value in field 5, sort those two lines based on field 2.
Beside sorting Unix sort is useful for merging files (option -m). It can also checked whether the file is sorted or not (option -c). It can also suppress duplicates (option -u):
In case Unix sort does not produce the required results you might want to look into Perl built-in function. If it is too slow more memory can be specified on invocation.
The following list describes the options and their arguments that may be used to control how sort functions.
sort: disorder: This line not in sorted order.
+pos1 | Specifies the beginning position of the input line used for field comparison. If pos1 is not specified then comparison begins at the beginning of the line. The pos1 position has the notation of f.c. The f specifies the number of fields to skip. The suffix .c specifies the number of characters to skip. For example, 3.2 is interpreted as skip three fields and two characters before performing comparisons. Omitting the .c portion is equivalent to specifying .0. Field one is referred to as position 0. If f is set to 0 then character positions are used for comparison. |
-pos2 | Specifies the ending position of the input line used for field comparison. If pos2 is not specified then comparison is done through the end of the line. The pos2 position has the notation of f.c. The f specifies to compare through field f. The c specifies the number of characters to compare through after field f. For example, -4.3 is interpreted as compare through three characters after the end of field four. Omitting the .c portion is equivalent to specifying .0. |
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
May 23, 2021 | www.tecmint.com
7. Sort the contents of file ' lsl.txt ' on the basis of 2nd column (which represents number of symbolic links).
$ sort -nk2 lsl.txtNote: The ' -n ' option in the above example sort the contents numerically. Option ' -n ' must be used when we wanted to sort a file on the basis of a column which contains numerical values.
8. Sort the contents of file ' lsl.txt ' on the basis of 9th column (which is the name of the files and folders and is non-numeric).
$ sort -k9 lsl.txt9. It is not always essential to run sort command on a file. We can pipeline it directly on the terminal with actual command.
$ ls -l /home/$USER | sort -nk510. Sort and remove duplicates from the text file tecmint.txt . Check if the duplicate has been removed or not.
$ cat tecmint.txt $ sort -u tecmint.txtRules so far (what we have observed):
- Lines starting with numbers are preferred in the list and lies at the top until otherwise specified ( -r ).
- Lines starting with lowercase letters are preferred in the list and lies at the top until otherwise specified ( -r ).
- Contents are listed on the basis of occurrence of alphabets in dictionary until otherwise specified ( -r ).
- Sort command by default treat each line as string and then sort it depending upon dictionary occurrence of alphabets (Numeric preferred; see rule – 1) until otherwise specified.
11. Create a third file ' lsla.txt ' at the current location and populate it with the output of ' ls -lA ' command.
$ ls -lA /home/$USER > /home/$USER/Desktop/tecmint/lsla.txt $ cat lsla.txtThose having understanding of ' ls ' command knows that ' ls -lA'='ls -l ' + Hidden files. So most of the contents on these two files would be same.
12. Sort the contents of two files on standard output in one go.
$ sort lsl.txt lsla.txtNotice the repetition of files and folders.
13. Now we can see how to sort, merge and remove duplicates from these two files.
$ sort -u lsl.txt lsla.txtNotice that duplicates has been omitted from the output. Also, you can write the output to a new file by redirecting the output to a file.
14. We may also sort the contents of a file or the output based upon more than one column. Sort the output of ' ls -l ' command on the basis of field 2,5 (Numeric) and 9 (Non-Numeric).
$ ls -l /home/$USER | sort -t "," -nk2,5 -k9That's all for now. In the next article we will cover a few more examples of ' sort ' command in detail for you. Till then stay tuned and connected to Tecmint. Keep sharing. Keep commenting. Like and share us and help us get spread.
Jul 12, 2019 | linuxhandbook.com
5. Sort by months [option -M]Sort also has built in functionality to arrange by month. It recognizes several formats based on locale-specific information. I tried to demonstrate some unqiue tests to show that it will arrange by date-day, but not year. Month abbreviations display before full-names.
Here is the sample text file in this example:
March
Feb
February
April
August
July
June
November
October
December
May
September
1
4
3
6
01/05/19
01/10/19
02/06/18Let's sort it by months using the -M option:
sort filename.txt -MHere's the output you'll see:
01/05/19 01/10/19 02/06/18 1 3 4 6 Jan Feb February March April May June July August September October November December... ... ...
7. Sort Specific Column [option -k]If you have a table in your file, you can use the
-k
option to specify which column to sort. I added some arbitrary numbers as a third column and will display the output sorted by each column. I've included several examples to show the variety of output possible. Options are added following the column number.1. MX Linux 100
2. Manjaro 400
3. Mint 300
4. elementary 500
5. Ubuntu 200sort filename.txt -k 2This will sort the text on the second column in alphabetical order:
4. elementary 500 2. Manjaro 400 3. Mint 300 1. MX Linux 100 5. Ubuntu 200sort filename.txt -k 3nThis will sort the text by the numerals on the third column.
1. MX Linux 100 5. Ubuntu 200 3. Mint 300 2. Manjaro 400 4. elementary 500sort filename.txt -k 3nrSame as the above command just that the sort order has been reversed.
4. elementary 500 2. Manjaro 400 3. Mint 300 5. Ubuntu 200 1. MX Linux 1008. Sort and remove duplicates [option -u]If you have a file with potential duplicates, the
READ Learn to Use CURL Command in Linux With These Examples-u
option will make your life much easier. Remember that sort will not make changes to your original data file. I chose to create a new file with just the items that are duplicates. Below you'll see the input and then the contents of each file after the command is run.1. MX Linux
2. Manjaro
3. Mint
4. elementary
5. Ubuntu
1. MX Linux
2. Manjaro
3. Mint
4. elementary
5. Ubuntu
1. MX Linux
2. Manjaro
3. Mint
4. elementary
5. Ubuntusort filename.txt -u > filename_duplicates.txtHere's the output files sorted and without duplicates.
1. MX Linux 2. Manjaro 3. Mint 4. elementary 5. Ubuntu9. Ignore case while sorting [option -f]Many modern distros running sort will implement ignore case by default. If yours does not, adding the
-f
option will produce the expected results.sort filename.txt -fHere's the output where cases are ignored by the sort command:
alpha alPHa Alpha ALpha beta Beta BEta BETA10. Sort by human numeric values [option -h]This option allows the comparison of alphanumeric values like 1k (i.e. 1000).
sort filename.txt -hHere's the sorted output:
10.0 100 1000.0 1kI hope this tutorial helped you get the basic usage of the sort command in Linux. If you have some cool sort trick, why not share it with us in the comment section?
Christopher works as a Software Developer in Orlando, FL. He loves open source, Taco Bell, and a Chi-weenie named Max. Visit his website for more information or connect with him on social media.
JohnThe sort command option "k" specifies a field, not a column. In your example all five lines have the same character in column 2 – a "."Stephane Chauveau
In gnu sort, the default field separator is 'blank to non-blank transition' which is a good default to separate columns. In his example, the "." is part of the first column so it should work fine. If –debug is used then the range of characters used as keys is dumped.What is probably missing in that article is a short warning about the effect of the current locale. It is a common mistake to assume that the default behavior is to sort according ASCII texts according to the ASCII codes. For example, the command echo `printf ".nxn0nXn@në" | sort` produces ". 0 @ X x ë" with LC_ALL=C but ". @ 0 ë x X" with LC_ALL=en_US.UTF-8.
Sep 25, 2018 | www.amazon.com
awk , cut , and join , sort views its input as a stream of records made up of fields of variable width, with records delimited by newline characters and fields delimited by whitespace or a user-specifiable single character.sort
- Usage
- sort [ options ] [ file(s) ]
- Purpose
- Sort input lines into an order determined by the key field and datatype options, and the locale.
- Major options
- -b
- Ignore leading whitespace.
- -c
- Check that input is correctly sorted. There is no output, but the exit code is nonzero if the input is not sorted.
- -d
- Dictionary order: only alphanumerics and whitespace are significant.
- -g
- General numeric value: compare fields as floating-point numbers. This works like -n , except that numbers may have decimal points and exponents (e.g., 6.022e+23 ). GNU version only.
- -f
- Fold letters implicitly to a common lettercase so that sorting is case-insensitive.
- -i
- Ignore nonprintable characters.
- -k
- Define the sort key field.
- -m
- Merge already-sorted input files into a sorted output stream.
- -n
- Compare fields as integer numbers.
- -o outfile
- Write output to the specified file instead of to standard output. If the file is one of the input files, sort copies it to a temporary file before sorting and writing the output.
- -r
- Reverse the sort order to descending, rather than the default ascending.
- -t char
- Use the single character char as the default field separator, instead of the default of whitespace.
- -u
- Unique records only: discard all but the first record in a group with equal keys. Only the key fields matter: other parts of the discarded records may differ.
- Behavior
- sort reads the specified files, or standard input if no files are given, and writes the sorted data on standard output.
Sorting by LinesIn the simplest case, when no command-line options are supplied, complete records are sorted according to the order defined by the current locale. In the traditional C locale, that means ASCII order, but you can set an alternate locale as we described in Section 2.8 . A tiny bilingual dictionary in the ISO 8859-1 encoding translates four French words differing only in accents:
$ cat french-english Show the tiny dictionary côte coast cote dimension coté dimensioned côté sideTo understand the sorting, use the octal dump tool, od , to display the French words in ASCII and octal:$ cut -f1 french-english | od -a -b Display French words in octal bytes 0000000 c t t e nl c o t e nl c o t i nl c 143 364 164 145 012 143 157 164 145 012 143 157 164 351 012 143 0000020 t t i nl 364 164 351 012 0000024Evidently, with the ASCII option -a , od strips the high-order bit of characters, so the accented letters have been mangled, but we can see their octal values: é is 351 8 and ô is 364 8 . On GNU/Linux systems, you can confirm the character values like this:$ man iso_8859_1 Check the ISO 8859-1 manual page ... Oct Dec Hex Char Description -------------------------------------------------------------------- ... 351 233 E9 é LATIN SMALL LETTER E WITH ACUTE ... 364 244 F4 ô LATIN SMALL LETTER O WITH CIRCUMFLEX ...First, sort the file in strict byte order:$ LC_ALL=C sort french-english Sort in traditional ASCII order cote dimension coté dimensioned côte coast côté sideNotice that e (145 8 ) sorted before é (351 8 ), and o (157 8 ) sorted before ô (364 8 ), as expected from their numerical values. Now sort the text in Canadian-French order:$ LC_ALL=fr_CA.iso88591 sort french-english Sort in Canadian-French locale côte coast cote dimension coté dimensioned côté sideThe output order clearly differs from the traditional ordering by raw byte values. Sorting conventions are strongly dependent on language, country, and culture, and the rules are sometimes astonishingly complex. Even English, which mostly pretends that accents are irrelevant, can have complex sorting rules: examine your local telephone directory to see how lettercase, digits, spaces, punctuation, and name variants like McKay and Mackay are handled.
Sorting by FieldsFor more control over sorting, the -k option allows you to specify the field to sort on, and the -t option lets you choose the field delimiter. If -t is not specified, then fields are separated by whitespace and leading and trailing whitespace in the record is ignored. With the -t option, the specified character delimits fields, and whitespace is significant. Thus, a three-character record consisting of space-X-space has one field without -t , but three with -t ' ' (the first and third fields are empty). The -k option is followed by a field number, or number pair, optionally separated by whitespace after -k . Each number may be suffixed by a dotted character position, and/or one of the modifier letters shown in Table.
Letter
Description
b
Ignore leading whitespace.
d
Dictionary order.
f
Fold letters implicitly to a common lettercase.
g
Compare as general floating-point numbers. GNU version only.
i
Ignore nonprintable characters.
n
Compare as (integer) numbers.
r
Reverse the sort order.
Fields and characters within fields are numbered starting from one.
If only one field number is specified, the sort key begins at the start of that field, and continues to the end of the record ( not the end of the field).
If a comma-separated pair of field numbers is given, the sort key starts at the beginning of the first field, and finishes at the end of the second field.
With a dotted character position, comparison begins (first of a number pair) or ends (second of a number pair) at that character position: -k2.4,5.6 compares starting with the fourth character of the second field and ending with the sixth character of the fifth field.
If the start of a sort key falls beyond the end of the record, then the sort key is empty, and empty sort keys sort before all nonempty ones.
When multiple -k options are given, sorting is by the first key field, and then, when records match in that key, by the second key field, and so on.
! While the -k option is available on all of the systems that we tested, sort also recognizes an older field specification, now considered obsolete, where fields and character positions are numbered from zero. The key start for character m in field n is defined by + n.m , and the key end by - n.m . For example, sort +2.1 -3.2 is equivalent to sort -k3.2,4.3 . If the character position is omitted, it defaults to zero. Thus, +4.0nr and +4nr mean the same thing: a numeric key, beginning at the start of the fifth field, to be sorted in reverse (descending) order.
Let's try out these options on a sample password file, sorting it by the username, which is found in the first colon-separated field:$ sort -t: -k1,1 /etc/passwd Sort by username bin:x:1:1:bin:/bin:/sbin/nologin chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash daemon:x:2:2:daemon:/sbin:/sbin/nologin groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93 harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh root:x:0:0:root:/root:/bin/bash zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zshFor more control, add a modifier letter in the field selector to define the type of data in the field and the sorting order. Here's how to sort the password file by descending UID:
$ sort -t: -k3nr /etc/passwd Sort by descending UID zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zsh gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93 groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash daemon:x:2:2:daemon:/sbin:/sbin/nologin bin:x:1:1:bin:/bin:/sbin/nologin root:x:0:0:root:/root:/bin/bashA more precise field specification would have been -k3nr,3 (that is, from the start of field three, numerically, in reverse order, to the end of field three), or -k3,3nr , or even -k3,3 -n -r , but sort stops collecting a number at the first nondigit, so -k3nr works correctly.
In our password file example, three users have a common GID in field 4, so we could sort first by GID, and then by UID, with:
$ sort -t: -k4n -k3n /etc/passwd Sort by GID and UID root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash harpo:x:12502:1000:Harpo Marx:/home/harpo:/bin/ksh zeppo:x:12505:1000:Zeppo Marx:/home/zeppo:/bin/zsh groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93The useful -u option asks sort to output only unique records, where unique means that their sort-key fields match, even if there are differences elsewhere. Reusing the password file one last time, we find:
$ sort -t: -k4n -u /etc/passwd Sort by unique GID root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin chico:x:12501:1000:Chico Marx:/home/chico:/bin/bash groucho:x:12503:2000:Groucho Marx:/home/groucho:/bin/sh gummo:x:12504:3000:Gummo Marx:/home/gummo:/usr/local/bin/ksh93Notice that the output is shorter: three users are in group 1000, but only one of them was output...
Sorting Text BlocksSometimes you need to sort data composed of multiline records. A good example is an address list, which is conveniently stored with one or more blank lines between addresses. For data like this, there is no constant sort-key position that could be used in a -k option, so you have to help out by supplying some extra markup. Here's a simple example:
$ cat my-friends Show address file # SORTKEY: Schloß, Hans Jürgen Hans Jürgen Schloß Unter den Linden 78 D-10117 Berlin Germany # SORTKEY: Jones, Adrian Adrian Jones 371 Montgomery Park Road Henley-on-Thames RG9 4AJ UK # SORTKEY: Brown, Kim Kim Brown 1841 S Main Street Westchester, NY 10502 USAThe sorting trick is to use the ability of awk to handle more-general record separators to recognize paragraph breaks, temporarily replace the line breaks inside each address with an otherwise unused character, such as an unprintable control character, and replace the paragraph break with a newline. sort then sees lines that look like this:
# SORTKEY: Schloß, Hans Jürgen^ZHans Jürgen Schloß^ZUnter den Linden 78^Z... # SORTKEY: Jones, Adrian^ZAdrian Jones^Z371 Montgomery Park Road^Z... # SORTKEY: Brown, Kim^ZKim Brown^Z1841 S Main Street^Z...Here, ^Z is a Ctrl-Z character. A filter step downstream from sort restores the line breaks and paragraph breaks, and the sort key lines are easily removed, if desired, with grep . The entire pipeline looks like this:
cat my-friends | Pipe in address file awk -v RS="" { gsub("\n", "^Z"); print }' | Convert addresses to single lines sort -f | Sort address bundles, ignoring case awk -v ORS="\n\n" '{ gsub("^Z", "\n"); print }' | Restore line structure grep -v '# SORTKEY' Remove markup linesThe gsub( ) function performs "global substitutions." It is similar to the s/x/y/g construct in sed . The RS variable is the input Record Separator. Normally, input records are separated by newlines, making each line a separate record. Using RS=" " is a special case, whereby records are separated by blank lines; i.e., each block or "paragraph" of text forms a separate record. This is exactly the form of our input data. Finally, ORS is the Output Record Separator; each output record printed with print is terminated with its value. Its default is also normally a single newline; setting it here to " \n\n " preserves the input format with blank lines separating records. (More detail on these constructs may be found in Chapter 9 .)
The beauty of this approach is that we can easily include additional keys in each address that can be used for both sorting and selection: for example, an extra markup line of the form:
# COUNTRY: UKin each address, and an additional pipeline stage of grep '# COUNTRY: UK ' just before the sort , would let us extract only the UK addresses for further processing.
You could, of course, go overboard and use XML markup to identify the parts of the address in excruciating detail:
<address> <personalname>Hans Jürgen</personalname> <familyname>Schloß</familyname> <streetname>Unter den Linden<streetname> <streetnumber>78</streetnumber> <postalcode>D-10117</postalcode> <city>Berlin</city> <country>Germany</country> </address>With fancier data-processing filters, you could then please your post office by presorting your mail by country and postal code, but our minimal markup and simple pipeline are often good enough to get the job done.
4.1.4. Sort EfficiencyThe obvious way to sort data requires comparing all pairs of items to see which comes first, and leads to algorithms known as bubble sort and insertion sort . These quick-and-dirty algorithms are fine for small amounts of data, but they certainly are not quick for large amounts, because their work to sort n records grows like n 2 . This is quite different from almost all of the filters that we discuss in this book: they read a record, process it, and output it, so their execution time is directly proportional to the number of records, n .
Fortunately, the sorting problem has had lots of attention in the computing community, and good sorting algorithms are known whose average complexity goes like n 3/2 ( shellsort ), n log n ( heapsort , mergesort , and quicksort ), and for restricted kinds of data, n ( distribution sort ). The Unix sort command implementation has received extensive study and optimization: you can be confident that it will do the job efficiently, and almost certainly better than you can do yourself without learning a lot more about sorting algorithms.
4.1.5. Sort StabilityAn important question about sorting algorithms is whether or not they are stable : that is, is the input order of equal records preserved in the output? A stable sort may be desirable when records are sorted by multiple keys, or more than once in a pipeline. POSIX does not require that sort be stable, and most implementations are not, as this example shows:
$ sort -t_ -k1,1 -k2,2 << EOF Sort four lines by first two fields> one_two > one_two_three > one_two_four > one_two_five > EOF one_two one_two_five one_two_four one_two_threeThe sort fields are identical in each record, but the output differs from the input, so sort is not stable. Fortunately, the GNU implementation in the coreutils package [1] remedies that deficiency via the -- stable option: its output for this example correctly matches the input.
[1] Available at ftp://ftp.gnu.org/gnu/coreutils/ .
Jul 14, 2007
Here's a convenient way of finding those space hogs in your home directory (can be any directory). For me, those large files are usually a result of mkfile event (testing purposes) and can be promptly deleted. Here's an example of its use.
#cd /export/home/esofthub
#ls -l | sort +4n | awk '{print $5 "\t" $9}'Find recursively (a little awkward)
#ls -lR | sort +4n | awk '{print $5 "\t" $9}' | more
In the following examples, first the preferred and then the obsolete way of specifying sort keys are given as an aid to understanding the relationship between the two forms.
Example 1 Sorting with the second field as a sort key
Either of the following commands sorts the contents of infile with the second field as the sort key:
example% sort -k 2,2 infile
example% sort +1 -2 infile
Example 2 Sorting in reverse order
Either of the following commands sorts, in reverse order, the contents of infile1 and infile2, placing the output in outfile and using the second character of the second field as the sort key (assuming that the first character of the second field is the field separator):
example% sort -r -o outfile -k 2.2,2.2 infile1 infile2
example% sort -r -o outfile +1.1 -1.2 infile1 infile2
Example 3 Sorting using a specified character in one of the files
Either of the following commands sorts the contents of infile1 and infile2 using the second non-blank character of the second field as the sort key:
example% sort -k 2.2b,2.2b infile1 infile2
example% sort +1.1b -1.2b infile1 infile2
Example 4 Sorting by numeric user ID
Either of the following commands prints the passwd(4) file (user database) sorted by the numeric user ID (the third colon-separated field):
example% sort -t : -k 3,3n /etc/passwd
example% sort -t : +2 -3n /etc/passwd
Example 5 Printing sorted lines excluding lines that duplicate a field
Either of the following commands prints the lines of the already sorted file infile, suppressing all but one occurrence of lines having the same third field:
example% sort -um -k 3.1,3.0 infile example% sort -um +2.0 -3.0 infileExample 6 Sorting by host IP address
Either of the following commands prints the hosts(4) file (IPv4 hosts database), sorted by the numeric IP address (the first four numeric fields):
example$ sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n /etc/hosts
example$ sort -t . +0 -1n +1 -2n +2 -3n +3 -4n /etc/hosts
Since '.' is both the field delimiter and, in many locales, the decimal separator, failure to specify both ends of the field will lead to results where the second field is interpreted as a fractional portion of the first, and so forth.
Here are some examples to illustrate various combinations of options.
sort -nr
sort -k 3
sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written `-k 2' instead of `-k 2,2' sort
would have used all characters beginning in the second field and extending to the end of the line
as the primary numeric key. For the large majority of applications, treating keys spanning
more than one field as numeric will not do what you expect.
Also note that the `n' modifier was applied to the field-end specifier for the first key. It would have been equivalent to specify `-k 2n,2' or `-k 2n,2n'. All modifiers except `b' apply to the associated field, regardless of whether the modifier character is attached to the field-start and/or the field-end part of the key specifier.
sort -t : -k 5b,5 -k 3,3n /etc/passwd
An alternative is to use the global numeric modifier `-n'.
sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
The use of `-print0', `-z', and `-0' in this case means that pathnames that contain Line Feed characters will not get broken up by the sort operation.
Finally, to ignore both leading and trailing white space, you could have applied the `b' modifier to the field-end specifier for the first key,
sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
or by using the global `-b' modifier instead of `-n' and an explicit `n' with the second key specifier.
sort -t : -b -k 5,5 -k 3,3n /etc/passwd
Let´s assume that we want to sort /etc/passwd using the geco field. To achieve this, we will use sort, the unix sorting tool
$ sort -t: +4 /etc/passwd murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash root:x:0:0:root:/root:/bin/bashIt is very easy to see that the file has been sorted, but using the ASCII table order. If we don´t want to make a difference among capital letter, we can use:
$ sort -t: +4f /etc/passwd murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash root:x:0:0:root:/root:/bin/bash practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash-t is the option to select the field separator. +4 stands for the number of field to jump before ordering the lines, and f means to sort regardless of upper and lowercase.
A much more complicated sort can be achieved. For example, we can sort using the shell in a first step then sort using the geco:
$ sort -t: +6r +4f /etc/passwd practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash root:x:0:0:root:/root:/bin/bash wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bashYou have a file with some people you lend money and the amount of money you gave them. Take ´deudas.txt´ as an example:
Son Goku:23450 Son Gohan:4570 Picolo:356700 Ranma 1/2:700If you want to know the first one to ´visit´, you need a sorted list.
Just type$ sort +1 deudas Ranma 1/2:700 Son Gohan:4570 Son Goku:23450 Picolo:356700which is not the desired result because the number of fields is not the same across the file. The solution is the ´n´ option:$ sort +1n deudas Picolo:356700 Son Goku:23450 Son Gohan:4570 Ranma 1/2:700Basic options for sort are
+n.m jumps over the first n fields and the next m characters before begin the sort
-n.m stops the sorting when arriving to the m-th character of the n-th fieldThe following are modification parameters:
-b jumps over leading whitespaces
-d dictionary sort (just using letters, numbers and whitespace)
-f ignores case distinction
-n sort numerically
-r reverse order
CS307, Practicum in Unix Sort Page 1 The sort utility The term ...
The sort utility practicum from New Mexico Institute of Mining and Technology
The term sorting, strictly speaking, really means to separate things into different categories. For example, you might sort clothes for washing into light and dark colors.
In computer jargon, though, when we say we are sorting data, we really mean that we are ordering it, that is, putting records in order according to their contents. For example, we might write a program to sort the entries in an address book into alphabetical order.
The sort utility reads a stream of records and outputs the records in order according to one or more sort keys, that is, according to part or all of the contents of each record.
Input and output streams If sort is executed without any arguments, it reads a stream of lines from its standard input, sorts them in order by the ASCII codes of all the characters from left to right, and writes the sorted stream to the standard output.
You may also specify one or more input files as arguments to sort. This example would sort three files named moe, larry and curly, and call the output file stooges:
% sort moe larry curly > stooges
You can ask sort to write to a specific file by using the -o option, followed by a space and then the name of the desired output file. This command would work just like the previous example:
% sort moe larry curly -o stooges
Fields and keys A field is some part of a record. For example, a file containing records describing your grocery list might have two fields, one for the item, and another for the quantity needed:
eggplant 2 chicken 1# apples 8
A field separator is some character you put between fields in a record. In the above example, spaces are used as field separators. If you don't specify otherwise, the sort utility assumes that space is the field separator.
A different grocery list might use, for example, comma as a field separator. This would allow you to have blanks within a field:
scallion, 3 bunches
CS307, Practicum in Unix Sort Page 2
ground pork, 1.5 lbs garlic, 10 heads
A sort key is the field (or fields) used in ordering records. If you want to sort on a certain field, use the +n option to sort, where n is the number of fields to be skipped. Thus, sort +0 means to sort on the first field, sort +1 means to sort on the second field, and so on.
For example, here is a file describing mineral specimens. Each record has three fields--the type of mineral, the price, and the place it was collected.
% cat minerals quartz 0.30 Georgetown feldspar 0.50 Riley shale 0.42 Floydada
To sort this file by place (the third field), we use:
% sort +2 minerals shale 0.42 Floydada quartz 0.30 Georgetown feldspar 0.50 Riley
Sometimes you want to sort a file on more than one key. For example, suppose you want to sort a list of students by grade and name: you want all the A's together, and all the B's, but within each grade you want the students in alphabetical order. The most important key is called the major key. If two records have the same value in their major key field, sort can then use another field (sometimes called the minor key) as a tie-breaker.
You can have any number of keys. For example, if you specify seven sort keys, and two given records have identical values for the first six keys, but different values for the seventh key, those two records will be ordered according to their seventh key.
To specify multiple keys to sort, use +m and -n options in pairs. A pair of arguments of the form +m -n tells sort to use fields (m \Gamma 1) through n, inclusive, as keys. If a +m option isn't followed by a -n option, sort uses all the fields through the end of the record as keys. Thus, sort +3 would use all fields from the fourth through the last.
For example, suppose you have a file named x of records with ten keys each, and you want to sort on the third, fourth, fifth, ninth, and first fields, in that order. Here is the correct command:
% sort +2 -5 +8 -9 +0 -1 x
Sort options Here is the full syntax of the sort command, taken from the man page:
% sort [-mcubfdinrt] [+m [-n]]... [-o outfile] [-T directory] [ infile ]...
This command syntax is typical of Unix utilities: there are a group of letters (-mcubfdinrt) that must be preceded by a hyphen. These "dash options" change the way that files are sorted.
The -m option selects merging instead of sorting. Merging produces a single sorted file by putting together two or more files that are already sorted by the same criteria. More than one infile must be specified. If the input files are not already sorted, sort will not produce sorted output, and it won't warn you either.
The -c option causes the input to be checked to see if it is sorted; it won't actually sort anything. If the input file is correctly sorted according to the selected keys, there will be no output. (The man page doesn't say what will be output in case sort errors are found.)
The -u option stands for unique. With this option, whenever two records compare equal in all keys (not necessarily in other fields), sort will throw away one of them. The output of sort -u will thus contain only one of each set of key values.
The -b option instructs sort to ignore leading blanks while sorting. Compare these examples:
% cat leaders
rat bat cat % sort leaders
rat bat cat % sort -b leaders
bat cat
rat
The -f option stands for "fold," which means that uppercase letters should be treated the same as lowercase. In the ASCII character set, normally all capital letters sort before all lowercases letters.
% cat cases purple brown MacGillivray's % sort cases MacGillivray's brown purple % sort -f cases brown MacGillivray's purple
The -d option selects "dictionary"-style comparisons. Punctuation marks (actually, anything but letters, digits and blanks) are ignored:
% cat irish O'Donahue O'Dell Odets
% sort irish O'Dell O'Donahue Odets % sort -df irish O'Dell Odets O'Donahue
The -i option makes sort ignore non-ASCII characters during key comparisons. The -n option specifies that a sort key is a number, and should be sorted by its numeric value, not its string value. Compare these two examples:
% cat numbers 0.03 159.7 96.3 87334 % sort numbers 0.03 159.7 87334 96.3 % sort -n numbers 0.03 96.3 159.7 87334
The -r option reverses the sort order from ascending to descending:
% sort -nr numbers 87334 159.7 96.3 0.03 % sort -r presidents Reagan, Ronald Carter, Jimmy Bush, George
Finally, the -t option allows you to specify a field separator. The t stands for "tab character," another name for the field separator character, but this is confusing because there is an ASCII character called tab, which may or may not be used as a field separator. The t must be followed immediately by the character to be used as field separator:
% cat grocery scallion, 3 bunches
ground pork, 1.5 lbs garlic, 10 heads % sort -nt, +1 grocery ground pork, 1.5 lbs scallion, 3 bunches garlic, 10 heads
If you use a field separator that has some special meaning to the shell, you should enclose it in apostrophes:
% sort -t'--' infile -o outfile
The -T option may be necessary if you are sorting large files; it tells sort to use a specified directory for its scratch area while sorting. The -T must be followed by one space, then the pathname of a directory.
For example, I was sorting a 5-megabyte file once and sort bombed out due to lack of space. I found out that it uses the root directory (/ ) as its default scratch directory, and at that time the root directory only had 3 megabytes of space left. I found that the /tmp directory had 100 megabytes left (the df command will tell you how much space is left on every disc on the system), and used this command:
% sort -T /tmp !other options?...
Key offsets It is possible to use part of a field as a sort key. You may specify that the nth character of a field be the beginning or end of a sort key.
The +m.a and -n.b options are used for this key specification. In this syntax, the a and b numbers give the offsets into the fields where the key begins, that is, it specifies the number of characters into the field.
For example, let us suppose that the first field on a line has the form aannnn, where the aa portion is a letter code and the nnnn portion is a string of digits. If you want to sort on the digit portion, ignoring the letters, use:
% sort +0.2
that is, use the first field starting at the third character. Here is an example of a key offset. You are given a file containing people's Social Security Numbers of the form aaabbcccc, and you want to sort on the bb section as the major key, and the aaa and cccc sections as minor keys. Colon (: ) is used as the field separator.
Here's a convenient way of finding those space hogs in your home directory (can be any directory). For me, those large files are usually a result of mkfile event (testing purposes) and can be promptly deleted. Here's an example of its use.
#cd /export/home/esofthub
#ls -l | sort +4n | awk '{print $5 "\t" $9}'Find recursively (a little awkward)
#ls -lR | sort +4n | awk '{print $5 "\t" $9}' | more
ps -ef | sortThis command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).
ls -al | sort +4n
This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.
ls -al | sort +4n | more
The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.
ls -al | sort +4nr
This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.
The output of du has been very informative, but it's difficult to scan a listing to ascertain the four or five largest directories, particularly as more and more directories and files are included in the output. The good news is that the Unix sort utility is just the tool we need to sidestep this problem.
# du -s * | sort -nr
One final concept and we're ready to move along. If you want to only see the five largest files or directories in a specific directory, all that you'd need to do is pipe the command sequence to head:
13984 Lynx
10464 IBM
3092 Gator
412 bin
196 DEMO
84 etcpasswd
76 CBO_MAIL
48 elance
36 CraigsList
16 Exchange
4 gettermsheet.sh
4 getstocks.sh
4 getmodemdriver.sh
4 buckaroo
4 browse.sh
4 badjoke.rot13
4 badjoke
0 gif.gif# du -s * | sort -nr | head -5 13984 Lynx 10464 IBM 3092 Gator 412 bin 196 DEMO
The ! command (pronounced "bang") creates a temporary file to be used with a program that requires a filename in its command line. This is useful with shells that don't support process substitution. For example, to diff two files after sorting them, you might do:
diff `! sort file1` `! sort file2`commer
commer is a shell script that uses comm to compare two sorted files; it processes comm's output to make it easier to read. (See article 11.9.)
[Overview] [List]lensort
lensort sorts lines from shortest to longest. (See article 22.7.)
[Overview] [List]namesort
The namesort program sorts a list of names by the last name. (See article 22.8.) See also namesort.pl.
[Overview] [List]namesort.pl
The namesort.pl script uses the Perl module Lingua::EN::NameParse to sort a list of names by the last name. (See article 22.8.) See also namesort.
[Overview] [List]
This command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).
ls -al | sort +4n
This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.
ls -al | sort +4n | more
The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.
ls -al | sort +4nr
This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.
Daniel Malaby dan at malaby.com
Thu Jul 14 08:02:47 GMT 2005
- Previous message: DHCP assigned unregistered IP address
- Next message: using -t option with unix sort ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi All, I am trying to sort a tab delimited file with sort. The problem I am having is with the -t option. I do not know how to pass a tab. Things I have tried: sort -t \t sort -t '\t' sort -t "\t" sort -t 0x09 sort -t '0x09' sort -t "0x09" sort -t ^I sort -t '^I' sort -t "^I" Any suggestions would be much appreciated. Thanks
Daniel Malaby dan at malaby.com Thu Jul 14 17:48:05 GMT 2005
Nelis Lamprecht wrote: > On 7/14/05, Nelis Lamprecht <nlamprecht at gmail.com> wrote: > >>On 7/14/05, Daniel Malaby <dan at malaby.com> wrote: >> >>>Hi All, >>> >>>I am trying to sort a tab delimited file with sort. The problem I am >>>having is with the -t option. I do not know how to pass a tab. >> >><snip> >> >>>sort -t \t >> >></snip> >> >>>Any suggestions would be much appreciated. >> >>remove the space between -t and \t and it should work > > > actually scratch that, it works either way. can you give a sample of the data ? > > Regards, > Nelis The sample data has 9 fields, I am trying to sort on the fifth field, here is what I have tried. sort -t\t +4 -5 -o test.txt sample.txt I did try removing the space and it did not work, I have also tried removing the -5. I think the spaces in the third field are confusing sort. BTW this is being done on a PC running FBSD 4.11 prerelease #1 Thanks for your help and suggestions. -------------- next part -------------- E002 19085 GENERAL DYNAMICS 5031802 E-GL/VX/B/R1.0 SFT CD, GL VXWORKS BOREALIS R1.0 06/30/05 1 $995.00 $995.00 E016 19096 TGA INGENIERIA Y ELECTRONICS S 5881-2 E-AD600729C501 ARGUS PMC,2 DVI 16MB PERCHAN USB A/V 12/01/05 30 $2,312.00 $69,360.00 E016 19096 TGA INGENIERIA Y ELECTRONICS S 5881-2 E-DDX/SO/R4.0 SFT CD, DDX SOL 2.6-9 BOREALIS R4.0 12/01/05 30 $74.00 $2,220.00 E016 19096 TGA INGENIERIA Y ELECTRONICS S 5881-2 E-VIN/SO/R1.0 SFT CD, VID CAP SOL 2.6-9 BOREALIS R1.0 12/01/05 30 $74.00 $2,220.00 E021 19093 GANYMED COMPUTER GMBH 7103879 E-AD90073913011 GARNET PMC RIO8 C2, REAR I/O 16MB 07/19/05 2 $1,848.00 $3,696.00 E024 19080 DRS LAUREL TECHNOLOGIES 94358 E-AC7007121115A ECLIPSE3 PMC, VGA 16MB Q70 08/18/05 1 $846.00 $846.00 E024 19080 DRS LAUREL TECHNOLOGIES 94358 E-AC7007121115A ECLIPSE3 PMC, VGA 16MB Q70 10/19/05 19 $846.00 $16,074.00 E024 19080 DRS LAUREL TECHNOLOGIES 94358 E-AC7007121115A ECLIPSE3 PMC, VGA 16MB Q70 09/20/05 2 $846.00 $1,692.00 E024 19080 DRS LAUREL TECHNOLOGIES 94358 E-AC7007121115A ECLIPSE3 PMC, VGA 16MB Q70 11/17/05 7 $846.00 $5,922.00
Let´s assume that we want to sort /etc/passwd using the geco field. To achieve this, we will use sort, the unix sorting tool
$ sort -t: +4 /etc/passwd murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash root:x:0:0:root:/root:/bin/bashIt is very easy to see that the file has been sorted, but using the ASCII table order. If we don´t want to make a difference among capital letter, we can use:
$ sort -t: +4f /etc/passwd murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash root:x:0:0:root:/root:/bin/bash practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash-t is the option to select the field separator. +4 stands for the number of field to jump before ordering the lines, and f means to sort regardless of upper and lowercase.
A much more complicated sort can be achieved. For example, we can sort using the shell in a first step then sort using the geco:
$ sort -t: +6r +4f /etc/passwd practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash root:x:0:0:root:/root:/bin/bash wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bashYou have a file with some people you lend money and the amount of money you gave them. Take ´deudas.txt´ as an example:
Son Goku:23450 Son Gohan:4570 Picolo:356700 Ranma 1/2:700If you want to know the first one to ´visit´, you need a sorted list.
Just type$ sort +1 deudas Ranma 1/2:700 Son Gohan:4570 Son Goku:23450 Picolo:356700which is not the desired result because the number of fields is not the same across the file. The solution is the ´n´ option:$ sort +1n deudas Picolo:356700 Son Goku:23450 Son Gohan:4570 Ranma 1/2:700Basic options for sort are
+n.m jumps over the first n fields and the next m characters before begin the sort
-n.m stops the sorting when arriving to the m-th character of the n-th fieldThe following are modification parameters:
-b jumps over leading whitespaces
-d dictionary sort (just using letters, numbers and whitespace)
-f ignores case distinction
-n sort numerically
-r reverse order
For example, suppose we want to list the distinct file owners in a directory. To do this, we must perform three discrete tasks:
1. We must list all files in the directory (ls –al)Using the pipe command, we can tie these three functions together into a single UNIX command, piping the output from one command as sending it as input to the next UNIX command:
2. We must parse this output and extract the file owner from the fourth column of the output. (awk '{ print $3 }')
3. We must then take the list of file owners and remove duplicate entries (sort –u)root> ls -al|awk '{ print $3 }'|sort -u
marion
oracle
root
% awk '{ print NF " " $0}' < out | sort -n | tail
From: Ed Schmollinger (schmolli@private)
Date: Thu Sep 16 2004 - 09:14:32 PDT
- Previous message: Marcus J. Ranum: "Re: [logs] Faster unix 'sort' replacement?"
- In reply to: Mike Blomgren: "[logs] Faster unix 'sort' replacement?"
- Next in thread: Mike Blomgren: "RE: [logs] Faster unix 'sort' replacement?"
- Reply: Mike Blomgren: "RE: [logs] Faster unix 'sort' replacement?"
- Reply: cadams@private: "Re: [logs] Faster unix 'sort' replacement?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
On Thu, Sep 16, 2004 at 12:33:12AM +0200, Mike Blomgren wrote: > I'm having trouble with 'sort' taking alot of cpu-time on a Solaris machine, > and I'm wondering if anyone knows of a replacement for the gnu 'sort' > command, which is faster and will compile on Solaris and preferably Linux > too? > > I'm using sort in the standard 'cat <file> | awk '{"compute..."}' | sort | > uniq -c | sort -n -r' type analysis. You can get rid of the multiple sorts/uniq thing by doing it all at once: --- CUT HERE --- #!/usr/bin/perl -wT use strict; my %msg = (); while (<>) { chomp; $msg{$_} = $msg{$_} ? $msg{$_} + 1 : 1; } for(sort { $msg{$a} <=> $msg{$b} } keys %msg) { print "$msg{$_}\t$_\n"; } --- CUT HERE --- I've found that for my datasets, the awk/sed stage is what constitues the bulk of the bottleneck. You may want to look at optimizing that part as well. -- Ed Schmollinger - schmolli@private
Russell Fulton r.fulton at auckland.ac.nz
Mon Sep 20 13:13:59 MDT 2004
- Previous message: [logs] Faster unix 'sort' replacement?
- Next message: [logs] Faster unix 'sort' replacement?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 2004-09-17 at 05:59, Mike Blomgren wrote:
> Thanks for the tip - I'll have to try that one with perl doing the sort
> instead of gnu sort. I have been somewhat reluctant to use perl since I find
> it has a severe performance impact in some cases - but that may be related
> to my regexp's and not the sorting. For a fact though, I do know that using
> associative arrays is a good way to consume memory in a hurry. And thus
> causing the os to start swapping memory to disk, which is not very
> beneficial for speed, to say the least...If you are short of memory sort may be swapping stuff out to disk and
hence your performance problems. It depends on the implementations but
some sorts are smart enough to work out how much memory is really
available and then do sort & merges with in this. This is much better
than sorts that simply assume that virtual memory is endless and cause
the OS to thrash madly but is much slower than doing the whole thing in
memory.This will not show up as OS level swapping though, just as lots of disk
activity during the sort.--
Russell Fulton, Information Security Officer, The University of Auckland
New Zealand
Google matched content |
Internal
External
docs.sun.com man pages section 1 User Commands Sun man page
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options
too. Ordering options:
-b, --ignore-leading-blanks
ignore leading blanks
-d, --dictionary-order
consider only blanks and alphanumeric characters
-f, --ignore-case
fold lower case to upper case characters
-g, --general-numeric-sort
compare according to general numerical value
-i, --ignore-nonprinting
consider only printable characters
-M, --month-sort
compare (unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons
Other options:
-c, --check
check whether input is sorted; do not sort
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
-m, --merge
merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp; multiple options
specify multiple directories
-u, --unique with -c, check for strict ordering; without -c, output only the
first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
POS is F[.C][OPTS], where F is the field number and C the character
position in the field. OPTS is one or more single-letter ordering
options, which override global ordering options for that key. If no
key is given, use the entire line as the key.
SIZE may be followed by the following multiplicative suffixes: % 1% of
memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.
With no FILE, or when FILE is -, read standard input.
*** WARNING *** The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
/usr/bin/sort
Sort keys can be specified using the options:
The notation:
A field comprises a maximal sequence of nonseparating characters and, in the absence of option -t, any preceding field separator.
The field_start portion of the keydef optionargument has the form:
field_number[.first_character]
Fields and characters within fields are numbered starting with 1. field_number and first_character, interpreted as positive decimal integers, specify the first character to be used as part of a sort key. If .first_character is omitted, it refers to the first character of the field.
The field_end portion of the keydef optionargument has the form:
field_number[.last_character]
The field_number is as described above for field_start. last_character, interpreted as a non-negative decimal integer, specifies the last character to be used as part of the sort key. If last_character evaluates to zero or .last_character is omitted, it refers to the last character of the field specified by field_number.
If the -b option or b type modifier is in effect, characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.)
[+pos1[-pos2]]
(obsolete). Provide functionality equivalent to the -k keydef option.
pos1 and pos2 each have the form m.n optionally followed by one or more of the flags bdfiMnr. A starting position specified by +m.n is interpreted to mean the n+1st character in the m+1st field. A missing .n means .0, indicating the first character of the m+1st field. If the b flag is in effect n is counted from the first non-blank in the m+1st field; +m.0b refers to the first nonblank character in the m+1st field.
A last position specified by -m.n is interpreted to mean the nth character (including separators) after the last character of the mth field. A missing .n means .0, indicating the last character of the mth field. If the b flag is in effect n is counted from the last leading blank in the m+1st field; -m.1b refers to the first non-blank in the m+1st field.
The fully specified +pos1 - pos2 form with type modifiers T and U:
+w.xT -y.zU
is equivalent to:
anagram -- an interesting use of sort
% awk '{ print NF " " $0}' < out | sort -n | tail
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: May 23, 2021