(Video materials in preparation)
sorter : Split a file based on a key
(The file doesn't need to be sorted on the key)
Usage : sorter [options] <filename> <file>
Options : -d : delete key
-a : append file
-z : compress
-s : storage size
Version : Tue Jan 9 09:02:34 JST 2024
Edition : 1
Read in <file> and then write records who share the same value in
the key field to separate files using <filename>. For example, to
write records with the same value in field 2 to separate files,
specify "data.%2" as <filename>. The output files will be named
data.<value of field 2>. Unlike keycut, sorter does not require
the input file to be sorted on the key field. Also, records are
written to each output file in the order they appear in the input
file, so the output files are not sorted. The key field is specified
in <filename> using %<field number>, but you can also speficy
substrings such as %5.2 or %5.1.3.
$ cat data
04 Connecticut 13 Hartford 92 56 83 96 75
01 Texas 03 Houston 82 0 23 84 10
03 New_Jersey 10 Newark 52 91 44 9 0
02 New_York 04 Manhattan 30 50 71 36 30
01 Texas 01 Austin 91 59 20 76 54
03 New_Jersey 12 Trenton 95 60 35 93 76
04 Connecticut 16 Bridgetown 45 21 24 39 03
02 New_York 05 Brooklyn 78 13 44 28 51
$ sorter data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 87 Feb 19 11:14 data.01 ↑
-rw-r--r-- 1 usp usp 82 Feb 19 11:14 data.02 Split into
-rw-r--r-- 1 usp usp 77 Feb 19 11:14 data.03 four files
-rw-r--r-- 1 usp usp 91 Feb 19 11:14 data.04 ↓
$ cat data.01
01 Texas 03 Houston 82 0 23 84 10
01 Texas 01 Austin 91 59 20 76 54
$ cat data.02
02 New_York 04 Manhattan 30 50 71 36 30
02 New_York 05 Brooklyn 78 13 44 28 51
$ cat data.03
03 New_Jersey 10 Newark 52 91 44 9 0
03 New_Jersey 12 Trenton 95 60 35 93 76
$ cat data.04
04 Connecticut 13 Hartford 92 56 83 96 75
04 Connecticut 16 Bridgetown 45 21 24 39 03
$ sorter data.%1.2.1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 87 Feb 19 11:15 data.1
-rw-r--r-- 1 usp usp 82 Feb 19 11:15 data.2
-rw-r--r-- 1 usp usp 77 Feb 19 11:15 data.3
-rw-r--r-- 1 usp usp 91 Feb 19 11:15 data.4
If you specify the -a option, the output files will be appended
instead of replaced. If the output file doesn't already exist
it will be created. If you don't specify this option and the
file already exists, it is overwritten.
$ sorter data.%1 data
$ sorter -a data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 174 Feb 19 11:16 data.01
-rw-r--r-- 1 usp usp 164 Feb 19 11:16 data.02
-rw-r--r-- 1 usp usp 154 Feb 19 11:16 data.03
-rw-r--r-- 1 usp usp 182 Feb 19 11:16 data.04
$ cat data.01
01 Texas 03 Houston 82 0 23 84 10
01 Texas 01 Austin 91 59 20 76 54
01 Texas 03 Houston 82 0 23 84 10
01 Texas 01 Austin 91 59 20 76 54
If you specify the -d option, the records are written to the
output files without the key field. Even if the key field is
specified as a substring (such as %1.2.1) the entire key field
(in this example, field 1) is skipped.
$ sorter -d data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 81 Feb 19 13:13 data.01
-rw-r--r-- 1 usp usp 76 Feb 19 13:13 data.02
-rw-r--r-- 1 usp usp 71 Feb 19 13:13 data.03
-rw-r--r-- 1 usp usp 85 Feb 19 13:13 data.04
$ cat data.01
Texas 03 Houston 82 0 23 84 10
Texas 01 Austin 91 59 20 76 54
If you specify the -z option, the output files are compressed
using gzip.
$ sorter -z data.%1.gz data
$ ls -l data.*
-rw-r--r-- 1 usp usp 98 Feb 19 13:17 data.01.gz
-rw-r--r-- 1 usp usp 94 Feb 19 13:17 data.02.gz
-rw-r--r-- 1 usp usp 82 Feb 19 13:17 data.03.gz
-rw-r--r-- 1 usp usp 100 Feb 19 13:17 data.04.gz
$ gunzip < data.01.gz
01 Texas 03 Houston 82 0 23 84 10
01 Texas 01 Austin 91 59 20 76 54
sorter.c uses zlib. When compiling, make sure to use the following:
$ cc -sstatic -O3 -o /home/TOOL/sorter sorter.c -lz
If you use the -a option and the -z option together, a compressed
file is appended to an already compressed file. This new file
can be properly decompressed using gunzip.
This command reads the entire input file into memory. However, when
available memory is low or when more than half of physical memory is
used, it writes data to the file and then continues.