(Video materials in preparation)
tagkeycut : Split a file based on a key
(The file doesn't need to be sorted on the key)
Usage : tagkeycut [options] <filename> <file>
Options : -d : delete key
-a : append file
-z : compress
Version : Tue Jan 9 09:02:34 JST 2024
Edition : 1
Reads in <file> and splits it into multiple files where the key
field specified in <filename> has the same values.
For example, if you want to split the file into multiple files where
the field with tag name KEY contains the same value, specify the
filename as "data.%KEY". The names of the output files will be
data(KEY field value). The key field must be sorted to use tagkeycut.
(Files are output when the value of the field changes.)
The tag key field is specified in <filename> using %<tag name>, but you
can also specify substrings such as %KEY.2 or %KEY.1.3.
When specifying with %, you can put braces {} around KEY such as
"%{KEY}" in order to make the range more clear. If the character
after the tag name is nt "- . / % }" then you must use braces.
$ cat data
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
01 New_Hampshire 03 Nashua 82 0 23 84 10
01 New_Hampshire 01 Manchester 91 59 20 76 54
02 Massachusetts 04 Boston 30 50 71 36 30
02 Massachusetts 05 Worcester 78 13 44 28 51
03 Vermont 10 Burlington 52 91 44 9 0
03 Vermont 12 Rutland 95 60 35 93 76
04 New_York 13 Brooklyn 92 56 83 96 75
04 New_York 16 Manhattan 45 21 24 39 03
$ tagkeycut data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 122 May 18 16:54 data.01 ↑
-rw-r--r-- 1 usp usp 117 May 18 16:54 data.02 Split into
-rw-r--r-- 1 usp usp 112 May 18 16:54 data.03 four files
-rw-r--r-- 1 usp usp 126 May 18 16:54 data.04 ↓
$ cat data.01
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
01 New_Hampshire 03 Nashua 82 0 23 84 10
01 New_Hampshire 01 Manchester 91 59 20 76 54
$ cat data.02
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
02 Massachusetts 04 Boston 30 50 71 36 30
02 Massachusetts 05 Worcester 78 13 44 28 51
$ cat data.03
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
03 Vermont 10 Burlington 52 91 44 9 0
03 Vermont 12 Rutland 95 60 35 93 76
$ cat data.04
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
04 New_York 13 Brooklyn 92 56 83 96 75
04 New_York 16 Manhattan 45 21 24 39 03
$ tagkeycut data.%1.2.1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 122 May 18 16:58 data.1
-rw-r--r-- 1 usp usp 117 May 18 16:58 data.2
-rw-r--r-- 1 usp usp 112 May 18 16:58 data.3
-rw-r--r-- 1 usp usp 126 May 18 16:58 data.4
If the -a option is specified, the lines are appended to the output
file.
If the output file doesn't exist, a new one is created.
If you do not specify this option, the file is overwritten.
$ tagkeycut data.%1 data
$ tagkeycut -a data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 209 May 18 17:00 data.01
-rw-r--r-- 1 usp usp 199 May 18 17:00 data.02
-rw-r--r-- 1 usp usp 189 May 18 17:00 data.03
-rw-r--r-- 1 usp usp 217 May 18 17:00 data.04
$ cat data.01
PNUM PREF CNUM CITY D1 D2 D3 D4 D5
01 New_Hampshire 03 Nashua 82 0 23 84 10
01 New_Hampshire 01 Manchester 91 59 20 76 54
01 New_Hampshire 03 Nashua 82 0 23 84 10
01 New_Hampshire 01 Manchester 91 59 20 76 54
If you specify the -d option, all records omitting the key field
are output to the file. Even if the key field is specified as a
subscript (such as "%KEY.2.1") the entire key field is omitted.
$ tagkeycut -d data.%1 data
$ ls -l data.*
-rw-r--r-- 1 usp usp 111 May 18 17:03 data.01
-rw-r--r-- 1 usp usp 106 May 18 17:03 data.02
-rw-r--r-- 1 usp usp 101 May 18 17:03 data.03
-rw-r--r-- 1 usp usp 115 May 18 17:03 data.04
$ cat data.01
PREF CNUM CITY D1 D2 D3 D4 D5
New_Hampshire 03 Nashua 82 0 23 84 10
New_Hampshire 01 Manchester 91 59 20 76 54
If you specify the -z option, the output files will be compressed
using gzip.
$ tagkeycut -z data.%1.gz data
$ ls -l data.*
-rw-r--r-- 1 usp usp 131 May 18 17:05 data.01.gz
-rw-r--r-- 1 usp usp 126 May 18 17:05 data.02.gz
-rw-r--r-- 1 usp usp 115 May 18 17:05 data.03.gz
-rw-r--r-- 1 usp usp 132 May 18 17:05 data.04.gz
$ gunzip < data.01.gz
PREF CNUM CITY D1 D2 D3 D4 D5
01 New_Hampshire 03 Nashua 82 0 23 84 10
01 New_Hampshire 01 Manchester 91 59 20 76 54
tagkeycut.c uses zlib. When compiling, please use:
$ cc -static -O3 -o /home/TOOL/tagkeycut tagkeycut.c -lz
If you use the -a option and the -z option together, you can append
to an existing compressed file. The resulting file can be properly
decompressed with gunzip.