(Video materials in preparation)
utf8nude : Removes control characters and invalid UTF-8 characters
Usage : utf8nude <file>
Options : -e
-i
-d<string>
Version : Tue Jan 9 09:02:34 JST 2024
Edition : 1
After converting POST data with cgi-name, this command deletes
control characters ans invalid UTF-8 characters. Followings are
subject to br removed.
* control characters except Tab, NewLine and Space.
(0x00 - 0x08、0x0b - 0x1b、0x7f、0xc2 0x80 - 0xc2 0x9b)
* 0x80 - 0xbf (except trailing bytes of multiple-byte character.
* redundant encode
* 5 and 6 byte UTF-8 codes in old standard
Control characters like NUL, DEL are removed.
$ xdump -v data1
61 62 00 63 64 7F 65 66 0A : ab.cd.ef.
$ utf8nude data1 | xdump -v
61 62 63 64 65 66 0A : abcdef.
Multiple-byte characters with trailing byte out of range 0x80 - 0xbf
are also removed.
EF BF C0 0A : ....
$ utf8nude data2 | xdump -v
0A : .
If -i option is specfied, this command exits with exit status 1 when
some characters are removed, If -e option is specfied, this command
immediately exits when a character is removed.
$ utf8nude data1 >out1
$ echo $?
0
$ xdump -v out1
61 62 00 63 64 7F 65 66 0A : ab.cd.ef.
$ utf8nude -i data2 >out2
$ echo $?
1
$ xdump -v out2
61 62 00 63 64 7F 65 66 0A : ab.cd.ef.
$ utf8nude -e data3 >out3
$ echo $?
1
$ xdump -v out3
61 62 : ab
The option -d<string> allows you to specify the substitution string.
$ utf8nude -d__ data1 | xdump -v
61 62 5F 5F 63 64 5F 5F 65 66 0A : ab__cd__ef.