Today I was faced with the follwing task: given a directory of files of various and unknown encodings, convert them all to UTF-8.
This post shows how to easily solve this problem using the two useful commands
The problem consists of two parts: first, determine the current encoding of each of the files, and then convert them from their current encoding to UTF-8.
file is a very useful command for easily determining type information of files.
A simple example shows typical information
file extracts from a text file.
$ file test.txt test.txt: ISO-8859 English text, with very long lines, with CRLF line terminators
To print only the needed information we add a few more options, as follows.
$ file -b --mime-encoding test.txt iso-8859-1
--mime-encoding specifies that only the encoding part should be printed, and
-b (brief) ommits the name of the file from the output.
Once the current encoding of a file has been determined, the
iconv command can be used to convert its encoding.
The following will print the contents of
stdout as UTF-8.
$ iconv -f iso-8859-1 -t utf-8 test.txt
-o option, the output can also be redirected back to the file.
$ iconv -f iso-8859-1 -t utf-8 test.txt -o test.txt
By putting the two steps together, we can easily convert all text files within a folder.
The following script reads all
txtfiles within the current folder, determine their current encoding, and tries to convert them to UTF-8.
#!/bin/sh TO='utf-8' for i in *.txt do FROM=$(file -b --mime-encoding $i) iconv -f $FROM -t $TO $i -o $i done