Today I was faced with the follwing task: given a directory of files of various and unknown encodings, convert them all to UTF-8.
This post shows how to easily solve this problem using the two useful commands file and iconv.
The problem consists of two parts: first, determine the current encoding of each of the files, and then convert them from their current encoding to UTF-8.
file is a very useful command for easily determining type information of files.
A simple example shows typical information file extracts from a text file.
$ file test.txt
test.txt: ISO-8859 English text, with very long lines, with CRLF line terminators
To print only the needed information we add a few more options, as follows.
$ file -b --mime-encoding test.txt
iso-8859-1
--mime-encoding specifies that only the encoding part should be printed, and -b (brief) ommits the name of the file from the output.
Once the current encoding of a file has been determined, the iconv command can be used to convert its encoding.
The following will print the contents of test.txt to stdout as UTF-8.
$ iconv -f iso-8859-1 -t utf-8 test.txt
Using the -o option, the output can also be redirected back to the file.
$ iconv -f iso-8859-1 -t utf-8 test.txt -o test.txt
By putting the two steps together, we can easily convert all text files within a folder.
The following script reads all txtfiles within the current folder, determine their current encoding, and tries to convert them to UTF-8.
#!/bin/sh
TO='utf-8'
for i in *.txt
do
FROM=$(file -b --mime-encoding $i)
iconv -f $FROM -t $TO $i -o $i
done