Even if files are not saved in UTF-8, you can can extract information on character encoding from the file names and import the texts correctly.
require(quanteda)
require(readtext)
temp_dir
contains the example files in various character encodings.
path_temp <- tempdir()
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = path_temp)
list.files()
returns names of all the text files (".txt”) in the directory
filename <- list.files(path_temp, "^(Indian|UDHR_).*\\.txt$")
head(filename)
## [1] "IndianTreaty_English_UTF-16LE.txt" "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt" "UDHR_Arabic_UTF-8.txt"
## [5] "UDHR_Arabic_WINDOWS-1256.txt" "UDHR_Chinese_GB2312.txt"
You can extract character encoding information from the file names using R’s basic commands.
filename <- gsub(".txt$", "", filename)
encoding <- sapply(strsplit(filename, "_"), "[", 3)
head(encoding)
## [1] "UTF-16LE" "UTF-8-BOM" "ISO-8859-6" "UTF-8" "WINDOWS-1256"
## [6] "GB2312"
There is a character encoding not supported by R.
setdiff(encoding, iconvlist())
## [1] "UTF-8-BOM"
You then pass encoding
to readtext()
to convert various character encodings into UTF-8.
path_data <- system.file("extdata/", package = "readtext")
dat_txt <- readtext(paste0(path_data, "/data_files_encodedtexts.zip"),
encoding = encoding,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "input_encoding"))
print(dat_txt, n = 50)
## readtext object consisting of 36 documents and 3 docvars.
## # Description: df[,5] [36 × 5]
## doc_id text document language input_encoding
## <chr> <chr> <chr> <chr> <chr>
## 1 IndianTreaty_English_… "\"WHEREAS, t\"..." IndianTre… English UTF-16LE
## 2 IndianTreaty_English_… "\"ARTICLE 1.\"..." IndianTre… English UTF-8-BOM
## 3 UDHR_Arabic_ISO-8859-… "\"الديباجة\nل\"..… UDHR Arabic ISO-8859-6
## 4 UDHR_Arabic_UTF-8.txt "\"الديباجة\nل\"..… UDHR Arabic UTF-8
## 5 UDHR_Arabic_WINDOWS-1… "\"الديباجة\nل\"..… UDHR Arabic WINDOWS-1256
## 6 UDHR_Chinese_GB2312.t… "\"世界人权宣言\n联合国\"..… UDHR Chinese GB2312
## 7 UDHR_Chinese_GBK.txt "\"世界人权宣言\n联合国\"..… UDHR Chinese GBK
## 8 UDHR_Chinese_UTF-8.txt "\"世界人权宣言\n联合国\"..… UDHR Chinese UTF-8
## 9 UDHR_English_UTF-16BE… "\"Universal \"..." UDHR English UTF-16BE
## 10 UDHR_English_UTF-16LE… "\"Universal \"..." UDHR English UTF-16LE
## 11 UDHR_English_UTF-8.txt "\"Universal \"..." UDHR English UTF-8
## 12 UDHR_English_WINDOWS-… "\"Universal \"..." UDHR English WINDOWS-1252
## 13 UDHR_French_ISO-8859-… "\"Déclaratio\"..." UDHR French ISO-8859-1
## 14 UDHR_French_UTF-8.txt "\"Déclaratio\"..." UDHR French UTF-8
## 15 UDHR_French_WINDOWS-1… "\"Déclaratio\"..." UDHR French WINDOWS-1252
## 16 UDHR_German_ISO-8859-… "\"Die Allgem\"..." UDHR German ISO-8859-1
## 17 UDHR_German_UTF-8.txt "\"Die Allgem\"..." UDHR German UTF-8
## 18 UDHR_German_WINDOWS-1… "\"Die Allgem\"..." UDHR German WINDOWS-1252
## 19 UDHR_Greek_CP1253.txt "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek CP1253
## 20 UDHR_Greek_ISO-8859-7… "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek ISO-8859-7
## 21 UDHR_Greek_UTF-8.txt "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR Greek UTF-8
## 22 UDHR_Hindi_UTF-8.txt "\"मानव अधिका\"..." UDHR Hindi UTF-8
## 23 UDHR_Icelandic_ISO-88… "\"Mannréttin\"..." UDHR Iceland… ISO-8859-1
## 24 UDHR_Icelandic_UTF-8.… "\"Mannréttin\"..." UDHR Iceland… UTF-8
## 25 UDHR_Icelandic_WINDOW… "\"Mannréttin\"..." UDHR Iceland… WINDOWS-1252
## 26 UDHR_Japanese_CP932.t… "\"『世界人権宣言』\n \"..… UDHR Japanese CP932
## 27 UDHR_Japanese_ISO-202… "\"『世界人権宣言』\n \"..… UDHR Japanese ISO-2022-JP
## 28 UDHR_Japanese_UTF-8.t… "\"『世界人権宣言』\n \"..… UDHR Japanese UTF-8
## 29 UDHR_Japanese_WINDOWS… "\"『世界人権宣言』\n \"..… UDHR Japanese WINDOWS-936
## 30 UDHR_Korean_ISO-2022-… "\"세 계 인 권 선 \"...… UDHR Korean ISO-2022-KR
## 31 UDHR_Korean_UTF-8.txt "\"세 계 인 권 선 \"...… UDHR Korean UTF-8
## 32 UDHR_Russian_ISO-8859… "\"Всеобщая д\"..." UDHR Russian ISO-8859-5
## 33 UDHR_Russian_KOI8-R.t… "\"Всеобщая д\"..." UDHR Russian KOI8-R
## 34 UDHR_Russian_UTF-8.txt "\"Всеобщая д\"..." UDHR Russian UTF-8
## 35 UDHR_Russian_WINDOWS-… "\"Всеобщая д\"..." UDHR Russian WINDOWS-1251
## 36 UDHR_Thai_UTF-8.txt "\"ปฏิญญาสากล\"..." UDHR Thai UTF-8