I got data that I crawled using Scrapy, which saves as csv file with encoding utf-8-sig. The data has many different special characters: Korean, Russian, Chinese, Spanish,..., a star symbol (★), and this 🎵, and this 🎄...
So Scrapy can save, and I can view those on Notepad++ or app like CSVFileView. But when I load in R using mydata <- read.csv(<path_to_file>, fileEncoding="UTF-8-SIG", header=FALSE), I got this error:
Error in file(file, "rt", encoding = fileEncoding) :
unsupported conversion from 'UTF-8-SIG' to ''
If I don't specify the encoding, I can load but the symbols will become characters like ☠and the first column head will be appended with ï..
Which encoding should I choose to include all characters?
As the input is already encoded as UTF-8, you should use the encoding argument to read the file as-is. Using fileEncoding will attempt to re-encode the file.
mydata <- read.csv(<path_to_file>, encoding="UTF-8", header=FALSE)
Related
I am attempting to make a boatload of Anki flashcards for Thai, so I did some web-scraping with R to extract transliterated elements from a website (dictionary). Everything looks good when printing in the console, but when I try to write the transliteration to a text file, the encoding changes, and I lose tone marks. Using Encoding() revealed that most entries were "UTF-8", which should be fine, but some entries were labeled as "unknown". You can download the HTML file from my GitHub, and my code is below for importing and extracting the text.
# Install appropriate library
install.packages("rvest")
library(rvest)
# Read in page to local variable
page <- read_html("Thai to English dictionary, translation and transliteration.html")
# Filter for specific tags
translit <- page %>% html_nodes(".tlit-line") %>% html_text()
write(translit, file = 'translit.txt')
library(stringi)
stringi::stri_write_lines(translit, encoding = "UTF-8", "translit.txt")
stri_write_lines (From stringi v1.5.3 by Marek Gagolewski)
Write Text Lines To A Text File. Writes a text file is such a way
that each element of a given character vector becomes a separate text
line.
Usage
stri_write_lines(
str,
con,
encoding = "UTF-8",
sep = ifelse(.Platform$OS.type == "windows", "\r\n", "\n"),
fname = con
)
Arguments
str - character vector with data to write
con - name of the output file or a connection object (opened in the
binary mode)
encoding - output encoding, NULL or '' for the current default one
sep - newline separator
fname - deprecated alias of con
Details
It is a substitute for the R writeLines function, with the ability to
easily re-encode the output.
We suggest using the UTF-8 encoding for all text files: thus, it is
the default one for the output.
I am trying to count the number of keywords in multiple pdf files.
library(tm)
library(pdftools)
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
corp <- Corpus(URISource(files), readerControl = list(reader = Rpdf))
words <- c("example", "keyword", "test")
dt <- DocumentTermMatrix(corp, control=list(dictionary=words))
When I run the code I always get this errors:
PDF error: May not be a PDF file (continuing anyway)
PDF error (3): Illegal character <21> in hex string
PDF error (5): Illegal character <4f> in hex string
PDF error (7): Illegal character <54> in hex string
PDF error (8): Illegal character <59> in hex string
PDF error (9): Illegal character <50> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.
In addition: There were 12 warnings (use warnings() to see them)
If you have any suggestions, please let me know. Thank you!
I guess your pdfs are formatted as binary files and should thus be downnloaded/read as binary files. I had a similar issue downloading pdf files with download.file. I couldnt mine infos from the pdf using pdftools after I downloaded them. I discovered that my pdfs where binary files and where broken bc I didnt download them in proper format (try using any pdf reader, it should say it's broken when opening your pdf). Using Windows as OS I added mode="wb" to download.file making sure it stores them in the right format. I could then run the functions from pdftools on it without that error message. Hope that helps somehow. Got the idea from that SO question: Problems with Downloading pdf file using R
Same error message as yours:
pdf_toc(example_path)
PDF error (1151926): Illegal character <3a> in hex string
PDF error (1151929): Illegal character <73> in hex string
[...omitted for brevity...]
PDF error (1152006): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_toc(loadfile(pdf), opw, upw) : PDF parsing failure.
I want to read a csv file with 5.000 observations into R Studio. If I set the encoding to UTF-8, only 3.500 observations are imported and I get 2 warning messages:
# Example code
options(encoding = "UTF-8")
df <- read.csv("path/data.csv", dec = ".", sep = ",")
1: invalid input found on input connection
2: EOF within quoted string
According to this thread I was able to find some encodings, which are able to read the whole csv file, e.g. windows-1258. However, with this encoding special characters such as ä, ü or ß are not read properly.
My guess is, that UTF-8 would be the right encoding, but that something is wrong with the character/factor variables of the csv file. For that reason I need a way to read the whole csv file with UTF-8. Any help is highly appreciated!
I am using Rstudio with R 3.3.1 on Windows 7 and I have installed CITAN package. I am trying to import bibliography entries from a CSV file that I exported from Scopus (as it is, untouched), choosing to export all available information.
This is the error that I get:
example <- Scopus_ReadCSV("scopus.csv")
Error in Scopus_ReadCSV("scopus.csv") : Column not found: `Source'.
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'scopus.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'scopus.csv'
Column `Source' is there when I open the file, so I do not know why it says 'not found'.
Eventually I came into the following conclusions:
The encoding of the CSV file as exported from Scopus was UTF-8-BOM, which does not seem to be recognized from R when using Scopus_readCSV("file.csv") or read.table("file.csv", header = TRUE, sep = ",", fileEncoding = "UTF-8").
Although it is used an encoding type for the file from Scopus, there can be found some "strange" non-english characters which are not readable from the read function in R. (Mainly found this problem in names with special characters)
Solutions for those issues:
Open the CSV file with a notepad application like the Notepad++ and save the file with UTF-8 encoding to become readable for R as UTF-8.
When running the read function in R you will notice that it stops reading (e.g. in the 40th out of 200 registries). See where exactly it stopped and this way you can find the special character, by opening the CSV with the notepad, and then you can erase/change it as you wish in order to not have the same issue in R again.
Another solution that worked for me:
Open the file in Google Sheets, then download it from there again as a *.csv-file. R opens it correctly afterwards.
I want to import a CSV file from web. The CSV file is an EUC-KR encoding file, and there are some Korean characters in the CSV file.
When I tried this:
x <- "www.google.com/abcdefg.csv"
read.csv(x, header=T)
I got this error message:
Error in make.names(col.names, unique = True) :
invalid multibyte string at '<b3><af>¥
I know that I can convert the CSV file with Text Editor like Notepad++.
But I want to import some non UTF-8 encoding data from web in Rstudio not using the Text Editor.
I also tried this:
x <- "www.google.com/abcdefg.csv"
read.csv(x, header=T,encoding="UTF-8")
but I still got the same error message. Do you know how to do this?
When reading the csv, you should state the encoding of the input. So try replacing UTF-8 with the encoding of your file.