Reading CSV containing emojis from Google Sheets fails - r

I've made a survey on Google Forms and send results to Google Sheets.
Then I tried to download results to R:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
google <- read.csv(url, sep = ',', header = T, fileEncoding = "UTF-8")
and faced the problem:
Warning:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
incorrect input found in input connection 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
There were imported just 96 rows out of all to my R dataset.
I checked my Google Sheets and saw that 96th row contains emoji and stops downloading another rows.
What should I do there or which encoding should I choose to have an opprotuninty to read the emojis in R?
R version: 1.2.5033

Thanks to Allan, you helped a lot to me!
I found another decision.
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRkkjx8AOgNdfDW9wtaHR8wtMQOrgTB1O1wwMcJLGre3E_MixhEaIGUI7gfHw5gBQX7-gcNkRUkMM3X/pub?output=csv'
df = readLines(url, encoding = "UTF-8")
df <- read.table(text = df,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)
Then I used View function to check my rows with emojis, and it shows it to me correctly.

You can try to load the contents of the file as a character vector then remove the emojis manually before you load the data.
Removing the very high UTF-8 values is a crude but effective way of stripping out emojis.
remove_emojis <- function(strings)
{
sapply(strings, function(x) intToUtf8(utf8ToInt(x)[-which(utf8ToInt(x) > 100000)]))
}
google <- remove_emojis(readLines(url, encoding = "UTF-8"))
df <- read.table(text = google,
sep = ",",
fileEncoding = "UTF-8",
stringsAsFactors = FALSE)

Related

Text encoding in R with japanese characters

I am trying to read a CSV file containing texts in many different characters using the function read.csv.
This is a sample of the file content:
device,country_code,keyword,indexed_clicks,indexed_cost
Mobile,JP,お金 借りる,5.913037843442198,103.05985173478956
Desktop,US,email,82.450427682737157,81.871030974598241
Desktop,US,news,414.14755054432345,66.502397615344861
Mobile,JP,ヤフートラベル,450.9622861586314,55.733902871922957
If I use the next function to read the data:
texts <- read.csv("text.csv", sep = ",", header = TRUE)
The dataframe is imported to R, but the characters are not well saved...
device country_code keyword indexed_clicks indexed_cost
1 Mobile JP ã\u0081Šé‡‘ 借りる 5.913038 103.05985
2 Desktop US email 82.450428 81.87103
3 Desktop US news 414.147551 66.50240
4 Mobile JP ヤフートラベル 450.962286 55.73390
If I use the next function (same as before with fileEncoding="UTF-8"):
texts <- read.csv("text.csv", sep = ",", header = TRUE, fileEncoding = "utf-8")
I get the next warning message:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
entrada inválida encontrada en la conexión de entrada 'text.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'text.csv'
Anyone knows how to read properly this file?
I replicated your problem with both:
texts <- read.csv("text.csv", sep = ",", header = TRUE)
and
texts_ <- read.csv("text.csv", sep = ",", header = TRUE, encoding = "utf-8")
and both works perfectly fine (R Studio V1.4.1717, Ubuntu 20.04.3 LTS).
Some possibilities I can think of:
The csv file wasn't saved properly as UTF-8 or corrupted. Have you checked the file again?
If you are using Windows, try using encoding instead of fileEncoding. These problems happen with non-standard characters (Windows Encoding Hell).

Import csv file into R data frame with UTF-8 encoding

I'm trying to import a csv file with a data frame:
pc2020 <- read.table("pc2020.csv", sep = ";", header = TRUE)
This works ok, but the enconding is wrong, thus, I get all messed up accentuated characters.
So, I'm trying with:
pc2020 <- read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8")
That returns:
Error in read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8") :
no lines available in input
In addition: Warning message:
In read.table("pc2020.csv", sep = ";", header = TRUE, fileEncoding = "UTF-8") :
invalid input found on input connection 'pc2020.csv'
You can use read.csv() function with the same attributes you used with read.table. Except fileEncoding - in read.csv() you should write just encoding = "UTF-8".
Also Duck's answer is suitable, too.

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

Problems reading in table with unclear line-end symbol

I am currently trying to read in a .txt file.
I have researched here and found Error in reading in data set in R - however, it did not solve my problem.
The data are political contributions listed by the Federal Election Commission of the U.S. at ftp://ftp.fec.gov/FEC/2014/webk14.zip
Upon inspection of the .txt, I realized that the data is weirdly structured. Especially, the end of the any line is not separated at all from the first cell of the next line (not by a "|", not by a space).
Strangely enough, import via Excel and Access seems to work just fine. However, R import does not work.
To avoid the Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 90 did not have 27 elements error, I use the following command:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", sep = "|", file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
This does not result in an error, however, the results a) have a different line count than with Excel import and b) fail to correctly separate columns (which is probably the reason for a))
I would like not to do a detour via Excel and directly import into R. Any ideas what I am doing wrong?
It might be related to the symbols inside the variable names so turn of interpretation of these using comment.char="", which gives you:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", comment.char="",sep = "|",file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))

R reading a tsv file using specific encoding

I am trying to read a .tsv (tab-separated value) file into R using a specific encoding. It's supposedly windows-1252. And it has a header.
Any suggestions for the code to put it into a data frame?
Something like this perhaps?
mydf <- read.table('thefile.txt', header=TRUE, sep="\t", fileEncoding="windows-1252")
str(mydf)
You can also use:
read.delim('thefile.txt', header= T, fileEncoding= "windows-1252")
Simply entering the command into your R consol:
> read.delim
function (file, header = TRUE, sep = "\t", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
reveals that read.delim is a packaged read.table command that already specifies tabs as your data's separator. read.delim might be more convenient if you're working with a lot of tsv files.
The difference between the two commands is discussed in more detail in this Stack question.
df <- read.delim(~/file_directory/file_name.tsv, header = TRUE) will be working fine for single .tsv file, because it is already tab separated, so no need sep = "\t". fileEncoding= "windows-1252" could be used but not necessary.

Resources