How to read non-english characters with read.delim in R? - r

I have a text file containing several languages, how to read in R use read.delim function,
Encoding("file.tsv")
#[1] "unknown"
source_data = read.delim(file, header= F, fileEncoding= "windows-1252",
sep = "\t", quote = "")
source_D[360]
#[1] "ð¿ð¾ð¸ñðº ð½ð° ññ‚ð¾ð¼ ñð°ð¹ñ‚ðµ"
But the source_D[360] showed in Notepad is 'поиск на этом сайте'

tidyverse approach:
use the option locale in read_delim.
(readr functions have _ instead of . and are usually faster and smarter to read)
more details here: https://r4ds.had.co.nz/data-import.html#parsing-a-vector
source_data = read_delim(file, header= F,
locale = locale(encoding = "windows-1252"),
sep = "\t", quote = "")

source_data = read.delim(file, header = F, sep = "\t", quote = "", stringsAsFactors = FALSE)
Encoding(source_data)= "UTF-8"
I have tried, If you run you R in windows, above code works for me.
and if you run R in Unix, you could use following code
source_data = read.delim(file, header = F, fileEncoding="UTF-8", sep = "\t", quote = "", stringsAsFactors = FALSE)

Related

Why R reads CSV file differently

I am using
myCounts<-read.csv("myCounts.csv", header = TRUE, row.names = 1, sep = ",")
and
Book4 <- read_delim("Book4.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE)
to read two csv files. But read.csv and read.delim is pressing them differently.
Could you please explane how to read in book4 data in the same structure of myCounts data?
I tried following, it works.
df<-read.delim("~/Documents/sample.csv" ,sep = ";",row.names = 1)

How do I copy text files containing code?

I am trying to read .tex files containing LaTeX code, and paste their content into different .tex files depending on the results of calculations in R.
I need to avoid changing any character of the tex files by processing them with R. I am looking for a way to stop R from interpreting the content of the files and make R just "copy" the files character for character.
Example R file:
cont <- paste(readLines("path/to/file/a.tex"), collapse = "\n")
write.table(cont , file = "Mother.tex", append = FALSE, quote = FALSE, sep = "",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
cont2 <- paste(readLines("path/to/file/b.tex"), collapse = "\n")
write.table(cont2 , file = "Mother.tex", append = TRUE, quote = FALSE, sep = "",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
cont3 <- paste(readLines("path/to/file/c.tex"), collapse = "\n")
write.table(cont3 , file = "Mother.tex", append = TRUE, quote = FALSE, sep = "",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
cont4 <- paste(readLines("path/to/file/d.tex"), collapse = "\n")
write.table(cont4 , file = "Mother.tex", append = TRUE, quote = FALSE, sep = "",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
Example Latex File a:
\documentclass{beamer}
\usepackage{listings}
\lstset{basicstyle=\ttfamily, keywordstyle=\bfseries}
\begin{document}
Example Latex file b:
\begin{frame}
Example Latex file c:
content based on values in r
\end{frame}
Example Latex file d:
\end{document}
I do have two Problems now:
wrong escape information for readlines
non utf-8 keyword at files: b,c,d
Latex is not abled to compile sucessfully, because theres an non utf-8 information inside the Motherfile after processing Mother with r.
If i do copy and paste the content of each file manually i am abled to compile Latex sucessfully. As a result of the information about bad utf-8 information in Latex (no wrong Characters in TexLive IDE shown) I suspect r to add information into the files, which is not shown by IDE TextLive.
I do not understand why theres something "invisible" added into my Mother tex file which is not shown inside TexLive.
Assuming you want to store the content of the .tex file into a string.
cont <- paste(readLines("path/to/file/file.tex"), collapse = "\n")

R STUDIO: I am not able to read special characteres like Ü in the .csv

I am new to R and I have a problem reading a .csv online
This is the .csv = https://dadesobertes.gva.es/dataset/15810be9-d797-4bf3-b37c-4c922bee8ef8/resource/a5140630-325a-4d54-b9e4-66216405164b/download/2020-05-31_casospormunicipio.csv
I am trying to read, but broke at the first Ü character
¿How should do it?
this is my code:
library(tidyverse)
library('data.table')
fread('https://dadesobertes.gva.es/dataset/15810be9-d797-4bf3-b37c-4c922bee8ef8/resource/a5140630-325a-4d54-b9e4-66216405164b/download/2020-05-31_casospormunicipio.csv', header = TRUE, sep = ";", encoding = 'UTF-8')
dff <- read.csv("https://dadesobertes.gva.es/dataset/15810be9-d797-4bf3-b37c-4c922bee8ef8/resource/a5140630-325a-4d54-b9e4-66216405164b/download/2020-05-31_casospormunicipio.csv",fileEncoding = "UTF-8", header = TRUE, sep = ";")
dff %>%
mutate(Municipio = fct_reorder(Municipio, Casos.PCR.)) %>%
ggplot(aes(x=Municipio, y=Casos.PCR.)) +
geom_bar(stat="identity", width=0.6) + coord_flip()
In your script: change fileEncoding to encoding, i.e.:
dff <- read.csv("https://dadesobertes.gva.es/dataset/15810be9-d797-4bf3-b37c-4c922bee8ef8/resource/a5140630-325a-4d54-b9e4-66216405164b/download/2020-05-31_casospormunicipio.csv",encoding= "UTF-8", header = TRUE, sep = ";")

Error in reading a CSV file with read.table()

I am encountering an issue while loading a CSV data set in R. The data set can be taken from
https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53
I imported the data using read.csv as below and the dataset was imported correctly.
EmpSal <- read.csv('E:/Data/EmpSalaries.csv')
I tried reading the data using read.table and there were a lot of anomalies when looking at the dataset.
EmpSal1 <- read.table('E:/Data/EmpSalaries.csv',sep=',',header = T,fill = T)
The above code started reading the data from 7th row and the dataset actually contains ~14K rows but only 5K rows were imported. When looked at the dataset in few cases 15-20 rows were combined into a single row and the entire row data appeared in a single column.
I can work on the dataset using read.csv but I am curious to know the reason why it didn't work with read.table.
read.csv is defined as:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
You need to add quote="\"" (read.table expects single quotes by default whereas read.csv expects double quotes)
EmpSal <- read.csv('Baltimore_City_Employee_Salaries_FY2015.csv')
EmpSal1 <- read.table('Baltimore_City_Employee_Salaries_FY2015.csv', sep=',', header = TRUE, fill = TRUE, quote="\"")
identical(EmpSal, EmpSal1)
# TRUE
As you mentioned, your data is imported successfully by using read.csv() command without mentioning quote argument.
Default value of quote argument for read.csv function is "\"" and for read.table function, it is "\"'".
Check following code,
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
There are many single quotation in your specified data. And this is the reason why read.table function isn't working for you.
Try the following code and it will work for you.
r<-read.table('/home/workspace/Downloads/Baltimore_City_Employee_Salaries_FY2015.csv',sep=",",quote="\"",header=T,fill=T)

Simplify R code to import big data as character

I am currently using the code below very often to import a big dataset into R and forcing it to treat everything as character in order to avoid the truncation of rows. The code seems to work well, but I was wondering whether any of you knows how it could be simplified or improved to so it doesn't get so repetitive each time I need to do it.
library(readr)
library(stringr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
like #Roland has suggested, you should write a function. here is one possibility:
foo <- function(){
require(readr)
dataset.path <- choose.files(caption = "Select dataset", multi = FALSE)
data.columns <- read_delim(dataset.path, delim = '\t', col_names = TRUE, n_max = 0)
data.coltypes <- paste(rep("c", ncol(data.columns)), collapse = "")
dataset <- read_delim(dataset.path, delim = '\t', col_names = TRUE, col_types = data.coltypes)
}
you can then just call foo() whenever you need to read a database in using this method.
your two liner:
data.coltypes <- c(rep("c", ncol(data.columns)))
data.coltypes <- str_c(data.coltypes, collapse = "")
can be collapsed into just one line and only using base R paste instead of str_c in the stringr package.

Resources