`fread` with headers with special characters (latin1) and unusual nested quotes - r

I have a latin1 encoded csv-file with nested quotes:
Ort;Stra▒e;Bezeichnung
Vienna;Testgasse 1;"Ministerium ""Pestalozzi"""
Graz;Teststra▒e 3;HS
Salzburg;Beispielstra▒e 9;"NMS ""Die Schlauen"""
Vienna;Wolfgang-Stra▒e 7;"Wirtshaus ""Wien III"""
Using fread from data.table 1.9.6 gives a wrong special character (ß) in the header while all ß below are correct - the quoted quotes stay "".
dat <- fread("latin1quotedat.csv", encoding = "Latin-1")
dat # wrong header, wrong quotes
Ort Stra\xdfe Bezeichnung
1: Vienna Testgasse 1 Ministerium ""Pestalozzi""
2: Graz Teststraße 3 HS
3: Salzburg Beispielstraße 9 NMS ""Die Schlauen""
4: Vienna Wolfgang-Straße 7 Wirtshaus ""Wien III""
Using read.csv2 from base R everything is as expected:
dat1 <- read.csv2("latin1quotedat.csv", encoding = "latin1")
dat1 # ok
Ort Straße Bezeichnung
1 Vienna Testgasse 1 Ministerium "Pestalozzi"
2 Graz Teststraße 3 HS
3 Salzburg Beispielstraße 9 NMS "Die Schlauen"
4 Vienna Wolfgang-Straße 7 Wirtshaus "Wien III"
Maybe there is an option for the quotes (although I didn't find one).
The misinterpreted special character in the header looks like a bug.
The code and an example csv can be found here: https://github.com/nachti/datatable_test.
Clone the repository and run latin1quotedat.R.
Gerhard

Now fixed with commit f91bba1 in current devel, v1.9.7. From NEWS:
fread() did not respect encoding on header column. Now fixed, #1680. Thanks #nachti.
With this, I get:
names(fread("~/Downloads/latin1quotedat.csv", encoding = "Latin-1"))
# [1] "Ort" "Straße" "Bezeichnung"

Related

R software, read.csv, multiple separators

Does anyone know a way to read a csv file in R with multiple separators?
a<-read.csv("C:/Users/User/Desktop/file.csv", sep=",", header=FALSE)
Here, I have the following dataset (txt/csv file) separated by commas and spaces:
5.006,84.698
4.604,87.725 7.250,88.392
6.668,91.556
5.927,95.440
4.953,99.695 7.387,100.489
6.466,104.447
5.599,107.548
4.053,111.411 7.440,112.892
6.096,116.417
4.805,119.031 7.546,120.671
6.149,123.793
4.307,127.201 7.461,129.974
5.493,132.853 7.641,135.393
and I want it to be read as a table with four columns, like this:
72 5.006 84.698 NA NA
73 4.604 87.725 7.250 88.392
74 6.668 91.556 NA NA
75 5.927 95.440 NA NA
76 4.953 99.695 7.387 100.489
77 6.466 104.447 NA NA
78 5.599 107.548 NA NA
79 4.053 111.411 7.440 112.892
80 6.096 116.417 NA NA
81 4.805 119.031 7.546 120.671
82 6.149 123.793 NA NA
83 4.307 127.201 7.461 129.974
84 5.493 132.853 7.641 135.393
Do you know the possible way to read it that way in R?
You could open the file in any text editor (notepad or something similar) and make the separators common across the file. You can either replace ',' with spaces or vice-versa using Find and Replace all and save the file.
Once you do that you can use read.csv with this new separator.
a <- read.csv("C:/Users/User/Desktop/file.csv", sep= " ", header=FALSE, fill = TRUE)
We can try using readLines() to read each line as a string. Then, we can split on multiple separators and roll up into a data frame.
file <- "C:/Users/User/Desktop/file.csv"
txt <- readLines(file, sep = ""))
y <- strsplit(txt, "[, ]+")
z <- lapply(y,function(x){as.data.frame(t(as.numeric(x)))})
df <- do.call(rbind.fill, z)
df
One option is to use Excel. You can choose multiple separators (delimiters) during the import stage (Wizard step 2). Comma and space are one of the default choices but you can choose other characters too.
Then import the excel file using one of many user-contributed packages, for example, readxl, or save as text and use read.csv / read.table.

How to use read_table or fread in this particular case?

As you know, read.table in R is a very useful but slow function, particularly when it comes to read big databases. In order to face problems related with that function, there exists functions such as read_table and fread from readr and data.table packages. Unfortunately, their arguments differ from read.table which made me difficult to replicate this example:
download.file("https://datasets.imdbws.com/title.basics.tsv.gz", "mov_title")
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "mov_rating")
title <- read.table("mov_title", sep="\t", header=TRUE,
fill=TRUE, na.strings="\\N", quote="")
rating <- read.table("mov_rating", sep="\t", header=TRUE,
fill=TRUE, na.strings="\\N", quote="")
Basically I want to use fread or read_table (or both if it's possible) to create my "title" and "rating" databases. Any advice or reference will be much appreciated.
this seems to work just fine... data.table::fread() can handle gz-files.
Set \t (=tab) as separator.
Since some movie-titles contain quotes, set quotes to nothing; quote = "". (or not, and just accept the warnings).
library( data.table )
title <- fread( "https://datasets.imdbws.com/title.basics.tsv.gz",
sep = "\t", quote = "" )
rating <- fread( "https://datasets.imdbws.com/title.ratings.tsv.gz",
sep = "\t", quote = "" )
fread suppots .gz file as well as reading from a url. You can keep rest of the arguments same as those in read.table
library(data.table)
title=fread("https://datasets.imdbws.com/title.basics.tsv.gz",sep = "\t",quote = "",na.strings = "\\N",header = T,fill = T)
> dim(title)
[1] 6518809 9
>
>
> head(title)
tconst titleType primaryTitle originalTitle isAdult startYear endYear
1: tt0000001 short Carmencita Carmencita 0 1894 NA
2: tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 NA
3: tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 NA
4: tt0000004 short Un bon bock Un bon bock 0 1892 NA
5: tt0000005 short Blacksmith Scene Blacksmith Scene 0 1893 NA
6: tt0000006 short Chinese Opium Den Chinese Opium Den 0 1894 NA
runtimeMinutes genres
1: 1 Documentary,Short
2: 5 Animation,Short
3: 4 Animation,Comedy,Romance
4: NA Animation,Short
5: 1 Comedy,Short
6: 1 Short

Is there anyway to read .dat file from movielens to R studio

I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

R correct use of read.csv

I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.

read.table and files with excess commas

I am trying to import a CSV file into R using the read.table command. I keep getting the error message "more columns than column names", even though I have set the strip.white to TRUE. The program that makes the csv files adds a large number of comma characters to the end of each line, which I think is the source of the extra columns.
read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T,
as.is=T,row.names = NULL, quote = "")
How can I get R to strip away the extraneous columns of commas from the header line and from the rest of the CSV file as it reads it into the R console?
Also, numerous cells in the csv file do not contain any data. Is it possible to get R to fill in these empty cells with "NA"?
The first two lines of the csv file:
Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
You can use a combination of colClasses with "NULL" entries to "blank-out" the commas (also still needing , fill=TRUE:
read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
2 9 9 9 9 9 9 9 9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", :
cols = 26 != length(data) = 38
I needed to add back in the missing linefeed at the end of the first line. (Yet another reason why you should edit questions rather than putting data examples in the comments.) There was an octothorpe in the header which required the comment.char be set to "":
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1
If you know what your colClasses will be, then you can get missing values to be NA in the numeric columns automatically. You could also use the na.strings setting to accomplish this. You could also do some editing on the header to take out the illegal characters in the column names. (I didn't think I needed to be the one to do that though.)
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward <NA> <NA> U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1 <NA> <NA> <NA> <NA> <NA>
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1 <NA> <NA>
I have been fiddling with the first two lines of your file, and the problem appears to be the # in one of your column names. read.table treats # as a comment character by default, so it reads in your header, ignores everything after # and returns 13 columns.
You will be able to read in your file with read.table using the argument comment.char="".
Incidentally, this is yet another reason why those who ask questions should include examples of the files/datasets they are working with.

Resources