sqldf, csv, and fields containing commas - r

Took me a while to figure this out. So, I am answering my own question.
You have some .csv, you want to load it fast, you want to use the sqldf package. Your usual code is irritated by a few annoying fields. Example:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr",9:43:00, 99.2
1003,"Ben,Sr",9:44:00, 99.3
This code only works on *nix systems.
library(sqldf)
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
If try to read with
x <- read.csv.sql("temp.csv", header=FALSE)
R complains
Error in try({ :
RS-DBI driver: (RS_sqlite_import: ./temp.csv line 2 expected 4 columns of data but found 5)
The sqldf-FAQ.13 solution doesn't work either:
x <- read.csv.sql("temp.csv", filter = "tr -d '\"' ", header=FALSE)
Again, R complains
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
In fact, the filter only removes double quotes.
So, how to proceed?

Perl and regexes to the rescue. Digging through SO, and toying with regexes here, it is not too hard come up with the right one:
s/(\"[^\",]+),([^\"]+\")/$1_$2/g
which matches "...,...", here the dots are anything but double quotes and commas, and substitues the comma with an underscore. A perl one-liner is the right filter to pass to sqldf:
x <- read.csv.sql("temp.csv",
filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'",
header=FALSE)
Here is the dataframe x
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
Now, DYO cosmesis on strings ...
EDIT: The regex above only replaces the first occurrence of a comma in the field. To replace all the occurrencies use this
s{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg
What's different?
I replaced the delimiters / with {};
The option e at the very end, instructs the parser to interpret the replacement field as perl code;
The replecement is a simple regex replace, that substitutes all "," with "_" within the matched substring $&.
An example:
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr,More,Commas\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
The file temp.csv looks like:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr,More,Commas",9:43:00, 99.2
1003, "Ben,Sr",9:44:00, 99.3
And can be read with
x <- read.csv.sql("temp.csv",
filter = "perl -p -e 's{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg'",
header=FALSE)
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr_More_Commas" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3

For windows, sqldf now comes with trcomma2dot.vbs which does this by default with read.csv2.sql . Although found it to be slow for very large data.(>1million rows)
It mentions about "tr" for non-windows based system but I could not try it.

Related

R software, read.csv, multiple separators

Does anyone know a way to read a csv file in R with multiple separators?
a<-read.csv("C:/Users/User/Desktop/file.csv", sep=",", header=FALSE)
Here, I have the following dataset (txt/csv file) separated by commas and spaces:
5.006,84.698
4.604,87.725 7.250,88.392
6.668,91.556
5.927,95.440
4.953,99.695 7.387,100.489
6.466,104.447
5.599,107.548
4.053,111.411 7.440,112.892
6.096,116.417
4.805,119.031 7.546,120.671
6.149,123.793
4.307,127.201 7.461,129.974
5.493,132.853 7.641,135.393
and I want it to be read as a table with four columns, like this:
72 5.006 84.698 NA NA
73 4.604 87.725 7.250 88.392
74 6.668 91.556 NA NA
75 5.927 95.440 NA NA
76 4.953 99.695 7.387 100.489
77 6.466 104.447 NA NA
78 5.599 107.548 NA NA
79 4.053 111.411 7.440 112.892
80 6.096 116.417 NA NA
81 4.805 119.031 7.546 120.671
82 6.149 123.793 NA NA
83 4.307 127.201 7.461 129.974
84 5.493 132.853 7.641 135.393
Do you know the possible way to read it that way in R?
You could open the file in any text editor (notepad or something similar) and make the separators common across the file. You can either replace ',' with spaces or vice-versa using Find and Replace all and save the file.
Once you do that you can use read.csv with this new separator.
a <- read.csv("C:/Users/User/Desktop/file.csv", sep= " ", header=FALSE, fill = TRUE)
We can try using readLines() to read each line as a string. Then, we can split on multiple separators and roll up into a data frame.
file <- "C:/Users/User/Desktop/file.csv"
txt <- readLines(file, sep = ""))
y <- strsplit(txt, "[, ]+")
z <- lapply(y,function(x){as.data.frame(t(as.numeric(x)))})
df <- do.call(rbind.fill, z)
df
One option is to use Excel. You can choose multiple separators (delimiters) during the import stage (Wizard step 2). Comma and space are one of the default choices but you can choose other characters too.
Then import the excel file using one of many user-contributed packages, for example, readxl, or save as text and use read.csv / read.table.

Is there anyway to read .dat file from movielens to R studio

I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

R correct use of read.csv

I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.

read.table and files with excess commas

I am trying to import a CSV file into R using the read.table command. I keep getting the error message "more columns than column names", even though I have set the strip.white to TRUE. The program that makes the csv files adds a large number of comma characters to the end of each line, which I think is the source of the extra columns.
read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T,
as.is=T,row.names = NULL, quote = "")
How can I get R to strip away the extraneous columns of commas from the header line and from the rest of the CSV file as it reads it into the R console?
Also, numerous cells in the csv file do not contain any data. Is it possible to get R to fill in these empty cells with "NA"?
The first two lines of the csv file:
Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
You can use a combination of colClasses with "NULL" entries to "blank-out" the commas (also still needing , fill=TRUE:
read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
2 9 9 9 9 9 9 9 9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", :
cols = 26 != length(data) = 38
I needed to add back in the missing linefeed at the end of the first line. (Yet another reason why you should edit questions rather than putting data examples in the comments.) There was an octothorpe in the header which required the comment.char be set to "":
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1
If you know what your colClasses will be, then you can get missing values to be NA in the numeric columns automatically. You could also use the na.strings setting to accomplish this. You could also do some editing on the header to take out the illegal characters in the column names. (I didn't think I needed to be the one to do that though.)
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward <NA> <NA> U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1 <NA> <NA> <NA> <NA> <NA>
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1 <NA> <NA>
I have been fiddling with the first two lines of your file, and the problem appears to be the # in one of your column names. read.table treats # as a comment character by default, so it reads in your header, ignores everything after # and returns 13 columns.
You will be able to read in your file with read.table using the argument comment.char="".
Incidentally, this is yet another reason why those who ask questions should include examples of the files/datasets they are working with.

read.table() changes column names [duplicate]

Whenever I read in a file using read.csv() with option header=T, the headers change in weird (but predictable) ways. A header name which ought to read "P(A<B)" becomes "P.A.B.", for instance:
> # when header=F:
> myfile1 <- read.csv(fullpath,sep="\t",header=F,nrow=3)
> myfile1
V1 V2 V3
1 ID Name P(A>B)
2 AB001 Alice 0.997
3 AB002 Bob 0.497
>
> # When header=T:
> myfile2 <- read.csv(fullpath,sep="\t",header=T,nrow=3)
> myfile2
ID Name P.A.B.
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
I tried to fix it like this, but it didn't work:
> names(myfile2) <- myfile1[1,]
> myfile2
3 3 3
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
So then I tried to use sub() to write a function that would take any vector "arbitrary.lengths.here." and return a vector "arbitrary(lengths>here)", but I didn't really get anywhere, and I started to suspect that I was making this problem more complicated than it had to be.
How would you deal with this problem of headers? Was I on the right track with sub()?
Set check.names=FALSE in read.csv()
read.csv(fullpath,sep="\t", header=FALSE, nrow=3, check.names=FALSE)
From the help for ?read.csv:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.
Not really intended as an answer, but intended to be helpful to Rnewbs: Those headers were read in as factors (and caused the third column to also be a factor. The screwy names() assignments probably used their integer storage mode. #Andrie has already given you the preferred solution, but if you wanted to just reassign the names (which would not undo the damage to the thrid column) you could use:
names(myfile1) <- scan(file=fullpath, what="character" nmax=1 , sep="\t")
myfile1 <- myfile[-1, ] # gets rid of unneeded line

Resources