How to use read_table or fread in this particular case? - r

As you know, read.table in R is a very useful but slow function, particularly when it comes to read big databases. In order to face problems related with that function, there exists functions such as read_table and fread from readr and data.table packages. Unfortunately, their arguments differ from read.table which made me difficult to replicate this example:
download.file("https://datasets.imdbws.com/title.basics.tsv.gz", "mov_title")
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "mov_rating")
title <- read.table("mov_title", sep="\t", header=TRUE,
fill=TRUE, na.strings="\\N", quote="")
rating <- read.table("mov_rating", sep="\t", header=TRUE,
fill=TRUE, na.strings="\\N", quote="")
Basically I want to use fread or read_table (or both if it's possible) to create my "title" and "rating" databases. Any advice or reference will be much appreciated.

this seems to work just fine... data.table::fread() can handle gz-files.
Set \t (=tab) as separator.
Since some movie-titles contain quotes, set quotes to nothing; quote = "". (or not, and just accept the warnings).
library( data.table )
title <- fread( "https://datasets.imdbws.com/title.basics.tsv.gz",
sep = "\t", quote = "" )
rating <- fread( "https://datasets.imdbws.com/title.ratings.tsv.gz",
sep = "\t", quote = "" )

fread suppots .gz file as well as reading from a url. You can keep rest of the arguments same as those in read.table
library(data.table)
title=fread("https://datasets.imdbws.com/title.basics.tsv.gz",sep = "\t",quote = "",na.strings = "\\N",header = T,fill = T)
> dim(title)
[1] 6518809 9
>
>
> head(title)
tconst titleType primaryTitle originalTitle isAdult startYear endYear
1: tt0000001 short Carmencita Carmencita 0 1894 NA
2: tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 NA
3: tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 NA
4: tt0000004 short Un bon bock Un bon bock 0 1892 NA
5: tt0000005 short Blacksmith Scene Blacksmith Scene 0 1893 NA
6: tt0000006 short Chinese Opium Den Chinese Opium Den 0 1894 NA
runtimeMinutes genres
1: 1 Documentary,Short
2: 5 Animation,Short
3: 4 Animation,Comedy,Romance
4: NA Animation,Short
5: 1 Comedy,Short
6: 1 Short

Related

fread is not reading the columns names properly

I am trying to use a csv generated from the Apple mobility reports, which can be found here.
Now everything works relatively fine, and I am able to get the .csv as intended, which looks something like this text:
csvtxt <- "geo_type,region,2020-01-14,2020-01-15,2020-01-16
country/region,Albania,50.1,100.2,75.3"
But when I fread it, the first line, which is unsurprisingly a column name line, is not recognized as so, even with the option check.names = FALSE that I found somewhere here but cannot find again.
library(data.table)
fread(csvtxt, check.names = FALSE)
# V1 V2 V3 V4 V5
#1: geo_type region 2020-01-14 2020-01-15 2020-01-16
#2: country/region Albania 50.1 100.2 75.3
Is there a way to get this data to import so that the column name line is recognized properly?
We need to force the header by setting it to TRUE.
library(data.table) # R version 4.0.2, data.table_1.13.2
fread(csvtxt, header = TRUE)
# geo_type region 2020-01-14 2020-01-15 2020-01-16
# 1: country/region Albania 50.1 100.2 75.3
From the manuals:
header
Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is
type character. If so, or TRUE is supplied, any empty column names are
given a default name.
Confusion might be from read.csv where header is TRUE by default:
read.csv(text = csvtxt)
# geo_type region X2020.01.14 X2020.01.15 X2020.01.16
# 1 country/region Albania 50.1 100.2 75.3

`fread` with headers with special characters (latin1) and unusual nested quotes

I have a latin1 encoded csv-file with nested quotes:
Ort;Stra▒e;Bezeichnung
Vienna;Testgasse 1;"Ministerium ""Pestalozzi"""
Graz;Teststra▒e 3;HS
Salzburg;Beispielstra▒e 9;"NMS ""Die Schlauen"""
Vienna;Wolfgang-Stra▒e 7;"Wirtshaus ""Wien III"""
Using fread from data.table 1.9.6 gives a wrong special character (ß) in the header while all ß below are correct - the quoted quotes stay "".
dat <- fread("latin1quotedat.csv", encoding = "Latin-1")
dat # wrong header, wrong quotes
Ort Stra\xdfe Bezeichnung
1: Vienna Testgasse 1 Ministerium ""Pestalozzi""
2: Graz Teststraße 3 HS
3: Salzburg Beispielstraße 9 NMS ""Die Schlauen""
4: Vienna Wolfgang-Straße 7 Wirtshaus ""Wien III""
Using read.csv2 from base R everything is as expected:
dat1 <- read.csv2("latin1quotedat.csv", encoding = "latin1")
dat1 # ok
Ort Straße Bezeichnung
1 Vienna Testgasse 1 Ministerium "Pestalozzi"
2 Graz Teststraße 3 HS
3 Salzburg Beispielstraße 9 NMS "Die Schlauen"
4 Vienna Wolfgang-Straße 7 Wirtshaus "Wien III"
Maybe there is an option for the quotes (although I didn't find one).
The misinterpreted special character in the header looks like a bug.
The code and an example csv can be found here: https://github.com/nachti/datatable_test.
Clone the repository and run latin1quotedat.R.
Gerhard
Now fixed with commit f91bba1 in current devel, v1.9.7. From NEWS:
fread() did not respect encoding on header column. Now fixed, #1680. Thanks #nachti.
With this, I get:
names(fread("~/Downloads/latin1quotedat.csv", encoding = "Latin-1"))
# [1] "Ort" "Straße" "Bezeichnung"

Is there anyway to read .dat file from movielens to R studio

I am trying to use Import Dataset in R Studio to read ratings.dat from movielens.
Basically it has this format:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
So I need to replace :: by : or ' or white spaces, etc. I use notepad++, it helps to load the file quite fast (compare to note) and can view very big file easily. However, when I do replacement, it shows some strange characters:
"LF"
as I do some research here, it said that it is \n (line feed or line break). But I do not know why when it load the file, it do not show these, only when I do replacement then they appear. And when I load into R Studio, it still detect as "LF", not line break and cause error in data reading.
What is the solution for that ? Thank you !
PS: I know there is python code for converting this but I don't want to use it, is there any other ways ?
Try this:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
Alternatively (use the d/l code from jlhoward but he also updated his code to not use built-in functions and switch to data.table while i wrote this, but mine's still faster/more efficient :-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
It's quite a bit faster than built-in functions.
Small improvement to #hrbrmstr's answer:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))

read.table and files with excess commas

I am trying to import a CSV file into R using the read.table command. I keep getting the error message "more columns than column names", even though I have set the strip.white to TRUE. The program that makes the csv files adds a large number of comma characters to the end of each line, which I think is the source of the extra columns.
read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T,
as.is=T,row.names = NULL, quote = "")
How can I get R to strip away the extraneous columns of commas from the header line and from the rest of the CSV file as it reads it into the R console?
Also, numerous cells in the csv file do not contain any data. Is it possible to get R to fill in these empty cells with "NA"?
The first two lines of the csv file:
Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
You can use a combination of colClasses with "NULL" entries to "blank-out" the commas (also still needing , fill=TRUE:
read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
2 9 9 9 9 9 9 9 9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", :
cols = 26 != length(data) = 38
I needed to add back in the missing linefeed at the end of the first line. (Yet another reason why you should edit questions rather than putting data examples in the comments.) There was an octothorpe in the header which required the comment.char be set to "":
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1
If you know what your colClasses will be, then you can get missing values to be NA in the numeric columns automatically. You could also use the na.strings setting to accomplish this. You could also do some editing on the header to take out the illegal characters in the column names. (I didn't think I needed to be the one to do that though.)
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward <NA> <NA> U‌​ser
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1 <NA> <NA> <NA> <NA> <NA>
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1 <NA> <NA>
I have been fiddling with the first two lines of your file, and the problem appears to be the # in one of your column names. read.table treats # as a comment character by default, so it reads in your header, ignores everything after # and returns 13 columns.
You will be able to read in your file with read.table using the argument comment.char="".
Incidentally, this is yet another reason why those who ask questions should include examples of the files/datasets they are working with.

sqldf, csv, and fields containing commas

Took me a while to figure this out. So, I am answering my own question.
You have some .csv, you want to load it fast, you want to use the sqldf package. Your usual code is irritated by a few annoying fields. Example:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr",9:43:00, 99.2
1003,"Ben,Sr",9:44:00, 99.3
This code only works on *nix systems.
library(sqldf)
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
If try to read with
x <- read.csv.sql("temp.csv", header=FALSE)
R complains
Error in try({ :
RS-DBI driver: (RS_sqlite_import: ./temp.csv line 2 expected 4 columns of data but found 5)
The sqldf-FAQ.13 solution doesn't work either:
x <- read.csv.sql("temp.csv", filter = "tr -d '\"' ", header=FALSE)
Again, R complains
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
In fact, the filter only removes double quotes.
So, how to proceed?
Perl and regexes to the rescue. Digging through SO, and toying with regexes here, it is not too hard come up with the right one:
s/(\"[^\",]+),([^\"]+\")/$1_$2/g
which matches "...,...", here the dots are anything but double quotes and commas, and substitues the comma with an underscore. A perl one-liner is the right filter to pass to sqldf:
x <- read.csv.sql("temp.csv",
filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'",
header=FALSE)
Here is the dataframe x
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
Now, DYO cosmesis on strings ...
EDIT: The regex above only replaces the first occurrence of a comma in the field. To replace all the occurrencies use this
s{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg
What's different?
I replaced the delimiters / with {};
The option e at the very end, instructs the parser to interpret the replacement field as perl code;
The replecement is a simple regex replace, that substitutes all "," with "_" within the matched substring $&.
An example:
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr,More,Commas\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
The file temp.csv looks like:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr,More,Commas",9:43:00, 99.2
1003, "Ben,Sr",9:44:00, 99.3
And can be read with
x <- read.csv.sql("temp.csv",
filter = "perl -p -e 's{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg'",
header=FALSE)
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr_More_Commas" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
For windows, sqldf now comes with trcomma2dot.vbs which does this by default with read.csv2.sql . Although found it to be slow for very large data.(>1million rows)
It mentions about "tr" for non-windows based system but I could not try it.

Resources