fread skip and autostart issue - r

I have the following code:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip)
Which produces the following error:
Error in fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + :
Expected sep (',') but new line, EOF (or other non printing character) ends field 14 on line 1003 when detecting types: 10066652 ТранÑпорт Ðвтомобили Ñ Ð¿Ñ€Ð¾Ð±ÐµÐ³Ð¾Ð¼ Nissan R Nessa, 1998 Ð¢Ð°Ñ€Ð°Ð½Ñ‚Ð°Ñ Ð² отличном ÑоÑтоÑнии. на прошлой неделе возили на тех. ОбÑлуживание. Ð’ дорожных неприÑтноÑÑ‚ÑÑ… не был учаÑтником. Детали кузова без коцок и терок. ПредназначалаÑÑŒ Ð´Ð»Ñ Ð¿Ð¾ÐµÐ·Ð´Ð¾Ðº на природу, Отдам только в добрые руки. Ð’ Ñалон не поÑтавлю не звоните "{""Марка"":""Nissan"", ""Модель"":""R Nessa"", ""Год выпуÑка"":""1998"", ""Пробег"":""180 000 - 189 999"", ""Тип кузова"":""МинивÑн"", ""Цвет"":""Оранжевый"", ""Объём двигателÑ"":""2.4"", ""Коробка передач"":""МеханичеÑкаÑ
I have tried changing it to this:
raw_test <- fread("avito_test.tsv", nrows = intNrows, skip = intSkip, autostart = (intSkip + 2))
Which is based on what I read on a similar question skip and autostart in fread
However, it produces a similar error as above.
How can I skip the first 1000 rows, and read the next thousand? My expected output is 1000 rows total, skipping the first thousand from my CSV file, and reading the second thousand.
Note: Reading the file with raw_test <- fread("avito_test.tsv", nrows = 1000, skip = -1) works well for getting me only the first thousand, but I am trying to get only the second thousand.
Edit: The data is publicly available at http://www.kaggle.com/c/avito-prohibited-content/data
Edit: Environment and package info:
> packageVersion("data.table")
[1] ‘1.9.3’
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Related

Character string limit reached early using fread from data.table, CSV lines look fine

I'm reading a moderately big CSV with fread but it errors out with "R character strings are limited to 2^31-1 bytes". readLines works fine, however. Pinning down the faulty line (2965), I am not sure what's going on: doesn't seem longer than the next one, for example.
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
library(data.table)
options(timeout=10000)
download.file("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2010-03.csv",
destfile = "trip_data.csv", mode = "wb")
dt = fread("trip_data.csv")
#> Error in fread("trip_data.csv"): R character strings are limited to 2^31-1 bytes
lines = readLines("trip_data.csv")
dt2955 = fread("trip_data.csv", nrows = 2955)
#> Warning in fread("trip_data.csv", nrows = 2955): Previous fread() session was
#> not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
dt2956 = fread("trip_data.csv", nrows = 2956)
#> Error in fread("trip_data.csv", nrows = 2956): R character strings are limited to 2^31-1 bytes
lines[2955]
#> [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"
lines[2956]
#> [1] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"
Created on 2022-02-12 by the reprex package (v2.0.1)
When trying to read part of the file (around 100k rows) I got:
Warning message:
In fread("trip_data.csv", skip = i * 500, nrows = 500) :
Stopped early on line 2958. Expected 18 fields but found 19. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9>>
>
After removing it I was able to read at least 100 k rows
da = data.table(check.names = FALSE)
for (i in 0:200) {
print(i*500)
dt = fread("trip_data.csv", skip = i*500, nrows = 500, fill = TRUE)
da <- rbind(da, dt, use.names = FALSE)
}
str(da)
Classes ‘data.table’ and 'data.frame': 101000 obs. of 18 variables:
$ vendor_id : chr "" "CMT" "CMT" "CMT" ...
$ pickup_datetime : POSIXct, format: NA "2010-03-22 17:05:03" "2010-03-22 19:24:29" ...
$ dropoff_datetime : POSIXct, format: NA "2010-03-22 17:22:51" "2010-03-22 19:40:13" ...
$ passenger_count : int NA 1 1 1 3 1 1 1 1 1 ...
[...]
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
Then you can read it line by line, checking length of the list, and binding it to data table.
Regards,
Grzegorz

R data.table problem when read file with inconsistent column

When I use R data.table(fread) to read dat file (3GB) a problem occurs:
Stopped early on line 3169933. Expected 136 fields but found 138. Consider fill=TRUE and comment.char=. First discarded non-empty line:
My code:
library(data.table)
file_path = 'data.dat' # 3GB
fread(file_path,fill=TRUE)
The problem is that my file has ~ 5 million rows. In detail:
From row 1 to row 3169933 it has 136 columns
From row 3169933 to row 5000000 it has 138 columns
fread() only reads my file to row 3169933 due to this error. fill = TRUE did not help in this case. Could anyone help me ?
R version: 3.6.3
data.table version: 1.13.2
Note about fill=TRUE in this case:
[Case 1- not my case] if part 1 of my file (50% rows) have 138 columns and part 2 have 136 columns then the fill=TRUE will help (it will fill two column in part 2 with NA)
[Case 2- my case] if part 1 of my file (50% rows) have 136 columns and part 2 have 138 columns then the fill =TRUE will not help in this case.
Not sure why you still have the problem even with fill=T... But if nothing helps, you can try playing with something like this:
tryCatch(
expr = {dt1 <<- fread(file_path)},
warning = function(w){
cat('Warning: ', w$message, '\n\n');
n_line <- as.numeric(gsub('Stopped early on line (\\d+)\\..*','\\1',w$message))
if (!is.na(n_line)) {
cat('Found ', n_line,'\n')
dt1_part1 <- fread(file_path, nrows=n_line)
dt1_part2 <- fread(file_path, skip=n_line)
dt1 <<- rbind(dt1_part1, dt1_part2, fill=T)
}
},
finally = cat("\nFinished. \n")
);
tryCatch() construct catches warning message so you can extract the line number and process it accordingly.
Try to read them separately, combine them after creating two extra columns for the first part.
first_part = fread('data.dat', nrows = 3169933) %>%
mutate(extra_1 = NA, extra_2 = NA)
second_part = fread('data.dat', skip = 3169933)
df = bind_rows(first_part, second_part)

Error in Running NLRX (NetLogo) in Manjaro (Arch) Linux

I am attempting to run an NLRX simulation in Manjaro Linux (RNetLogo wouldn't work for some reason either), and am running into the following error when attempting to set up an dummy experiment:
cp: cannot stat '~/.netlogo/NetLogo 6.1.1/netlogo-headless.sh': No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sed: can't read /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
sh: /tmp/Rtmpj15Yf7/netlogo-headless365385fb4bdc0.sh: No such file or directory
Error in util_gather_results(nl, outfile, seed, siminputrow) :
Temporary output file /tmp/Rtmpj15Yf7/nlrx5493_1365385ab03157.csvnot found. On unix systems this can happen if the default system temp folder is used.
Try reassigning the default temp folder for this R session (unixtools package).
In addition: Warning message:
In system(NLcall, wait = TRUE) : error in running command
Given that I am running R 4.0.0, the Unixtools package doesn't work, so that's out of the question. How would I go about fixing this?
Code for those curious:
library(nlrx)
netlogopath <- file.path("~/.netlogo/NetLogo 6.1.1")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("/home/out")
nl <- nl(nlversion = "6.0.3",
nlpath = netlogopath,
modelpath = modelpath,
jvmmem = 1024)
nl#experiment <- experiment(expname="wolf-sheep",
outpath=outpath,
repetition=1,
tickmetrics="true",
idsetup="setup",
idgo="go",
runtime=50,
evalticks=seq(40,50),
metrics=c("count sheep", "count wolves", "count patches with [pcolor = green]"),
variables = list('initial-number-sheep' = list(min=50, max=150, qfun="qunif"),
'initial-number-wolves' = list(min=50, max=150, qfun="qunif")),
constants = list("model-version" = "\"sheep-wolves-grass\"",
"grass-regrowth-time" = 30,
"sheep-gain-from-food" = 4,
"wolf-gain-from-food" = 20,
"sheep-reproduce" = 4,
"wolf-reproduce" = 5,
"show-energy?" = "false"))
nl#simdesign <- simdesign_lhs(nl=nl,
samples=100,
nseeds=3,
precision=3)
results <- run_nl_all(nl = nl)
R Version for those who may want it:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
In case others find this helpful: I have encountered similar errors as the result of file path misspecification. For instance, double check model path. You may need to drop app/.

Missing results from foreach and Parallel

Due to memory contraints in a previous script, I modified it following this advice in a similar issue as mine (do not give more data than needed by workers -
reading global variables using foreach in R). Unfortunately, now I'm struggling with missing results.
The script iterates over an 1.9M columns matrix, proccess each column and returns one row dataframe (rbind function from foreach combines each row). However, when it print the results, there are less rows (results) than the number of columns and this quantity changes every run. Seemingly, there is no error in the function inside foreach loop as it used to run smoothly in the previous script and no error or warning message pops up.
New Script:
if(!require(R.utils)) { install.packages("R.utils"); require(R.utils)}
if(!require(foreach)) { install.packages("foreach"); require(foreach)}
if(!require(doParallel)) { install.packages("doParallel"); require(doParallel)}
if(!require(data.table)) { install.packages("data.table"); require(data.table)}
registerDoParallel(cores=6)
out.file = "data.result.167_6_inside.out"
out.file2 = "data.result.167_6_outside.out"
data1 = fread("data.txt",sep = "auto", header=FALSE, stringsAsFactors=FALSE,na.strings = "NA")
data2 = transpose(data1)
rm(data1)
data3 = data2[,3: dim(data2)[2]]
levels2 = data2[-1,1:(3-1)]
rm(data2)
colClasses=c(ID="character",Col1="character",Col2="character",Col3="character",Col4="character",Col5="character",Col6="character")
res_table = dataFrame(colClasses,nrow=0)
write.table(res_table , file=out.file, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
write.table(res_table, file=out.file2, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
tableRes = foreach(col1=data3, .combine="rbind") %dopar% {
id1 = col1[1]
df2function = data.frame(levels2[,1,drop=F],levels2[,2,drop=F],as.numeric(col1[-1]))
mode(df2function[,1])="numeric"
mode(df2function[,2])="numeric"
values1 <- try (genericFuntion(df2function), TRUE)
if (is.numeric(try (values1$F, TRUE)))
{
res_table [1,1] = id1
res_table [1,2] = values1$F[1,1]
res_table [1,3] = values1$F[1,2]
res_table [1,4] = values1$F[1,3]
res_table [1,5] = values1$F[2,2]
res_table [1,6] = values1$F[2,3]
res_table [1,7] = values1$F[3,3]
} else
{
res_table[1,1] = id1
res_table[1,2] = NA
res_table[1,3] = NA
res_table[1,4] = NA
res_table[1,5] = NA
res_table[1,6] = NA
res_table[1,7] = NA
}
write.table(fstats_table, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
return(fstats_table[1,])
}
write.table(tableFst, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
In the previous script, the foreach was that way:
tableRes = foreach(i=1:length(data3), iter=icount(), .combine="rbind") %dopar% { (same code as above) }
Thus, I would like to know what are the possible causes of this behaviour.
I'm running this script in a cluster asking 80 Gb of memory (and 6 cores in this example). This is the largest amount of RAM I can request one a single node to be sure that script will not fail due to the lack of memory. (Each node has a pair of 14-core Intel Xeon skylake CPUs 2.6GHz, 128GB of RAM; OS - RHEL 7)
Ps 1: Although the new script is not paging anymore (even with more than 8 cores), seems that each child process still loading large amounts of data in the memory (~6 Gb) as I tracked using top command.
Ps 2: The new script is printing the results inside and outside the foreach loop to track if the loss of data occurs during or after the loop finishes and as I noticed every run gives me different amount of printed results inside and outside the loop.
P3: The fastest run was based on 20 cores (6 sec for 1000 iterations) and the slowest was 56 sec on a single core (tests performed using microbenchmark with 10 replications). However, more cores leads to less results being returned in the full matrix (1.9M columns).
I really appreciate any help you can provide,

Loading ffdf data take a lot of memory

I am facing a strange problem:
I save ffdf data using
save.ffdf()
from ffbase package and when i load them in a new R session, doing
load.ffdf("data.f")
it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R.
Having this issue, it does not make a lot of sense to use ffdf, isn´t it?
I can't use ffsave because i am working in a server and do not have the zip app on it.
packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3
Any ideas about ?
[edit] some code example to help to clarify:
data <- read.csv.ffdf(file = fn, header = T, colClasses = classes)
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols and some integer cols.
data.1 <- data
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example.
closing the R session... opening again:
load.ffdf(file.name) # file.name is a string pointing to the file.
#that gives me object data, with class(data) = ffdf.
then i have a data object ffdf[5] , and its memory size is almost as big as:
data.R <- data[,] # which is a data.frame.
[end of edit]
*[ SECOND EDIT :: FULL REPRODUCIBLE CODE ::: ]
As my question is not answered yet, and i still find the problem, i give a reproducible example ::
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.limit(size=4000)
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = factor( sample(c(1:N/10) , N, replace=T)) )
df[1:10,]
dff <- as.ffdf(df)
head(dff)
#str(dff)
save.ffdf(dff, dir = "dframeffdf")
dim(dff)
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)
### resetting R :: fresh RStudio Session
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.size() # 15.63
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
gc()
memory.size() # 287
So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9
[END SECOND EDIT ]
In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.
Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = sample(c(1:N/10) , N, replace=T))
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
### resetting R :: fresh RStudio Session
library(ff)
library(ffbase)
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78
The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.

Resources