Error in fread {data.table} with "|" delimited file containing apostrophe - r

Am attempting to read in a "|" delimited file using fread() from the data.table package and am receiving an error. The error is " ' ends field 14 on line 104970 ".
Have seen older questions asking if fread() has a quote handling feature yet, have not found anything recent. Also noticed that the "sep2" feature is forthcoming, is this the feature which will solve this longstanding problem?
I can read the same data fine with read.table():
df.readtable<-read.table(myfile,header=F,sep="|",quote="\"",fill=T,stringsAsFactors=F)
But am unable to reproduce result with fread():
> require(data.table)
> df.fread<-fread(myfile,verbose=T)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.215B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='|'
Found 17 columns
First row with 17 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2000001
Subtracted 1 for last eol and any trailing empty lines, leaving 2000000 data rows
Type codes: 44414114444444424 (first 5 rows)
Type codes: 44444114444444424 (+middle 5 rows)
Type codes: 44444114444444424 (+last 5 rows)
Type codes: 44444114444444424 (after applying colClasses and integer64)
Type codes: 44444114444444424 (after applying drop or select (if supplied)
Allocating 17 column slots (17 - 0 NULL)
Error in fread(myfile, verbose = T) :
' ends field 14 on line 104970 when reading data: LW61026|CITY|STATE|000111|L|00|1800|||N|N|N|CHANEL|"CHARLIE" BOARD|2011 CITY|19911114000000|
Session and package info:
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3

Related

`seq` takes a very long time with `by=1`

I noticed something strange today, in some cases adding by=1 to seq function introduces a large inefficiency.
> system.time(seq(from=936144000, to=1135987200))
user system elapsed
0 0 0
> system.time(seq(from=936144000, to=1135987200, by=1))
user system elapsed
4.42 8.39 18.20
At first glance the results are equivalent:
> all.equal(seq(from=936144000, to=1135987200),
+ seq(from=936144000, to=1135987200, by=1))
[1] TRUE
TRUE
The difference seems to be that omitting by=1 causes the result to be numeric, even if by is explicitly integer.
> identical(seq(from=936144000, to=1135987200),
+ seq(from=936144000, to=1135987200, by=1))
[1] FALSE
> class(seq(from=936144000, to=1135987200))
[1] "integer"
> class(seq(from=936144000, to=1135987200, by=1L))
[1] "numeric"
Also, calling directly to seq.int (assuming that is what happens behind the scenes in seq) also takes much longer than the seq without any arguments:
> system.time(seq.int(from=936144000, to=1135987200, by=1L))
user system elapsed
0.25 1.68 2.81
How do I properly specify by to avoid the inefficiency or to get the efficiency of omitting by?
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.0 RUnit_0.4.32 tools_4.1.0 geneorama_1.7.3 data.table_1.14.0
You don't need to assume what happens behind the scenes, you can run debug(seq) and see what the difference is. It's a generic function, and it calls seq.default.
In seq.default it turns out that if the by argument is missing (and some other conditions which hold in your example), seq(from, to) does from:to. This is extremely fast, because it doesn't even allocate the full vector: in recent versions of R, it is stored in a special format with just the limits of the range.
The other thing you can see if you look at seq.default is that the only way to get this output is to have missing(by) be TRUE. So the answer to your question is that you can't specify by to get the same speed.
#Baraliuh's advice is good: if you want seq(from, to, by=1), use from:to instead.

initializing parallel chains in rjags

I'm doing some ghetto parallelization in jags through rjags.
I've been using the function parallel.seeds to obtain RNG states to intialize the RNG's (example below). However, I don't understand why multiple integers are returned for each RNG. In the documentation it says that when you intialize .RNG.state is supposed to be a numeric vector with length one.
Furthermore, sometimes when I try to do this R crashes with no error generated. When I give up and just let it generate the seed for the chain on it's own, the model runs fine. Does this mean I am using the wrong .RNG.state? Any insight would be appreciated, as I am planning to scale up this model in the future.
> parallel.seeds("base::BaseRNG", 3)
[[1]]
[[1]]$.RNG.name
[1] "base::Wichmann-Hill"
[[1]]$.RNG.state
[1] 3891 16261 19841
[[2]]
[[2]]$.RNG.name
[1] "base::Marsaglia-Multicarry"
[[2]]$.RNG.state
[1] 408065014 1176110892
[[3]]
[[3]]$.RNG.name
[1] "base::Super-Duper"
[[3]]$.RNG.state
[1] -848274653 175424331
There is a difference between .RNG.seed (which is a vector of length one, and the thing you can specify to jags.model to e.g. ensure MCMC samples are repeatable) and .RNG.state (which is a vector of length depending on the pRNG algorithm). It is possible that these got mixed up in the docs somewhere - can you tell me where you read this so I can make sure it is fixed for JAGS/rjags 4?
Regarding the crashing - some more details would be needed to help you with that I'm afraid. I assume that it is the JAGS model that crashes, and not your R session that terminates, and after the model has been running for a while? A reproducible example would help a lot.
By the way - when you say 'scale up' - if you are planning to make use of > 4 chains I would strongly recommend you load the lecuyer module (see ?parallel.seeds examples at the bottom).
Matt
The documentation is a bit confusing; under ?jags.model we see that .RNG.seed should be a vector of length 1, but parallel.seeds() returns .RNG.state which is usually > 1. The state space for the Mersenne Twister algorithm has 624 integers, and that is the length of the vector when you do
parallel.seeds("base::BaseRNG",4)
to make sure you see all 4 types of RNG. Similarly the state space of the Wichmann-Hill generator has 3 integers, and I'm sure similar research would reveal the state spaces for the other two are longer than 1.
For my own edification I mocked up an example using the LINE data in rjags:
data(LINE)
LINE$model() ## edit and save to line.r
data = LINE$data()
line = jags.model("line.r",data=data)
line.samples <- jags.samples(LINE, c("alpha","beta","sigma"),n.iter=1000)
line.samples
inits = parallel.seeds("base::BaseRNG", 3) # a list of lists
inits[[1]]$tau = 1
inits[[1]]$alpha = 3
inits[[1]]$beta = 1
inits[[2]]$tau = .1
inits[[2]]$alpha = .3
inits[[2]]$beta = .1
inits[[3]]$tau = 10
inits[[3]]$alpha = 10
inits[[3]]$beta = 5
line = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples <- jags.samples(line, c("alpha","beta","sigma"),n.iter=1000)
line2 = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples2 <- jags.samples(line2, c("alpha","beta","sigma"),n.iter=1000)
all(line.samples$alpha-line.samples2$alpha < 0.00000001) ## TRUE
So the results are entirely repeatable, which is cool.
To understand the conditions under which R is crashing, I'd need to know the results of sessionInfo() on your computer, plus more details of the circumstances (e.g. what JAGS model are you running?). I just did:
for (i in 1:100){parallel.seeds("base::BaseRNG",4)}
and my computer didn't crash. For reference:
sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets
# [6] methods base
#
# other attached packages:
# [1] rjags_3-14 coda_0.17-1 mlogit_0.2-4
# [4] maxLik_1.2-4 miscTools_0.6-16 Formula_1.2-1
#
# loaded via a namespace (and not attached):
# [1] grid_3.1.3 lattice_0.20-30 lmtest_0.9-33
# [4] MASS_7.3-39 sandwich_2.3-3 statmod_1.4.21
# [7] tools_3.1.3 zoo_1.7-12
That shows the version of R and rjags that I'm using.

Extend HITs until agreement MTurkR

I am using MTurkR to post HITs to mTurk and I am having trouble extending the HITs until there is consensus among workers (or until a total of 5 HITs have been posted).
When two different responses are entered, the HIT is not extended. My code follows from MTurkR documentation pg. 45. My code is as follows:
# ##############################################
# SET PARAMTERS FOR HITS
# ##############################################
layout="XXXXXXXXXXX"
#format for sandbox question. Get this from MTURK site
annotation.v="Question1"
assignments.v="2"
title.v="TITLE"
description.v="DESCRIPTION."
reward.v=".00"
duration.v=seconds(hour=1)
expiration.v=seconds(days=4)
keywords.v="survey"
auto.approval.delay.v=seconds(days=1)
# ##############################################
# EXTEND HIT UNTIL AGREEMENT
# ##############################################
TurkAgreement=list(QuestionIds=c("Question1"),
QuestionAgreementThreshold=49, #at least 50% agree
ExtendIfHITAgreementScoreIsLessThan=50,
ExtendMinimumTimeInSeconds=3600,
ExtendMaximumAssignments=5,
DisregardAssignmentIfRejected=TRUE)
policya=do.call(GenerateHITReviewPolicy,TurkAgreement)
# ##############################################
# CREATE HITS
# ##############################################
hits=NULL
for(i in 1:length(DF)){
hits.i=CreateHIT(
hitlayoutid=layout,
hitlayoutparameters=GenerateHITLayoutParameter(c("XX","XX","XX"), c(DF[i,1],DF[i,2],DF[i,3])),
annotation=annotation.v[i],
assignments=assignments.v,
title=title.v,
description=description.v,
reward=reward.v,
duration=duration.v,
expiration=expiration.v,
keywords=keywords.v,
auto.approval.delay=auto.approval.delay.v,
qual.req=qualReqs,
hit.review.policy=policya,
sandbox=sandbox.v)
hits=rbind(hits,hits.i)}
The code generates 2 HITs (as specified by assignments.v) but the HIT doesn't extend.
My session info is below:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MTurkR_0.6
loaded via a namespace (and not attached):
[1] digest_0.6.4 RCurl_1.95-4.3 tcltk_3.1.1 XML_3.98-1.1
First a couple of key points about the code you posted:
Because you are adding the hit.review.policy you don't need the assignments argument. Two HITs will immediately populate with just this statement.
Before you define the terms for the hit.review.policy take a look at Amazon's documentation for this. Pay particular attention to the values for QuestionAgreementThreshold. Because values greater than this value are considered agreed answers, your original code indicated values greater than 49% were agreed answers. Also examine ExtendIfHITAgreementScoreIsLessThan this value should probably +1 of the value for QuestionAgreementThreshold in order to capture all possible agreement values.
Finally, be sure the answer values for HIT responses correspond to the QuestionIds object. When I view the answers to my HITs, the answers are referred to as "question", so I've replaced this value here.
With all of that said, here is the revised code that now properly extends the HITs:
# ##############################################
# SET PARAMTERS FOR HITS
# ##############################################
layout="XXXXXXXXXXX"
#format for sandbox question. Get this from MTURK site
annotation.v="question" #NOTE CHANGE HERE
#assignments.v="2" #NOTE CHANGE HERE (COMMENTED THIS OUT)
title.v="TITLE"
description.v="DESCRIPTION."
reward.v=".00"
duration.v=seconds(hour=1)
expiration.v=seconds(days=4)
keywords.v="survey"
auto.approval.delay.v=seconds(days=1)
# ##############################################
# EXTEND HIT UNTIL AGREEMENT
# ##############################################
TurkAgreement=list(QuestionIds=c("question"), #NOTE CHANGE HERE
QuestionAgreementThreshold=50, #at least 50% agree #NOTE CHANGE HERE
ExtendIfHITAgreementScoreIsLessThan=51, #NOTE CHANGE HERE
ExtendMinimumTimeInSeconds=3600,
ExtendMaximumAssignments=5,
DisregardAssignmentIfRejected=TRUE)
policya=do.call(GenerateHITReviewPolicy,TurkAgreement)
# ##############################################
# CREATE HITS
# ##############################################
hits=NULL
for(i in 1:length(DF)){
hits.i=CreateHIT(
hitlayoutid=layout,
hitlayoutparameters=GenerateHITLayoutParameter(c("XX","XX","XX"), c(DF[i,1],DF[i,2],DF[i,3])),
annotation=annotation.v[i],
assignments=assignments.v,
title=title.v,
description=description.v,
reward=reward.v,
duration=duration.v,
expiration=expiration.v,
keywords=keywords.v,
auto.approval.delay=auto.approval.delay.v,
qual.req=qualReqs,
hit.review.policy=policya,
sandbox=sandbox.v)
hits=rbind(hits,hits.i)}
Also note that the documentation on the MTurkR Package pg. 44 (end) is confusing. The document provides the following example:
lista<-list(QuestionIds = c("Question1","Question2","Question5"),
QuestionAgreementThreshold = 49, # at least 50 percent agreement
ExtendIfHITAgreementScoreIsLessThan = 50,
...
But the argument QuestionAgreementThreshold actually specifies at least 49% agreement (i.e., the HIT will not extend if 2 Turks split the answers). Unless this is the intention, it might be better to use the following code:
lista<-list(QuestionIds = c("Question1","Question2","Question5"),
QuestionAgreementThreshold = 50, #NOTE CHANGE HERE # at least 50 percent agreement
ExtendIfHITAgreementScoreIsLessThan = 51, #NOTE CHANGE HERE
...

Format to find day of week not working in Windows

I'm new to R. I have Windows PC at work and Ubuntu Linux at home. I'm try to figure out why my code doesn't work in windows R.
Im trying to find the day(numeric) of the week for a given date.
Using format, format(Sys.time(), "%u")
It works in Linux not on Windows? Am I missing something, I have added a simple code and session info from both PC's.
Output from my Windows 7 PC with R 3.01
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
format(Sys.time(), "%u")
[1] ""
Sys.time()
[1] "2013-09-22 10:34:00 CDT"
Output from my LINUX PC with R 3.01
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached)
[1] tools_3.0.1
format(Sys.time(),"%u")
[1] "7"
Sys.time()
[1] "2013-09-22 10:34:15 CDT"
This is R (on Windows)'s documented behavior.
For details, see ?strptime (I know, probably not the first place you might think to look ;-) which documents the date-time conversion specifications available in R. Under Details, an initial list of specifications found on all OSes is followed by a section that reads:
Also defined in the current standards but less widely implemented
(e.g. not for output on Windows) are
‘%C’ Century (00-99): the integer part of the year divided by 100.
[ . . . many snipped lines . . .]
‘%u’ Weekday as a decimal number (1-7, Monday is 1).
[ . . . more snipped lines . . .]
The closest substitutes in format are:
%w Weekday as decimal number (0–6, Sunday is 0).
%a Abbreviated weekday name in the current locale. (Also matches full
name on input.)
%A Full weekday name in the current locale. (Also matches abbreviated
name on input.)
You may check the function ISOweekday() in package ISOweek:
"This function returns the weekday of a given date according to ISO 8601. It is an substitute for the "%u" format which is not implemented on Windows."
ISOweekday("2013-09-24 01:42:23")
# [1] 2

R - raster function NAs values lower than -9999 in ASCII file

I have been having problems importing a ASCII raster that has values that go from Min. :-69826220 to Max. :167780500. The problem I am encountering is that when I use the raster function to import the ASCII file then every value smaller than -9999 is reported as NA and the minimum value is -9458.
Is this a bug of the function and is there a workaround? When I import the same ASCII file as a data frame everything is fine and I get the whole range of values.
Also I am using the same procedure to import other ASCII rasters and don't have any problem.
here is the link to the ASCII file https://dl.dropboxusercontent.com/u/24234831/ps0011yme.asc
Here is the session info, I opened a new session just in case.
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] raster_2.1-16 sp_1.0-8
loaded via a namespace (and not attached):
[1] grid_3.0.0 lattice_0.20-15
Any help is appreciated
You can try to use setMinMax() on your raster file to try and work out the min and max values and store them in the returned Raster* object. Try it like so:
r <- setMinMax( raster("path/to/myraster.asc") )
I am not sure what is happening because if I downlaod you data and do:
r1 <- raster( "~/Downloads/test.asc")
summary(values(r1))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-69830000 -4789000 737300 16950000 13880000 167800000 71468
Please add the output of sessionInfo() into your question , i.e. not as a comment.
The errors in this case were being caused by not having rgdal installed, which are bindings to the Geospatial Data Abstraction Library and are very important for importing/exporting raster and shapefile data.
I'm unable to reproduce your error. Here's a hand-built .asc file:
NCOLS 3
NROWS 3
XLLCORNER 0
YLLCORNER 0
CELLSIZE 0.5
NODATA_value -9999
1e-6 0.3 -34567891234
0.2 -1e6 25
3 68492758321934 20
That loaded correctly into a raster object. You'll notice the NODATA_value item there, which explains where your -9999 come from. My bet is that there's something corrupted in your source .asc file. Can you post the header and a small sample of the data?
The internal ascii file driver in 'raster' assumes that there are no valid values lower than the NA flag value if the flag value is < 0 (and I would not recommend using an NA flag in the middle of the values). Clearly, this approach can cause problems like in this case; and I will change that. You can see the difference between the internal driver and the gdal driver if you do
library(raster)
library(rgdal)
a1 <- raster(filename, native=TRUE)
a2 <- raster(filename, native=FALSE)

Resources