I have a large (8GB+) csv file (comma-separated) that I want to read into R. The file contains three columns
date #in 2017-12-27 format
text #a string
type #a label per string (either NA, typeA, or typeB)
The problem I encounter is that the text column contains various string indicators: ' (single quot. marks), " (double quot. marks), no quot. marks, as well as multiple separated strings.
E.g.
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
To read these large data, I tried:
Base read.csv and readr's read_csv function: work fine for a portion but fail (probably due to memory) or take ages to read
chunking the data via Mac terminal into batches of 1m lines: fails because lines seem to break arbitrarily
Using fread (preferred as I hope this will solve the two other issues): fails with Error: Expecting 3 cols, but line 1103 contains text after processing all cols.
My idea is to work around these issues by using specifics of the data that I know, i.e. that each line starts with a date and ends with either NA, typeA, or typeB.
How could I implement this (either using pure readLines or into fread)?
Edit:
Sample data (anonymized) as opened with Mac TextWrangler:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid",typeA
Sample data 2:
"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","#myid. I'll change this.",NA
Sample data for reproducible fread error "Expecting 3 cols, but line 3 contains text after processing all cols.":
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
SessionInfo:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1
loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0
readLines approach could be
infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL
#change this value as per your requirement
chunksize <- 1
while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}
which gives
> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA
Sample data: test.txt contains
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
Update:
You can modify above code with below regex parser to parse another set of sample data
df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))
Another set of sample data:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success #myid","typeA"
Related
I'm creating a Leaflet widget in R using the following code
m <- leaflet(map_data_wgs84) %>% addTiles() %>% addCircles(popup = (paste(sep="<br/>", as.character(map_data_wgs84$MEMBER_REF), map_data_wgs84$Name)))
saveWidget(m, file="c://software//members.html")
I want a popup that palaces an ID number and name separated by a line break. However when I run the saveWidget command I get the following error
Error in gsub("</", "\\u003c/", payload, fixed = TRUE) :
input string 1 is invalid UTF-8
which is because of the <br/> separator.
What am I doing wrong here?
thanks
UPDATE:
it would appear it is not the <br/> separator but rather character(s) in the map_data_wgs84$Name column. These 12000 records are pulled from a contact database before mapping.
I suspect I need some way to make the characters clean for use in Leaflet with something like htmlEscaoe however I cant figure out how to use this within paste. This doesnt work because htmlEscape is parsed as a string:
addCircles(popup = paste(as.character(map_data_wgs84$MEMBER_REF), ~htmlEscape(map_data_wgs84$Name), sep=","))
For for a MEMBER_REF of 56202 the popup becomes:
56202,htmlEscape(map_data_wgs84$Name)
Overview
To resolve the UTF-8 error, I followed three steps:
After reading How to identify/delete non-UTF-8 characters in R, I used base::Encoding() to manually encode the values with the Name column to UTF-8. I then used base::iconv() to replace all non UTF-8 characters with an empty space;
I manually placed the line break element - <br> rather than <br/> - inside of the sep argument within the paste() function when creating the popup within the markers of the leaflet object; and
To be safe, I used htmltools::htmlEscape() inside of the text vectors used inside of paste().
All together, I was able to export that object as an HMTL file. All packages versions are copied down below in the Session Info section.
Reproducible Example
# load necessary packages
library( htmltools )
library( htmlwidgets )
library( leaflet )
# create data
map_data_wgs84 <-
data.frame( MEMBER_REF = "Popup"
, Name = "Th\x86e birthplace of R."
, Long = 174.768
, Lat = -36.852
, stringsAsFactors = FALSE )
# pre-processing
# ensure that all characters in the `Name` column
# are valid UTF-8 encoded
# Thank you to SO for this gem
# https://stackoverflow.com/questions/17291287/how-to-identify-delete-non-utf-8-characters-in-r
Encoding( x = map_data_wgs84$Name ) <- "UTF-8"
# replace all non UTF-8 character strings with an empty space
map_data_wgs84$Name <-
iconv( x = map_data_wgs84$Name
, from = "UTF-8"
, to = "UTF-8"
, sub = "" )
# check work
map_data_wgs84$Name # [1] "The birthplace of R."
# create leaflet object
my.map <-
leaflet( data = map_data_wgs84 ) %>%
addTiles() %>%
addCircles( lng = ~Long
, lat = ~Lat
, popup = paste( as.character( map_data_wgs84$MEMBER_REF )
, htmlEscape( map_data_wgs84$Name )
, sep = "<br>" )
, radius = 50 )
# export leaflet object as HMTL file
saveWidget( widget = my.map
, file = "mywidget.html" )
# end of script #
Session Info
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] leaflet_2.0.1 htmlwidgets_1.2 htmltools_0.3.6
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 digest_0.6.15 later_0.7.3 mime_0.5
[5] R6_2.2.2 xtable_1.8-2 jsonlite_1.5 magrittr_1.5
[9] promises_1.0.1 tools_3.5.1 crosstalk_1.0.0 shiny_1.1.0
[13] httpuv_1.4.5 yaml_2.1.19 compiler_3.5.1
I saw this solution in another post (sorry, lost the link!) which use sprintf to combine text and variables and then using htmltools within lapply to make the content HTML-happy! In this particular I created the following list:
labels <- sprintf(
"<strong>%s</strong><br/>%o<br/>%s",
map_data$Name, map_data$Events, map_data$member_ref
) %>% lapply(htmltools::HTML)
and then called this from leaflet:
my_map <- leaflet(map_data_wgs84) %>% addTiles() %>% addCircles(popup = labels)
Hope this helps someone else!
I have an annoying csv > 10gb which opens on Mac OSX but not on Windows 10.
The code I use
data_in <- fread("my_data.csv")
SessionInfo Windows
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
[9] ggplot2_2.2.1 tidyverse_1.2.1 RMySQL_0.10.14 DBI_0.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 cellranger_1.1.0 pillar_1.2.1 compiler_3.4.4 plyr_1.8.4 bindr_0.1.1 tools_3.4.4 lubridate_1.7.2 jsonlite_1.5
[10] nlme_3.1-131.1 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0 psych_1.8.3.3 cli_1.0.0 rstudioapi_0.7 yaml_2.1.18
[19] parallel_3.4.4 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.4.4 glue_1.2.0 R6_2.2.2
[28] readxl_1.0.0 foreign_0.8-69 modelr_0.1.1 reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[37] colorspace_1.3-2 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
SessionInfo OSX
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.2
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 yaml_2.1.19
The error I get on Windows, I have tried all suggested solutions found without any luck.
Expecting 10 cols, but line 1346596 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
In addition: Warning message:
Additional info when using verbose = TRUE (Windows) (Tried a smaller file, same problem)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.004474 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 10 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: ,asin,sale
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 3657 (including 0 at the end)
Count of sep: 138915
nrow = MIN( nsep [138915] / (ncol [10] -1), neol [3657] - endblanks [0] ) = 3657
Type codes (point 0): 1444444340
Type codes (point 1): 1444444340
Type codes (point 2): 1444444340
Type codes (point 3): 1444444340
Type codes (point 4): 1444444344
Type codes (point 5): 1444444344
Type codes (point 6): 1444444344
Type codes (point 7): 1444444344
Type codes (point 8): 1444444344
Type codes (point 9): 1444444344
Type codes (point 10): 1444444444
Type codes: 1444444444 (after applying colClasses and integer64)
Type codes: 1444444444 (after applying drop or select (if supplied)
Allocating 10 column slots (10 - 0 dropped)
Error in fread("md2.csv", verbose = T) :
Expecting 10 cols, but line 3312 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
verbose = T OSX
nput contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file md2.csv
File opened, size = 4.581MB (4803885 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
File ends abruptly with ','. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<,asin,salesRank,imUrl,categori>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 10 fields using quote rule 0
Detected 10 columns on line 1. This line is either column names or first data row. Line starts as: <<,asin,salesRank,imUrl,categori>>
Quote rule picked = 0
fill=false and the most number of columns found is 10
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 10 because (4803885 bytes from row 1 to eof) / (2 * 127664 jump0size) == 18
Type codes (jump 000) : 5AAAAAA7A2 Quote rule 0
Type codes (jump 004) : 5AAAAAA7AA Quote rule 0
Type codes (jump 010) : 5AAAAAA7AA Quote rule 0
'header' determined to be true due to column 8 containing a string on row 1 and a lower type (float64) in the rest of the 1041 sample rows
=====
Sampled 1041 rows (handled \n inside quoted fields) at 11 jump points
Bytes from first data row on line 2 to the end of last row: 4803813
Line length: mean=2028.07 sd=3025.66 min=28 max=29901
Estimated number of rows: 4803813 / 2028.07 = 2369
Initial alloc = 4738 rows (2369 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5AAAAAA7AA
[10] Allocate memory for the datatable
Allocating 10 column slots (10 - 0 dropped) with 4738 rows
[11] Read the data
jumps=[0..2), chunk_size=2401906, total_size=4803813
Read 3311 rows x 10 columns from 4.581MB (4803885 bytes) file in 00:00.025 wall clock time
[12] Finalizing the datatable
Type counts:
1 : int32 '5'
1 : float64 '7'
8 : string 'A'
=============================
0.001s ( 2%) Memory map 0.004GB file
0.005s ( 19%) sep=',' ncol=10 and header detection
0.000s ( 0%) Column type detection using 1041 sample rows
0.000s ( 0%) Allocation of 4738 rows x 10 cols (0.000GB) of which 3311 ( 70%) rows used
0.019s ( 78%) Reading 2 chunks (0 swept) of 2.291MB (each chunk 1655 rows) using 2 threads
+ 0.004s ( 15%) Parse to row-major thread buffers (grown 0 times)
+ 0.012s ( 48%) Transpose
+ 0.004s ( 15%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.025s Total
Works fine in newest version of data.table
I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
I got this error message:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
I checked the file and there's additional text in 258088th row in the 4th column ("8-4").
Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
The error persisted. On the other hand,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
This gives no error.
I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.
I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.
Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.
For additional environmental info, my sessionInfo() is:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.
I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.
The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.
In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:
f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame': 273070 obs. of 4 variables:
# $ V1: num 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num 0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr NA NA NA NA ...
I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.
Generate text with uneven number of fields
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
Write to file
cat(text, file = "foo")
Scan file for delimeters
max.fields<-max(count.fields("foo", sep = ','))
Now use fread to read file, but expect a max number of columns from the col.names argument
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691
At work we recently observed an unusual behaviour in our code, and I have nailed it down to the following statement having value FALSE, where it should be TRUE:
! strptime(strftime((strptime(d, "%Y-%m-%d")+3600*24),"%Y-%m-%d"),"%Y-%m-%d")%in%strptime(VsiB[,1],"%Y-%m-%d")
VsiB is a dataframe of some dates and values:
DATEFROM VALUE
1 2016-12-15 59.43073
2 2016-12-16 62.78324
3 2016-12-19 54.06673
4 2016-12-26 50.50698
5 2017-01-02 51.20186
6 2017-01-09 52.52000
7 2016-12-17 52.72573
and d is a date from this dataframe (problem in this case occured at d = 2016-12-17).
This statement is basically checking, if the next date from d is contained in the dataframe VsiB. Well, in this instance, it is cartainly not.
The really curious part is, that for all the other countries (we are looping every the whole code for the total of 24 countries, 5 of which use exactly the same data as the country, where the error ocurred), that statement was TRUE.
It also happened only two times to this day, even though we run it every day twice.
The only explanation we could think of, was that
strptime(strftime((strptime(d, "%Y-%m-%d")+3600*24),"%Y-%m-%d")
did not add a day (note +3600*24) to d. For example,
strptime(strftime((strptime(d, "%Y-%m-%d")+3600*24-epsilon),"%Y-%m-%d")
for an epsilon > 0, would gain d, not d+1. So my question is, if there a possibility that some external conditions, as system time, etc., could distort the calculations enough, for that to happen?
P.S: I know that as.Date(d)+1 would solve it in an elegant manner, but I am nonetheless curious, if it is possible that strptime did not work correctly (plus any change could cause problems down the line).
Sys.timezone()
[1] "Europe/Prague"
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Slovenian_Slovenia.1250 LC_CTYPE=Slovenian_Slovenia.1250
[3] LC_MONETARY=Slovenian_Slovenia.1250 LC_NUMERIC=C
[5] LC_TIME=Slovenian_Slovenia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] rsconnect_0.4.3 tools_3.1.2
Data:
VsiB <- structure(list(DATEFROM = structure(c(17150, 17151, 17154, 17161,
17168, 17175, 17152), class = "Date"), VALUE = c(59.43073, 62.78324,
54.06673, 50.50698, 51.20186, 52.52, 52.72573)), .Names = c("DATEFROM",
"VALUE"), class = "data.frame", row.names = c(NA, -7L))
I am trying to write a simple function that (should) return true if the parameter date(s) is an Op-Ex Friday.
require(timeDate)
require(quantmod)
getSymbols("^GSPC", adjust=TRUE, from="1960-01-01")
assign("SPX", GSPC, envir=.GlobalEnv)
names(SPX) <- c("SPX.Open", "SPX.High", "SPX.Low", "SPX.Close",
"SPX.Volume", "SPX.Adjusted")
dates <- last(index(SPX), n=10)
from <- as.numeric(format(as.Date(min(dates)), "%Y"))
to <- as.numeric(format(as.Date(max(dates)), "%Y"))
isOpExFriday <- ifelse(
isBizday(
timeDate(as.Date(dates)),
holidayNYSE(from:to)) & (as.Date(dates) == as.Date(
format(timeNthNdayInMonth(timeFirstDayInMonth(dates), nday=5, nth=3)))
), TRUE, FALSE)
Now, the result should be [1] "2011-09-16". But instead I get [1] "2011-09-15":
dates[isOpExFriday]
[1] "2011-09-15"
Am I doing something wrong, expecting something that timeDate package is not doing by design or is there a bug in timeDate?
I am guessing it's a timezone problem. What happens if you use this:
format(dates[isOpExFriday], tz="UTC")
On second look, you probably need to put the 'tz=' argument inside the format call inside the as.Date(format(...)) call. The format function "freezes" that dates value as text.
EDIT: On testing however I think you are right about it being a bug. (And I sent a bug report to the maintainer with this response.) Even after trying to insert various timezone specs and setting myFinCenter in RmetricsOptions, I still get the which stems from this error deep inside your choice of functions:
timeNthNdayInMonth(as.Date("2011-09-01"), nday=5, nth=3)
America/New_York
[1] [2011-09-15]
I suspect it is because of this code since as I understand it Julian dates are not adjusted for timezones or daylight savings times:
ct = 24 * 3600 * (as.integer(julian.POSIXt(lt)) +
(nth - 1) * 7 + (nday - lt1$wday)%%7)
class(ct) = "POSIXct"
The ct value in seconds is then coverted to POSIXct from second since "origin" simply by coercion of class. If I change the code to:
ct=as.POSIXct(ct, origin="1970-01-01") # correct results come back
My quantmod and timeDate versions are both current per CRAN. Running Mac with R 2.13.1 in 64 bit mode with a US locale. I have not yet tried to reproduce with a minimal session so there could still be some collision or hijacking with other packages:
> sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] quantmod_0.3-17 TTR_0.20-3 xts_0.8-2
[4] Defaults_1.1-1 timeDate_2130.93 zoo_1.7-4
[7] gplots_2.10.1 KernSmooth_2.23-6 caTools_1.12
[10] bitops_1.0-4.1 gdata_2.8.1 gtools_2.6.2
[13] wordnet_0.1-8 ggplot2_0.8.9 proto_0.3-9.2
[16] reshape_0.8.4 plyr_1.6 rattle_2.6.10
[19] RGtk2_2.20.17 rms_3.3-1 Hmisc_3.8-3
[22] survival_2.36-9 sos_1.3-0 brew_1.0-6
[25] lattice_0.19-30