Bad interpretation of #N/A using `fread` - r
I am using data.table fread() function to read some data which have missing values and they were generated in Excel, so the missing values string is "#N/A". However, when I use the na.strings command the final str of the read data is still character. To replicate this, here is code and data.
Data:
Date,a,b,c,d,e,f,g
1/1/03,#N/A,0.384650146,0.992190069,0.203057232,0.636296656,0.271766148,0.347567706
1/2/03,#N/A,0.461486974,0.500702057,0.234400718,0.072789936,0.060900352,0.876749487
1/3/03,#N/A,0.573541006,0.478062582,0.840918789,0.061495666,0.64301024,0.939575302
1/4/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/5/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/6/03,#N/A,0.66678429,0.897482818,0.569609033,0.524295691,0.132941158,0.194114347
1/7/03,#N/A,0.576835985,0.982816576,0.605408973,0.093177815,0.902145012,0.291035649
1/8/03,#N/A,0.100952961,0.205491093,0.376410642,0.775917986,0.882827749,0.560508499
1/9/03,#N/A,0.350174456,0.290225065,0.428637309,0.022947911,0.7422805,0.354776101
1/10/03,#N/A,0.834345466,0.935128099,0.163158666,0.301310627,0.273928596,0.537167776
1/11/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/12/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/13/03,#N/A,0.325914633,0.68192633,0.320222677,0.249631582,0.605508964,0.739263677
1/14/03,#N/A,0.715104989,0.639040211,0.004186366,0.351412982,0.243570606,0.098312443
1/15/03,#N/A,0.750380716,0.264929325,0.782035411,0.963814327,0.93646428,0.453694758
1/16/03,#N/A,0.282389354,0.762102103,0.515151803,0.194083842,0.102386764,0.569730516
1/17/03,#N/A,0.367802161,0.906878948,0.848538256,0.538705673,0.707436236,0.186222899
1/18/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/19/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/20/03,#N/A,0.79933188,0.214688799,0.37011313,0.189503843,0.294051763,0.503147404
1/21/03,#N/A,0.620066341,0.329949446,0.123685075,0.69027192,0.060178071,0.599825005
(data saved in temp.csv)
Code:
library(data.table)
a <- fread("temp.csv", na.strings="#N/A")
gives (I have larger dataset so neglect the number of observations):
Classes ‘data.table’ and 'data.frame': 144 obs. of 8 variables:
$ Date: chr "1/1/03" "1/2/03" "1/3/03" "1/4/03" ...
$ a : chr NA NA NA NA ...
$ b : chr "0.384650146" "0.461486974" "0.573541006" NA ...
$ c : chr "0.992190069" "0.500702057" "0.478062582" NA ...
$ d : chr "0.203057232" "0.234400718" "0.840918789" NA ...
$ e : chr "0.636296656" "0.072789936" "0.061495666" NA ...
$ f : chr "0.271766148" "0.060900352" "0.64301024" NA ...
$ g : chr "0.347567706" "0.876749487" "0.939575302" NA ...
- attr(*, ".internal.selfref")=<externalptr>
This code works fine
a <- read.csv("temp.csv", header=TRUE, na.strings="#N/A")
Is it a bug? Is there some smart workaround?
The documentation from ?fread for na.strings reads:
na.strings A character vector of strings to convert to NA_character_. By default for columns read as type character ",," is read as a blank string ("") and ",NA," is read as NA_character_. Typical alternatives might be na.strings=NULL or perhaps na.strings = c("NA","N/A","").
You should convert them to numeric yourself after, I suppose. At least this is what I understand from the documentation.
Something like this?
cbind(a[, 1], a[, lapply(.SD[, -1], as.numeric)])
Related
How to read file with irregularly nested quotations?
I have a file with irregular quotes like the following: "INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID""" "Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1" It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data). What I've tried is the following: # All sorts of tidyverse imports tib <- readr::read_csv("file.csv") And I also tried the suggestions offered here: # Base R import DF0 <- read.table("file.csv", as.is = TRUE) DF <- read.csv(text = DF0[[1]]) # Data table import DT0 <- fread("file.csv", header =F) DT <- fread(paste(DT0[[1]], collapse = "\n")) But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.
I used data.table::fread with the quote="" option (which is "as is"). Then I cleaned the names and data by eliminating all the quotes. The dates could be converted too, but I didn't do that. library(data.table) library(magrittr) DT0 <- fread('file.csv', quote = "") DT0 %>% setnames(names(.), gsub('"', '', names(.))) string_cols <- which(sapply(DT0, class) == 'character') DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)), .SDcols = string_cols] str(DT0) Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables: $ INDICATOR : chr "Spi-3" $ CTY_CODE : chr "AFG" $ MGN_CODE : chr "" $ EVENT_NR : int 1 $ EVENT_NR_CR : int 1 $ START_DATE : chr "1952-10-01" $ PEAK_DATE : chr "1952-11-01" $ END_DATE : chr "1953-06-01" $ MAX_EXT_ON : chr "1952-11-01" $ DURATION : int 9 $ SEVERITY : num 6.98 $ INTENSITY : num 0.78 $ AVERAGE_AREA : num 19.8 $ WIDEST_AREA_PERC: num 44.1 $ SCORE : int 5 $ GRP_ID : chr "1" - attr(*, ".internal.selfref")=<externalptr>
How to specify end of header line with read.table
I have ASCII files with data separated by $ signs. There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns. Header line: ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY which does not end with a $ sign. First row line: 7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$ Which does end with a $ sign. My import command: read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="") My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue. If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative? Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file: cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$", file="deleteme.txt", "\n") Solution using gsub: First read the file as text and then edit its content: file_path <- "deleteme.txt" fh <- file(file_path) file_content <- readLines(fh) close(fh) Either add a $ at the end of header row: file_content[1] <- paste0(file_content, "$") Or remove $ from the end of all rows: file_content <- gsub("\\$$", "", file_content) Then we write the fixed file back to disk: cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n") Now we can read the file: df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE) And get the desired structure: str(df) 'data.frame': 1 obs. of 23 variables: $ ISR : int 7215577 $ CASE : int 8135839 $ I_F_COD : chr "I" $ FOLL_SEQ : logi NA $ IMAGE : chr "7215577-0" $ EVENT_DT : int 20101011 $ MFR_DT : logi NA $ FDA_DT : int 20110104 $ REPT_COD : chr "DIR" $ MFR_NUM : logi NA $ MFR_SNDR : logi NA $ AGE : int 67 $ AGE_COD : chr "YR" $ GNDR_COD : logi FALSE $ E_SUB : chr "N" $ WT : int 220 $ WT_COD : chr "LBS" $ REPT_DT : int 20110102 $ OCCP_COD : chr "CN" $ DEATH_DT : logi NA $ TO_MFR : chr "N" $ CONFID : chr "Y" $ REPORTER_COUNTRY: chr "UNITED STATES "
fread importing empty as NA
I'm trying to import a csv with blanks read as "". Unfortunately they're all reading as "NA" now. To better demonstrate the problem I'm also showing how NA, "NA", and "" are all mapping to the same thing (except in the very bottom example), which would prevent the easy workaround dt[is.na(dt)] <- "" > write.csv(matrix(c("0","",NA,"NA"),ncol = 2),"MRE.csv") Opening this in notepad, it looks like this "","V1","V2" "1","0",NA "2","","NA" So reading that back... > fread("MRE.csv") V1 V1 V2 1: 1 0 NA 2: 2 NA NA The documentation seems to suggest this but it does not work as described > fread("MRE.csv",na.strings = NULL) V1 V1 V2 1: 1 0 NA 2: 2 NA NA Also tried this which reads the NA as an actual NA, but the problem remains for the empty string which is read as "NA" > fread("MRE.csv",colClasses=c(V1="character",V2="character")) V1 V1 V2 1: 1 0 <NA> 2: 2 NA NA > fread("MRE.csv",colClasses=c(V1="character",V2="character"))[,V2] [1] NA "NA" data.table version 1.11.4 R version 3.5.1
A few possible things going on here: Regardless of you writing "0" here, the reading function (fread) is inferring based on looking at a portion of the file. This is not uncommon (readr does it, too), and is controllable (with colClasses=). This might be unique to your question here (and not your real data), but your call to write.csv is implicitly putting the literal NA letters in the file (not to be confused with "NA" where you have the literal string). This might be confusing things, even when you override with colClasses=. You might already know this, but since fread is inferring that those columns are really integer classes, then they cannot contain empty strings: once determined to be a number column, anything non-number-like will be NA. Let's redo your first csv-generating side to make sure we don't confound the situation. write.csv(matrix(c("0","",NA,"NA"),ncol = 2), "MRE.csv", na="") (Below, I'm using magrittr's pipe operator %>% merely for presentation, it is not required.) The first example demonstrates fread's inference. The second shows our overriding that behavior, and now we have blank strings in each NA spot that is not the literal string "NA". fread("MRE.csv") %>% str # Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: # $ V1: int 1 2 # $ V1: int 0 NA # $ V2: logi NA NA # - attr(*, ".internal.selfref")=<externalptr> fread("MRE.csv", colClasses="character") %>% str # Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: # $ V1: chr "1" "2" # $ V1: chr "0" "" # $ V2: chr "" "NA" # - attr(*, ".internal.selfref")=<externalptr> This can also be controlled on a per-column basis. One issue with this example is that fread is for some reason forcing the column of row-names to be named V1, the same as the next column. This looks like a bug to me, perhaps you can look at Rdatatable's issues and potentially post a new one. (I might be wrong, perhaps this is intentional/known behavior.) Because of this, per-column overriding seems to stop at the first occurrence of a column name. fread("MRE.csv", colClasses=c(V1="character", V2="character")) %>% str # Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: # $ V1: chr "1" "2" # $ V1: int 0 NA # $ V2: chr "" "NA" # - attr(*, ".internal.selfref")=<externalptr> One way around this is to go with an unnamed vector, requiring the same number of classes as the number of columns: fread("MRE.csv", colClasses=c("character","character","character")) %>% str # Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: # $ V1: chr "1" "2" # $ V1: chr "0" "" # $ V2: chr "" "NA" # - attr(*, ".internal.selfref")=<externalptr> Another way (thanks #thelatemail) is with a list: fread("MRE.csv", colClasses=list(character=2:3)) %>% str # Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: # $ V1: int 1 2 # $ V1: chr "0" "" # $ V2: chr "" "NA" # - attr(*, ".internal.selfref")=<externalptr> Side note: if you need to preserve them as ints/nums, then: if your concern is about how it affects follow-on calculations, then you can: fix the source of the data so that nulls are not provided; filter out the incomplete observations (rows); or fix the calculations to deal intelligently with missing data. if your concern is about how it looks in a report, then whatever tool you are using to render in your report should have a mechanism for how to display NA values; for example, setting options(knitr.kable.NA="") before knitr::kable(...) will present them as empty strings. if your concern is about how it looks on your console, you have two options: interfere with the data by iterating over each (intended) column and changing NA values to ""; this only works on character columns, and is irreversible; or write your own subclass of data.frame that changes how it is displayed on the console; the benefit to this is that it is non-destructive; the problem is that you have to re-class each object where you want this behavior, and most (if not all) functions that output frames will likely inadvertently strip or omit that class from your input. (You'll need to write an S3 method of print for your subclass to do this.)
Error in .jcall(cell, "V", "setCellValue", value) : method setCellValue with signature ([D)V not found when attempting write.xlsx
library(dtplyr) library(xlsx) library(lubridate) 'data.frame': 612 obs. of 7 variables: $ Company : Factor w/ 10 levels "Harbor","HCG",..: 6 10 10 3 6 8 6 8 6 6 ... $ Title : chr NA NA NA NA ... $ Send.Offer.Letter :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable: ..$ Send Offer Letter: Date, format: NA NA NA NA ... ..- attr(*, "spec")=List of 2 .. ..$ cols :List of 1 .. .. ..$ Send Offer Letter: list() .. .. .. ..- attr(*, "class")= chr "collector_character" "collector" .. ..$ default: list() .. .. ..- attr(*, "class")= chr "collector_guess" "collector" .. ..- attr(*, "class")= chr "col_spec" $ Accepted.Position : chr NA NA NA NA ... $ Application.Date : chr NA NA NA NA ... $ Hire.Date..Start. :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 612 obs. of 1 variable: ..$ Hire Date (Start): POSIXct, format: "2008-05-20" NA NA "2008-05-13" ... $ Rehire..Yes.or.No.: Factor w/ 23 levels "??","36500","continuing intern",..: NA NA NA NA NA NA NA NA NA NA ... I have an extremely messy dataset (it was entered entirely freehand on excel spreadsheets) regarding new hires. Variables associated with dates are, of course, making things difficult. There was no consistency in entry format, sometimes random character strings were a part of a date (think 5/17, day tbd) etc. I finally got the dates consistently formatted into POSIXct format, but it led to the odd situation you see above where it appears there are nested variables in my columns. I have already coerced two date variables into as.character ($Accepted.Position and $Application.Date), as I have seen examples of POSIXct date formatting causing issues with write.xlsx. When I attempt to write to xlsx, I get the following: write.xlsx(forstack, file = "forstackover.xlsx", col.names = TRUE) Error in .jcall(cell, "V", "setCellValue", value) : method setCellValue with signature ([D)V not found In addition: There were 50 or more warnings (use warnings() to see the first 50) My dput is too long to post here, so here is the pastebin for it: Dput forstack Attempting to coerce $Hire.Date..Start with as.character produces the odd result which I have partially pasted here: as.character result I am not sure what action to take here. I found a similar discussion here: stack question similar to this one but this user was trying to call a specific portion of a column for ggplot2 graphing. Any help is appreciated.
I had this issue when trying to write a tibble tbl_df to xlsx using the xlsx package. The error threw when I added the row.names = FALSE option, but no error without row.names call. I converted the tbl_df to data.frame and it worked.
I agree with #greg dubrow's solution. I have a simpler suggestion as code. write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE) You can be more free with file.choose() write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE) By the way, in my code, similar to #Lee's, it gave an error for row.names = FALSE. The error is now resolved. If we expand it a little bit more: write.xlsx(as.data.frame(forstack), file = file.choose(), col.names = TRUE, row.names=FALSE)
For me the issue was because the data.frame was grouped so I added ungroup(forstack) prior to write.xlsx and that fixed the issue.
It's a trivial issue. But this trick works. write.xlsx(as.data.frame(forstack), file = "forstackover.xlsx", col.names = TRUE) This happens because when col.names or row.names are called out, the input file must be a data.frame and not a tibble.
Can't write data frame to database
I can't really create a code example because I'm not quite sure what the problem is and my actual problem is rather involved. That said it seems like kind of a generic problem that maybe somebody's seen before. Basically I'm constructing 3 different dataframes and rbinding them together, which is all as expected smooth sailing but when I try to write that merged frame back to the DB I get this error: Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, : unimplemented type 'list' in 'EncodeElement' I've tried manually coercing them using as.data.frame() before and after the rbinds and the returned object (the same one that fails to write with the above error message) exists in the environment as class data.frame so why does dbWriteTable not seem to have got the memo? Sorry, I'm connecting to a MySQL DB using RMySQL. The problem I think as I look a little closer and try to explain myself is that the columns of my data frame are themselves lists (of the same length), which sorta makes sense of the error. I'd think (or like to think anyways) that a call to as.data.frame() would take care of that but I guess not? A portion of my str() since it's long looks like: .. [list output truncated] $ stcong :List of 29809 ..$ : int 3 ..$ : int 8 ..$ : int 4 ..$ : int 2 I guess I'm wondering if there's an easy way to force this coercion?
Hard to say for sure, since you provided so little concrete information, but this would be one way to convert a list column to an atomic vector column: > d <- data.frame(x = 1:5) > d$y <- as.list(letters[1:5]) > str(d) 'data.frame': 5 obs. of 2 variables: $ x: int 1 2 3 4 5 $ y:List of 5 ..$ : chr "a" ..$ : chr "b" ..$ : chr "c" ..$ : chr "d" ..$ : chr "e" > d$y <- unlist(d$y) > str(d) 'data.frame': 5 obs. of 2 variables: $ x: int 1 2 3 4 5 $ y: chr "a" "b" "c" "d" ... This assumes that each element of your list column is only a length one vector. If any aren't, things will be more complicated, and you'd likely need to rethink your data structure anyhow.