I have a problem when importing .csv file into R. With my code:
t <- read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,check.names=F)
R reports an error and does not do what I want:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I guess the problem is because my data is not well formatted. I only need data from [,1:32]. All others should be deleted.
Data can be downloaded from:
https://drive.google.com/file/d/0B86_a8ltyoL3VXJYM3NVdmNPMUU/edit?usp=sharing
Thanks so much!
Open the .csv as a text file (for example, use TextEdit on a Mac) and check to see if columns are being separated with commas.
csv is "comma separated vectors". For some reason when Excel saves my csv's it uses semicolons instead.
When opening your csv use:
read.csv("file_name.csv",sep=";")
Semi colon is just an example but as someone else previously suggested don't assume that because your csv looks good in Excel that it's so.
That's one wonky CSV file. Multiple headers tossed about (try pasting it to CSV Fingerprint) to see what I mean.
Since I don't know the data, it's impossible to be sure the following produces accurate results for you, but it involves using readLines and other R functions to pre-process the text:
# use readLines to get the data
dat <- readLines("N0_07312014.CSV")
# i had to do this to fix grep errors
Sys.setlocale('LC_ALL','C')
# filter out the repeating, and wonky headers
dat_2 <- grep("Node Name,RTC_date", dat, invert=TRUE, value=TRUE)
# turn that vector into a text connection for read.csv
dat_3 <- read.csv(textConnection(paste0(dat_2, collapse="\n")),
header=FALSE, stringsAsFactors=FALSE)
str(dat_3)
## 'data.frame': 308 obs. of 37 variables:
## $ V1 : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ V2 : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ V3 : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ V4 : chr "" "" "" "" ...
## .. more
## $ V36: chr "" "" "" "" ...
## $ V37: chr "0" "0" "0" "0" ...
# grab the headers
headers <- strsplit(dat[1], ",")[[1]]
# how many of them are there?
length(headers)
## [1] 32
# limit it to the 32 columns you want (Which matches)
dat_4 <- dat_3[,1:32]
# and add the headers
colnames(dat_4) <- headers
str(dat_4)
## 'data.frame': 308 obs. of 32 variables:
## $ Node Name : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ RTC_date : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ RTC_time : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ N1 Bat (VDC) : chr "" "" "" "" ...
## $ N1 Shinyei (ug/m3): chr "" "" "0.23" "null" ...
## $ N1 CC (ppb) : chr "" "" "null" "null" ...
## $ N1 Aeroq (ppm) : chr "" "" "null" "null" ...
## ... continues
If you only need the first 32 columns, and you know how many columns there are, you can set the other columns classes to NULL.
read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",10)))
If you do not want to code up each colClass and you like the guesses read.csv then just save that csv and open it again.
Alternatively, you can skip the header and name the columns yourself and remove the misbehaved rows.
A<-data.frame(read.csv("N0_07312014.CSV",
header=F,stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",5)),
na.string=c("","null","NaN","X")))
Yournames<-as.character(A[1,])
names(A)<-Yournames
yourdata<-unique(A)[-1,]
The code above assumes you do not want any duplicate rows. You can alternatively remove rows that have the first entry equal to the first column name, but I'll leave that to you.
try read.table() instead of read.csv()
I was also facing the same issue. Now solved.
Just use header = FALSE
read.csv("data.csv", header = FALSE) -> mydata
I had the same problem. I opened my data in textfile and double expressions are separated by semicolons, you should replace them with a period
I was having this error that was caused by multiple rows of meta data at the top of the file. I was able to use read.csv by doing skip= and skipping those rows.
data <- read.csv('/blah.csv',skip=3)
For me, the solution was using csv2 instead of csv.
read.csv("file_name.csv", header=F)
Setting the HEADER to be FALSE will do the job perfectly for you...
Related
Having a very simple xml I want to export to dataframe in R.
<root>
<source>
<sourceId value="8556"/>
</source>
<content>
<DESCRIPTION value="0"/>
<SORTED value="290"/>
<ANNULATION value="34"/>
<RECORDING value="5665"/>
<TOLOCK value=""/>
<FUTURE value="categorical"/>
</content>
</root>
I retrieve the node I need this way:
library(XML)
xmlDoc <- xmlParse("path-to-file", useInternalNode=TRUE)
df <- xmlToDataFrame(getNodeSet(xmlDoc,"//content"))
but dataframe has only columns with no value at all. So I guess I am wrong in some step.
> df
DESCRIPTION SORTED ANNULATION RECORDING TOLOCK FUTURE
1
> str(df)
'data.frame': 1 obs. of 6 variables:
$ DESCRIPTION: chr ""
$ SORTED : chr ""
$ ANNULATION : chr ""
$ RECORDING : chr ""
$ TOLOCK : chr ""
$ FUTURE : chr ""
Usually, xml processing is very dependant on the file. So you have to struggle with it as there is no silver bullet.
In your case, just iterate throug names and values from tags this way assuming you want it in one row (not very pretty I must say):
doc <- read_xml("my.xml")
content <- xml_find_first(doc,".//content")
values <- xml_children(content) %>% xml_attr("value")
names <- xml_name(xml_children(content))
df <- data.frame(mstrix(ncol = length(names), nrow = 0))
df <- rbind(df, values)
colnames(df) <- names
I am reading several SAS files from a server and load them all into a list into R. I removed one of the datasets because I didn't need it in the final analysis ( dateset # 31)
mylist<-list.files("path" , pattern = ".sas7bdat")
mylist <- mylist[- 31]
Then I used lapply to read all the datasets in the list ( mylist) at the same time
read.all <- lapply(mylist, read_sas)
the code works well. However when I run view(read.all) to see the the datasets, I can only see a number ( e.g, 1, 2, etc) instead of the names of the initial datasets.
Does anyone know how I can keep the name of datasets in the final list?
Also, can anyone tell me how I can work with this list in R?
is it an object ? may I read one of the dateset of the list ? or how can I join some of the datasets of the list?
Use basename and tools::file_path_sans_ext:
filenames <- head(list.files("~/StackOverflow", pattern = "^[^#].*\\.R", recursive = TRUE, full.names = TRUE))
filenames
# [1] "C:\\Users\\r2/StackOverflow/1000343/61469332.R" "C:\\Users\\r2/StackOverflow/10087004/61857346.R"
# [3] "C:\\Users\\r2/StackOverflow/10097832/60589834.R" "C:\\Users\\r2/StackOverflow/10214507/60837843.R"
# [5] "C:\\Users\\r2/StackOverflow/10215127/61720149.R" "C:\\Users\\r2/StackOverflow/10226369/60778116.R"
basename(filenames)
# [1] "61469332.R" "61857346.R" "60589834.R" "60837843.R" "61720149.R" "60778116.R"
tools::file_path_sans_ext(basename(filenames))
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
somedat <- setNames(lapply(filenames, readLines, n=2),
tools::file_path_sans_ext(basename(filenames)))
names(somedat)
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
str(somedat)
# List of 6
# $ 61469332: chr [1:2] "# https://stackoverflow.com/questions/61469332/determine-function-name-within-that-function/61469380" ""
# $ 61857346: chr [1:2] "# https://stackoverflow.com/questions/61857346/how-to-use-apply-family-instead-of-nested-for-loop-for-my-problem?noredirect=1" ""
# $ 60589834: chr [1:2] "# https://stackoverflow.com/questions/60589834/add-columns-to-data-frame-based-on-function-argument" ""
# $ 60837843: chr [1:2] "# https://stackoverflow.com/questions/60837843/how-to-remove-all-parentheses-from-a-vector-of-string-except-whe"| __truncated__ ""
# $ 61720149: chr [1:2] "# https://stackoverflow.com/questions/61720149/extracting-the-original-data-based-on-filtering-criteria" ""
# $ 60778116: chr [1:2] "# https://stackoverflow.com/questions/60778116/how-to-shift-data-by-a-factor-of-two-months-in-r" ""
Each "name" is the character representation of (in this case) the stackoverflow question number, with the ".R" removed. (And since I typically include the normal URL as the first line then an empty line in the files I use to test/play and answer SO questions, all of these files look similar at the top two lines.)
I have ASCII files with data separated by $ signs.
There are 23 columns in the data, the first row is of column names, but there is inconsistency between the line endings, which causes R to import the data improperly, by shift the data left-wise with respect to their columns.
Header line:
ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY
which does not end with a $ sign.
First row line:
7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$
Which does end with a $ sign.
My import command:
read.table(filename, header=TRUE, sep="$", comment.char="", header=TRUE, quote="")
My guess is that the inconsistency between the line endings causes R to think that the records have one column more than the header, thus making the first column as a row.names column, which is not correct. Adding the specification row.names=NULL does not fix the issue.
If I manually add a $ sign in the file the problem is solved, but this is infeasible as the issue occurs in hundreds of files. Is there a way to specify how to read the header line? Do I have any alternative?
Additional info: the headers change across different files, so I cannot set my own vector of column names
Create a dummy test file:
cat("ISR$CASE$I_F_COD$FOLL_SEQ$IMAGE$EVENT_DT$MFR_DT$FDA_DT$REPT_COD$MFR_NUM$MFR_SNDR$AGE$AGE_COD$GNDR_COD$E_SUB$WT$WT_COD$REPT_DT$OCCP_COD$DEATH_DT$TO_MFR$CONFID$REPORTER_COUNTRY\n7215577$8135839$I$$7215577-0$20101011$$20110104$DIR$$$67$YR$F$N$220$LBS$20110102$CN$$N$Y$UNITED STATES$",
file="deleteme.txt",
"\n")
Solution using gsub:
First read the file as text and then edit its content:
file_path <- "deleteme.txt"
fh <- file(file_path)
file_content <- readLines(fh)
close(fh)
Either add a $ at the end of header row:
file_content[1] <- paste0(file_content, "$")
Or remove $ from the end of all rows:
file_content <- gsub("\\$$", "", file_content)
Then we write the fixed file back to disk:
cat(paste0(file_content, collapse="\n"), file=paste0("fixed_", file_path), "\n")
Now we can read the file:
df <- read.table(paste0("fixed_", file_path), header=TRUE, sep="$", comment.char="", quote="", stringsAsFactors=FALSE)
And get the desired structure:
str(df)
'data.frame': 1 obs. of 23 variables:
$ ISR : int 7215577
$ CASE : int 8135839
$ I_F_COD : chr "I"
$ FOLL_SEQ : logi NA
$ IMAGE : chr "7215577-0"
$ EVENT_DT : int 20101011
$ MFR_DT : logi NA
$ FDA_DT : int 20110104
$ REPT_COD : chr "DIR"
$ MFR_NUM : logi NA
$ MFR_SNDR : logi NA
$ AGE : int 67
$ AGE_COD : chr "YR"
$ GNDR_COD : logi FALSE
$ E_SUB : chr "N"
$ WT : int 220
$ WT_COD : chr "LBS"
$ REPT_DT : int 20110102
$ OCCP_COD : chr "CN"
$ DEATH_DT : logi NA
$ TO_MFR : chr "N"
$ CONFID : chr "Y"
$ REPORTER_COUNTRY: chr "UNITED STATES "
I'm trying to import a csv with blanks read as "". Unfortunately they're all reading as "NA" now.
To better demonstrate the problem I'm also showing how NA, "NA", and "" are all mapping to the same thing (except in the very bottom example), which would prevent the easy workaround dt[is.na(dt)] <- ""
> write.csv(matrix(c("0","",NA,"NA"),ncol = 2),"MRE.csv")
Opening this in notepad, it looks like this
"","V1","V2"
"1","0",NA
"2","","NA"
So reading that back...
> fread("MRE.csv")
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
The documentation seems to suggest this but it does not work as described
> fread("MRE.csv",na.strings = NULL)
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
Also tried this which reads the NA as an actual NA, but the problem remains for the empty string which is read as "NA"
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))
V1 V1 V2
1: 1 0 <NA>
2: 2 NA NA
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))[,V2]
[1] NA "NA"
data.table version 1.11.4
R version 3.5.1
A few possible things going on here:
Regardless of you writing "0" here, the reading function (fread) is inferring based on looking at a portion of the file. This is not uncommon (readr does it, too), and is controllable (with colClasses=).
This might be unique to your question here (and not your real data), but your call to write.csv is implicitly putting the literal NA letters in the file (not to be confused with "NA" where you have the literal string). This might be confusing things, even when you override with colClasses=.
You might already know this, but since fread is inferring that those columns are really integer classes, then they cannot contain empty strings: once determined to be a number column, anything non-number-like will be NA.
Let's redo your first csv-generating side to make sure we don't confound the situation.
write.csv(matrix(c("0","",NA,"NA"),ncol = 2), "MRE.csv", na="")
(Below, I'm using magrittr's pipe operator %>% merely for presentation, it is not required.)
The first example demonstrates fread's inference. The second shows our overriding that behavior, and now we have blank strings in each NA spot that is not the literal string "NA".
fread("MRE.csv") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: int 0 NA
# $ V2: logi NA NA
# - attr(*, ".internal.selfref")=<externalptr>
fread("MRE.csv", colClasses="character") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
This can also be controlled on a per-column basis. One issue with this example is that fread is for some reason forcing the column of row-names to be named V1, the same as the next column. This looks like a bug to me, perhaps you can look at Rdatatable's issues and potentially post a new one. (I might be wrong, perhaps this is intentional/known behavior.)
Because of this, per-column overriding seems to stop at the first occurrence of a column name.
fread("MRE.csv", colClasses=c(V1="character", V2="character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: int 0 NA
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
One way around this is to go with an unnamed vector, requiring the same number of classes as the number of columns:
fread("MRE.csv", colClasses=c("character","character","character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Another way (thanks #thelatemail) is with a list:
fread("MRE.csv", colClasses=list(character=2:3)) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Side note: if you need to preserve them as ints/nums, then:
if your concern is about how it affects follow-on calculations, then you can:
fix the source of the data so that nulls are not provided;
filter out the incomplete observations (rows); or
fix the calculations to deal intelligently with missing data.
if your concern is about how it looks in a report, then whatever tool you are using to render in your report should have a mechanism for how to display NA values; for example, setting options(knitr.kable.NA="") before knitr::kable(...) will present them as empty strings.
if your concern is about how it looks on your console, you have two options:
interfere with the data by iterating over each (intended) column and changing NA values to ""; this only works on character columns, and is irreversible; or
write your own subclass of data.frame that changes how it is displayed on the console; the benefit to this is that it is non-destructive; the problem is that you have to re-class each object where you want this behavior, and most (if not all) functions that output frames will likely inadvertently strip or omit that class from your input. (You'll need to write an S3 method of print for your subclass to do this.)
I am struggling to parse a JSON in R which contains newlines both within character strings and between key/value pairs (and whole objects).
Here's the sort of format I mean:
{
"id": 123456,
"name": "Try to parse this",
"description": "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
}
{
"id": 987654,
"name": "Have another go",
"description": "Another two line description... \r\n With 2 lines."
}
Say that I have this JSON saved as example.json. I have tried various techniques to overcome parsing problems, suggested elsewhere on SO. None of the following works:
library(jsonlite)
foo <- readLines("example.json")
foo <- paste(readLines("example.json"), collapse = "")
bar <- fromJSON(foo)
bar <- jsonlite::stream_in(textConnection(foo))
bar <- purrr::map(foo, jsonlite::fromJSON)
bar <- ndjson::stream_in(textConnection(foo))
bar <- read_json(textConnection(foo), format = "jsonl")
I gather that this is really NDJSON format, but none of the specialised packages cope with it. Some suggest streaming in the data with either jsonlite or ndjson (or this one and this one). Others suggest mapping the function across lines (or similarly in base R).
Everything raises one of the following errors:
Error: parse error: trailing garbage or Error: parse error: premature EOF or problems opening the text connection.
Does anyone have a solution?
Edit
Knowing that the json is wrongly formatted, we lose some ndjson efficiency but I think we can fix it in real time, assuming that we clearly have a close-brace (}) followed by nothing or some whitespace (including newlines) followed by an open-brace ({)
fn <- "~/StackOverflow/TomWagstaff.json"
wrongjson <- paste(readLines(fn), collapse = "")
if (grepl("\\}\\s*\\{", wrongjson))
wrongjson <- paste0("[", gsub("\\}\\s*\\{", "},{", wrongjson), "]")
json <- jsonlite::fromJSON(wrongjson, simplifyDataFrame = FALSE)
str(json)
# List of 2
# $ :List of 3
# ..$ id : int 123456
# ..$ name : chr "Try to parse this"
# ..$ description: chr "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
# $ :List of 3
# ..$ id : int 987654
# ..$ name : chr "Have another go"
# ..$ description: chr "Another two line description... \r\n With 2 lines."
From here, you can continue with
txtjson <- paste(sapply(json, jsonlite::toJSON, pretty = TRUE), collapse = "\n")
(Below is the original answer, hoping/assuming that the format was somehow legitimate.)
Assuming your data is actually like this:
{"id":123456,"name":"Try to parse this","description":"Thought reading a JSON was easy? \r\n Try parsing a newline within a string."}
{"id": 987654,"name":"Have another go","description":"Another two line description... \r\n With 2 lines."}
then it is as you suspect ndjson. From that you can do this:
fn <- "~/StackOverflow/TomWagstaff.json"
json <- jsonlite::stream_in(file(fn), simplifyDataFrame = FALSE)
# opening file input connection.
# Imported 2 records. Simplifying...
# closing file input connection.
str(json)
# List of 2
# $ :List of 3
# ..$ id : int 123456
# ..$ name : chr "Try to parse this"
# ..$ description: chr "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
# $ :List of 3
# ..$ id : int 987654
# ..$ name : chr "Have another go"
# ..$ description: chr "Another two line description... \r\n With 2 lines."
Notice I've not simplified to a frame. To get your literal block on the console, do
cat(sapply(json, jsonlite::toJSON, pretty = TRUE), sep = "\n")
# {
# "id": [123456],
# "name": ["Try to parse this"],
# "description": ["Thought reading a JSON was easy? \r\n Try parsing a newline within a string."]
# }
# {
# "id": [987654],
# "name": ["Have another go"],
# "description": ["Another two line description... \r\n With 2 lines."]
# }
If you want to dump it to a file in that way (though nothing in jsonlite or similar will be able to read it, since it is no longer legal ndjson nor legal json as a whole file), then you can
txtjson <- paste(sapply(json, jsonlite::toJSON, pretty = TRUE), collapse = "\n")
and then save that with writeLines or similar.