I have been trying to process a fairly large file (2 million-plus observations) using the reader package. However, using the read_table2() function (for that matter, read_table()) also generates the following error:
Warning: 2441218 parsing failures.
row col expected actual file
1 -- 56 columns 28 columns '//filsrc/Research/DataFile_1.txt'
With some additional research, I was able to calculate the maximum number of fields for each file:
max_fields <- max(count.fields("DataFile_1.txt", sep = "", quote = "\"'", skip = 0,
blank.lines.skip = TRUE, comment.char = "#"))
and then set up the columns using the max_fields for the read_Table2() as follows:
file_one=read_table2("DataFile_1.txt", col_names = paste0('V', seq_len(max_fields)), col_types = NULL,
na = "NA",n_max=Inf, guess_max = min(Inf, 3000000),progress = show_progress(), comment = "")
The resulting output shows Warning as I mentioned earlier.
My question is:
Have we compromised the data integrity? In other words, do we have the same data just spread out into more columns during parsing without an appropriate col_type() assigned to each column and that is the issue, or we actually lost some information during the process?
I have checked the dataset with another method read.table() and it seemed to have produced the same dimensions (rows and columns) as read_table2(). So what exactly does Parsing failures mean in this context?
Related
We are interested in analyzing our pupil data (only interested in size, not position) recorded with an SR eyelink 1000Hz system.
We exported the files using the SR data viewer as sample reports.
After running ppl_prep_data the TIMESTAMP variable class is converted from character to numeric however it returns all NA and the real timestamp values are lost. The rest of the pipeline isthereforer not working.
Does anyone of you have an idea why this is the case that it gives us a NA message and if so how can we maybe work around this?
Below you can find the code the code that we are using:
#step 1 Load library
library(PupilPre)
#step 2:load data
# change folder were the data is in the line below
Pupildat <- read.table("DATAXX.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
# after reading in the first column is called weird something with ?.. so we rename it for the next line of code
names(Pupildat)[1] <- 'RECORDING_SESSION_LABEL'
## Step 3:PupilPre Pipeline ###
# Check classes of columns and reassigns => creates event variable
data_pre <- ppl_prep_data(data = Pupildat, Subject = "RECORDING_SESSION_LABEL", EventColumns = c("Subject", "TRIAL_INDEX"))
align_msg(data_pre, Msg = "Hashtag_1")
#Using the function check_msg_time you can see that the TIMESTAMP values associated with the message are not the same for each event.
#This indicates that alignment is required. Note that a regular expression (regex) can be used here as the message string.
#example below, though think we want different timings for the events
check_msg_time(data = data_pre, Msg = "Hashtag_1")
### returns NA
I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.
for (E in EDCODES) {
Filename <- paste("$. Data/2. Liabilities/",
E,
sep="")
Framename <- gsub("\\..*",
"",
E)
assign(Framename,
read.csv(Filename,
header = TRUE,
sep = ",",
stringsAsFactors = FALSE,
na.strings = c("\"ND",
"ND,5",
"5\""),
colClasses = c("BAA35" = "double"),
encoding = "UTF-8",
quote = ""))}
First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".
If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.
The data has 78 columns, so I don't believe posting it here will display in a usable way.
Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?
I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?
I am trying to export a dataframe with library(openxlsx) and
openxlsx::write.xlsx(as.data.frame(df), file = "df.xlsx", colNames = TRUE, rowNames = FALSE, append = FALSE)
but I get the following error:
Error in x[is.na(x)] <- na.string : replacement has length zero
I was having a similar problem in a data frame with numeric columns (at least intended to be numeric), but, in my case, I was using the related openxlsx::writeData.
The data frame was generated using sapply, with functions which could deliver errors because of the data. So, I coded to fill with NA when an error were generated. I ended up with NaN and NAs in the same column.
What worked for me is conducting the following treatment before writeData:
df[is.na(df)]<-''
so, for your problem, the following may work:
df[is.na(df)]<-''
openxlsx::write.xlsx(as.data.frame(df), file = "df.xlsx", colNames = TRUE, rowNames = FALSE, append = FALSE)
So I'm using this web scraping function boldseqspec which returns data on specimens of several species based on a vector of taxonomic groups given in the "taxon" argument like this:
df<-bold_seqspec(taxon=c("group1","group2","group3"), format = "tsv")
But recently for some cases I'm getting the following message and subsequently losing information when I use it:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I've gotten this before in read.delim but I solved it with this:
df<-read.delim("file.txt",quote = "",comment.char = "")
Reproducible example:
install.packages("bold")
library(bold)
df<-bold_seqspec(taxon=c("Cnidaria","Hippocampus"), format = "tsv", marker="COI-5P")
The problem is that the function I'm using for data mining (boldseqspec) doesn't have the quote and comment.char arguments.
It seems that the tsv output contains both double-quotes and single quotes and this is why the parsing mucks up (for more information on the matter - actually it is also about a biological data set, too - see EOF-within-quoted-string / difference between read.delim() and read.table())
Hence, if you can't set quotations, I'd suggest using a workaround by setting the format = "xml" and convert the xml to a DataFrame in a further step with the libraries XML (or xml2?) and dplyr.
library(XML)
library(dplyr)
xml = bold_seqspec(taxon=c("Cnidaria","Hippocampus"), format = "xml", marker="COI-5P")
df= xmlToDataFrame(xml , stringsAsFactors = FALSE,) %>%
mutate_all(~type.convert(., as.is = T))
Hope this helps.
I am trying to create a zoo object in R from the following csv file:
http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/Skewdailyprices.csv
The problem seems to be that there are a few minor inconsistencies in the period from 2/27/2006 to 3/20/2006 (some extra commas and an "x") that lead to problems.
I am looking for a method that reads the complete csv file into R automatically. There is a new data point every business day and when doing manual prepocessing you would have to re-edit the file every day by hand.
I am not sure if these are the only problems with this file but I am running out of ideas how to create a zoo object out of this time series. I think that with some more knowledge of R it should be possible.
Use colClasses to tell it that there are 4 fields and use fill so it knows to fill them if they are missing on any row. Ignore the warning:
library(zoo)
URL <- "http://www.cboe.com/publish/scheduledtask/mktdata/datahouse/Skewdailyprices.csv"
z <- read.zoo(URL, sep = ",", header = TRUE, format = "%m/%d/%Y", skip = 1,
fill = TRUE, colClasses = rep(NA, 4))
It is a good idea to separate the cleaning and analysis steps. Since you mention that your dataset changes often, this cleaning must be automatic. Here is a solution for autocleaning.
#Read in the data without parsing it
lines <- readLines("Skewdailyprices.csv")
#The bad lines have more than two fields
n_fields <- count.fields(
"Skewdailyprices.csv",
sep = ",",
skip = 1
)
#View the dubious lines
lines[n_fields != 2]
#Fix them
library(stringr) #can use gsub from base R if you prefer
lines <- str_replace(lines, ",,x?$", "")
#Write back out to file
writeLines(lines[-1], "Skewdailyprices_cleaned.csv")
#Read in the clean version
sdp <- read.zoo(
"Skewdailyprices_cleaned.csv",
format = "%m/%d/%Y",
header = TRUE,
sep = ","
)