I am trying to read a csv into R skipping the first 2 rows. The csv is a mixture of blanks, text data and numeric data where the thousand separator is ",".
The file reads into R fine (gives a 31 x 27 df), but when i change the argument to include skip = 2 it returns a single column with 282 observations.
I have tried it using the readr package's read_csv function and it works fine.
testdf <- read.csv("test.csv")
works fine - gives a dataframe of 31 obs of 27 variables
I get the following error message when trying to use the skip argument:
testdf <- read.csv("test.csv", skip = 2)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
which results in a single variable with 282 observations
Related
I have been trying to process a fairly large file (2 million-plus observations) using the reader package. However, using the read_table2() function (for that matter, read_table()) also generates the following error:
Warning: 2441218 parsing failures.
row col expected actual file
1 -- 56 columns 28 columns '//filsrc/Research/DataFile_1.txt'
With some additional research, I was able to calculate the maximum number of fields for each file:
max_fields <- max(count.fields("DataFile_1.txt", sep = "", quote = "\"'", skip = 0,
blank.lines.skip = TRUE, comment.char = "#"))
and then set up the columns using the max_fields for the read_Table2() as follows:
file_one=read_table2("DataFile_1.txt", col_names = paste0('V', seq_len(max_fields)), col_types = NULL,
na = "NA",n_max=Inf, guess_max = min(Inf, 3000000),progress = show_progress(), comment = "")
The resulting output shows Warning as I mentioned earlier.
My question is:
Have we compromised the data integrity? In other words, do we have the same data just spread out into more columns during parsing without an appropriate col_type() assigned to each column and that is the issue, or we actually lost some information during the process?
I have checked the dataset with another method read.table() and it seemed to have produced the same dimensions (rows and columns) as read_table2(). So what exactly does Parsing failures mean in this context?
I have a .DAT file that contains several thousand rows of data. Each row has a fixed number of variables and each row is a case, but not every case has values for each variable. So if a case doesn't have a value for a variable, that space will be blank. So the entire data looks like a sparse matrix. A sample data looks like below:
10101010 100 10000FM
001 100 100 1000000 F
I want to read this data in r as a data frame. I've tried read.table but failed.
My code is
m <- read.table("C:/Users/Desktop/testdata.dat", header = FALSE)
R gives me error message like
"Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 6 elements"
How do I fix this?
Generally the dat file has some lines of extra information before actual data.
Skip them with the skip argument as follows:
df<-read.table("C:/Users/Desktop/testdata.dat",header=FALSE, skip=3)
Else you can also try the below using the readlines function, this will read the specified number of lines from your file (pass n parameter as below):
readLines("C:/Users/Desktop/testdata.dat",n=5)
I am trying to read.table for a tab-delimited file using the following command:
df <- read.table("input.txt", header=FALSE, sep="\t", quote="", comment.char="",
encoding="utf-8")
There should be 30 million rows. However, after read.table() returns, df contains only ~6 million rows. And there is a warning message like this:
Warning message:
In read.table("input.txt", header = FALSE, sep = "\t", quote = "", :
incomplete final line found by readTableHeader on 'input.txt'
I believe read.table quits after encountering a special sympbol (ASCII code: 1A Substitute) in one of the string columns. In the input file, the only special character is tab because it is used to separate columns. Is there anyway to ask read.table to treat any other character as not special?
If you have 30 million rows. I would use fread rather than read.table. It is faster. Learn more about here http://www.inside-r.org/packages/cran/data.table/docs/fread
fread(input, sep="auto", encoding = "UTF-8" )
Regarding your issue with read.table. I think the solutions here should solve it.
'Incomplete final line' warning when trying to read a .csv file into R
I run the following:
data <- read.csv("C:/Users/user/Desktop/directory/file.csv")
These are my warning messages:
Warning messages:
1: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 1 appears to contain embedded nulls
2: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 2 appears to contain embedded nulls
....
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
The resulting data frame is just a column filled with NAs.
A problem might be that all my data is in one column separated by commas like this(first row is header, second row is example data):
stat_breed,tran_disp,train_key,t_type,trainer_Id,horsedata_Id
QH,"Tavares Joe","214801","E",0,0
What can I do to accurately read my data?
I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.