Read a .DAT file that looks like a sparse matrix in r - r

I have a .DAT file that contains several thousand rows of data. Each row has a fixed number of variables and each row is a case, but not every case has values for each variable. So if a case doesn't have a value for a variable, that space will be blank. So the entire data looks like a sparse matrix. A sample data looks like below:
10101010 100 10000FM
001 100 100 1000000 F
I want to read this data in r as a data frame. I've tried read.table but failed.
My code is
m <- read.table("C:/Users/Desktop/testdata.dat", header = FALSE)
R gives me error message like
"Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 6 elements"
How do I fix this?

Generally the dat file has some lines of extra information before actual data.
Skip them with the skip argument as follows:
df<-read.table("C:/Users/Desktop/testdata.dat",header=FALSE, skip=3)
Else you can also try the below using the readlines function, this will read the specified number of lines from your file (pass n parameter as below):
readLines("C:/Users/Desktop/testdata.dat",n=5)

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Set column types for csv with read.csv.ffdf

I am using a payments dataset from Austin Text Open Data. I am trying to load the data with the following code:-
library(ff)
asd <- read.table.ffdf(file = "~/Downloads/Fiscal_Year_2010_eCheckbook_Payments.csv", first.rows = 100, next.ros = 50, FUN = "read.csv", VERBOSE = TRUE)
This shows me the following error:-
read.table.ffdf 301..Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '7AHM'
This happens on 339th line of csv file at 5th column of the dataset. The reason why I think this is happening is that all the values of the 5th column are integers where as this happens to be string. But the actual type of the column should be string.
So I wanted to know if there was a way I could set the types of the column
Below I am providing the types for all the columns in a vector:-
c("character","integer","integer","character","character", "character","character","character","character","character","integer","character","character","character","character","character","character","character","integer","character","character","character","character","character","integer","integer","integer","character","character","character","character","double","character","integer")
You can also find the type of each column from the description of the dataset.
Please also keep in mind that I am very new to this library. Practically just found out about it today.
Maybe you need to transform your data type...The following is just an example that maybe to help you.
data <- transform(
data,
age=as.integer(age),
sex=as.factor(sex),
cp=as.factor(cp),
trestbps=as.integer(trestbps),
choi=as.integer(choi),
fbs=as.factor(fbs),
restecg=as.factor(restecg),
thalach=as.integer(thalach),
exang=as.factor(exang),
oldpeak=as.numeric(oldpeak),
slope=as.factor(slope),
ca=as.factor(ca),
thai=as.factor(thai),
num=as.factor(num)
)
sapply(data, class)

Large file processing - error using chunked::read_csv_chunked with dplyr::filter

When using the function chunked::read_csv_chunked and dplyr::filter in a pipe, I get an error every time the filter returns an empty dataset on any of the chunks. In other words, this occurs when all the rows from a given chunk of the dataset are filtered out.
Here is a modified example, drawn from the package chunked help file:
library(chunked); library(dplyr)
# create csv file for demo purpose
in_file <- file.path(tempdir(), "in.csv")
write.csv(women, in_file, row.names = FALSE, quote = FALSE)
# reading chunkwise and filtering
women_chunked <-
read_chunkwise(in_file, chunk_size = 3) %>% #read only a few lines for the purpose of this example
filter(height > 150) # This basically filters out most lines of the dataset,
# so for instance the first chunk (first 3 rows) should return an empty table
# Trying to read the output returns an error message
women_chunked
# >Error in UseMethod("groups") :
# >no applicable method for 'groups' applied to an object of class "NULL"
# As does of course trying to write the output to a file
out_file <- file.path(tempdir(), "processed.csv")
women_chunked %>%
write_chunkwise(file=out_file)
# >Error in read.table(con, nrows = nrows, sep = sep, dec = dec, header = header, :
# >first five rows are empty: giving up
I am working on many csv files, each 50 millions rows, and will thus often end up in a similar situation where the filtering returns (at least for some chunks) an empty table.
I coudn't find a solution or any post related to on this problem. Any suggestions?
I do not think the sessionInfo output is useful in this case, but please let me know if I should post it anyway. Thanks a lot for any help!

unexpected result from skip in read.csv

I am trying to read a csv into R skipping the first 2 rows. The csv is a mixture of blanks, text data and numeric data where the thousand separator is ",".
The file reads into R fine (gives a 31 x 27 df), but when i change the argument to include skip = 2 it returns a single column with 282 observations.
I have tried it using the readr package's read_csv function and it works fine.
testdf <- read.csv("test.csv")
works fine - gives a dataframe of 31 obs of 27 variables
I get the following error message when trying to use the skip argument:
testdf <- read.csv("test.csv", skip = 2)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
which results in a single variable with 282 observations

Error when reading in a .txt file and splitting it into columns in R

I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.

Resources