Reading gzipped files using read.table() - r

I'm getting an error when trying to reload the cross-reference and hprd tables from the UCSC Table Browser:
kgxref=read.table("kgXref.txt.gz",sep="\t",as.is=T,comment="",header=T,quote="")
hprd=read.table("humanHprdP2P.txt.gz",as.is=T,header=T,comment="")
I keep getting the error:
Error in read.table("kgXref.txt.gz", sep = "\t", as.is = T, comment = "", : invalid numeric 'as.is' expression
I've checked that my filenames are typed correctly, the working directory is set to the correct folder, and I've tried to load the files both with and without the ".gz" extensions (I have both the zipped and unzipped versions in the wd).
I should probably add that I had this exact code working a few weeks ago. I recently updated my OS (Mac Mojave), R (3.6.0) and RStudio (Version 1.2.1335) last week in order to install a few packages that were not compatible with my older versions. I feel like this may have something to do with it.
Any help would be appreciated! Thanks in advance!

It seems that the as.is parameter is a vector according to read.table documentation:
the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
Note: to suppress all conversions including those of numeric columns, set colClasses = "character".
Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.
So in order to check if this is the problem I would remove the parameter totally:
hprd=read.table("humanHprdP2P.txt.gz",header=T,comment="")
Then, if it works, specify a vector with the column indices that should be kept "as is" or booleans, like c(2,3).
I am sorry I cannot be more precise with a minimal working example of this, but hope it helps.

I am guessing, by using as.is = TRUE we are trying to avoid columns to become class of factor, and want to keep them as class of character. Plus those files have no headers.
Here are options:
# keep strings as character (avoid factors)
kgxref <- read.table("kgXref.txt.gz", stringsAsFactors = FALSE, sep = "\t")
Using data.table::fread:
# use fread instead, with default settings it reads the file as expected
kgxref <- fread("kgXref.txt.gz")
Or even better, using fread, we can directly get the table from the link:
# fread with a link to zipped data file from UCSC
kgxref <- fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/kgXref.txt.gz")
humanHprdP2P <- fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/humanHprdP2P.txt.gz")

Related

Problem with loading huge excel file(100mb) when using read_xlsx. Returns wrongly "TRUE" in some cells

I am working with a huge dataframe and had some problems loading it from the excel file. I could only load it using read_xlsx from the readxl package. However i have now realized that some of the cells contains "TRUE" instead of the real value from the excel file. How can it load the file wrongly and is there any solution to avoid this?
Following this advice solved the problem.
JasonAizkalns: Hard to tell, but this may be caused from allowing read_xlsx to "guess" the column types. If you know the column type beforehand, it's always best to specify them with the col_types parameter. In this case, it may have guessed that column type was logical when really it's supposed to be something else (say, text or numeric)
Cleaning the dataset from columns with none numeric values and then using x<-read_xlsx(filename, skip = 1, col_types = "numeric"). Hereafter i y<- read_xlsx(filename, skip = 1, col_types = "date") on the column containing dates. I used cbind(y,x) to complete the dataset with the none numeric column. It seems that read_xlsx misinterprets the columns with numeric values if there is a lot of values missing.

in R, Can I get only the names columns of a csv(txt) file?

I have many big files. But I would like get only the names of the columns without load them.
Using the data.table packages, I can do
df1 <-fread("file.txt")
names1<- names(df)
But, for get all names of the all files, is ver expensive. There is some other option?
Many functions to read in data have optional arguments that allow you to specify how many lines you'd like to read in. For example, the read.table function would allow you to do:
df1 <- read.table("file.txt", nrows=1, header=TRUE)
colnames(df1)
I'd bet that fread() has this option too.
(Note that you may even be able to get away with nrows=0, but I haven't checked to see if that works)
EDIT
As commenter kindly points out, fread() and read.table() work a little differently.
For fread(), you'll want to supply argument nrows=0 :
df1 <- fread("file.txt", nrows=0) ##works
As per the documentation,
nrows=0 is a special case that just returns the column names and types; e.g., a dry run for a large file or to quickly check format consistency of a set of files before starting to read any.
But nrows=0 is one of the ignored cases when supplied in read.table()
df1 <- fread("file.txt") ##reads entire file
df1 <- fread("file.txt", nrows=-1) ##reads entire file
df1 <- fread("file.txt", nrows=0) ##reads entire file

Reading large csv file with missing data using bigmemory package in R

I am using large datasets for my research (4.72GB) and I discovered "bigmemory" package in R that supposedly handles large datasets (up to the range of 10GB). However, when I use read.big.matrix to read a csv file, I get the following error:
> x <- read.big.matrix("x.csv", type = "integer", header=TRUE, backingfile="file.bin", descriptorfile="file.desc")
Error in read.big.matrix("x.csv", type = "integer", header = TRUE,
: Dimension mismatch between header row and first data row.
I think the issue is that the csv file is not full, i.e., it is missing values in several cells. I tried removing header = TRUE but then R aborts and restarts the session.
Does anyone have experience with reading large csv files with missing data using read.big.matrix?
It may not be solving your problem directly, but you might find a package of mine filematrix useful. The relevant function is fm.create.from.text.file.
Please let me know if it works for your data file.
Did you check bigmemory PDF at https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf?
It was clearly described right there.
write.big.matrix(x, 'IrisData.txt', col.names=TRUE, row.names=TRUE)
y <- read.big.matrix("IrisData.txt", header=TRUE, has.row.names=TRUE)
# The following would fail with a dimension mismatch:
if (FALSE) y <- read.big.matrix("IrisData.txt", header=TRUE)
Basically, error means there is a column in the CSV file with row names. If you don't pass has.row.names=TRUE, bigmemory will consider row names a separate column, and without header you'll get mismatch.
I personally found data.table package more useful for dealing with large data set cases, YMMV

Is there a sed type package in R for removing embedded NULs?

I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).
R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).
I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.
Suggestions?
Maybe this could be of help:
in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv",
open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"),
con = "StormEvents_ascii.csv")
I was able to read the csv file without errors with this call do read.table:
options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE,
sep = ",", fill = TRUE, quote = '"')
Obviously you'd need to change the class of several columns, since all are considered character as it is.
Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see
Removing "NUL" characters (within R)
An update for May 2020: The tidyverse and data.table both still choke on null characters within files however the base::read.*() family and readLines() will gracefully skip them with the skipNul=TRUE option. You can read a file in skipping over null characters and then write it back out again.

Inconsistency between 'read.csv' and 'write.csv' in R

The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.

Resources