How to set a readable xlsx range in read.xlsx() in openxlsx - r

I am using read.xlsx() function to read a xlsx file, with colNames = FALSE, rowNames = TRUE arguments, everything was fine, but after adding a line of variable, it pops up error saying
Error in ".rowNamesDF<-"(x, value = value) :
missing values in 'row.names' are not allowed
When I check the problem byView() and using rowNames = FALSE, I found that the last row was introduced by a NA variable. However, since in the manual of read.xlsx() it doesn't say how to define a range, and I can't do like read.xlsx()[1:ncol(),] either, so I don't know what to do.
My trials:
I tried to delete the last row in the xlsx file, but R keeps saying missing value is introduced.
I know I could use rowNames = FALSE argument first and remove the last row, and define first row as row.names(), but I don't want to do so because I think there is a better solution.

Can you provide an example of the data contained in your excel file ?
So I can try something based on your data, if I understood you want to add a line at the ned of it right ?

Related

SASxport to R: Errors while reading XPT SAS file

Anyone knows how to ignore/skip error while getting SAS export format file into R?
require(SASxport)
asc = SASxport::read.xport("..\\LLCP2018.XPT_", keep = cols)
Checking if the specified file has the appropriate header
Extracting data file information...
Reading the data file... ### Error in [.data.frame(ds, whichds) : undefined columns selected
I have plenty of columns and don't want to check one-by-one if it really exists.
Would like to ignore missing but there's no option within the function.
EDIT
Found an easy solution:
lu = SASxport::lookup.xport(xfile)
Now can probably choose from lu$names and intersect with cols. Still not every variables can be read but it's better.
But when I choose few columns (checked) I get another error unable to skip:
Error in if (any(tooLong)) { : missing value where TRUE/FALSE needed
Why this stops the reading process and returns null?
EDIT 2
Found workaround reading the same function but from different package:
asc <- foreign::read.xport(xfile)
Works, unfortunately, loads whole data - if there's some size limitation probably nothing I could do.

Reading gzipped files using read.table()

I'm getting an error when trying to reload the cross-reference and hprd tables from the UCSC Table Browser:
kgxref=read.table("kgXref.txt.gz",sep="\t",as.is=T,comment="",header=T,quote="")
hprd=read.table("humanHprdP2P.txt.gz",as.is=T,header=T,comment="")
I keep getting the error:
Error in read.table("kgXref.txt.gz", sep = "\t", as.is = T, comment = "", : invalid numeric 'as.is' expression
I've checked that my filenames are typed correctly, the working directory is set to the correct folder, and I've tried to load the files both with and without the ".gz" extensions (I have both the zipped and unzipped versions in the wd).
I should probably add that I had this exact code working a few weeks ago. I recently updated my OS (Mac Mojave), R (3.6.0) and RStudio (Version 1.2.1335) last week in order to install a few packages that were not compatible with my older versions. I feel like this may have something to do with it.
Any help would be appreciated! Thanks in advance!
It seems that the as.is parameter is a vector according to read.table documentation:
the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
Note: to suppress all conversions including those of numeric columns, set colClasses = "character".
Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.
So in order to check if this is the problem I would remove the parameter totally:
hprd=read.table("humanHprdP2P.txt.gz",header=T,comment="")
Then, if it works, specify a vector with the column indices that should be kept "as is" or booleans, like c(2,3).
I am sorry I cannot be more precise with a minimal working example of this, but hope it helps.
I am guessing, by using as.is = TRUE we are trying to avoid columns to become class of factor, and want to keep them as class of character. Plus those files have no headers.
Here are options:
# keep strings as character (avoid factors)
kgxref <- read.table("kgXref.txt.gz", stringsAsFactors = FALSE, sep = "\t")
Using data.table::fread:
# use fread instead, with default settings it reads the file as expected
kgxref <- fread("kgXref.txt.gz")
Or even better, using fread, we can directly get the table from the link:
# fread with a link to zipped data file from UCSC
kgxref <- fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/kgXref.txt.gz")
humanHprdP2P <- fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/humanHprdP2P.txt.gz")

Error encountered with using rbindlist: column 25 of result is determined to be integer64 but maxType == 'Character' !=REALSXP

I used the following function to merge all .csv files in my directory into one dataframe:
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }
dataframe = multmerge(path)
This code produces this error:
Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP
The code has worked on the same csv files before...I'm not sure what's changed and what the error message means.
So in looking at the documentation of fread I just noticed there is an integer64 option so are you dealing with integers greater than 2^31?
EDIT: I added the tryCatch which will print a formatted message to the console indicating which files are causing the error with the actual error message. However for rbindlist to then execute over the normal files you need to create a dummy list that will produce an extra column called ERROR which will have NAs in all rows except the bottom one(s) which will have the name of the problem file as its value(s).
I suggest after you run this code through once, delete the ERROR column and extra row(s) from the data.table and then save this combined file as a .csv. I would then move all the files that combined properly into a different folder and only have the current combined file and the ones that didn't load properly in the path. Then rerun the function without the colClasses specified. I combined everything into one script so it's hopefully less confusing:
#First Initial run without colClasses
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
#You should get the original error message and identify the filename.
dataframe = multmerge(path)
#Delete placeholder column and extra rows
#You will get as many extra rows as you have problem files -
#most likely just the one with column 25 or any others that had that same issue with column 25.
#Note the out of bounds error message will probably go away with the colClasses argument pulled out.)
#Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
#Move all files but problem file into new folder
#Now you should only have the big one and only one in your path.
#Rerun the function but add the colClasses argument this time
#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
dataframe2 = multmerge(path)
Now we know the source of the error is column 25 which we can specify in colClasses. If you run the code and you get the same error message for a different column simply add the number of that column after the 25. Once you have the dataframe inputted I would check what is going on in that column (or any others if you must add other columns). Maybe there was a data entry error in one of the files or different encoding of an NA value. That's why I say to initially convert that column to character first because you will lose less information than converting to numeric first.
Once you have no errors always write the cleaned combined data.table to a csv that is contained in your folder and always move the individual files that have been combined into the other folder. That way when you add new files you will only be combining the big one and a few others so that in the future you can see what is going on easier. Just keep notes as to which files gave you trouble and which columns. Does that make sense?
Because files are often so idiosyncratic you will have to be flexible but this approach to the workflow should make it easy to identify problem files and add what you need to add to the fread to make it work. Basically archive the files that have been processed and keep track of exceptions like the column 25 one and keep the most current combined file and ones that haven't been processed together in the active path. Hope that helps and good luck!

Problem with loading huge excel file(100mb) when using read_xlsx. Returns wrongly "TRUE" in some cells

I am working with a huge dataframe and had some problems loading it from the excel file. I could only load it using read_xlsx from the readxl package. However i have now realized that some of the cells contains "TRUE" instead of the real value from the excel file. How can it load the file wrongly and is there any solution to avoid this?
Following this advice solved the problem.
JasonAizkalns: Hard to tell, but this may be caused from allowing read_xlsx to "guess" the column types. If you know the column type beforehand, it's always best to specify them with the col_types parameter. In this case, it may have guessed that column type was logical when really it's supposed to be something else (say, text or numeric)
Cleaning the dataset from columns with none numeric values and then using x<-read_xlsx(filename, skip = 1, col_types = "numeric"). Hereafter i y<- read_xlsx(filename, skip = 1, col_types = "date") on the column containing dates. I used cbind(y,x) to complete the dataset with the none numeric column. It seems that read_xlsx misinterprets the columns with numeric values if there is a lot of values missing.

Inconsistency between 'read.csv' and 'write.csv' in R

The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.

Resources