write.table in R writing more rows than in the dataframe - r

I am working with a very large dataframe (34236 rows, 530 columns) FINAL.Merged. When I tried to write it, it creates a .txt file with 18 more lines (i.e., 34254 rows) than it is in the dataframe.
I tried to replicate this error with other dataframes, but did not get such error. I never encountered this issue before and find it very unusual. Does anyone have any clue as to why?
This is the code I am using:
write.table(FINAL.Merged, "Pediatric_cancer_survivors.txt", quote = FALSE, row.names = FALSE, sep = "\t")
UPDATE: Just so it would be helpful to others, this is how I fixed it:
library("dplyr")
FINAL.Merged <- FINAL.Merged %>%
mutate_if(is.character, trimws)
My dataframes were read from SAS data and had whitespaces. After trimming the whitespace, I no longer have this problem.

Related

Error in x[is.na(x)] <- na.string : replacement has length zero when exporting data frame to openxlsx in R

I have an issue when I try to export a data frame with the library openxlsx to an Excel. When I tried, this error happen:
openxlsx::write.xlsx(usertl_lp, file = "Mi_Exportación.xlsx")
Error in x[is.na(x)] <- na.string : replacement has length zero
usertl_lp_clean <- usertl_lp %>% mutate(across(where(is.list), as.character))
openxlsx::write.xlsx(usertl_lp_clean, file = "Mi_Exportación.xlsx")
This error may be caused by cells containing vectors. So, using across to modify the vector to character.
I posted this here for others in need.
I think you are looking for the writeData function from the same package.
Check out writeFormula from the same package as well or even write_xlsx from the writexl package.
I was having a similar problem in a data frame, but, in my case, I was using the related openxlsx::writeData.
The data frame was generated using sapply, with functions which could deliver errors because of the data. So, I coded to fill with NA when an error were generated. I ended up with NaN and NAs in the same column.
What worked for me is conducting the following treatment before writeData:
df[is.na(df)]<-''
so, for your problem, the following may work:
df[is.na(df)]<-''
openxlsx::write.xlsx(as.data.frame(df), file = "df.xlsx", colNames = TRUE, rowNames = FALSE, append = FALSE)

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

write and read.csv different number of columns

A strange problem with write and read.csv. I have ways to work around this but would be great if someone can identify what is going on.
I have code from someone else which dynamically creates a series of CSVs by appending new rows. The problem is that read.csv appears to read the newly created csv inconsistently.
Dummy code example:
datfile <- "E:/temp.csv"
write(paste("Name","tempname",sep=","),datfile,1)
write(paste("VShort",50,sep=","),datfile,1,append=T)
write(paste("Short1",1,1,sep=","),datfile,1,append=T)
write(paste("Short2",0,2,sep=","),datfile,1,append=T)
write(paste("Short3",0,2,sep=","),datfile,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile,1,append=T)
write(paste("Short4",2,0,sep=","),datfile,1,append=T)
read.csv(datfile,header=F,colClasses="character")
Seven rows of data written to CSV, but read.csv reads in 8 rows (Long is split over two rows). Eight rows and three columns read in.
The problem is fixed by opening temp.csv in Excel and saving. Then read.csv reads in the 7 lines appropriately.
The problem only appears to exist under certain conditions. For example, remove Short 3 and there is no problem:
datfile2 <- "E:/temp2.csv"
write(paste("Name","tempname",sep=","),datfile2,1)
write(paste("VShort",50,sep=","),datfile2,1,append=T)
write(paste("Short1",1,1,sep=","),datfile2,1,append=T)
write(paste("Short2",0,2,sep=","),datfile2,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile2,1,append=T)
write(paste("Short4",2,0,sep=","),datfile2,1,append=T)
read.csv(datfile2,header=F,colClasses="character")
Six rows and five columns are read in.
Any ideas what is going on here?
R version 3.2.4 Revised
Windows 10
This is probably related to the following in ?read.csv:
The number of data columns is determined by looking at the first five
lines of input (or the whole file if it has less than five lines), or
from the length of col.names if it is specified and is longer. This
could conceivably be wrong if fill or blank.lines.skip are true, so
specify col.names if necessary (as in the ‘Examples’).
It just happens that the row with the most number of columns is the sixth row in your first example.
I suggest using col.names to get around this, e.g.:
`... read.csv(..., col.names = paste0('V', 1:6))`
As the OP notes in a comment to this answer, you can find out the number of
columns required using readLines:
Ncol <- max(unlist(lapply(strsplit(readLines(datfile), ","), length)))
and then modify the above to give:
read.csv(datfile,header=F,colClasses="character", col.names=paste0("V", 1:Ncol))

R - write.table() creates table that isn't read properly by fread()

I have a data.table in R that I'm trying to write out to a .txt file, and then input back into R.
It's sizeable table of 6.5M observations and 20 variables, so I want to use fread().
When I use
write.table(data, file = "data.txt")
a table of about 2.2GB is written in data.txt. In manually inspecting it, I can see that there are column names, that it's separated by " ", and that there are quotes on character variables. So everything should be fine.
However,
data <- fread("data.txt")
returns a data.table of 6.5M observations and 1 variable. OK, maybe for some reason fread() isn't automatically understanding the separator string:
data <- fread("data.txt", sep = " ")
All the data is in the proper variables now, but
R has added an unnecessary row-number column
in one (only one) of my columns all NAs have been replaced by 9218868437227407266
All variable names are missing
Maybe fread() isn't recognizing the header, somehow.
data <- fread("data.txt", sep = " ", header = T)
Now my first set of observations is my column names. Not very useful.
I'm completely baffled. Does anyone understand what's happening here?
EDIT:
row.names = F solved the names problem, thanks Ananda Mahto.
Ran
datasub <- data[runif(1000,1,6497651), ]
write.table(datasub, file = "datasub.txt", row.names = F)
fread("datasub.txt")
fread() seems to work fine for the smaller dataset.
EDIT:
Here is the subset of data I created above:
https://github.com/cbcoursera1/ExploratoryDataAnalysisProject2/blob/master/datasub.txt
This data comes from the National Emissions Inventory (NEI) and is made available by the EPA. More information is available here:
http://www.epa.gov/ttn/chief/eiinformation.html
EDIT:
I can no longer reproduce this issue. It may be that row.names = F solved the issue, or possibly restarting R/clearing my environment/something random fixed the problem.

Error when reading in a .txt file and splitting it into columns in R

I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.

Resources