write.xlsx for genomic data - bigdata

I haveseveral big files contains more than 200000 rows. I have edited them and tried to save them as xls file sing write.xlsx. but I am getting the following error. any suggestion on how to save them as xls file?
Error in .jcall(sheet, "Lorg/apache/poi/ss/usermodel/Row;", "createRow", : java.lang.IllegalArgumentException: Invalid row number (65536) outside allowable range (0..65535)

XLS cannot handle more than 64k rows (65535 as the error indicates). Break down your 200k rows into four chunks which then will be less than 65k.

Related

Deleting excess rows from csv files in R and then combining csv files

I have over 200+ csv files containing temperature data from iButton data loggers. The csv files that are created in onewireviewer have 14 rows of data that I need to get rid of in all of the csv files (see the image below) so that I can then merge csv files based on the column headings.
Onewireviewer output csv file
Id love to be able to automate it in some way as I have around 70 folders (basically one folder per location) with 2-3 csv files from onewireviewer in each folder.
Ive tried messing around with bits of code Ive found online but I couldnt get anything to work and Im now just incredibly frustrated. Any and all help is greatly appreciated!
If it helps I did try running the tidy verse code found here Remove certain rows and columns in multiple csv files under the same folder in R but I get this error:
Column specification -----------------------------------------------------
Delimiter: ","
chr (1): 1-Wire/iButton Part Number: DS1921G-F5
i Use spec() to retrieve the full column specification for this data.
i Specify the column types or set show_col_types = FALSE to quiet this message.
Error: Can't subset columns that don't exist.
x Locations 2, 3, 4, 5, 6, etc. don't exist.
i There are only 1 column.
Run rlang::last_error() to see where the error occurred.
In addition: Warning messages:
1: One or more parsing issues, see problems() for details
2: One or more parsing issues, see problems() for details
try something like
library(data.table)
rbindlist(lapply(list.files(...), fread, skip = 14, ...), ...)
where ... are function's arguments.
check out ?list.files, ?data.table::fread and ?data.table::rbindlist to find out more about them.

Ignoring aggregate error for files without row data in R

I am running analysis on 300+ csv files and getting the desired output however my R code stops and terminates when it finds a blank csv file in the list without any row data and generates the following error
"Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate".
Basically it is generating the output until it finds a blank csv in list and then terminates the code with this error. I want something in the code which ignores this error because manually checking the blank file would be so inefficient.
The code is very huge however i am reading the files initially using :
filenames <- list.files(pattern = '.csv', recursive = TRUE)
and then reading several columns from multiple csv files and based on particular columns performing aggregation. Interestingly,in one column i am calculating number of rows from csv files using
row <- aggregate(formula=.~day,data=rowdata,FUN=NROW)
I doubt if error is generating from this function upon reading blank csv files.
Help needed !
Can you include an if condition to check the number of rows in the dataframe?
Something like this :
if(nrow(rowdata) > 0) row <- aggregate(.~day,data=rowdata,FUN=NROW)
else NULL

R: read xlsx not reading excel file when rows surpass 100k

Looks like when I try to read more than 100k rows from an excel file, I get this error message:
Error: Cell references aren't uniformly A1 or R1C1 format:
df <- read_xlsx("Test.xlsx",
col_names=T,sheet="Data",
range="G1:AL170000")
If I try to read under 100k rows, it does it fine. What am I doing wrong? Any ideas?
Hi I came across the exact same problem and found this solution in the RStudio Community and see also
For your problem I would go with:
df <- read_xlsx("Test.xlsx",
col_names=T,sheet="Data",
range=cell_limits(c(1,6), c(NA,38)))

.txt files to data frame with read_data (RTextTools); error w/ CSV reference

I'd like to use the RTextTools package (documented here and on the CRAN manual, to text-mine several documents I've resolved into .txt files. I'm having trouble with read_data() (GitHub)
To read text files read_data takes a folder pathname, and a CSV labeling filenames and training values.
In my directory of text files I run this command
df_text <- read_data(filepath = getwd(),type = "folder",index = paste0(getwd(),"/dir-3.csv")
Error in data.frame(Text.Data = frame, Labels = labels_fixed) :
arguments imply differing number of rows: 3, 292
In addition: Warning messages:
1: In readLines(filename) : incomplete final line found on 'C:/contracts/pdfs/text
My CSV file is just two columns that list the filenames I want to read, and a made-up training value I plan to change later:
filename.txt | #
x.txt | 2
y.txt | 2
z.txt | 2
How can I correct these error messages?
The function conditions on the relationship between:
nrow(labels)
which is the number of rows in your CSV and
length(files)
which is the number of files in the directory.
The easiest way to fix this problem would be to ensure a 1:1 correspondence between files in the directory and files in the index. Barring that, it's something to do with the files in your directory that aren't in your index, hard to tell without seeing your directory.
I'm also suspicious that the second column of your index file may be causing problems. Maybe try getting rid of it or giving it string instead of numeric values?

Save big matrix as csv file - header over multiple rows in excel

I have a matrix with 168 rows and about 6000 columns. The column names are stock identifiers, the rownames are dates. I would like to export this matrix as a .csv file. I tried the following:
write.csv(OAS_data, "Some Path")
The export works. But the header of the matrix (stock identifiers) is distributed over the first 3 rows of the .csv file (when opened in excel). The first two rows each contain the names of 2184 column names. The rest of the column names is in the third line. How can I avoid those breaks? The rest of the .csv file looks fine - no line breaks there.
Thank You.
Your best bet is probably to transpose your data and analyze it that way due to Excel's limits.
write.csv(t(OAS_data), "Some Path")

Resources