exporting R data.table in .txt or .csv format - r

In these days I have run into a problem of data export from R to a more “common” format as .csv or .txt.
My dataset is in data.table format and has 149000 rows * 124 columns. I adopt the following lines of code to try to export it:
write.table(data_reduced,"directory/data_reduced.txt",sep="\t",row.names=FALSE)
write.csv2(data_reduced,"directory/data_reduced.csv")
The result, in both cases, is that the .txt or .csv files have a lower number of rows than they are supposed to do and this changes with the different trials I did (it ranges from 900 to 1800, more or less). Usually what I get are the first rows and then the very last one.
I have tried to transform the data.table in a matrix or data.frame but the result I get is more or less the same. I have also tried to adopt the write.xlsx function but I have some problems with Java (which is something common as I have noticed reading the SO forum and other web sources).
I have also read about a function called fwrite to export very large datasets but it looks like that my RStudio cannot find it, despite I installed the data.table package.
Can anyone give me an explanation/solution for this problem? I've been reading different sources to sort it out but with no success until now.
I use RStudio Version 0.99.473.

Related

Importing multiple excel files into R with different format variables

I use list apply to simultaneously import multiple (hundreds) of excel files ( using read_excel funtion to import specific cell range) into R followed by rbind.fill to build a single r dataframe and has always worked. However, this time one a (same name) variable (which is a date) has two different formats in different excel files. In some files is a double (POSIXct)and others is a character. I think I need to first get them to the same format before importing??? Don't know how to do it. Hope someone can help. Much appreciated.
I tried using multiple function (concatenate read_excel with as.character) and then rbind.fill and got the message that "ALL Inputs to r.bind must be data frames"

R: Exporting massive files within a loop

I have a code that generates several very large dataframes in a loop. Each one has around 300 million rows so I run out of memory before the loop is over. I am trying to export each dataframe once it is constructed within the loop and then remove it to free up space in my R environment before I start constructing the next.
The issue is how to export these very large datasets. I tried using fwrite from the data.table package but when I open the csv file I get an empty csv file called Book1 instead. I also tried saving it as a dta file using write.dta from the foreign package but Stata tells me it is corrupted when I try opening it.
When saving it as .csv with fwrite and opening it with Stata it worked perfectly!

When I upload my excel file to R, the column titles are in the rows and the data seems all jumbled. How do I fix this?

hi literally day one new coder
On the excel sheet, my data looks organized, but when I upload my file to R, it's not able to read the excel properly and the column headers are in the rows and the data seems randomized.
So far I have tried:
library(readxl)
dataset <-read_excel("pathname")
View(dataset)
Also tried:
dataset <-read_excel("pathname", sheet=1, colNames=TRUE)
Also tried to use the package openxlsx
but nothing is giving me the correct, organized data set.
I tried formatting my Excel to a CSV file, and the CSV file looks exactly like the data that shows up on R (both are messed up).
How should I approach this problem?
I deal with importing .xlsx into R frequently. It can be challenging due to the flexibility of the excel platform. I generally use readxl::read_xlsx() to fetch data from .xlsx files. My suggestions:
First, specify exactly the data you want to import with the range argument.
A cell range to read from, as described in cell-specification. Includes typical Excel
ranges like "B3:D87", possibly including the sheet name like "Budget!B2:G14"
Second, if there are there merged cells or other formatting challenges in column headers, I resort to setting col_names = FALSE. And supplying clean names after import with names(df) <- c("first_col", "second_col")
Third, if there are merged cells elsewhere in the spreadsheet I generally I resort to "fixing" them in excel (not ideal but easier for my use case), however, others may have suggestions on a programmatic fix.
It may be helpful to provide a screenshot of your spreadsheet.

Read a sample from sas7bdat file in R

I have a sas7bdat file of size around 80 GB. Since my pc has a memory of 4 GB the only way I can see is reading some of its rows. I tried using the sas7bdat package in R which gives the error "big endian files are not supported"
The read_sas() function in haven seems to work but the function supports reading specific columns only while I need to read any subset of rows with all columns. For example, it will be fine if I can read 1% of the data to understand it.
Is there any way to do this? Any package which can work?
Later on I plan to read parts of the file and divide it into 100 or so sections
If you have Windows you can use the SAS Universal Viewer, which is free, and export the dataset to CSV. Then you can import the CSV into R in more readable chunks using this method.

Reading excel with R

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels,
ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?
Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.
All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.
If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

Resources