small files in R use of RAM - r

I'm trying to open a 163MB .xlsx file in R.
library(openxlsx)
df <- read.xlsx(xlsxFile = "df.xlsx", sheet = 1, colNames = T)
Doing this this small file (relatively small) uses all the 8GB of RAM I have on my laptop.
I have a CSV version of this file but due to the use of , and ; in one of the columns using a CSV is not an option How does this happen, knowing that I recently loaded a kaggle file (a 0.5GB csv) into R and still used my laptop for browsing internet ?
Edit : the RAM usage + output of pryr::object_size(df)

did you try readxl package https://blog.rstudio.org/2017/04/19/readxl-1-0-0/
read_xlsx(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max))

You can also read it as tab delimited (read.csv(..., sep="\t")) or save it as a .txt file and read it as tab delimited.

It looks like this is (or at least was) a problem with openxlsx. This Github issue describes the problem of inflated file sizes and suggests a solution (use the development version): https://github.com/awalker89/openxlsx/issues/161
So, potential solutions:
Try the development version of openxlsx (devtools::install_github("awalker89/openxlsx")
As suggested by #Ajay Ohri, try the readxl package instead.
Load the file and save it as a binary R file with save() or saveRDS()
Try again with the .csv file with readr::read_csv() or data.table::fread(); both are faster than base R's read.csv()

Related

Write file xlsx in R without using Java

wight now I'having an issue with R. I could not use library openxlsx in my PC. When I try to import, these errors show me that:
Error: package or namespace load failed for ‘openxlsx’:
object ‘zipr’ is not exported by 'namespace:zip'
Any one suggest me some different method to save file in xlsx format. I tried writexl package, however I don't know how to save file in seperate worksheets, the description of the function write_xlsx in this package only show me this:
write_xlsx(
x,
path = tempfile(fileext = ".xlsx"),
col_names = TRUE,
format_headers = TRUE
)
Can you use the WriteXLS package?
WriteXLS(x, ExcelFileName = paste0(tempfile, ".xlsx"), SheetNames = NULL, row.names = FALSE, col.names = TRUE)
Edit:
If you're trying to save data to different sheets, turn your data into a list:
mySheets = c("data1", "data2")
In case you would like to continue using openxlsx library in order to save in xlsx format and with different worksheets, you can solve the namespace error by upgrading the zip package to latest version v2.1.0, same issue as posted here. The zip package is a dependency of openxlsx.

Does readr read_csv allow one to specific the specific file in a zip

The readr package in the tidyverse has the option to automatically unpack a zip file and convert it to a tibble. But I have a zip file that holds multiple csv files. In the line of code below, SSPdataZip has three files in it. When I run it I get a warning "Multiple files in zip ..." and the name of the one it chooses. I know the name of the one I want but can't figure out how to tell read_csv what it is. Is there an option I'm missing?
temp <- readr::read_csv(SSPdataZip, col_names = TRUE, guess_max = 2000)
I believe you can use unz to achieve this:
readr::read_csv(unz(description = "SSPdataZip", filename = "FileName.csv"), col_names = TRUE, guess_max = 2000)

I can't import all the rows of my csv file in R with read.csv in R

I have a very large dataset in a csv file. When opening it in a data visualization software (Spotfire) I can see that it has more than 7 millions rows.
However, I am working with RStudio, and when I try to open the file with read.csv2, being cautious of the quotes or other options that could affect my dataset, I end up with a 4 million file.
Here is my code when I import the file
my_data <- {
as.data.frame(read.csv2(file,
sep = ";",
header = TRUE,
na.strings=c(""," ","NA"),
quote = "",
check.names=FALSE,
stringsAsFactors=FALSE))
}
Moreover, when I take a look at the data in RStudio with View(my_data) I can see that my lines are perfectly correct
Is it related to a size limit of the files in RStudio or something like that ?

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

Read excel file with formulas in cells into R

I was trying to read an excel spreadsheet into R data frame. However, some of the columns have formulas or are linked to other external spreadsheets. Whenever I read the spreadsheet into R, there are always many cells becomes NA. Is there a good way to fix this problem so that I can get the original value of those cells?
The R script I used to do the import is like the following:
options(java.parameters = "-Xmx8g")
library(XLConnect)
# Step 1 import the "raw" tab
path_cost = "..."
wb = loadWorkbook(...)
raw = readWorksheet(wb, sheet = '...', header = TRUE, useCachedValues = FALSE)
UPDATE: read_excel from the readxl package looks like a better solution. It's very fast (0.14 sec in the 1400 x 6 file I mentioned in the comments) and it evaluates formulas before import. It doesn't use java, so no need to set any java options.
# sheet can be a string (name of sheet) or integer (position of sheet)
raw = read_excel(file, sheet=sheet)
For more information and examples, see the short vignette.
ORIGINAL ANSWER: Try read.xlsx from the xlsx package. The help file implies that by default it evaluates formulas before importing (see the keepFormulas parameter). I checked this on a small test file and it worked for me. Formula results were imported correctly, including formulas that depend on other sheets in the same workbook and formulas that depend on other workbooks in the same directory.
One caveat: If an externally linked sheet has changed since the last time you updated the links on the file you're reading into R, then any values read into R that depend on external links will be the old values, not the latest ones.
The code in your case would be:
library(xlsx)
options(java.parameters = "-Xmx8g") # xlsx also uses java
# Replace file and sheetName with appropriate values for your file
# keepFormulas=FALSE and header=TRUE are the defaults. I added them only for illustration.
raw = read.xlsx(file, sheetName=sheetName, header=TRUE, keepFormulas=FALSE)

Resources