Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Does anyone know which method of saving data is faster fwrite from data.table or saveWorkbook in openxlsx?
Not quite an answer, but too long for a comment.
The easy comment is: Just try to benchmark your code with bench::mark
library(bench)
...
mark(
data.table::fwrite(data, tempfile()),
openxlsx::saveWorkbook(data, tempfile()),
check = FALSE
)
The slightly longer comment is: Do you just want to have the fastest read/write? Then you might want to look into fst and or qs.
I presented a lightning talk at our last R User Group where I benchmarked different read/write speeds, memory usages, file sizes etc. You find the slides here.
Hope that helps
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 12 months ago.
Improve this question
So what I'm trying to do is add a column of 0's to this data frame, but if any of the rows has the code "h353" within any of the columns in that row, then I want that row to have a 1 instead of a 0 in the new column. I'm not even sure if the code works as is, but I just know it's going to take forever to run in its current state since the file is pretty large. Any suggestions on how to fix this/make it more efficient?
current code
This should do the job:
dat<-data.frame(x=rep(0,30), y=rep(0,30), z=rep(0,30))
dat[2,2]<-"h353"
dat[15,3]<-"h353"
dat[20,1]<-"h353"
dat$md<-0
for (i in 1:length(dat[1,])) {if (i==1){mdrows<-as.character(dat[,i])=="h353"} else {mdrows<-mdrows|as.character(dat[,i])=="h353"}}
dat$md[mdrows]<-1
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
data <- read_delim("imported_data.csv", delim = ",")
data <- read_csv("imported_data.csv")
data <- read.csv("imported_data.csv")
data <- fread("imported_data.csv")
All these function have the same output, which one should I use?
When it comes to more sophisticated functions, again what should I do?
Thanks.
Use the one that's most appropriate for the situation.
If you are using Dplyr and related libraries, use read_csv or read_delim. The former is a convenience wrapper for the latter, so use whichever one seems most logical to you.
If you are using Data.table, use fread. Data.table has better performance on very large datasets, compared to Dplyr.
If you are not using either of those libraries, use read.csv or read.table because they are included in base R.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm writing an R code, which calls C++, and C++ functions use a lot of parallel computing based on openMP. This is my first code using openMP and what I saw is that even setting the same C++ random seed, the code never gives the same results.
I read a lot of posts here, where it seems that this is an issue with openMP, but they are all old (between12 to 5 years ago)
I want to know if there are solutions now and if there are published article which explain this problem or/and possible solutions.
Thanks
You need to read up on parallel random number generation. This is not an OpenMP problem, but one that will afflict any use of random numbers in a parallel code.
Start with
Parallel Random Numbers: As Easy as 1, 2, 3 - The Salmons
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm curently using R package data.table to process big datasets.
I'm wondering if there is a difference between the syntax
DT[,v]
and the syntax :
DT$v
if DT is my data.table object and v the variable I want to select.
I know that the dollar sign is usually used for data frames and that [,v] is always used in data.table examples. However they both work and seem to give (in my experience with 5million rows) similar times to execute.
Do you know if they are processed differently and if one is more efficient when processing even huger datasets ?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have very big .csv file, it's around a few GB.
I want to read first few thousand lines of it.
Is there any method to do this efficiently?
Use the nrows argument in read.csv(...)
df <- read.csv(file="my.large.file.csv",nrows=2000)
There is also a skip= parameter that tells read.csv(...) how many lines to skip before you start reading.
If your file is that large you might be better off using fread(...) in the data.table package. Same arguments.
If you're on UNIX or OS/X, you can use the command line:
head -n 1000 myfile.csv > myfile.head.csv
Then just read it in R like normal.