Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
data <- read_delim("imported_data.csv", delim = ",")
data <- read_csv("imported_data.csv")
data <- read.csv("imported_data.csv")
data <- fread("imported_data.csv")
All these function have the same output, which one should I use?
When it comes to more sophisticated functions, again what should I do?
Thanks.
Use the one that's most appropriate for the situation.
If you are using Dplyr and related libraries, use read_csv or read_delim. The former is a convenience wrapper for the latter, so use whichever one seems most logical to you.
If you are using Data.table, use fread. Data.table has better performance on very large datasets, compared to Dplyr.
If you are not using either of those libraries, use read.csv or read.table because they are included in base R.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am learning using R to do data cleaning work. Just encounter a question that I could deal with by python but not in R.
The dataset is like this.dataset
I want to concat the first two columns and assign it as index. The first thing I need to do is to fillna('ffill') the first column. Then I need concat two columns.
Could you tell me how to do this in R (tidyverse is better)?
The result should like this:
result
Thanks in advance!
Try these. Be sure to read the help pages since many of them have arguments which may need to be set depending on what you want.
zoo::na.locf (last observation carried forward)
zoo::na.locf0
tidyr::fill
data.table::nafill
zoo also has na.aggregate, na.approx, na.contiguous, na.fill, na.spline, na.StructTS and na.trim for other forms of NA filling and tidyr also has replace_na.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Does anyone know which method of saving data is faster fwrite from data.table or saveWorkbook in openxlsx?
Not quite an answer, but too long for a comment.
The easy comment is: Just try to benchmark your code with bench::mark
library(bench)
...
mark(
data.table::fwrite(data, tempfile()),
openxlsx::saveWorkbook(data, tempfile()),
check = FALSE
)
The slightly longer comment is: Do you just want to have the fastest read/write? Then you might want to look into fst and or qs.
I presented a lightning talk at our last R User Group where I benchmarked different read/write speeds, memory usages, file sizes etc. You find the slides here.
Hope that helps
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to write a function in R, since in other languages like c++ it works very slow. The function fills a 2d table with data, and then summarizes values of each row for further processing.
I am not sure if it answers your question, but if you work with data you can put them into data frames to take a look at the statistical parameters and for further processing. For example:
df = data.frame("var1" = c(5,10,15), "var2" = c(20,40,60))
#the 'summary' command gives you some statistical parameters based on the column
summary(df)
#with the 'apply' command you can addresses the rows.
#in this example you get the mean of each row:
apply(df, 1,mean)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I would like to understand how really works this script :
y <- y[keep, , keep.lib.sizes=FALSE]
in :
keep <- rowSums(cpm(y)>1) >= 3
y <- y[keep, , keep.lib.sizes=FALSE]
I do know d.f[a,b] but I can not find R-doc for d.f[a, ,b].
I tried "brackets", "hooks", "commas"... :-(
(Sometimes I would prefer that one does not simplifie his R script !)
Thanks in advance.
Subscripting data.Frames takes two values: df[rows, columns]. Any third value are optional arguments that you can use to subscript.
The most common of those is drop=FALSE as in df[1:18, 3, drop = FALSE]. This is done because when you subset just one column of a data.frame, it will lose the data.frame class. In your specific case, it seems like you are using another object that looks like a data.frame but with added functionalities from the bioconductor package. A look at the methods for those will tell you how these work.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm curently using R package data.table to process big datasets.
I'm wondering if there is a difference between the syntax
DT[,v]
and the syntax :
DT$v
if DT is my data.table object and v the variable I want to select.
I know that the dollar sign is usually used for data frames and that [,v] is always used in data.table examples. However they both work and seem to give (in my experience with 5million rows) similar times to execute.
Do you know if they are processed differently and if one is more efficient when processing even huger datasets ?