Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am learning using R to do data cleaning work. Just encounter a question that I could deal with by python but not in R.
The dataset is like this.dataset
I want to concat the first two columns and assign it as index. The first thing I need to do is to fillna('ffill') the first column. Then I need concat two columns.
Could you tell me how to do this in R (tidyverse is better)?
The result should like this:
result
Thanks in advance!
Try these. Be sure to read the help pages since many of them have arguments which may need to be set depending on what you want.
zoo::na.locf (last observation carried forward)
zoo::na.locf0
tidyr::fill
data.table::nafill
zoo also has na.aggregate, na.approx, na.contiguous, na.fill, na.spline, na.StructTS and na.trim for other forms of NA filling and tidyr also has replace_na.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a very large dataset with lots of columns (400+) and every time I create a new variable or add a new one I have to reorder it. I want it ordered so that all the related variables remain together so I've been using dplyr::select() to reorder things. Yet there are times when I have to go back into my script very early on and add a new variable. When I run the whole code after that, there tends to be one or two variables I forgot to put into preceding select() functions so it goes missing.
I use select() because selecting all the columns between two variables and referencing them by name is super easy (eg, Vfour:Vthreefifty). Do you have any tips for reordering datasets with lots of columns?
Given no reproducible example but using your 2 column names:
df %>%
select(., starts_with('V'))
You can then chain starts_with as needed.
Other options include:
ends_with, contains, matches
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I would like to understand how really works this script :
y <- y[keep, , keep.lib.sizes=FALSE]
in :
keep <- rowSums(cpm(y)>1) >= 3
y <- y[keep, , keep.lib.sizes=FALSE]
I do know d.f[a,b] but I can not find R-doc for d.f[a, ,b].
I tried "brackets", "hooks", "commas"... :-(
(Sometimes I would prefer that one does not simplifie his R script !)
Thanks in advance.
Subscripting data.Frames takes two values: df[rows, columns]. Any third value are optional arguments that you can use to subscript.
The most common of those is drop=FALSE as in df[1:18, 3, drop = FALSE]. This is done because when you subset just one column of a data.frame, it will lose the data.frame class. In your specific case, it seems like you are using another object that looks like a data.frame but with added functionalities from the bioconductor package. A look at the methods for those will tell you how these work.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So I went over this tutorial, and have a few questions:
What is the exact meaning of "columns within the frame of a data.table are seen as if they are variables"?
Is there a particular meaning for the "L" after 6 in month==6L? (in the data table its only 6 not 6L).
I understand how to calculate mean for every column by something, but what if I simply want to calculate the mean of each column (assuming that I have many columns so I don't want to write all the names).
Thanks!
expanding the quote: "you don’t have to use DT$ repetitively since columns within the frame of a data.table are seen as if they are variables" referring to variables within a data.table, is like using the with function, which minimizes typing and may make lines more readable.
"L" is an R marker that says treat the preceding number as an integer (not a numeric (double)).
use the .SD method, for example to get the sum of all variables by variable byVariable in data.table dt:
myDT <- dt[, lapply(.SD, sum), by="byVariable"]
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In R Studio I have a data set of 10 columns and I am required to add a further column (further variable) which is the average of 2 columns already there.
We have been told to use this formula: Tav = (Tmax+Tmin)/2 to create an extra column for the average of tmax and tmin but it does not work for me.
I attach an image showing my situation:
I have tried to search for a solution on this site and others but cannot seem to find anything that helps my specific situation.
Thanks for any help in advance.
Next time please read these first before posting: https://stackoverflow.com/help/how-to-ask
How to make a great R reproducible example?
Based on your screenshot, I think this is what you're after:
abp$tav <- (abp$tmax+abp$tmin)/2
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm curently using R package data.table to process big datasets.
I'm wondering if there is a difference between the syntax
DT[,v]
and the syntax :
DT$v
if DT is my data.table object and v the variable I want to select.
I know that the dollar sign is usually used for data frames and that [,v] is always used in data.table examples. However they both work and seem to give (in my experience with 5million rows) similar times to execute.
Do you know if they are processed differently and if one is more efficient when processing even huger datasets ?