R reset cumsum when it found 0 [duplicate] - r

This question already exists:
R ffdfdply reset cumsum using data.table
Closed 9 years ago.
I am using the ff package to load an excel file.
i=as.ffdf(data.frame(a=c(1,1,1,1,1,1), b=c(1,4,6,2,5,3), c=c(1,1,1,1,1,1), d=c(1,0,1,1,0,1)))
I am trying to get the cumulative sum on column d and reset it whenever it found 0. I am trying to get the below output.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
I know, I can easily achieved it through ddply but I have large set of data rows i.e. > 5000000 rows.
Thanks

This will work but little bit slower with 24385601 rows. I created unique combination on column a and c and use the Arun solution. Key column (key_a_c) is used to split the data set i.e. to reset cumsum.
Create a unique key on column a and c
i$key_a_c <- ikey(i[c("a", "c")])
Generate cumulative series by spliting on the basis of key_a_c
p1=ffdfdply(i, split=as.character(i$key_a_c), FUN= function(x) {
x$Result <- as.ff(x[, "d"] * sequence(rle(x[, "d"])$lengths))
as.data.frame(x)
}, trace=T)
Please share your views and code if you have some optimized solution.

Related

R set column to maximum of current entry and specified value in an elegant way [duplicate]

This question already has answers here:
R: Get the min/max of each item of a vector compared to single value
(1 answer)
Replace negative values by zero
(5 answers)
Closed 1 year ago.
NOTE: I technically know how to do this, but I feel like there has to be a "nicer" way to do this. If such questions are not allowed here just delete it, but I would really like to improve my R style, so any suggestions are welcome.
I have a dataframe data <- data.frame(foo=rep(c(-1,2),5))
foo
1 -1
2 2
3 -1
4 2
5 -1
6 2
7 -1
8 2
9 -1
10 2
Now I would like to be able to set the entries of foo to a certain value (for this example, let's say 1) if the current entry is smaller than that value.
So my desired output would be
foo
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 2
I feel like there should be something like data$foo <- max(data$foo,1) that does the job (but ofc, it "maxes" over the whole column).
Is there an elegant way to do this?
data$foo <- ifelse(data$foo < 1,1,data$foo) and data$foo <- lapply(data$foo,function(x) max(1,x)) just feel somewhat "ugly".
max gives you maximum of the whole column but for your case you need pmax(parallel maximum) so it gives you maximum of 1 or each number in the vector.
data$foo <- pmax(data$foo, 1)
data
# foo
#1 1
#2 2
#3 1
#4 2
#5 1
#6 2
#7 1
#8 2
#9 1
#10 2
This works:
data <- data.frame(foo=rep(c(-1,2),5))
val <- 1
data[data$foo < val, ] <- val
Let's break this down. data$foo takes the column and makes it into a vector. data$foo < val checks which elements of this vector are smaller than val, creating a new vector of similar lenghts filled with TRUE and FALSE at the correct positions.
Finally, the entire line data[data$foo < val, ] <- val uses that vector of TRUE and FALSE to select the rows (using the [, ]) of data to which val is now used.

Treshold values row-wise in a dataframe

Consider an example data frame:
A B C v
5 4 2 3
7 1 3 5
1 2 1 1
I want to set all elements of a row to 1 if the element is bigger or equal than v, and 0 otherwise. The example data frame would result in the following:
A B C v
1 1 0 3
1 0 0 5
1 1 1 1
How can I do this efficiently? The number of columns will be much higher, and I would like a solution that does not require me to specify the names of the columns individually, and will apply it to all of them (except v) instead.
My solution with a for loop is way too slow.
We can create a logical matrix and coerce to binary
df1[-4] <- +(df1[-4] >= df1$v)

In R, select row of a matrix that match a vector [duplicate]

This question already has answers here:
Find all rows of matrix equal to vector
(3 answers)
Closed 6 years ago.
R beginner here, need ur help. Lets say we have a matrix like this one:
1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
4 1 1 0
5 1 0 1
6 0 1 1
7 1 1 1
Next we have a certain vector f.e. (1, 0, 1), wich would be matching row 5.
Whats the best way to get the index 5 from the matrix given that vector?
I have allready read the questions
R - fastest way to select the rows of a matrix that satisfy multiple conditions
and
In R, select rows of a matrix that meet a condition
but i think the situation differs in this case. Thanx for your input!
I can propose combination of which, apply, and all functions.
m <- matrix(c(1,0,0,0,1,0,0,0,1,1,1,0,1,0,1,0,1,1,1,1,1), 7, byrow=TRUE)
which(apply(m, 1, function(x) return(all(x == c(1,0,1)))))
[1] 5
We can use rowSums
which(rowSums(m1 == rep(c(1,0,1), each = nrow(m1)))==3)
#5
#5

R: Aggregate and create columns based on counts [duplicate]

This question already has answers here:
Frequency counts in R [duplicate]
(2 answers)
Closed 7 years ago.
I'm sure this question has been asked before, but I can't seem to find an answer anywhere, so I apologize if this is a duplicate.
I'm looking for R code that allows me to aggregate a variable in R, but while doing so creates new columns that count instances of levels of a factor.
For example, let's say I have the data below:
Week Var1
1 a
1 b
1 a
1 b
1 b
2 c
2 c
2 a
2 b
2 c
3 b
3 a
3 b
3 a
First, I want to aggregate by week. I'm sure this can be done with group_by in dplyr. I then need to be able to cycle through the code and create a new column each time a new level appears in Var 1. Finally, I need counts of each level of Var1 within each week. Note that I can probably figure out a way to do this manually, but I'm looking for an automated solution as I will have thousands of unique values in Var1. The result would be something like this:
Week a b c
1 2 3 0
2 1 1 3
3 2 2 0
I think from the way you worded your question, you've been looking for the wrong thing/something too complicated. It's a simple data-reshaping problem, and as such can be solved with reshape2:
library(reshape2)
#create wide dataframe (from long)
res <- dcast(Week~Var1, value.var="Var1",
fun.aggregate = length, data=data)
> res
Week a b c
1 1 2 3 0
2 2 1 1 3
3 3 2 2 0

increase in one variable nested within another column in R + setting 0 as starting value

I'm trying to use the diff function to calculate the increase in a variable ("damage") in this dataset (df). I want to fill the column "damage_new" with this new variable. The values that you see now are the values I would like to have.
df = data.frame(id=c(1,1,1,2,2), trial=c(1,3,4,1,2), damage=(1,NA,3,1,5))
df
ID TRIAL DAMAGE DAMAGE_NEW
1 1 1 0
1 3 NA NA
1 4 3 NA
2 1 1 0
2 2 5 4
If I run
diff(df$damage) it will calculate the difference in the whole dataset.
two things that I haven't managed are:
-how to nest the difference within the values of another column? Specifically, I want to calculate the damage increase (for the whole dataset), but within a single individual (ID), of which I have repeated measurements.
-I also would like to have the damage_new column to be the same length as the rest of the dataset (to attach it), and for each individual, have the first value of damage_new set to 0, since obviously the first measurement has no reference.
-To further describe the dataset, I have NAs in the 'damage" column, which I suspect will lead to more NAs in the damage_new column, but I would like to keep them (and I wonder how the function deals with them?). I also don't have the same number of measurements per individual (they will have a different number of trials, with some missing in between).
thanks a lot for the always fast and efficient answers!
The dplyr package is great for this kind of things:
library(dplyr)
df %>% group_by(id) %>% mutate(damage_new=c(0,diff(damage)))
Source: local data frame [5 x 4]
Groups: id
id trial damage damage_new
1 1 1 1 0
2 1 3 NA NA
3 1 4 3 NA
4 2 1 1 0
5 2 2 5 4
You can read more about dplyr usage here
Update
If you'd like to go with the base R, you could do:
df$damage_new <- ave(df$damage,df$id,FUN=function(v) c(0,diff(v)))
which will produce the same df.
Library data.table is your friend there:
> library(data.table)
> setDT(df)
> setkey(df, id, trial)
> df[,new_damage:=c(0,diff(damage)),by=id]
> df
id trial damage new_damage
1: 1 1 1 0
2: 1 3 NA NA
3: 1 4 3 NA
4: 2 1 1 0
5: 2 2 5 4
On the diff working with NA, anything you withdraw from NA gives NA:
> diff(c(1,3,4,NA,5,7))
[1] 2 1 NA NA 2

Resources