Split data.frame by variable and apply function referring to concrete row - r

I need to split data.frame by some variable and calculate difference between each row's value and value from some other specified row.
In example below, I split df by v1. Then for each row of v3 calculate difference between the actual value and v3[v2 == "C"].
v1 <- rep(1:4,each = 3)
v2 <- rep(c("A","B","C"),4)
v3 <- rep(1:5,3)[1:12]
res <- c(-2,-1,0,3,4,0,-2,-1,0,3,-1,0)
df <- data.frame(v1,v2,v3,res)
df
v1 v2 v3 res
1 1 A 1 -2
2 1 B 2 -1
3 1 C 3 0
4 2 A 4 3
5 2 B 5 4
6 2 C 1 0
7 3 A 2 -2
8 3 B 3 -1
9 3 C 4 0
10 4 A 5 3
11 4 B 1 -1
12 4 C 2 0
I prefer plyr or data.table, if possible.

Here is a data table solution:
library(data.table)
setDT(df)
df[, new := v3 - v3[v2=="C"], by = "v1"]

Related

R: select rows by group after resampling

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.
df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))
boot.index <- sample(unique(df$ID), replace = TRUE)
Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this
ID v1 v2
B 4 0
B 2 1
B 2 1
B 4 0
B 2 1
B 2 1
C 5 0
C 6 1
C 9 0
Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!
set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C
do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
# ID v1 v2
#7 C 5 0
#8 C 6 1
#9 C 9 0
#71 C 5 0
#81 C 6 1
#91 C 9 0
#1 A 3 1
#2 A 1 0
#3 A 2 0
%in% to select the relevant rows would get your desired output.
> df
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
7 C 5 0
8 C 6 1
9 C 9 0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
dplyr::filter based solution:
> df %>% filter(ID %in% boot.index)
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
You can also do this with a join:
boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)

Aggregate and count in new column

I have a large data frame with the columns V1 and V2. It is representing an edgelist. I want to create a third column, COUNT, which counts how many times that exact edge appears. For example, if V1 == 1 and V2 == 2, I want to count how many other times V1 == 1 and V2 == 2, combine them into one row and put the count in a third column.
Data <- data.frame(
V1 = c(1,1),
V2 = c(2,2)
)
I've tried something like new = aggregate(V1 ~ V2,data=df,FUN=length) but it's not working for me.
...or maybe use data.table:
library(data.table)
df<-data.table(v1=c(1,2,3,4,5,1,2,3,1),v2=c(2,3,4,5,6,2,3,4,3))
df[ , count := .N, by=.(v1,v2)] ; df
v1 v2 count
1: 1 2 2
2: 2 3 2
3: 3 4 2
4: 4 5 1
5: 5 6 1
6: 1 2 2
7: 2 3 2
8: 3 4 2
9: 1 3 1
Assuming the structure of data as :
df<-data.frame(v1=c(1,2,3,4,5,1,2,3),v2=c(2,3,4,5,6,2,3,4),stringsAsFactors = FALSE)
> df
v1 v2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 1 2
7 2 3
8 3 4
Using ddply function from plyr package to get count of all edge-pairs
df2 <- ddply(df, .(v1,v2), function(df) c(count=nrow(df)))
> df2
v1 v2 count
1 1 2 2
2 2 3 2
3 3 4 2
4 4 5 1
5 5 6 1

Finding cumulative product if a certain condition is met in R

I have a dataset that looks like the following (called Data):
v1 v2
1 1
1 3
1 5
2 3
2 4
3 1
3 2
I want to return a vector v3 that:
is equal to v2[i] if v1[i] is not equal to v1[i-1]
is equal to v3[i-1]*v2[i] if v1[i] is equal to v1[i-1]
So, in this example, v3 should return
v3
1
3
15
3
12
1
2
I've lagged the column v1 by using lag.v1<-c(NA,Data[1:nrow(Data)-1,1]) in order to compare to the previous row. I think something similar to the following should work, but with the value of v3 in the previous row instead of the current row.
Data$v3<-ifelse(1*(Data$v1==lag.v1)==1, Data$v3*Data$v2, Data$v2)
In other words, I need to somehow access the previous row of v3 (lag v3) as I'm forming v3 in the above equation.
Help is greatly appreciated, thank you!
You can use ave with cumprod, this calculates the cumulative product of column v2 grouped by v1:
df$v3 <- with(df, ave(v2, v1, FUN=cumprod))
df
# v1 v2 v3
#1 1 1 1
#2 1 3 3
#3 1 5 15
#4 2 3 3
#5 2 4 12
#6 3 1 1
#7 3 2 2
With plyr package, you can use ddply with transform:
plyr::ddply(df, "v1", transform, v3 = cumprod(v2))
# v1 v2 v3
#1 1 1 1
#2 1 3 3
#3 1 5 15
#4 2 3 3
#5 2 4 12
#6 3 1 1
#7 3 2 2
If you haven't, you probably also want to know a dplyr approach:
library(dplyr)
df %>% group_by(v1) %>% mutate(v3 = cumprod(v2))
#Source: local data frame [7 x 3]
#Groups: v1 [3]
# v1 v2 v3
# <int> <int> <dbl>
#1 1 1 1
#2 1 3 3
#3 1 5 15
#4 2 3 3
#5 2 4 12
#6 3 1 1
#7 3 2 2
We can use data.table
library(data.table)
setDT(df)[, v3 := cumprod(v2), by = v1]
df
# v1 v2 v3
#1: 1 1 1
#2: 1 3 3
#3: 1 5 15
#4: 2 3 3
#5: 2 4 12
#6: 3 1 1
#7: 3 2 2

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Cumulative Sum Starting at Center of Data Frame - R

I have this data.frame called dum
dummy <- data.frame(label = "a", x = c(1,1,1,1,0,1,1,1,1,1,1,1,1))
dummy1 <- data.frame(label = "b", x = c(1,1,1,1,1,1,1,1,0,1,1,1,1))
dum <- rbind(dummy,dummy1)
What I am trying to do is take the cumulative sum starting at 0 in the x column of dum. The summing would be grouped by the label column, which can be implemented in dplyr or plyr. The part that I am struggling with is how to start the cumulative sum from the 0 position in x and go outward.
The resulting data.frame should look like this :
>dum
label x output
1 a 1 4
2 a 1 3
3 a 1 2
4 a 1 1
5 a 0 0
6 a 1 1
7 a 1 2
8 a 1 3
9 a 1 4
10 a 1 5
11 a 1 6
12 a 1 7
13 a 1 8
14 b 1 8
15 b 1 7
16 b 1 6
17 b 1 5
18 b 1 4
19 b 1 3
20 b 1 2
21 b 1 1
22 b 0 0
23 b 1 1
24 b 1 2
25 b 1 3
26 b 1 4
This would need to be iterated thousands of times over millions of rows of data.
As usual, thanks for any and all help
It seems more like you just want to find the distance to a zero, rather than any sort of cumulative sum. If that's the case, then
#find zeros for each group
zeros <- tapply(seq.int(nrow(dum)) * as.numeric(dum$x==0), dum$label, max)
#calculate distance from zero for each point
dist <- abs(zeros[dum$label]-seq.int(nrow(dum)))
And that gives
cbind(dum, dist)
# label x dist
# 1 a 1 4
# 2 a 1 3
# 3 a 1 2
# 4 a 1 1
# 5 a 0 0
# 6 a 1 1
# 7 a 1 2
# 8 a 1 3
# 9 a 1 4
# 10 a 1 5
# 11 a 1 6
# 12 a 1 7
# 13 a 1 8
# 14 b 1 8
# 15 b 1 7
# 16 b 1 6
# 17 b 1 5
# 18 b 1 4
# 19 b 1 3
# 20 b 1 2
# 21 b 1 1
# 22 b 0 0
# 23 b 1 1
# 24 b 1 2
# 25 b 1 3
# 26 b 1 4
Or even ave will let you do it in one step
dist <- with(dum, ave(x,label,FUN=function(x) abs(seq_along(x)-which.min(x))))
cbind(dum, dist)
You can do this with by but also with plyr, data.table, etc. The function that is used on each subset is
f <- function(d) {
x <- d$x
i <- match(0, x)
v1 <- rev(cumsum(rev(x[1:i])))
v2 <- cumsum(x[(i+1):length(x)])
transform(d, output = c(v1, v2))
}
To call it on each subset e.g. with by
res <- by(dum, list(dum$label), f)
do.call(rbind, res)
If you want to use ddply
library(plyr)
ddply(dum, .(label), f)
May be faster with data.table
library(data.table)
dumdt <- as.data.table(dum)
setkey(dumdt, label)
dumdt[, f(.SD), by = key(dumdt)]
Using dplyr
library(dplyr)
dum%>%
group_by(label)%>%
mutate(dist=abs(row_number()-which.min(x)))

Resources