Correlations by grouping twice in R, using dplyR or aggregate?

Correlations by grouping twice in R, using dplyR or aggregate? - r

My (toy) data looks like:
Item_Id Location_Id date price
1 A 5372 1 .5
2 A 5372 2 NA
3 A 5372 3 1
4 A 6065 1 1
5 A 6065 2 1
6 A 6065 3 3
7 A 7000 1 NA
8 A 7000 2 NA
9 A 7000 3 NA
10 B 5372 1 3
11 B 5372 2 NA
12 B 5372 3 1
13 B 6065 1 2
14 B 6065 2 1
15 B 6065 3 3
16 B 7000 1 8
17 B 7000 2 NA
18 B 7000 3 9
In reality there are hundreds of unique item_Ids and location_Ids.
Data
Item_Id=c(rep('A',9),rep('B',9))
Location_Id=rep(c(rep(5372,3),rep(6065,3),rep(7000,3)),2)
date = rep(1:3,6)
price = c(0.5,NA,1,1,1,3,NA,NA,NA,3,NA,1,2,1,3,8,NA,9)
df = data.frame(Item_Id,Location_Id,date,price)
I want to ultimate get the median correlation (over locations) of the prices series for every item with every other item. I tried writing a loop in the hopes that it would be quick (not finished):
for(item in items){
remainingitems = items[items!=item]
for(item2 in remainingitems){
cortemp = numeric(0)
for(locat in locations){
print(locat)
a = pricepanel[pricepanel$Item_Id==item &
pricepanel$Location_Id==locat,]$price
b = pricepanel[pricepanel$Item_Id==item2 &
pricepanel$Location_Id==locat,]$price
cortemp=c(cortemp,cor(cbind(a,b), use="pairwise.complete.obs")[2])
}
}
But I stopped because it was much too slow. The most inner loop took several minutes alone and there are hundreds of stores and items. Basically I want to get the correlation matrix (every product with every other product) for every location, and then take the element-wise median across those matrices.
I expect there is an efficient way to do this, but I am new to this kind of thing in R. I tried reading dplyr since I suspect the solution lies in there, but I got stuck.
The interim output would be something like:
$5752
A B
A 1 -1
B -1 1
$6065
A B
A 1 0.8660254
B 0.8660254 1
$7000
A B
A 1 NA
B NA 1
Then the final would take the elementwise median of all those location matrices.
Final:
A B
A 1 -.0669873
B -.0669873 1

You could get the "interim" output using dplyr and tidyr:
library(dplyr)
library(tidyr)
cors <- df %>% spread(Item_Id, price) %>%
group_by(Location_Id) %>%
do(correlation = cor(.[, -(1:2)], use = "pairwise.complete.obs"))
The way that this works is that the spread function (from tidyr) spreads the As, Bs, Cs etc into their own columns:
df %>% spread(Item_Id, price)
# Location_Id date A B
# 1 5372 1 0.5 3
# 2 5372 2 NA NA
# 3 5372 3 1.0 1
# 4 6065 1 1.0 2
# 5 6065 2 1.0 1
# 6 6065 3 3.0 3
# 7 7000 1 NA 8
# 8 7000 2 NA NA
# 9 7000 3 NA 9
(This should work with any number of "Items"- A, B, C, D...) The group_by(Location_Id) function then tells the code to operate within each location. Finally the do command tells it to find the correlation of the columns within each group (. is a placeholder for "the data within each group"), while ignoring the first two columns, Location_Id and date.
The above code produces a result that looks like:
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Location_Id correlation
# 1 5372 <dbl[2,2]>
# 2 6065 <dbl[2,2]>
# 3 7000 <dbl[2,2]>
The correlation column is a list of your three within-location matrices. At that point you can use the solution in this question to take the elementwise median:
apply(simplify2array(cors$correlation), c(1,2), median, na.rm = TRUE)

Here's a possible split apply solution using base R
lapply(split(df[, c("Item_Id", "price")], df$Location_Id),
function(x) {
cor(matrix(x$price, nrow = nrow(x)/length(unique(x$Item_Id))), use ="pairwise.complete.obs")
} )
# $`5372`
# [,1] [,2]
# [1,] 1 -1
# [2,] -1 1
#
# $`6065`
# [,1] [,2]
# [1,] 1.0000000 0.8660254
# [2,] 0.8660254 1.0000000
#
# $`7000`
# [,1] [,2]
# [1,] NA NA
# [2,] NA 1
And here's a similar solution to #Davids using data.table package
library(data.table)
DT <- dcast.data.table(as.data.table(df),
Location_Id + date ~ Item_Id,
value.var = "price")[, -2, with = FALSE]
Res <- DT[, .(Res = list(cor(.SD, use = "pairwise.complete.obs"))), Location_Id]
You can then view the cor matrices using
Res$Res
# [[1]]
# A B
# A 1 -1
# B -1 1
#
# [[2]]
# A B
# A 1.0000000 0.8660254
# B 0.8660254 1.0000000
#
# [[3]]
# A B
# A NA NA
# B NA 1

Related

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!

here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7

You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7

library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

Averaging row and column cells from multiple data frames

I have multiple data frames, like:
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))
head(DG)
y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6
head(DT)
y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5
head(DF)
y v x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12
I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:
y v x
1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
....
In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.
I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.

If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.
The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:
library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 6.666667
4: 4 3 6 8.333333
5: 5 1 7 6.666667
6: 6 3 8 8.333333
This will work also with NAs, e.g.,
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))
Note that column x of DT has been modified
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 8.000000
4: 4 3 6 8.000000
5: 5 1 7 7.000000
6: 6 3 8 10.000000
Note that x in rows 3 and 6 has changed.

If you only have the three data frames, I would recommend
result = (DG + DT + DF) / 3
result
# y v x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333
This assumes that your rows and columns are already in the correct order.
If you have more data frames, put them in a list (see here for help with that) and then you can do this:
result = Reduce("+", list_of_data) / length(list_of_data)
If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.
library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))
apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
# y v x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333
The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.
library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
group_by(.id) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
select(-.id) %>%
group_by(rn) %>%
summarize_all(mean, na.rm = TRUE) %>%
select(-rn)
# # A tibble: 6 x 3
# y v x
# <dbl> <dbl> <dbl>
# 1 1 3 6.67
# 2 3 4 8.33
# 3 1 5 6.67
# 4 3 6 8.33
# 5 1 7 6.67
# 6 3 8 8.33

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?

If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.

If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Combining common IDs in 2 Lists of data tables

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.
library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
data.table(id=c(5,6), a=c(4,0), b=c(2,1)))
list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))
What I need to do is find the id in each list, and average their results.
list id a b d
list1 1 0 NA 1
list2 1 1 2 NA
NA values are treated as 0, so the result for id 1 should be:
id a b d
1 0.5 1 0.5
Next, the top 3 column names are selected and ordered based on their values so that the result is:
id top3
1 b d a
This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.
for (i in 1:6){ # i is the id to be searched for
for (j in 1:length(list1)){
if (i %in% list1[[j]]$id){
listnum1 <- j
rownum1 <- which(list1[[j]]$id==i)
break
}
}
for (j in 1:length(list2)){
if (i %in% list2[[j]]$id){
listnum2 <- j
rownum2 <- which(list2[[j]]$id==i)
break
}
}
v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
for (j in 1:ncol(bind)){ # Convert NAs to 0
set(bind, which(is.na(bind[[j]])), j, 0)}
means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))
# select and order the top 3 ids and bind to a data frame
top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}
id top3.V1
1: 1 b d a
2: 2 f c d
3: 3 d e c
4: 4 f c b
5: 5 a b
6: 6 e c b
When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?

Melt down the individual data.table's and you won't run into the issue of wasted memory:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
# find number of "rows" per id
, nvals := max(rle(sort(variable))$lengths), by = id][
# compute the means, assuming that missing values are equal to 0
, sum(value)/nvals[1], by = .(id, variable)][
# extract top 3 values
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
# id V1
#1: 1 b a d
#2: 2 f c b
#3: 3 d e a
#4: 4 b c f
#5: 5 a b
#6: 6 e b c
Or instead of rle you can do:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, .(vals = sum(value), nvals = .N), by = .(id, variable)][
, vals := vals / max(nvals), by = id][
order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]
Or better yet, as Frank points out, don't even bother with the mean:
rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
, sum(value), by = .(id, variable)][
order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]

Not sure about the performance, but this should prevent the for-loop:
library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)
This will give you a single data frame:
id a d b c f e
1 1 0 1 NA NA NA NA
2 3 0 1 NA NA NA NA
3 2 NA NA 1 2 3 NA
4 4 NA NA 0 1 1 NA
5 5 4 NA 2 NA NA NA
6 6 0 NA 1 NA NA NA
7 2 NA 1 NA 0 NA 0
8 3 NA 1 NA 0 NA 1
9 6 NA 0 NA 1 NA 2
10 1 1 NA 2 NA NA NA
11 4 0 NA 1 NA NA NA
12 5 3 NA 2 NA NA NA
By summarising based on id:
means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))
id a d b c f e
1 1 0.5 1 2.0 NA NA NA
2 2 NaN 1 1.0 1 3 0
3 3 0.0 1 NaN 0 NaN 1
4 4 0.0 NaN 0.5 1 1 NaN
5 5 3.5 NaN 2.0 NaN NaN NaN
6 6 0.0 0 1.0 1 NaN 2
Listing the top 3 through sapply will give you the same resulting table (but as a matrix, each column corresponding to id)
sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b" "f" "d" "c" "a" "e"
[2,] "d" "d" "e" "f" "b" "b"
[3,] "a" "b" "a" "b" NA "c"
** Updated **
Since the data is going to be large, it's prudent to create some functions that can choose and combine appropriate data.frame for each id.
(i) find out all the id present in both list
id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")
(ii) find out in which table ids 1 to 6 are within the list
id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist))
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist))
(iii) create a function to combine appropriate dataframe for specific id
id_who<-function(x){
a <- data.frame(list1[id_l1[[x]]])
a <- a[a$id==x, ]
b <- data.frame(list2[id_l2[[x]]])
b <- b[b$id==x, ]
full_join(a,b)
}
lapply(1:6, id_who)
[[1]]
id a d b
1 1 0 1 NA
2 1 1 NA 2
[[2]]
id b c f d e
1 2 1 2 3 NA NA
2 2 NA 0 NA 1 0
[[3]]
id a d c e
1 3 0 1 0 1
[[4]]
id b c f a
1 4 0 1 1 NA
2 4 1 NA NA 0
[[5]]
id a b
1 5 4 2
2 5 3 2
[[6]]
id a b c d e
1 6 0 1 1 0 2
output<-ldply(new, summarise_each, funs(means))
Output will be the same as the above.
The advantage of this process is that you can easily put in logical breaks in the process, either in (ii) or (iii).

R colSum for two every two rows

I am struggeling with the following(easy) problem but cannot find a good solution to it. Consider a df as follows:
test<-c("A","B","C","D","E","F")
test2<-sample(1:6)
test3<-data.frame(test,test2)
I would like to have a third column that in the second row shows ratio of row 1:2 of column 2, in the fourth row the ratio 3:4 of column2 and in the sixth row the ratio 5:6 of column2. My df is by far larger otherwise I would have done by hand:)
Any suggestions on how to do that? i know that you can get the diff with the diff command but the ratio? And how do I bind to rows together? split() does not seem to do that.

This should be pretty fast:
test3$ratio <- NA
test3$ratio[c(FALSE, TRUE)] <- test3$test2[c(FALSE, TRUE)] /
test3$test2[c(TRUE, FALSE)]

Using loop (instead of 6 below you can put the number of last row in your large dataframe):
for( i in seq(2,6,by=2)) {
test3$ratio[i] <- with(test3,test2[i-1]/test2[i])
}
> test3
test test2 ratio
1 A 3 NA
2 B 5 0.6000000
3 C 4 NA
4 D 6 0.6666667
5 E 1 NA
6 F 2 0.5000000

You can use gl to generate your groups:
temp <- within(test3, {
Sums <- ave(test2, gl(nrow(test3)/2, 2), FUN = function(x) x[2]/x[1])
Sums[c(TRUE, FALSE)] <- NA
})
temp
# test test2 Sums
# 1 A 2 NA
# 2 B 6 3.000000
# 3 C 3 NA
# 4 D 4 1.333333
# 5 E 1 NA
# 6 F 5 5.000000
Alternatively (and similar to flodel's answer), you can use head and tail:
test3$Sums <- NA
test3$Sums[c(FALSE, TRUE)] <- (tail(c(0, test3$test2), -1)/
head(c(0, test3$test2), -1))[c(FALSE, TRUE)]
test3
# test test2 Sums
# 1 A 2 NA
# 2 B 6 3.000000
# 3 C 3 NA
# 4 D 4 1.333333
# 5 E 1 NA
# 6 F 5 5.000000
For the above, the sample data was:
set.seed(1)
test<-c("A","B","C","D","E","F")
test2<-sample(1:6)
test3<-data.frame(test,test2)

Categories

HOME

jupyter-notebook

robotframework

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Correlations by grouping twice in R, using dplyR or aggregate? - r

Related

Combining elements of one column into two columns by group in R

Averaging row and column cells from multiple data frames

Repeat vector to fill down column in data frame

Combining common IDs in 2 Lists of data tables

R colSum for two every two rows

Categories

Resources