Find data frame column differences per multiple groupings - r

In R, I would like to subtract the sum of a value column (grouped by a letter in column 't1') from the sum of the same value column (grouped by the same letter in column 't2'). Repeat the process for every letter and for every year group.
Consider;
set.seed(3)
df <- data.frame(age = rep(1:3,each=25),
t1 = rep(expand.grid(LETTERS[1:5],LETTERS[1:5])[,1],3),
t2 = rep(expand.grid(LETTERS[1:5],LETTERS[1:5])[,2],3),
value = sample(1:10,75,replace=T))
This data frame shows 3 values in the 'age' column, 2 columns with categories (t1 and t2) and an associated value (value).
As an example, here is how it might work for 'A':
library(plyr);
# extract rows with A
df2 <- df[df$t1=="A" | df$t2=="A",]
# remove where t1 and t2 are the same (not needed)
df2 <- df2[df2$t1 != df2$t2,]
# use ddply to subtract sum of 'value' for A in t1 from t2
df2 <- ddply(df2, .(age), transform, change = sum(value[t2=="A"])-sum(value[t1=="A"]))
# create a name
df2$cat <- "A"
# remove all the duplicate rows, just need one summary value
df2 <- df2[ !duplicated(df2$change), ]
# keep summary data
df2 <- df2[,c(1,6,5)]
now I need to do this for all the values that occur in t1 and t2 (in this case A,B,C & D), creating a 12 line summary.
I tried a loop with;
for (c in as.character(unique(df$t1)))
but got nowehere
thanks a lot

Here is one base R solution that involves aggregation and merging:
# aggregate by age and t1 or t2
t1Agg <- aggregate(value ~ t1 + age, data=df, FUN=sum)
t2Agg <- aggregate(value ~ t2 + age, data=df, FUN=sum)
# merge aggregated data
aggData <- merge(t1Agg, t2Agg, by.x=c("age","t1"), by.y=c("age","t2"))
names(aggData) <- c("age", "t", "value.t1", "value.t2")
aggData$diff <- aggData$value.t1 - aggData$value.t2

I would recommend tidying your data first and then you can spread post-summarise and add a new column:
# Make reproducible
set.seed(4)
df <- data.frame(age = rep(1:3,each=25),
t1 = rep(expand.grid(LETTERS[1:5],LETTERS[1:5])[,1],3),
t2 = rep(expand.grid(LETTERS[1:5],LETTERS[1:5])[,2],3),
value = sample(1:10,75,replace=T))
library(tidyr)
library(dplyr)
df_tidy <- gather(df, t_var, t_val, -age, -value)
sample_n(df_tidy, 3)
# age value t_var t_val
# 104 2 6 t2 A
# 48 2 9 t1 C
# 66 3 7 t1 A
df_tidy %>%
group_by(age, t_var, t_val) %>%
summarise(val_sum = sum(value)) %>%
spread(t_var, val_sum) %>%
mutate(diff = t1 - t2)
# age t_val t1 t2 diff
# (int) (chr) (int) (int) (int)
# 1 1 A 30 22 8
# 2 1 B 32 32 0
# 3 1 C 27 28 -1
# 4 1 D 38 39 -1
# 5 1 E 30 36 -6
# 6 2 A 36 35 1
# 7 2 B 26 30 -4
# 8 2 C 40 27 13
# 9 2 D 27 31 -4
# 10 2 E 28 34 -6
# 11 3 A 26 39 -13
# 12 3 B 19 26 -7
# 13 3 C 31 29 2
# 14 3 D 41 33 8
# 15 3 E 39 29 10

Related

Find time-lag between groups in a data.frame

Let's suppose I want to estimate the time lag between two groups within a data.frame.
Here an example of my data:
df_1 = data.frame(time = c(1,3,5,6,8,11,15,16,18,20), group = 'a') # create group 'a' data
df_2 = data.frame(time = c(2,7,10,13,19,25), group = 'b') # create group 'b' data
df = rbind(df_1, df_2) # merge groups
df = df[with(df, order(time)), ] # order by time
rownames(df) = NULL #remove row names
> df
time group
1 1 a
2 2 b
3 3 a
4 5 a
5 6 a
6 7 b
7 8 a
8 10 b
9 11 a
10 13 b
11 15 a
12 16 a
13 18 a
14 19 b
15 20 a
16 25 b
Now I need to subtract the time observation from group b to the time observation from group a.
i.e. 2-1, 7-6, 10-8, 13-11, 19-18 and 25-20.
# Expected output
> out
[1] 1 1 2 2 1 5
How can I achieve this?
We can find indices of b and subtract the time value from it's previous index.
inds <- which(df$group == "b")
df$time[inds] - df$time[inds - 1]
#[1] 1 1 2 2 1 5
Here's a tidyverse solution. First add a column by basic logic of the appearance of group b with transmute and a subtraction of the preceding column. Then filter to just the results, and convert to vector with deframe
library(tidyverse)
df %>%
transmute(result = if_else(group == "b", time - lag(time), 0)) %>%
filter(result != 0) %>%
deframe()
result:
[1] 1 1 2 2 1 5

Aggregate dataframe in rolling blocks of 3 rows

I have the following data frame as an example
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
> df
score total1 total2
1 a 1 16
2 b 2 17
3 c 3 18
4 d 4 19
5 e 5 20
6 f 6 21
7 g 7 22
8 h 8 23
9 i 9 24
10 j 10 25
11 k 11 26
12 l 12 27
13 m 13 28
14 n 14 29
15 o 15 30
I would like to aggregate my data frame by sum by grouping the rows having different name, i.e.
groups sum1 sum2
'a-b-c' 6 51
'c-d-e' 21 60
etc
All the given answers to this kind of question assume that the strings repeat in the row.
The usual aggregate function that I use to obtain the summary delivers a different result:
aggregate(df$total1, by=list(sum1=df$score %in% c('a','b','c'), sum2=df$score %in% c('d','e','f')), FUN=sum)
sum1 sum2 x
1 FALSE FALSE 99
2 TRUE FALSE 6
3 FALSE TRUE 15
If you want a tidyverse solution, here is one possibility:
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
df %>%
mutate(groups = case_when(
score %in% c("a","b","c") ~ "a-b-c",
score %in% c("d","e","f") ~ "d-e-f"
)) %>%
group_by(groups) %>%
summarise_if(is.numeric, sum)
returns
# A tibble: 3 x 3
groups total1 total2
<chr> <int> <int>
1 a-b-c 6 51
2 d-e-f 15 60
3 <NA> 99 234
Add a "groups" column with the category value.
df$groups = NA
and then define each group like this:
df$groups[df$score=="a" | df$score=="b" | df$score=="c" ] = "a-b-c"
Finally aggregate by that column.
Here's a solution that works for any sized data frame.
df <- data.frame(score=letters[1:15], total1=1:15, total2=16:30)
# I'm adding a row to demonstrate that the grouping pattern works when the
# number of rows is not equally divisible by 3.
df <- rbind(df, data.frame(score = letters[16], total1 = 16, total2 = 31))
# A vector that represents the correct groupings for the data frame.
groups <- c(rep(1:floor(nrow(df) / 3), each = 3),
rep(floor(nrow(df) / 3) + 1, nrow(df) - length(1:(nrow(df) / 3)) * 3))
# Your method of aggregation by `groups`. I'm going to use `data.table`.
require(data.table)
dt <- as.data.table(df)
dt[, group := groups]
aggDT <- dt[, list(score = paste0(score, collapse = "-"),
total1 = sum(total1), total2 = sum(total2)), by = group][
, group := NULL]
aggDT
score total1 total2
1: a-b-c 6 51
2: d-e-f 15 60
3: g-h-i 24 69
4: j-k-l 33 78
5: m-n-o 42 87
6: p 16 31

How to add a variable based on order of observation in a dataframe - R

I want to add a variable value to a dataframe based on the order of the observation in the data frame.
… Subject Latency(s)
1 A 25
2 A 24
3 A 25
4 B 22
5 B 24
6 B 23
I want to add a third column called Trial and I want the values to be either T1, T2, or T3 based on the order of the observation and by Subject. So for example, Subject A would get T1 in row 1, T2 in row 2, and T3 in row 3. Then the same for subject B, and so on.
Right now my approach is to use group_by in dplyr to group by Subject. But I'm not sure then how to specify the new variable using mutate.
Use mutate w/ row_number & group_by(Subject)
library(dplyr)
txt <- "ID Subject Latency(s)
1 A 25
2 A 24
3 A 25
4 B 22
5 B 24
6 B 23"
dat <- read.table(text = txt, header = TRUE)
dat <- dat %>%
group_by(Subject) %>%
mutate(Trial = paste0("T", row_number()))
dat
#> # A tibble: 6 x 4
#> # Groups: Subject [2]
#> ID Subject Latency.s. Trial
#> <int> <fct> <int> <chr>
#> 1 1 A 25 T1
#> 2 2 A 24 T2
#> 3 3 A 25 T3
#> 4 4 B 22 T1
#> 5 5 B 24 T2
#> 6 6 B 23 T3
Created on 2018-03-17 by the reprex package (v0.2.0).
This solution should work for any number of subjects. To illustrate, copy and paste this code into your console.
library(dplyr)
d <- data.frame(subject = c("A","A","A","B","B","B","C","D","D"),
latency = c(25,24,25,22,24,23,34,54,34))
# get counts of unique subjects
n <- d %>% dplyr::count(subject)
# create a list of sequences
my_list <- lapply(n$n, seq)
# paste a "T" to each of these sequences
t_list <- lapply(my_list, function(x){paste0("T", x)})
# bind the collapsed list back onto your df
d$trial <- do.call(c, t_list)

Adding sequence numbers based on time R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

create sequence vector for each level of other column [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3

Resources