Vector of SumIfs() in R - r

I am looking to mimic Excel's SumIfs() function in R by creating a mean-if vector of conditional averages for each observation. I've seen a lot of examples that use aggregate() or setDT() to summarize a data frame based on fixed quantities. However, I'd like to create a vector of these summaries based on the variable inputs of each row in my data frame.
Here is an example of my data:
> a <- c('c', 'a', 'b', 'a', 'c', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'b', 'a')
> b <- c(6, 1, 1, 2, 1, 2, 2, 4, 3, 3, 5, 5, 4, 6, 6)
> c <- c(69.9, 21.2, 37, 25, 65.9, 33.1, 67, 28.4, 36, 67, 22, 37.9, 62.3, 30, 25)
> df <- data.frame(cbind(a, b, c))
> df$b <- as.numeric(as.character(df$b))
> df$c <- as.numeric(as.character(df$c))
> df
a b c
1 c 6 69.9
2 a 1 21.2
3 b 1 37.0
4 a 2 25.0
5 c 1 65.9
6 b 2 33.1
7 c 2 67.0
8 a 4 28.4
9 b 3 36.0
10 c 3 67.0
11 a 5 22.0
12 b 5 37.9
13 c 4 62.3
14 b 6 30.0
15 a 6 25.0
I would like to add a fourth column, df$d, that takes the average of df$c for those observations where df$a == x & y - 2 <= df$b < y where x and y are df$a and df$b, respectively, for the observation being calculated.
Doing this by hand, df$d looks like:
> df$d <- c(62.3, NA, NA, 21.2, NA, 37, 65.9, 25, 35.05, 66.45, 28.4, 36, 67, 37.9, 25.2)
> df
a b c d
1 c 6 69.9 62.30
2 a 1 21.2 NA
3 b 1 37.0 NA
4 a 2 25.0 21.20
5 c 1 65.9 NA
6 b 2 33.1 37.00
7 c 2 67.0 65.90
8 a 4 28.4 25.00
9 b 3 36.0 35.05
10 c 3 67.0 66.45
11 a 5 22.0 28.40
12 b 5 37.9 36.00
13 c 4 62.3 67.00
14 b 6 30.0 37.90
15 a 6 25.0 25.20
Is there a function I can use to do this automatically? Thanks for your help!

This can be done in a straightforward way using a left self-join in SQL. This joins to each row of the u instance of df those rows of the v instance of df that satisfy the on condition and then averages over their c values.
library(sqldf)
sqldf("select u.*, avg(v.c) as d
from df u left join df v
on u.a = v.a and v.b between u.b-2 and u.b-1
group by u.rowid")
giving:
a b c d
1 c 6 69.9 62.30
2 a 1 21.2 NA
3 b 1 37.0 NA
4 a 2 25.0 21.20
5 c 1 65.9 NA
6 b 2 33.1 37.00
7 c 2 67.0 65.90
8 a 4 28.4 25.00
9 b 3 36.0 35.05
10 c 3 67.0 66.45
11 a 5 22.0 28.40
12 b 5 37.9 36.00
13 c 4 62.3 67.00
14 b 6 30.0 37.90
15 a 6 25.0 25.20

You could just use a loop to basically just write out exactly how you described the problem:
n <- nrow(df)
d <- numeric(n)
for (i in seq_len(n)) {
x <- df$a[i]
y <- df$b[i]
d[i] <- with(df, mean(c[a == x & y - 2 <= b & b < y]))
}
all.equal(d, df$d)
#> [1] TRUE
I don't love this solution, but I couldn't think of a simple way to do this otherwise, because the required grouping isn't disjoint due to the condition for b. I'm very curious to see if somebody comes up with a neater way to do this.

Related

Merge overlapping datasets by column identifier?

I am trying to merge/join two datasets which have different data about the same samples with no rows in common. I would like to be able to merge them by the column names and have that add the rows from the smaller dataset to the larger, filling in NA for all columns that do not have information from the smaller dataset. I feel like this is something super easy that I'm just somehow not able to figure out.
2 tiny sample datasets:
df1 <- data.frame(team=c('A', 'B', 'C', 'D'),
points=c(88, 98, 104, 100),
league=c('Alpha', 'Beta', 'Gamma', 'Delta'))
team points league
1 A 88 Alpha
2 B 98 Beta
3 C 104 Gamma
4 D 100 Delta
df2 <- data.frame(team=c('L', 'M','N', 'O', 'P', 'Q'),
points=c(43, 66, 77, 83, 12, 12),
league=c('Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c(22, 31, 29, 20, 33, 44),
fouls=c(1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 L 43 Epsilon 22 1
2 M 66 Zeta 31 3
3 N 77 Eta 29 2
4 O 83 Theta 20 4
5 P 12 Iota 33 5
6 Q 12 Kappa 44 1
the output I would like to get would be:
df3<- data.frame(team=c('A', 'B', 'C', 'D', 'L', 'M','N', 'O', 'P', 'Q' ),
points=c(88, 98, 104, 100, 43, 66, 77, 83, 12, 12),
league=c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c('NA', 'NA', 'NA', 'NA', 22, 31, 29, 20, 33, 44),
fouls= c('NA', 'NA', 'NA', 'NA',1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
I tried transposing the dfs, but because they have no rows in common that does not work either. I thought about making an index, but I'm just learning about those and I'm not sure how I would do it or if that's the correct move.
Use full_join and arrange
library(dplyr)
full_join(df2, df1) %>%
arrange(team)
-output
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Or with rows_upsert
rows_upsert(df2, df1, by = c("team", "points", "league"))
We could use bind_rows()
When row-binding, columns are matched by name, and any missing columns will be filled with NA:
library(dplyr)
bind_rows(df1, df2)
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Using base R, you could add the missing columns in df1 using setdiff() and then rbind them together:
df1[, setdiff(names(df2), names(df1))] <- NA
rbind(df1, df2)
Output:
# team points league rebounds fouls
# 1 A 88 Alpha NA NA
# 2 B 98 Beta NA NA
# 3 C 104 Gamma NA NA
# 4 D 100 Delta NA NA
# 5 L 43 Epsilon 22 1
# 6 M 66 Zeta 31 3
# 7 N 77 Eta 29 2
# 8 O 83 Theta 20 4
# 9 P 12 Iota 33 5
# 10 Q 12 Kappa 44 1

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

Enumerate quantiles in reverse order

I'm trying to get the quantile number of a column in a data frame, but in reverse order. I want the highest number to be in quantile number 1.
Here is what I have so far:
> x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
> x <- data.frame(x)
> within(x, Q <- as.integer(cut(x, quantile(x, probs=0:5/5, na.rm=TRUE),
include.lowest=TRUE)))
x Q
1 10.0 1
2 12.0 1
3 75.0 3
4 89.0 4
5 25.0 2
6 100.0 4
7 67.0 2
8 89.0 4
9 4.0 1
10 67.0 2
11 120.2 5
12 140.5 5
13 170.5 5
14 78.1 3
And what I want to get is:
x Q
1 10.0 5
2 12.0 5
3 75.0 3
4 89.0 2
5 25.0 4
6 100.0 2
7 67.0 4
8 89.0 2
9 4.0 5
10 67.0 4
11 120.2 1
12 140.5 1
13 170.5 1
14 78.1 3
One way to do this is to specify the reversed labels in the cut() function. If you want Q to be an integer then you need to first coerce the factor labels into a character and then into an integer.
result <- within(x, Q <- as.integer(as.character((cut(x,
quantile(x, probs = 0:5/5, na.rm = TRUE),
labels = c(5, 4, 3, 2, 1),
include.lowest = TRUE)))))
head(result)
x Q
1 10 5
2 12 5
3 75 3
4 89 2
5 25 4
6 100 2
Your data:
x <- c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
x <- data.frame(x)

Replicate row value following a factor

Given the following data frame:
df <- data.frame(patientID = rep(c(1:4), 3),
condition = c(rep("A", 4), rep("B",4), rep("C",4)),
weight = round(rnorm(12, 70, 7), 1),
height = round(c(rnorm(4, 170, 10), rep(0, 8)), 1))
> head(df)
patientID condition weight height
1 1 A 71.43 168.5
2 2 A 59.89 177.3
3 3 A 72.15 163.4
4 4 A 70.14 166.1
5 1 B 66.21 0.0
6 2 B 66.62 0.0
How can I copy the height for each patient from condition A into the other two conditions? I tried using for loops, data.table and dplyr without success.
How can I achieve this using either methods?
If your data is as it looks - sorted by condition, patientID, and the patients per condition are identical, then you can just make use of recycling as follows:
require(data.table)
setDT(df)[, height := height[condition == "A"]]
But I understand that's a lot of ifs there.
So, without assuming anything about the data, with one exception that condition,patientID pairs are unique, you can do:
require(data.table)
setDT(df)[, height := height[condition == "A"], by=patientID]
Once again, this makes use of recycling, but within each group - as it doesn't assume the data is ordered.
Both of the above methods on the sample data give:
# patientID condition weight height
# 1: 1 A 73.3 169.5
# 2: 2 A 76.3 173.4
# 3: 3 A 63.6 145.5
# 4: 4 A 56.2 164.7
# 5: 1 B 67.7 169.5
# 6: 2 B 77.3 173.4
# 7: 3 B 76.8 145.5
# 8: 4 B 70.9 164.7
# 9: 1 C 76.6 169.5
# 10: 2 C 73.0 173.4
# 11: 3 C 66.7 145.5
# 12: 4 C 71.6 164.7
The same idea can be translated to dplyr as well, which I'll leave it to you to try. Hint: it just requires group_by and mutate.
No need for the fancy stuff here. Just use the $ operator and [ subsetting.
> df$height <- df$height[df$patientID]
> df
patientID condition weight height
1 1 A 67.4 175.1
2 2 A 66.8 179.0
3 3 A 49.7 159.7
4 4 A 64.5 165.3
5 1 B 66.0 175.1
6 2 B 70.8 179.0
7 3 B 58.7 159.7
8 4 B 74.3 165.3
9 1 C 70.9 175.1
10 2 C 75.6 179.0
11 3 C 61.3 159.7
12 4 C 74.5 165.3
This should do the trick. It assumes that the first level of the condition factor is always the one with the true data.
idx <- tapply(rownames(df), list(df$patientID, df$condition), identity)
idx<-na.omit(cbind(as.vector(idx[,-1]),as.vector(idx[,1])))
df[as.vector(idx[,1]),"height"] <- df[as.vector(idx[,2]), "height"]
And from #Arun's suggestion
df$height<-with(df, ave(ifelse(condition=="A",height,-1),
factor(patientID), FUN=max))
where you can be explicit about the condition level to pull values from

Ranking variables with conditions

Say I have the following data frame:
df <- data.frame(store = LETTERS[1:8],
sales = c( 9, 128, 54, 66, 23, 132, 89, 70),
successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))
I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):
df <- df[order(-df$successRate, -df$sales), ]
In order to actually create a ranking variable, I do the following:
df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))
So df looks like this:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 2
1 A 9 0.80 3
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:
df$rank <- ifelse(df$sales < 50, NA,
ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))
The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 2
3 C 54 0.54 3
8 H 70 0.46 4
6 F 132 0.35 5
2 B 128 0.25 6
Super easy to do with data.table:
library(data.table)
dt = data.table(df)
# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]
dt[sales >= 50, rank := .I]
dt
# store sales successRate rank
#1: D 66 0.92 1
#2: E 23 0.85 NA
#3: A 9 0.80 NA
#4: G 89 0.54 2
#5: C 54 0.54 3
#6: H 70 0.46 4
#7: F 132 0.35 5
#8: B 128 0.25 6
If you must do it in data.frame, then after your preferred order, run:
df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))

Resources