mutate based on conditional sum in a group - r

Say I have a dataframe like this:
set.seed(1)
n <- 20
df <- data.frame(ID = sample(1:5, n, replace = TRUE),
Fac1 = sample(letters[1:5], n, replace = TRUE),
Fac2 = sample(LETTERS[10:15], n, replace = TRUE),
Val1 = sample(1:10, n, replace = TRUE)) %>%
arrange(ID) %>% group_by(ID,Fac1) %>%
summarise(Val1 = sum(Val1),Fac2 = first(Fac2)) %>%
group_by(ID,Fac2) %>%
mutate(Val2 = sum(Val1))
df
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 N 10
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 N 6
13 5 a 13 M 13
14 5 b 3 N 3
ID is a grouping variable. Rows with an Fac1 value of e should have the Fac2 value changed to be that same as the other row in the group where Fac1 is either b or c and the sum of Val 2 for the two rows if greater than 20. (I've simplified this to the point where you probably don't get why but just work with me).
This is what I have tried so far:
result <- df %>% group_by(ID) %>%
mutate(Fac2 = case_when(
Fac1 == "e" &
sum(Val2,ifelse(Fac1 %in% c("b","c"), Val2, 0)) > 20 ~
ifelse(sum(Val2,ifelse(Fac1 %in% c("b","c"),Val2,0)) > 20,
as.character(Fac2),
NA_character_),
TRUE ~ as.character(Fac2)
))
It doesn't work properly because it is summing the first value of Val2 in the group rather than only doing so when Fac1 is b or c.
Any ideas?
Adding desired outcome:
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 M 10 **Changed to M b/c row 4 is M and 10 + 18 > 20
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 L 6 **Changed to L b/c row 10 is L and 6 + 22 > 20
13 5 a 13 M 13
14 5 b 3 N 3

I'm having a hard time following what you are wanting the values to be changed to.
But when I have multiple conditions or decisions that need to be made in a sequence, I use a loop and a series of if statements to go through the data frame. I prefer while loops, so that's what I'll use in the example.
counter <- 1
stopper <- nrow(df)
while (counter <= stopper) {
fac1 <- df$Fac1[counter1]
if (fac1 == 'e') {
if ([INSERT NEXT CONDITION]) #Change whichever value your trying to change using the counter to reference the correct row.
else #Change whichever value your trying to change using the counter to reference the correct row.
}
counter <- counter + 1
}
For me, simplifying the code makes it a lot easier for me to keep track of what decisions are being made. It also allows for complex decisions that are difficult to get functions to work with.

I was able to get the desired result with this code. I made a new column containing the result of the test for what value to replace Fac2 with, which wasn't entirely necessary but makes it more readable and debugable.
The key thing was to use first(na.omit()) to get the value from a different row in the same group which met the condition.
result <- df %>% group_by(ID) %>%
mutate(Max_bc_Val = ifelse(Val2 == max(ifelse(Fac1 %in% c("b","c"),
Val2,0)),
ifelse(Fac1 %in% c("b","c"),
as.character(Fac2),NA),NA)) %>%
mutate(Fac2 = case_when(
Fac1 == "e" ~ ifelse(is.na(first(na.omit(Max_bc_Val))),
NA_character_,
first(na.omit(Max_bc_Val))),
TRUE ~ as.character(Fac2)))
This works but doesn't seem like the best solution. Any other ideas?

Related

Replace values in one dataframe with another thats not NA

I have two dataframes A and B, that share have the same column names and the same first column (Location)
A <- data.frame("Location" = 1:3, "X" = c(21,15, 7), "Y" = c(41,5, 5), "Z" = c(12,103, 88))
B <- data.frame("Location" = 1:3, "X" = c(NA,NA, 14), "Y" = c(50,8, NA), "Z" = c(NA,14, 12))
How do i replace the values in dataframe A with the values from B if the value in B is not NA?
Thanks.
We can use coalesce
library(dplyr)
A %>%
mutate(across(-Location, ~ coalesce(B[[cur_column()]], .)))
-output
# Location X Y Z
#1 1 21 50 12
#2 2 15 8 14
#3 3 14 5 12
Here's an answer in base R:
i <- which(!is.na(B),arr.ind = T)
A[i] <- B[i]
A
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12
One option with fcoalesce from data.table pakcage
list2DF(Map(data.table::fcoalesce,B,A))
gives
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12

Randomly sample values from a pool so that the sum is less than a threshold in R

Let's say we have a pool of values and I want to sample random number of values from this pool, so that the sum of these values is between two thresholds. I want to design a function in R to implemented that.
pool = data.frame(ID = letters, value = sample(1:5, size = 26, replace = T))
> print(pool)
ID value
1 a 1
2 b 4
3 c 4
4 d 2
5 e 2
6 f 4
7 g 5
8 h 5
9 i 4
10 j 3
11 k 3
12 l 5
13 m 3
14 n 2
15 o 3
16 p 4
17 q 1
18 r 1
19 s 5
20 t 1
21 u 2
22 v 4
23 w 5
24 x 2
25 y 4
26 z 1
I want to randomly sample what ever number of IDs so that the sum of values for these IDs are between two thresholds, let's say between 8 and 10 (including the two boundaries). The expected outcome should be like these:
c("a", "b", "c")
c("f", "g")
c("a", "d", "e", "j", "k")
I think this question has not been asked previously. Does anyone have clues?
Here's an approach where I shuffle the input and check the cumulative sum of the shuffled output to look for an acceptable sum.
If a subset of that initial sequence happens to work, it outputs that sequence (in this manifestation, the longest sequence under the max threshold). If it doesn't work, it reshuffles and looks again, up to the max number of iterations.
set.seed(42)
library(dplyr)
sample_in_range <- function(src_tbl, min_sum = 8, max_sum = 10, max_iter = 100) {
for(i in 1:max_iter) {
output <- src_tbl %>%
sample_n(nrow(src_tbl)) %>%
mutate(ID = as.character(ID),
cuml = cumsum(value)) %>%
filter(cuml <= max_sum)
if(max(output$cuml) >= min_sum) return(output)
}
}
output <- sample_in_range(pool)
output
ID value cuml
1 k 3 3
2 w 2 5
3 z 4 9
4 t 1 10
output %>% pull(ID)
[1] "k" "w" "z" "t"

remove cases following certain other cases

I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
library(microbenchmark)
microbenchmark(
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
},
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <- as.data.table(df)

Reshape data based on column in dataframe

I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
There's probably a more general way to do this, but below uses rbind.fill.
cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
rbind.fill(apply(df[,4:5], 1, function(x)
data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))
## id1 id2 mean start end
## 1 A D 4 12 13
## 1.1 A D 4 13 14
## 1.2 A D 4 14 15
## 2 B E 5 14 15
## 3 C F 6 8 9
## 3.1 C F 6 9 10
The survSplit function of the survival package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.
No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
sq=seq(1:(end-start)))
Two alternatives, many years later, offering alternatives using today's popular data.table and tidyverse packages:
Option 1:
library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
id1 id2 mean start end
1: A D 4 12 13
2: A D 4 13 14
3: A D 4 14 15
4: B E 5 14 15
5: C F 6 8 9
6: C F 6 9 10
Option 2:
library(tidyverse)
mydf %>%
group_by(id1, id2, mean) %>%
summarise(start = list(start:(end-1))) %>%
unnest(start) %>%
mutate(end = start+1)

Resources