Reshape data based on column in dataframe - r

I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)

There's probably a more general way to do this, but below uses rbind.fill.
cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
rbind.fill(apply(df[,4:5], 1, function(x)
data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))
## id1 id2 mean start end
## 1 A D 4 12 13
## 1.1 A D 4 13 14
## 1.2 A D 4 14 15
## 2 B E 5 14 15
## 3 C F 6 8 9
## 3.1 C F 6 9 10

The survSplit function of the survival package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.

No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
sq=seq(1:(end-start)))

Two alternatives, many years later, offering alternatives using today's popular data.table and tidyverse packages:
Option 1:
library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
id1 id2 mean start end
1: A D 4 12 13
2: A D 4 13 14
3: A D 4 14 15
4: B E 5 14 15
5: C F 6 8 9
6: C F 6 9 10
Option 2:
library(tidyverse)
mydf %>%
group_by(id1, id2, mean) %>%
summarise(start = list(start:(end-1))) %>%
unnest(start) %>%
mutate(end = start+1)

Related

Find overlap of multiple ranges in data.table

I would like to find the overlapping part of multiple ranges which are given rowise in a data.table object.
An example would be:
t <- data.table(a = c(3,4,5), b = c(13,12,19))
So we have the ranges:
3 - 13,
4 - 12,
5 - 19
Hence the overlapping range would be:
5 - 12
In case of an additional range 19 - 22 the overlap should return NA - NA or 0 - 0 since there is no overlap.
I found solutions for similar problems like spatstat.utils:: intersect.ranges(). However this works only on two vectors and is hard to implement in a data.table
DT[,.(o.l = function()[1], o.r = function()[2], by=.()]
manner which I would really like to do if possible,..
As output for this example I would like to have:
t <- data.table(a = c(3,4,5), b = c(13,12,19), o.l = c(5,5,5), o.r = c(12,12,12))
Here's a one-line example:
library(data.table)
dt = data.table(a = c(3,4,5), b = c(13,12,19))
dt[, c("o.l", "o.r") := as.list(range(Reduce(intersect, mapply(seq, a, b, 1))))]
dt
# a b o.l o.r
# 1: 3 13 5 12
# 2: 4 12 5 12
# 3: 5 19 5 12
Where the core of the problem is
dt = data.table(a = c(3,4,5), b = c(13,12,19))
dt[, Reduce(intersect, mapply(seq, a, b, 1))]
# [1] 5 6 7 8 9 10 11 12
Borrowing idea from David Aurenburg answer in How to flatten / merge overlapping time periods, here is another possible approach:
DT[, g := c(0L, cumsum(shift(a, -1L) >= cummax(b))[-.N])][,
c("ol", "or") := .(max(a), min(b)), g]
data:
DT <- data.table(a = c(3,4,5,19,20,24), b = c(13,12,19,22,23,25))
output:
a b g ol or
1: 3 13 0 5 12
2: 4 12 0 5 12
3: 5 19 0 5 12
4: 19 22 1 20 22
5: 20 23 1 20 22
6: 24 25 2 24 25

mutate based on conditional sum in a group

Say I have a dataframe like this:
set.seed(1)
n <- 20
df <- data.frame(ID = sample(1:5, n, replace = TRUE),
Fac1 = sample(letters[1:5], n, replace = TRUE),
Fac2 = sample(LETTERS[10:15], n, replace = TRUE),
Val1 = sample(1:10, n, replace = TRUE)) %>%
arrange(ID) %>% group_by(ID,Fac1) %>%
summarise(Val1 = sum(Val1),Fac2 = first(Fac2)) %>%
group_by(ID,Fac2) %>%
mutate(Val2 = sum(Val1))
df
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 N 10
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 N 6
13 5 a 13 M 13
14 5 b 3 N 3
ID is a grouping variable. Rows with an Fac1 value of e should have the Fac2 value changed to be that same as the other row in the group where Fac1 is either b or c and the sum of Val 2 for the two rows if greater than 20. (I've simplified this to the point where you probably don't get why but just work with me).
This is what I have tried so far:
result <- df %>% group_by(ID) %>%
mutate(Fac2 = case_when(
Fac1 == "e" &
sum(Val2,ifelse(Fac1 %in% c("b","c"), Val2, 0)) > 20 ~
ifelse(sum(Val2,ifelse(Fac1 %in% c("b","c"),Val2,0)) > 20,
as.character(Fac2),
NA_character_),
TRUE ~ as.character(Fac2)
))
It doesn't work properly because it is summing the first value of Val2 in the group rather than only doing so when Fac1 is b or c.
Any ideas?
Adding desired outcome:
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 M 10 **Changed to M b/c row 4 is M and 10 + 18 > 20
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 L 6 **Changed to L b/c row 10 is L and 6 + 22 > 20
13 5 a 13 M 13
14 5 b 3 N 3
I'm having a hard time following what you are wanting the values to be changed to.
But when I have multiple conditions or decisions that need to be made in a sequence, I use a loop and a series of if statements to go through the data frame. I prefer while loops, so that's what I'll use in the example.
counter <- 1
stopper <- nrow(df)
while (counter <= stopper) {
fac1 <- df$Fac1[counter1]
if (fac1 == 'e') {
if ([INSERT NEXT CONDITION]) #Change whichever value your trying to change using the counter to reference the correct row.
else #Change whichever value your trying to change using the counter to reference the correct row.
}
counter <- counter + 1
}
For me, simplifying the code makes it a lot easier for me to keep track of what decisions are being made. It also allows for complex decisions that are difficult to get functions to work with.
I was able to get the desired result with this code. I made a new column containing the result of the test for what value to replace Fac2 with, which wasn't entirely necessary but makes it more readable and debugable.
The key thing was to use first(na.omit()) to get the value from a different row in the same group which met the condition.
result <- df %>% group_by(ID) %>%
mutate(Max_bc_Val = ifelse(Val2 == max(ifelse(Fac1 %in% c("b","c"),
Val2,0)),
ifelse(Fac1 %in% c("b","c"),
as.character(Fac2),NA),NA)) %>%
mutate(Fac2 = case_when(
Fac1 == "e" ~ ifelse(is.na(first(na.omit(Max_bc_Val))),
NA_character_,
first(na.omit(Max_bc_Val))),
TRUE ~ as.character(Fac2)))
This works but doesn't seem like the best solution. Any other ideas?

remove cases following certain other cases

I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
library(microbenchmark)
microbenchmark(
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
},
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <- as.data.table(df)

In R: dcast in function, pass column names (again!)

Given a df in semi-long format with id variables a and b and measured data in columns m1and m2. The type of data is specified by the variable v (values var1 and var2).
set.seed(8)
df_l <-
data.frame(
a = rep(sample(LETTERS,5),2),
b = rep(sample(letters,5),2),
v = c(rep("var1",5),rep("var2",5)),
m1 = sample(1:10,10,F),
m2 = sample(20:40,10,F))
Looks as:
a b v m1 m2
1 W r var1 3 40
2 N l var1 6 32
3 R a var1 9 28
4 F g var1 5 21
5 E u var1 4 38
6 W r var2 1 35
7 N l var2 8 33
8 R a var2 10 29
9 F g var2 7 30
10 E u var2 2 23
If I want to make a wide format of values in m1 using id a as rows and values in v1as columns I do:
> reshape2::dcast(df_l, a~v, value.var="m1")
a var1 var2
1 E 4 2
2 F 5 7
3 N 6 8
4 R 9 10
5 W 3 1
How do I write a function that does this were arguments to dcast (row, column and value.var) are supplied as arguments, something like:
fun <- function(df,row,col,val){
require(reshape2)
res <-
dcast(df, row~col, value.var=val)
return(res)
}
I checked SO here and here to try variations of match.call and eval(substitute()) in order to "get" the arguments inside the function, and also tried with the lazyeval package. No succes.
What am I doing wrong here ? How to get dcast to recognize variable names?
Formula argument also accepts character input.
foo <- function(df, id, measure, val) {
dcast(df, paste(paste(id, collapse = " + "), "~",
paste(measure, collapse = " + ")),
value.var = val)
}
require(reshape2)
foo(df_l, "a", "v", "m1")
Note that data.table's dcast (current development) can also cast multiple value.var columns directly. So, you can also do:
require(data.table) # v1.9.5
foo(setDT(df_l), "a", "v", c("m1", "m2"))
# a m1_var1 m1_var2 m2_var1 m2_var2
# 1: F 1 6 28 21
# 2: H 9 2 38 29
# 3: M 5 10 24 35
# 4: O 8 3 23 26
# 5: T 4 7 31 39

Resources