Given the following data.frame
d <- rep(c("a", "b"), each=5)
l <- rep(1:5, 2)
v <- 1:10
df <- data.frame(d=d, l=l, v=v*v)
df
d l v
1 a 1 1
2 a 2 4
3 a 3 9
4 a 4 16
5 a 5 25
6 b 1 36
7 b 2 49
8 b 3 64
9 b 4 81
10 b 5 100
Now I want to add another column after grouping by l. The extra column should contain the value of v_b - v_a
d l v e
1 a 1 1 35 (36-1)
2 a 2 4 45 (49-4)
3 a 3 9 55 (64-9)
4 a 4 16 65 (81-16)
5 a 5 25 75 (100-25)
6 b 1 36 35 (36-1)
7 b 2 49 45 (49-4)
8 b 3 64 55 (64-9)
9 b 4 81 65 (81-16)
10 b 5 100 75 (100-25)
In paranthesis the way how to calculate the value.
I'm looking for a way using dplyr. So I started with something like this
df %.%
group_by(l) %.%
mutate(e=myCustomFunction)
But how should I define myCustomFunction? I thought grouping of the data.frame produces another (sub-)data.frame which is a parameter to this function. But it isn't...
I guess this is the dplyr equivalent to #jlhoward's data.table solution:
df %>%
group_by(l) %>%
mutate(e = v[d == "b"] - v[d == "a"])
Edit after comment by OP:
If you want to use a custom function, here's a possible way:
myfunc <- function(x) {
with(x, v[d == "b"] - v[d == "a"])
}
test %>%
group_by(l) %>%
do(data.frame(. , e = myfunc(.))) %>%
arrange(d, l) # <- just to get it back in the original order
Edit after comment by #hadley:
As hadley commented below, it would be better in this case to define the function as
f <- function(v, d) v[d == "b"] - v[d == "a"]
and then use the custom function f inside a mutate:
df %>%
group_by(l) %>%
mutate(e = f(v, d))
Thanks #hadley for the comment.
Using dplyr:
df %.%
group_by(l) %.%
mutate(e=diff(v))
# d l v e
# 1 a 1 1 35
# 2 a 2 4 45
# 3 a 3 9 55
# 4 a 4 16 65
# 5 a 5 25 75
# 6 b 1 36 35
# 7 b 2 49 45
# 8 b 3 64 55
# 9 b 4 81 65
# 10 b 5 100 75
Here's an approach using data tables.
library(data.table)
DT <- as.data.table(df)
DT[,e := diff(v), by=l]
These approaches using diff(...) assume your data frame is sorted as in your example. If not, this is a more reliable way to do the same thing.
DT[, e := .SD[d == "b", v] - .SD[d == "a", v], by=l]
(or) even more directly
DT[, e := v[d == "b"] - v[d == "a"], by=l]
But if you want to access the entire subset of data and pass it to your custom function, then you can use .SD. Also make sure you read about ?.SDcols from ?data.table.
If you want to consider a non-dplyr option
df$e <- with(df, ave(v, l, FUN=function(x) diff(x)))
will do the trick. The ave function is useful for calculating values for groups of observations.
Related
I have created a data frame which has string and integers. The integers which are positive and negative.
I have to change all the ints to be positive without using for/if loops but by only using vectorization and indexing. I have created one with a for loop but I am a bit stuck on the next part.
df <- data.frame(x = letters[1:5],
y = seq(-4,4,2),
z = c(3,4,-5,6,-8))
This is my loop to convert to positive.
loop_df_fn <- function(data){
for(i in names(data)){
if(is.numeric(data[[i]])){
data[[i]][data[[i]]<0] <- abs(data[[i]][data[[i]]< 0])*10
}
}
return(data)
}
print((loop_df_fn(df)))
You can use
df[] <- lapply(df , \(x) if(is.numeric(x)) abs(x)*10 else x)
Output
x y z
1 a 40 30
2 b 20 40
3 c 0 50
4 d 20 60
5 e 40 80
A tidy solution:
library(dplyr)
df1 <- df %>%
mutate(across(where(is.numeric), ~if_else(.<0, .*-10, .)))
rapply(df, \(x) (x*-10)^(x<0)*x^(x>0), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 1 50
4 d 2 6
5 e 4 80
rapply(df, \(x) replace(x, x<0, x[x<0]*-10), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80
lastly:
ind <- sapply(df, is.numeric)
df[ind][df[ind]<0] <- df[ind][df[ind]<0] * -10
df
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80
Say I have a dataframe like this:
set.seed(1)
n <- 20
df <- data.frame(ID = sample(1:5, n, replace = TRUE),
Fac1 = sample(letters[1:5], n, replace = TRUE),
Fac2 = sample(LETTERS[10:15], n, replace = TRUE),
Val1 = sample(1:10, n, replace = TRUE)) %>%
arrange(ID) %>% group_by(ID,Fac1) %>%
summarise(Val1 = sum(Val1),Fac2 = first(Fac2)) %>%
group_by(ID,Fac2) %>%
mutate(Val2 = sum(Val1))
df
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 N 10
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 N 6
13 5 a 13 M 13
14 5 b 3 N 3
ID is a grouping variable. Rows with an Fac1 value of e should have the Fac2 value changed to be that same as the other row in the group where Fac1 is either b or c and the sum of Val 2 for the two rows if greater than 20. (I've simplified this to the point where you probably don't get why but just work with me).
This is what I have tried so far:
result <- df %>% group_by(ID) %>%
mutate(Fac2 = case_when(
Fac1 == "e" &
sum(Val2,ifelse(Fac1 %in% c("b","c"), Val2, 0)) > 20 ~
ifelse(sum(Val2,ifelse(Fac1 %in% c("b","c"),Val2,0)) > 20,
as.character(Fac2),
NA_character_),
TRUE ~ as.character(Fac2)
))
It doesn't work properly because it is summing the first value of Val2 in the group rather than only doing so when Fac1 is b or c.
Any ideas?
Adding desired outcome:
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 M 10 **Changed to M b/c row 4 is M and 10 + 18 > 20
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 L 6 **Changed to L b/c row 10 is L and 6 + 22 > 20
13 5 a 13 M 13
14 5 b 3 N 3
I'm having a hard time following what you are wanting the values to be changed to.
But when I have multiple conditions or decisions that need to be made in a sequence, I use a loop and a series of if statements to go through the data frame. I prefer while loops, so that's what I'll use in the example.
counter <- 1
stopper <- nrow(df)
while (counter <= stopper) {
fac1 <- df$Fac1[counter1]
if (fac1 == 'e') {
if ([INSERT NEXT CONDITION]) #Change whichever value your trying to change using the counter to reference the correct row.
else #Change whichever value your trying to change using the counter to reference the correct row.
}
counter <- counter + 1
}
For me, simplifying the code makes it a lot easier for me to keep track of what decisions are being made. It also allows for complex decisions that are difficult to get functions to work with.
I was able to get the desired result with this code. I made a new column containing the result of the test for what value to replace Fac2 with, which wasn't entirely necessary but makes it more readable and debugable.
The key thing was to use first(na.omit()) to get the value from a different row in the same group which met the condition.
result <- df %>% group_by(ID) %>%
mutate(Max_bc_Val = ifelse(Val2 == max(ifelse(Fac1 %in% c("b","c"),
Val2,0)),
ifelse(Fac1 %in% c("b","c"),
as.character(Fac2),NA),NA)) %>%
mutate(Fac2 = case_when(
Fac1 == "e" ~ ifelse(is.na(first(na.omit(Max_bc_Val))),
NA_character_,
first(na.omit(Max_bc_Val))),
TRUE ~ as.character(Fac2)))
This works but doesn't seem like the best solution. Any other ideas?
Given a df in semi-long format with id variables a and b and measured data in columns m1and m2. The type of data is specified by the variable v (values var1 and var2).
set.seed(8)
df_l <-
data.frame(
a = rep(sample(LETTERS,5),2),
b = rep(sample(letters,5),2),
v = c(rep("var1",5),rep("var2",5)),
m1 = sample(1:10,10,F),
m2 = sample(20:40,10,F))
Looks as:
a b v m1 m2
1 W r var1 3 40
2 N l var1 6 32
3 R a var1 9 28
4 F g var1 5 21
5 E u var1 4 38
6 W r var2 1 35
7 N l var2 8 33
8 R a var2 10 29
9 F g var2 7 30
10 E u var2 2 23
If I want to make a wide format of values in m1 using id a as rows and values in v1as columns I do:
> reshape2::dcast(df_l, a~v, value.var="m1")
a var1 var2
1 E 4 2
2 F 5 7
3 N 6 8
4 R 9 10
5 W 3 1
How do I write a function that does this were arguments to dcast (row, column and value.var) are supplied as arguments, something like:
fun <- function(df,row,col,val){
require(reshape2)
res <-
dcast(df, row~col, value.var=val)
return(res)
}
I checked SO here and here to try variations of match.call and eval(substitute()) in order to "get" the arguments inside the function, and also tried with the lazyeval package. No succes.
What am I doing wrong here ? How to get dcast to recognize variable names?
Formula argument also accepts character input.
foo <- function(df, id, measure, val) {
dcast(df, paste(paste(id, collapse = " + "), "~",
paste(measure, collapse = " + ")),
value.var = val)
}
require(reshape2)
foo(df_l, "a", "v", "m1")
Note that data.table's dcast (current development) can also cast multiple value.var columns directly. So, you can also do:
require(data.table) # v1.9.5
foo(setDT(df_l), "a", "v", c("m1", "m2"))
# a m1_var1 m1_var2 m2_var1 m2_var2
# 1: F 1 6 28 21
# 2: H 9 2 38 29
# 3: M 5 10 24 35
# 4: O 8 3 23 26
# 5: T 4 7 31 39
I have two dataframes with different dimensions,
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10))
>df1
names duration
1 J 97
2 G 57
3 H 53
4 A 23
5 E 100
6 D 90
7 C 73
8 F 60
9 B 37
10 I 67
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5])
> df2
names names_new
1 A a
2 B b
3 C c
4 D d
5 E e
I want to replace in df1 the values that match df1$names and df2$names but using the df2$names_new. My desired output would be:
> df1
names duration
1 J 97
2 G 57
3 H 53
4 a 23
5 e 100
6 d 90
7 c 73
8 F 60
9 b 37
10 I 67
This is the code I'm using but I wonder if there is a cleaner way to do it with no so many steps,
df2[,1] <- as.character(df2[,1])
df2[,2] <- as.character(df2[,2])
df1[,1] <- as.character(df1[,1])
match(df1[,1], df2[,1]) -> id
which(!is.na(id)==TRUE) -> idx
id[!is.na(id)] -> id
df1[idx,1] <- df2[id,2]
Many thanks
Here's an approach from qdapTools:
library(qdapTools)
df1$names <- df1$names %lc+% df2
The %l+% is a binary operator version of lookup. The left are the terms and the right side is the lookup table. The + means that any noncomparables will revert back to the original. This is a wrapper for the data.table package and is pretty speedy.
Here is the output including set.seed(1) for reproducibility:
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
library(qdapTools)
df1$names <- df1$names %lc+% df2
df1
## names duration
## 1 c 20
## 2 d 17
## 3 e 68
## 4 G 37
## 5 b 74
## 6 H 47
## 7 I 98
## 8 F 93
## 9 J 35
## 10 a 71
Are all names in df2 also in df1? And do you intent to keep them as a factor? If so, you might find this solution helpful.
idx <- match(levels(df2$names), levels(df1$names))
levels(df1$names)[idx] <- levels(df2$names_new)
This works but requires that names and names_new are character and not factor.
set.seed(1)
df1 <- data.frame(names= sample(LETTERS[1:10]), duration=sample(0:100, 10),stringsAsFactors=F)
df2 <- data.frame(names= LETTERS[1:5], names_new=letters[1:5],stringsAsFactors=F)
rownames(df1) <- df1$names
df1[df2$name,]$names <- df2$names_new
Another option using merge:
transform(merge(df1,df2,all.x=TRUE),
names=ifelse(is.na(names_new),as.character(names),
as.character(names_new)))
Another way using match would be (if df1$names and df1$names are characters of course)
df1[match(df2$names, df1$names), "names"] <- df2$names_new
I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
There's probably a more general way to do this, but below uses rbind.fill.
cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
rbind.fill(apply(df[,4:5], 1, function(x)
data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))
## id1 id2 mean start end
## 1 A D 4 12 13
## 1.1 A D 4 13 14
## 1.2 A D 4 14 15
## 2 B E 5 14 15
## 3 C F 6 8 9
## 3.1 C F 6 9 10
The survSplit function of the survival package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.
No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
sq=seq(1:(end-start)))
Two alternatives, many years later, offering alternatives using today's popular data.table and tidyverse packages:
Option 1:
library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
id1 id2 mean start end
1: A D 4 12 13
2: A D 4 13 14
3: A D 4 14 15
4: B E 5 14 15
5: C F 6 8 9
6: C F 6 9 10
Option 2:
library(tidyverse)
mydf %>%
group_by(id1, id2, mean) %>%
summarise(start = list(start:(end-1))) %>%
unnest(start) %>%
mutate(end = start+1)