I want to apply a function to rows of a data frame. The function is conditional on the value of one column being greater than the value in another column. If the condition is met I take the element from two (other) columns and multiply them, the result is then added to a new column. If the initial condition is not met there is no multiplication and an original value is copied to the new column.
Create some data:
var0 <- c("A", "B", "C", "D", "E")
var1 <- rep(c(105,200), each = 5)
var2 <- c(110:114, 25:29)
var3 <- rep(c(560,135), each = 5)
var4 <- rep(c(0.5,0.2), each = 5)
my_df <- as.data.frame(cbind(var0, var1, var2, var3, var4))
Have a look at the data:
var0 var1 var2 var3 var4
1 A 105 110 560 0.5
2 B 105 111 560 0.5
3 C 105 112 560 0.5
4 D 105 113 560 0.5
5 E 105 114 560 0.5
6 A 200 25 135 0.2
7 B 200 26 135 0.2
8 C 200 27 135 0.2
9 D 200 28 135 0.2
10 E 200 29 135 0.2
My attempt at writing the code:
apply(my_df, 1, function(x) {
if(x$var3 > x$var1) {
x$output <- x$var2 * x$var4
} else {
x$output <- x$var2
}
return(x)
})
What the result should look like:
var0 var1 var2 var3 var4 output
1 A 105 110 560 0.5 55.0
2 B 105 111 560 0.5 55.5
3 C 105 112 560 0.5 56.0
4 D 105 113 560 0.5 56.5
5 E 105 114 560 0.5 57.0
6 A 200 25 135 0.2 25.0
7 B 200 26 135 0.2 26.0
8 C 200 27 135 0.2 27.0
9 D 200 28 135 0.2 28.0
10 E 200 29 135 0.2 29.0
Because var3 is greater than var1 in the first 5 rows var2 * var4 occurs, in the last 5 rows the condition is not met so var2 is simply copied to the output column.
You don't need to use an apply() function here, you can just use ifelse():
df$output <- ifelse(df$var3 > df$var1, df$var2*df$var4, df$var2)
var0 <- c("A", "B", "C", "D", "E")
var1 <- rep(c(105,200), each = 5)
var2 <- c(110:114, 25:29)
var3 <- rep(560,135, 5)
var4 <- rep(c(0.5,0.2), each = 5)
to avoid numbers to be converted to factors I am using cbind.data.frame instead of as.data.frame of cbind
my_df <-cbind.data.frame(var0, var1, var2, var3, var4)
> str(my_df)
'data.frame': 10 obs. of 5 variables:
$ var0: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 1 2 3 4 5
$ var1: num 105 105 105 105 105 200 200 200 200 200
$ var2: int 110 111 112 113 114 25 26 27 28 29
$ var3: num 560 560 560 560 560 560 560 560 560 560
$ var4: num 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2
I then use an ifelse condition to get the new column
>my_df$output=ifelse(my_df$var3>my_df$var1,my_df$var2*my_df$var4,my_df$var2)
> my_df
var0 var1 var2 var3 var4 output
1 A 105 110 560 0.5 55.0
2 B 105 111 560 0.5 55.5
3 C 105 112 560 0.5 56.0
4 D 105 113 560 0.5 56.5
5 E 105 114 560 0.5 57.0
6 A 200 25 560 0.2 5.0
7 B 200 26 560 0.2 5.2
8 C 200 27 560 0.2 5.4
9 D 200 28 560 0.2 5.6
10 E 200 29 560 0.2 5.8
Note I was not getting the same values in var3 as yours. So I changed var3 to be the ones given
> var3 <- c(rep(560,5),rep(135,5))
> var3
[1] 560 560 560 560 560 135 135 135 135 135
> my_df <-cbind.data.frame(var0, var1, var2, var3, var4)
> my_df$output=ifelse(my_df$var3>my_df$var1,my_df$var2*my_df$var4,my_df$var2)
> my_df
var0 var1 var2 var3 var4 output
1 A 105 110 560 0.5 55.0
2 B 105 111 560 0.5 55.5
3 C 105 112 560 0.5 56.0
4 D 105 113 560 0.5 56.5
5 E 105 114 560 0.5 57.0
6 A 200 25 135 0.2 25.0
7 B 200 26 135 0.2 26.0
8 C 200 27 135 0.2 27.0
9 D 200 28 135 0.2 28.0
10 E 200 29 135 0.2 29.0
Related
I have a data frame df:
library(tidyverse)
t <- c(103,104,108,120,127,129,140,142,150,151,160,177,178,183,186,187,191,194,198,199)
w <- c(1,1,1,-1,-1,-1,-1,-1,1,1,-1,-1,1,1,1,-1,1,1,-1,-1)
df <- data_frame(t, w)
> dput(df)
structure(list(t = c(103, 104, 108, 120, 127, 129, 140, 142,
150, 151, 160, 177, 178, 183, 186, 187, 191, 194, 198, 199),
w = c(1, 1, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1,
-1, 1, 1, -1, -1)), .Names = c("t", "w"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> df
# A tibble: 20 x 2
t w
<dbl> <dbl>
1 103 1.00
2 104 1.00
3 108 1.00
4 120 -1.00
5 127 -1.00
6 129 -1.00
7 140 -1.00
8 142 -1.00
9 150 1.00
10 151 1.00
11 160 -1.00
12 177 -1.00
13 178 1.00
14 183 1.00
15 186 1.00
16 187 -1.00
17 191 1.00
18 194 1.00
19 198 -1.00
20 199 -1.00
Now, if the value in w is larger than zero, find the nearest previous negative w, and assign the difference between the corresponding t values to a new column d. Otherwise, d is equal to zero. I.e. the desired output should look like this:
t w d
103 1.00 NA (there is no previous w < 0)
104 1.00 NA (there is no previous w < 0)
108 1.00 NA (there is no previous w < 0)
120 -1.00 0
127 -1.00 0
129 -1.00 0
140 -1.00 0
142 -1.00 0
150 1.00 8 = 150 - 142
151 1.00 9 = 151 - 142
160 -1.00 0
177 -1.00 0
178 1.00 1 = 178 - 177
183 1.00 6 = 183 - 177
186 1.00 9 = 186 - 177
187 -1.00 0
191 1.00 4 = 191 - 187
194 1.00 7 = 194 - 187
198 -1.00 0
199 -1.00 0
(The NAs above might be zero as well.)
Since yesterday I'm trying to attack this problem using findInterval(), which(), etc. but without success. Another way I was thinking about is to introduce somehow a variable shift in lag() function...
Ideally, I would like to have a tidyverse-like solution.
Any help would be very much appreciated.
Thank you in advance!
Using data.table (since tidyverse currently has no non-equi joins):
library(data.table)
DT = data.table(df)
DT[, v := 0]
DT[w > 0, v :=
DT[w < 0][.SD, on=.(t < t), mult="last", i.t - x.t]
]
t w v
1: 103 1 NA
2: 104 1 NA
3: 108 1 NA
4: 120 -1 0
5: 127 -1 0
6: 129 -1 0
7: 140 -1 0
8: 142 -1 0
9: 150 1 8
10: 151 1 9
11: 160 -1 0
12: 177 -1 0
13: 178 1 1
14: 183 1 6
15: 186 1 9
16: 187 -1 0
17: 191 1 4
18: 194 1 7
19: 198 -1 0
20: 199 -1 0
It initializes the new column to 0, then replaces it on the subset of rows where w > 0. The replacement uses a join of the subset of data, .SD, where w > 0 to the part of the table where w < 0, DT[w < 0]. The join syntax is x[i, on=, j] where in this case...
x = DT[w < 0]
i = .SD = DT[w > 0]
The join uses each row of i to look up rows in x based on the rules in on=. When multiple matches are found, we take only the last (mult = "last").
j is what we use the join to do, here calculate the difference between two columns. To disambiguate columns from each table, we use prefixes x.* and i.*.
Using cummax. I'm not sure if this generalizes, but it works for the example:
DT[, v := t - cummax(t*(w < 0))]
DT[cumsum(w < 0) == 0, v := NA]
I guess this requires that the t column is sorted in increasing order.
A tidverse way:
First, make an intermediate column (t2) with NA if positive and and t if neg
df <- mutate(df, t2 = case_when(w > 0 ~ as.numeric(NA), TRUE ~ t))
#fill NA in t2 so that for each row, t2 is value of t when w was last neg
df <- fill(df, t2)
#> df
# A tibble: 20 x 3
# t w t2
# <dbl> <dbl> <dbl>
# 1 103 1 NA
# 2 104 1 NA
# 3 108 1 NA
# 4 120 -1 120
# 5 127 -1 127
# 6 129 -1 129
# 7 140 -1 140
# 8 142 -1 142
# 9 150 1 142
#10 151 1 142
#11 160 -1 160
#12 177 -1 177
#13 178 1 177
#14 183 1 177
#15 186 1 177
#16 187 -1 187
#17 191 1 187
#18 194 1 187
#19 198 -1 198
#20 199 -1 199
Then subtract t2 from t
df$d <- with(df, t - t2)
#> df
# A tibble: 20 x 4
# t w t2 d
# <dbl> <dbl> <dbl> <dbl>
# 1 103 1 NA NA
# 2 104 1 NA NA
# 3 108 1 NA NA
# 4 120 -1 120 0
# 5 127 -1 127 0
# 6 129 -1 129 0
# 7 140 -1 140 0
# 8 142 -1 142 0
# 9 150 1 142 8
#10 151 1 142 9
#11 160 -1 160 0
#12 177 -1 177 0
#13 178 1 177 1
#14 183 1 177 6
#15 186 1 177 9
#16 187 -1 187 0
#17 191 1 187 4
#18 194 1 187 7
#19 198 -1 198 0
#20 199 -1 199 0
I have data frame like this
test <- data.frame(gr=rep(letters[1:2],each=6),No=c(100:105,200:205))
gr No
1 a 100
2 a 101
3 a 102
4 a 103
5 a 104
6 a 105
7 b 200
8 b 201
9 b 202
10 b 203
11 b 204
12 b 205
in the No column the numbers are increasing in each gr. I need to sum gr a with 100 and b with 50 and need to have consecutive decrease after this operation.
I would like to have a new column that consecutive decrease with this increase. So I tried
decrese_func <- function(No,gr){
if(any(gr=="a")){
No+100
}
else
No+50
}
test%>%
group_by(gr)%>%
mutate(new_column=decrese_func(No,gr))
# A tibble: 12 x 3
# Groups: gr [2]
gr No new_column
<fct> <int> <dbl>
1 a 100 200
2 a 101 201
3 a 102 202
4 a 103 203
5 a 104 204
6 a 105 205
7 b 200 250
8 b 201 251
9 b 202 252
10 b 203 253
11 b 204 254
12 b 205 255
but what I need is like this
gr No new_column
<fct> <int> <dbl>
1 a 100 200
2 a 101 199
3 a 102 198
4 a 103 197
5 a 104 196
6 a 105 195
7 b 200 250
8 b 201 249
9 b 202 248
10 b 203 247
11 b 204 246
12 b 205 245
I cannot figure it out how to have consecutive decrease ?
Thx.
Not the most elegant answer but in the mean time, this may work:
library(dplyr)
test %>%
mutate(A = case_when(gr == "a" ~ 100,
gr == "b" ~ 50,
TRUE ~ NA_real_)) %>%
group_by(gr) %>%
mutate(B = (1:NROW(gr) - 1) * 2,
New_Column = No + A - B)
# A tibble: 12 x 5
# Groups: gr [2]
gr No A B New_Column
<fct> <int> <dbl> <dbl> <dbl>
1 a 100 100 0 200
2 a 101 100 2 199
3 a 102 100 4 198
4 a 103 100 6 197
5 a 104 100 8 196
6 a 105 100 10 195
7 b 200 50 0 250
8 b 201 50 2 249
9 b 202 50 4 248
10 b 203 50 6 247
11 b 204 50 8 246
12 b 205 50 10 245
Add select(gr, No, New_Column) at the end of the chain to get gr, No and New_Column only. I left the other columns just to show what's going on.
And if you want to wrap it into a function you could do something like:
desc_func <- function(group_var, condition, if_true_add, if_false_add, to_number) {
ifelse(
group_var == condition,
to_number + if_true_add - (1:NROW(group_var) - 1) * 2,
to_number + if_false_add - (1:NROW(group_var) - 1) * 2)
}
test %>%
group_by(gr) %>%
mutate(test_var = desc_func(gr, "a", 100, 50, No))
# A tibble: 12 x 3
# Groups: gr [2]
gr No test_var
<fct> <int> <dbl>
1 a 100 200
2 a 101 199
3 a 102 198
4 a 103 197
5 a 104 196
6 a 105 195
7 b 200 250
8 b 201 249
9 b 202 248
10 b 203 247
11 b 204 246
12 b 205 245
Here is a way to do this in base R
test$New <- with(test, No + c(100, 50)[cumsum(!duplicated(gr))] - 2*(No %% 100))
test$New
#[1] 200 199 198 197 196 195 250 249 248 247 246 245
Or a slight variation with match
with(test, No + c(100, 50)[match(gr, unique(gr))] - 2*(No %% 100))
I have a data frame like this:
df <- data.frame(x=c(7,5,4),y=c(100,100,100),w=c(170,170,170),z=c(132,720,1256))
I create a new column using mapply:
set.seed(123)
library(truncnorm)
df$res <- mapply(rtruncnorm,df$x,df$y,df$w,df$z,25)
So, I got:
> df
#x y w z res
#1 7 100 170 132 117.9881, 126.2456, 133.7627, 135.2322, 143.5229, 100.3735, 114.8287
#2 5 100 170 720 168.8581, 169.4955, 169.6461, 169.8998, 169.0343
#3 4 100 170 1256 169.7245, 167.6744, 169.7025, 169.4441
#dput(df)
df <- structure(list(x = c(7, 5, 4), y = c(100, 100, 100), w = c(170,
170, 170), z = c(132, 720, 1256), res = list(c(117.988108836195,
126.245562762918, 133.762709785614, 135.232193379024, 143.52290514973,
100.373469134837, 114.828678702662), c(168.858147661715, 169.495493758985,
169.646123183828, 169.899849943838, 169.034333943479), c(169.724470294466,
167.674371713068, 169.70250974042, 169.444134892323))), .Names = c("x",
"y", "w", "z", "res"), row.names = c(NA, -3L), class = "data.frame")
But what I really need is repeat each row of df dataframe according to the df$res result as follows:
> df2
# x y w z res
#1 7 100 170 132 117.9881
#2 7 100 170 132 126.2456
#3 7 100 170 132 133.7627
#4 7 100 170 132 135.2322
#5 7 100 170 132 143.5229
#6 7 100 170 132 100.3735
#7 7 100 170 132 114.8287
#8 5 100 170 720 168.8581
#9 5 100 170 720 169.4955
#10 5 100 170 720 169.6461
#11 5 100 170 720 169.8998
#12 5 100 170 720 169.0343
#13 4 100 170 1256 169.7245
#14 4 100 170 1256 167.6744
#15 4 100 170 1256 169.7025
#16 4 100 170 1256 169.4441
How, do I achieve this efficiently? I need to apply this to a big dataframe
df <- data.frame(x=c(7,5,4),y=c(100,100,100),w=c(170,170,170),z=c(132,720,1256))
set.seed(123)
l <- mapply(rtruncnorm,df$x,df$y,df$w,df$z,25)
cbind.data.frame(df[rep(seq_along(l), lengths(l)),],
res = unlist(l))
# x y w z res
# 1 7 100 170 132 117.9881
# 1.1 7 100 170 132 126.2456
# 1.2 7 100 170 132 133.7627
# 1.3 7 100 170 132 135.2322
# 1.4 7 100 170 132 143.5229
# 1.5 7 100 170 132 100.3735
# 1.6 7 100 170 132 114.8287
# 2 5 100 170 720 168.8581
# 2.1 5 100 170 720 169.4955
# 2.2 5 100 170 720 169.6461
# 2.3 5 100 170 720 169.8998
# 2.4 5 100 170 720 169.0343
# 3 4 100 170 1256 169.7245
# 3.1 4 100 170 1256 167.6744
# 3.2 4 100 170 1256 169.7025
# 3.3 4 100 170 1256 169.4441
Try this based on your given df:
df$res <- sapply(df$res, paste0, collapse=",")
do.call(rbind, apply(df, 1, function(x) do.call(expand.grid, strsplit(x, ","))))
# x y w z res
# 1 7 100 170 132 117.988108836195
# 2 7 100 170 132 126.245562762918
# 3 7 100 170 132 133.762709785614
# 4 7 100 170 132 135.232193379024
# 5 7 100 170 132 143.52290514973
# 6 7 100 170 132 100.373469134837
# 7 7 100 170 132 114.828678702662
# 8 5 100 170 720 168.858147661715
# 9 5 100 170 720 169.495493758985
# 10 5 100 170 720 169.646123183828
# 11 5 100 170 720 169.899849943838
# 12 5 100 170 720 169.034333943479
# 13 4 100 170 1256 169.724470294466
# 14 4 100 170 1256 167.674371713068
# 15 4 100 170 1256 169.70250974042
# 16 4 100 170 1256 169.444134892323
I have this "d" data frame that has 2 groups. In real life I have 20 groups.
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
group value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 101
12 B 102
13 B 103
14 B 104
15 B 105
16 B 106
17 B 107
18 B 108
19 B 109
20 B 110
21 A 10000
I'd like to add 2 columns, "Upper" and "Lower" that are calculated at the GROUP below level. Since there are only 2 groups I can add the columns manually like this:
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
d$upper = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]+ 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]+ 2.5*IQR(d$value[d$group=="B"]) )
d$lower = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]- 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]- 2.5*IQR(d$value[d$group=="B"]) )
group value upper lower
1 A 1 21 -4.0
2 A 2 21 -4.0
3 A 3 21 -4.0
4 A 4 21 -4.0
5 A 5 21 -4.0
6 A 6 21 -4.0
7 A 7 21 -4.0
8 A 8 21 -4.0
9 A 9 21 -4.0
10 A 10 21 -4.0
11 B 101 119 96.5
12 B 102 119 96.5
13 B 103 119 96.5
14 B 104 119 96.5
15 B 105 119 96.5
16 B 106 119 96.5
17 B 107 119 96.5
18 B 108 119 96.5
19 B 109 119 96.5
20 B 110 119 96.5
21 A 10000 21 -4.0
But when I have 20 or 30 columns whats the best way to add these columns without doing a loop?
Groupwise operations can easily be done using dplyr's group_by function:
library(dplyr)
d <- data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d %>%
group_by(group) %>%
mutate(upper=quantile(value, 0.75) + 2.5*IQR(value),
lower=quantile(value, 0.75) - 2.5*IQR(value))
This splits the data frame by the "group" variable and then computes the "upper" and "lower" columns separately for each group.
I need to change the format of my current data by creating a new column as "tnum"(first column)to indicate trait/variable numbers and the last column "tval" to indicate each trait value.
My current data file (9,000 animals) is similar to this format:
anim <- c(201,202,203,204,205)
bwt <- c(1.2,1.0,0.9,1.1,1.5)
leng <- c(14,21,18,16,19)
temp <- c(33,34,39,38,37)
mydf <- data.frame(anim,bwt,leng,temp)
anim bwt leng temp
1 201 1.2 14 33
2 202 1.0 21 34
3 203 0.9 18 39
4 204 1.1 16 38
5 205 1.5 19 37
Trait 1 = bwt, trait 2 = leng, and trait 3 = temp. This what I am looking for:
tnum anim tval
1 201 1.2
2 201 14
3 201 33
1 202 1.0
2 202 21
3 202 34
1 203 0.9
2 203 18
3 203 39
1 204 1.1
2 204 16
3 204 38
1 205 1.5
2 205 19
3 205 37
Any help would be appreciated.
Baz
library("reshape")
m <- melt(mydf, id.vars="anim")
m
anim variable value
1 201 bwt 1.2
2 202 bwt 1.0
3 203 bwt 0.9
4 204 bwt 1.1
5 205 bwt 1.5
6 201 leng 14.0
7 202 leng 21.0
8 203 leng 18.0
9 204 leng 16.0
10 205 leng 19.0
11 201 temp 33.0
12 202 temp 34.0
13 203 temp 39.0
14 204 temp 38.0
15 205 temp 37.0
and please format your code better next time. its simple.