I am using mutate to create a column depending on the first value of a group
library(tidyverse)
test = data.frame(grp = c(1,1,1,2,2,2), x = c(1,2,3,1,2,3), y = c(1,2,3,1,2,3))
test
grp x y
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 1
5 2 2 2
6 2 3 3
test %>% group_by(grp) %>%
mutate(y = ifelse(grp[[1]] == x[[1]], y-1, y))
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 1
However output is not as I expected.
Expected output is
grp x y
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 1
3 1 3 2
4 2 1 1
5 2 2 2
6 2 3 3
Can you please explain what is happening and how best to get my expected solution?
You need to remove the index [[1]] from grp since it will only change the first value of that group and use that to replace y. Since grp is the group you should avoid indexing it. Just use it as is, i.e.
library(dplyr)
test %>%
group_by(grp) %>%
mutate(new_y = ifelse(grp == first(x), y-1, y))
# A tibble: 6 × 4
# Groups: grp [2]
grp x y new_y
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 1 2 2 1
3 1 3 3 2
4 2 1 1 1
5 2 2 2 2
6 2 3 3 3
Because of the x[[1]], you are always comparing the group values of each row with the the x value of the first row. I think you want grp==x within ifelse()
I have this dataframe.
Sub <- c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
trial <-c(1,1,1,1,2,2,2,2,2,2,1,1,1,1,2,2,2,2,2,2)
One <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
Two <- c(1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,0,1)
Three <- c(2,0,0,1,3,0,0,0,0,1,7,8,0,0,0,1,1,1,1,0)
Four <- c(3,4,5,4,3,4,5,6,7,8,6,5,4,5,6,7,6,5,6,5)
Five <- c(3,4,5,4,6,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
Six <- c(3,4,5,4,6,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
Seven <- c(3,4,5,4,9,7,5,4,3,2,3,4,5,4,3,5,7,4,3,5)
dat <- data.frame(Sub, trial, One, Two, Three, Four, Five, Six, Seven)
I created this function to calculate the correlation among my variables.
fun <- function(a,b,c,d,e,f,g) {
v = cor(a,b)
v1 = cor(a,c)
v2 = cor(a,d)
v3 = cor(a,e)
v4 = cor(a,f)
v5 = cor(a,g)
return(c(v,v1,v2,v3,v4,v5))
}
I need to apply this function to each group of my dataset (Sub,trial).
dat %>%
group_by(Sub,trial) %>%
summarize(as.data.frame(matrix(fun(One, Two, Three, Four, Five, Six, Seven), nr = 1)))
However I got this result:
Sub trial V1 V2 V3 V4 V5 V6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA NA NA NA
2 1 2 NA NA NA NA NA NA
3 2 1 NA NA NA NA NA NA
4 2 2 NA NA NA NA NA NA
Sub/trial are well grouped. But I got NA results for the other variables.
Do you have any advice?
Thank you.
The solution by user #user438383 is the correct one.
The reason you get NA has nothing to do with applying the function.
As you get the the warning that standard deviation is zero you may consider this:
R - Warning message: "In cor(...): the standard deviation is zero"
Here is an example:
# generate a list of dataframes with your groups:
my_list <- dat %>%
group_by(Sub, trial) %>%
group_split()
[[1]]
# A tibble: 5 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 2 3 3 3 3
2 1 1 1 0 0 4 4 4 4
3 1 1 1 0 0 5 5 5 5
4 1 1 1 0 1 4 4 4 4
5 1 1 1 1 7 6 3 3 3
[[2]]
# A tibble: 6 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 3 3 6 6 9
2 1 2 1 0 0 4 7 7 7
3 1 2 1 0 0 5 5 5 5
4 1 2 1 0 0 6 4 4 4
5 1 2 1 0 0 7 3 3 3
6 1 2 1 0 1 8 2 2 2
[[3]]
# A tibble: 3 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 1 0 8 5 4 4 4
2 2 1 1 0 0 4 5 5 5
3 2 1 1 0 0 5 4 4 4
[[4]]
# A tibble: 6 x 9
Sub trial One Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 1 1 0 6 3 3 3
2 2 2 1 1 1 7 5 5 5
3 2 2 1 1 1 6 7 7 7
4 2 2 1 0 1 5 4 4 4
5 2 2 1 0 1 6 3 3 3
6 2 2 1 1 0 5 5 5 5
Now apply cor to the first group
my_list[[1]] %>%
summarise(across(Two:Seven, ~cor(One, .)))
# gives:
# A tibble: 1 x 6
Two Three Four Five Six Seven
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA
Warning messages:
1: In cor(One, Two) : Standardabweichung ist Null
2: In cor(One, Three) : Standardabweichung ist Null
3: In cor(One, Four) : Standardabweichung ist Null
4: In cor(One, Five) : Standardabweichung ist Null
5: In cor(One, Six) : Standardabweichung ist Null
6: In cor(One, Seven) : Standardabweichung ist Null
# or correlation of two columns only One and two of group one
cor(my_list[[1]]$One, my_list[[1]]$Two)
# gives:
[1] NA
Warning message:
In cor(my_list[[1]]$One, my_list[[1]]$Two) : Standardabweichung ist Null
An extrapolated example with the mtcars dataset:
mtcars %>%
relocate(cyl, vs, everything()) %>%
group_by(cyl, vs) %>%
summarise(across(hp:carb, ~cor(., mpg)))
cyl vs hp drat wt qsec am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 0 NA NA NA NA NA NA NA
2 4 1 -0.522 0.466 -0.721 -0.296 0.557 0.442 -0.189
3 6 0 -1 1 -0.101 0.931 NA -1 -1
4 6 1 -0.248 -0.249 -0.936 -0.0424 NA -0.442 -0.442
5 8 0 -0.284 0.0479 -0.650 -0.104 0.0496 0.0496 -0.394
Warning messages:
1: In cor(am, mpg) : Standardabweichung ist Null
2: In cor(am, mpg) : Standardabweichung ist Null
Assume following table:
library(dplyr)
library(tibble)
library(purrr)
df = tibble(
client = c(1,1,1,1,2,2,2,2),
prod_type = c(1,1,2,2,1,1,2,2),
max_prod_type = c(2,2,2,2,2,2,2,2),
value_1 = c(10,20,30,30,100,200,300,300),
value_2 = c(1,2,3,3,1,2,3,3),
)
# A tibble: 8 x 5
client prod_type max_prod_type value_1 value_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1
2 1 1 2 20 2
3 1 2 2 30 3
4 1 2 2 30 3
5 2 1 2 100 1
6 2 1 2 200 2
7 2 2 2 300 3
8 2 2 2 300 3
Column 'max_prod_type' here denotes maximum value for 'prod_type' column per each 'client' value. I need to compute new column 'sum', which would contain sum from adding the values from 'value_1' and 'value_2', but only for those rows, where 'prod_type' == 'max_prod_type' per each 'client' value.
I have tried following code:
df %>%
mutate(
sum =
map2_dbl(
client, max_prod_type,
~case_when(
prod_type == .y~
filter(df, client == .x, prod_type == .y) %>%
mutate(sum = value_1 + value_2) %>%
select(sum) %>%
sum(),
T~NA_real_
)
)
)
Desired output is following:
# A tibble: 8 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
But it throws an error:
Error: Problem with `mutate()` input `sum`.
x Result 1 must be a single double, not a double vector of length 6
i Input `sum` is `map2_dbl(...)`.
Moreover, as for me such way of implementation is somewhat slow. I'm wondering if there any correct and more optimized solution to this problem.
Appreciate your help!
One option could be:
df %>%
group_by(client) %>%
mutate(res = row_number() == which(value_1 == max(value_1)),
res = if_else(res, sum(value_1[res]) + sum(value_2[res]), NA_real_))
client prod_type max_prod_type value_1 value_2 res
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
I think this is closer to what you want:
df %>%
mutate(sum = case_when(prod_type == max_prod_type ~ value_1 + value_2,
TRUE ~ NA_real_))
# A tibble: 6 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 33
4 2 1 2 100 1 NA
5 2 1 2 200 2 NA
6 2 2 2 300 3 303
I have a dataset look like this -
sample <- tibble(x = c (1,2,3,NA), y = c (5, NA,2, NA))
sample
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 1 5
2 2 NA
3 3 2
4 NA NA
Now I want create a new variable Z, which will count how many observations are in each row. For example for the sample dataset above the first value of new variable Z should be 2 because both x and y have values. Similarly, for 2nd row the value of Z is 1 as there is one missing value and for 4th row, the value is 0 as there is no observations in the row.
The expected dataset looks like this -
x y z
<dbl> <dbl> <dbl>
1 1 5 2
2 2 NA 1
3 3 2 2
4 NA NA 0
I want to do this on few number of variables, not the whole dataset.
Using base R. First line checks all columns, second one checks columns by name, third might not work as good if the number of columns is substantial.
sample$z1 <- rowSums(!is.na(sample))
sample$z2 <- rowSums(!is.na(sample[c("x", "y")]))
sample$z3 <- is.finite(sample$x) + is.finite(sample$y)
> sample
# A tibble: 4 x 5
x y z1 z2 z3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 5 2 2 2
2 2 NA 1 1 1
3 3 2 2 2 2
4 NA NA 0 0 0
We can use
library(dplyr)
sample %>%
rowwise %>%
mutate(z = sum(!is.na(cur_data()))) %>%
ungroup
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <int>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
If it is select columns
sample %>%
rowwise %>%
mutate(z = sum(!is.na(select(cur_data(), x:y))))
Or with rowSums on a logical matrix
sample %>%
mutate(z = rowSums(!is.na(cur_data())))
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
apply function with selected columns example:
set.seed(7)
vals <- sample(c(1:20, NA, NA), 20)
sample <- matrix(vals, ncol = 5)
# Select columns 1, 3, 4
cols <- c(1, 3, 4)
rowcnts <- apply(sample[ , cols], 1, function(x) length(x[!is.na(x)]))
sample <- cbind(sample, rowcnts)
> sample
rowcnts
[1,] 10 15 16 NA 12 2
[2,] 19 8 14 18 9 3
[3,] 7 17 6 4 1 3
[4,] 2 3 13 NA 5 2
I have a df with entries in 10 columns grouped by unit and year. I want to calculate a) how often the values per column increased and b) how often the values per column decreased from one year to the other (e.g. from 2010 to 2011, 2011 to 2012 and so on) per group.
This is my df
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
So a solution should show information how often unit 1 in X1 had increases and decreases from one year to the other, how often unit 1 had increases/decreases in X2 and so forth
A tidyverse solution would be preferable ;)
One solution that produces a wide format. Each one of the Xs will get 2 new columns of counts: X_incr and X_decr:
# example data
df <- data.frame(unit=rep(1:250, 4),
year=rep(c(2012, 2013, 2014, 2015), each=250),
replicate(10,sample(0:50000,1000,rep=TRUE)))
library(dplyr)
# function to count increases and decreases
f_incr = function(x) sum(lead(x) > x, na.rm = T)
f_decr = function(x) sum(lead(x) < x, na.rm = T)
df %>%
group_by(unit) %>% # for each unit
summarise_at(vars(matches("X")), funs(incr = f_incr, # apply functions
decr = f_decr))
# # A tibble: 250 x 21
# unit X1_incr X2_incr X3_incr X4_incr X5_incr X6_incr X7_incr X8_incr X9_incr X10_incr X1_decr X2_decr
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1 0 2 1 1 1 1 1 2 2 2 3
# 2 2 1 2 1 2 0 1 1 3 2 2 2 1
# 3 3 3 1 1 1 2 1 1 2 2 2 0 2
# 4 4 1 1 2 1 1 1 1 1 2 1 2 2
# 5 5 3 2 2 1 2 2 1 2 2 2 0 1
# 6 6 1 2 1 2 2 2 1 2 2 1 2 1
# 7 7 1 2 1 1 2 0 2 3 1 1 2 1
# 8 8 2 1 1 2 2 1 1 2 1 1 1 2
# 9 9 1 2 3 1 2 2 1 1 2 2 2 1
#10 10 2 1 2 2 2 2 0 1 2 1 1 2
# # ... with 240 more rows, and 8 more variables: X3_decr <int>, X4_decr <int>, X5_decr <int>, X6_decr <int>,
# # X7_decr <int>, X8_decr <int>, X9_decr <int>, X10_decr <int>
Or if you prefer a format where each X has 2 rows of counts (X_incr and X_decr):
library(tidyr)
df %>%
group_by(unit) %>%
summarise_at(vars(matches("X")), funs(incr = f_incr,
decr = f_decr)) %>%
gather(type, counts, -unit)
# # A tibble: 5,000 x 3
# unit type counts
# <int> <chr> <int>
# 1 1 X1_incr 1
# 2 2 X1_incr 1
# 3 3 X1_incr 3
# 4 4 X1_incr 1
# 5 5 X1_incr 3
# 6 6 X1_incr 1
# 7 7 X1_incr 1
# 8 8 X1_incr 2
# 9 9 X1_incr 1
#10 10 X1_incr 2
# # ... with 4,990 more rows
Or this:
df %>%
gather(type,value,-unit,-year) %>% # reshape data
group_by(unit, type) %>% # for each combination
summarise(incr = f_incr(value), # get increasing counts
decr = f_decr(value)) %>% # get decreasing counts
arrange(type, unit) %>% # order (just for visualisation purposes)
ungroup() # forget the grouping
# # A tibble: 2,500 x 4
# unit type incr decr
# <int> <chr> <int> <int>
# 1 1 X1 1 2
# 2 2 X1 1 2
# 3 3 X1 3 0
# 4 4 X1 1 2
# 5 5 X1 3 0
# 6 6 X1 1 2
# 7 7 X1 1 2
# 8 8 X1 2 1
# 9 9 X1 1 2
#10 10 X1 2 1
# # ... with 2,490 more rows
I hope I understand the question (a) correctly. You are trying to see for each row how many times the value increases (first from x1 to x2, then from x2 to x3, and so on)
I am using apply to iterate over each row. Then overlay the second through last value over the first through second-to-last value and see if they if the first is bigger or smaller than the second. And add sum the boolean values to see how many times an increase or decrease occurs for that row. Note the switch from '>' to '<'
increases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] > x[1:(length(x)-1)])})
decreases <- apply(df[,3:12], 1, function(x) {sum(x[2:length(x)] < x[1:(length(x)-1)])})
For question (b) you can subtract the subset where year equals 2012 from the subset where year equals 2013, and test whether values are bigger than 0 for increases and smaller than 0 for decreases. Then use colSums to see for how many 'units' the increase or decrease is true.
Increase:
colSums((subset(df, year==2013) - subset(df, year==2012))>0)[3:12]
Decrease:
colSums((subset(df, year==2013) - subset(df, year==2012))<0)[3:12]