dplyr mutating multiple columns using two columns as arguments to custom function - r

I have a data frame:
df <- data.frame(A=c(10, 20, 30, 20),
B=c(0, 10, 20, 10),
C=c(11, 22, 21, 12),
D=c(13, 11, 33, 15))
A B C D
10 0 11 13
20 10 22 11
30 20 21 33
20 10 12 15
and a function to get the index of the number in a pair closest to a number of interest:
comp <- function(x, y) {
# x = number of interest, y = vector of 2 numbers for comparison)
ind <- which(abs(y-x)==min(abs(y-x)))
if (length(ind)==2) {
ind <- 3
}
return(ind)
}
(The if statement is for when the number is smack in the middle of the two numbers, ex. 15 compared to 10 and 20).
I would like to change columns C & D to the index for which the number is closest using my function (1 for A or 2 for B):
A B C D
10 0 1 1
20 10 1 2
30 20 2 1
20 10 2 3
I'm not sure how to call in columns A and B as arguments for the function. I've tried mutate_at(df, 3:4, funs(comp), c(df$A, df$B)), but that returns:
A B C D
10 0 3 6
20 10 3 6
30 20 3 6
20 10 3 6
Doesn't have to be a tidyr solution, whatever works!
Thanks

I changed around your function a little bit in order for vectorization to work. It also only accepted 2 values, when you were looking to compare 1 value with 2 others, so 3 arguments would be needed:
comp <- function(val, x, y){
case_when(
abs(val - x) < abs(val - y) ~ 1,
abs(val - x) > abs(val - y) ~ 2,
TRUE ~ 3)
}
df %>%
mutate_at(vars(C,D), comp , .$A, .$B)
A B C D
1 10 0 1 1
2 20 10 1 2
3 30 20 2 1
4 20 10 2 3

Related

Using a For Loop to multiply variables by numbers in a matrix

This should be relatively simple but I am new to R and cannot quite figure this out.
I will illustrate what I am trying to do.
I have the following:
names <- c("A","B","C")
values <- c(3,6,9)
values2 <- c(5,10,15)
y <- c("2019")
r <- c("1")
t <- c("Team A", "Team B", "Team C")
mgn <- c(33, 56, 63)
df1 <- data.frame(names,y,r,t,values,values2,mgn)
I also have a matrix:
numbers <- matrix(1:6, nrow = 3, ncol = 2)
I am trying to loop through each of the values and values2 in my df1 and multiply these by the values in my numbers matrix like so:
3 x 1 = 3
5 x 4 = 20
6 x 2 = 12
10 x 5 = 50
9 x 3 = 27
15 x 6 = 90
I would then like to print each of these values like:
values values2
[1] 3 20
[2] 12 50
[3] 18 90
I tried the following (just for the first values col):
for(col in 1:ncol(numbers)){
df1$values %*% numbers[col]
print(df1$values)
}
But this is the ouput I get:
[1] 3 6 9
[1] 6 12 18
[1] 6 12 18
[1] 12 24 36
[1] 12 24 36
[1] 24 48 72
I then would like to repeat the process, so that the next row of values and values2 is multiplied by the first row again in numbers (2 and 5) so that:
3 x 2 = 6
5 x 5 = 25
and so on, until all the combinations are calculated.
This would give me the output like so:
3 x 1 = 3
5 x 4 = 20
6 x 1 = 6
10 x 4 = 40
9 x 1 = 9
15 x 4 = 60
Then it should go to the next line of values and values2 and repeat:
3 x 2 = 6
5 x 5 = 25
6 x 2 = 12
10 x 5 = 50
9 x 2 = 18
15 x 5 = 75
And finally the last line:
3 x 3 = 9
5 x 6 = 30
6 x 3 = 18
10 x 6 = 60
9 x 3 = 27
15 x 6 = 90
Finally, I would like to loop through each of these, add them together like:
sumvalues = values + values2
create a total column like:
df1%>%group_by(y, r, t)%>%dplyr::mutate(total=sum(sumvalues)
then obtain the pearson correlation for each by:
cor(mgn, sumvalues, method = "pearson")
So I can have the output like so:
sumvalues total mgn pearson
[1]
[2]
[3]
Here's how I did it:
#### make the two objects to have the same dimensions:
df2<-df1[ ,c(2:3)]
#### multiply and create new object:
new<-df2*numbers
#### if you want to return the first column to df1:
df3<-cbind(df1[1],x)
print(df3)
Your first output can be reached by :
df1[-1] * numbers
# values values2
#1 3 20
#2 12 50
#3 27 90
To get all possible combinations you can use apply with sweep :
apply(numbers, 1, function(x) sweep(df1[-1], 2, x, `*`))
#[[1]]
# values values2
#1 3 20
#2 6 40
#3 9 60
#[[2]]
# values values2
#1 6 25
#2 12 50
#3 18 75
#[[3]]
# values values2
#1 9 30
#2 18 60
#3 27 90

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

Categorize data sequences continously

I am new to R and have a question regarding the adding of a new variable to a table. I have data sequences starting with 10 and ending with 20, which appear several times.
Is there a way to group these sequences continuously?
Example:
The data in the column looks like that
10 3 15 15 19 19 20 20 10 10 11 17 20 ...
I would like to have an output like that
10 group 1
3 group 1
15 group 1
15 group 1
19 group 1
19 group 1
20 group 1
20 group 1
10 group 2
10 group 2
11 group 2
17 group 2
20 group 2
...
Is it possible to program something like that?
Thanks a lot for your help!!
Using base R you can detect the sequences and create a grouping variable with cumsum and head:
df$grp <- cumsum(df$x == 10 & c(20, head(df$x, -1)) == 20)
gives:
> df
x grp
1: 10 1
2: 3 1
3: 15 1
4: 15 1
5: 19 1
6: 19 1
7: 20 1
8: 20 1
9: 10 2
10: 10 2
11: 11 2
12: 17 2
13: 20 2
What this does:
df$x == 10 detects the 10's
c(20, head(df$x, -1)) == 20 detects whether the previous value is equal to 20, the first value is set to 20 because there is preceding value for the first value of df$x
By combining these two with & you get a logical value indicating which values in df$ are equel to 10 and for which the preceding value is also equal to 20.
Wrapping that in cumsum you get a grouping value.
Or with data.table:
library(data.table)
setDT(df)[, grp := cumsum(x == 10 & c(0, head(x, -1)) == 20)][]
Or with dplyr:
library(dplyr)
df %>%
mutate(grp = cumsum(x == 10 & lag(x, default = 20) == 20))
You can use paste/paste0 to add text to the group-label:
paste0('group_', cumsum(df$x == 10 & c(20, head(df$x, -1)) == 20))
Used data:
df <- data.frame(x = c(10, 3, 15, 15, 19, 19, 20, 20, 10, 10, 11, 17, 20))
Try this. x is your numerics and y will be your groups.
x<-0:20
y<-NA
df1<-data.frame(x,y)
group1<-(x>10)
group2<-(x<=10)
df1$y[group1]<-"Group1"
df1$y[group2]<-"Group2"
df1

Conditional cumsum with reset

I have a data frame, the data frame is already sorted as needed, but now I will like to "slice it" in groups.
This groups should have a max cumulative value of 10. When the cumulative value is > 10, it should reset the cumulative sum and start over again
library(dplyr)
id <- sample(1:15)
order <- 1:15
value <- c(4, 5, 7, 3, 8, 1, 2, 5, 3, 6, 2, 6, 3, 1, 4)
df <- data.frame(id, order, value)
df
This is the output I'm looking for(I did it "manually")
cumsum_10 <- c(4, 9, 7, 10, 8, 9, 2, 7, 10, 6, 8, 6, 9, 10, 4)
group_10 <- c(1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7)
df1 <- data.frame(df, cumsum_10, group_10)
df1
So I'm having 2 problems
How to create a cumulative variable that resets everytime it passes an upper limit (10 in this case)
How to count/group every group
For the first part I was trying some combinations of group_by and cumsum with no luck
df1 <- df %>% group_by(cumsum(c(False, value < 10)))
I would prefer a pipe (%>%) solution instead of a for loop
Thanks
I think this is not easily vectorizable.... at least i do not know how.
You can do it by hand via:
my_cumsum <- function(x){
grp = integer(length(x))
grp[1] = 1
for(i in 2:length(x)){
if(x[i-1] + x[i] <= 10){
grp[i] = grp[i-1]
x[i] = x[i-1] + x[i]
} else {
grp[i] = grp[i-1] + 1
}
}
data.frame(grp, x)
}
For your data this gives:
> my_cumsum(df$value)
grp x
1 1 4
2 1 9
3 2 7
4 2 10
5 3 8
6 3 9
7 4 2
8 4 7
9 4 10
10 5 6
11 5 8
12 6 6
13 6 9
14 6 10
15 7 4
Also for my "counter-example" this gives:
> my_cumsum(c(10,6,4))
grp x
1 1 10
2 2 6
3 2 10
As #Khashaa pointed out this can be implementet more efficiently via Rcpp. He linked to this answer How to speed up or vectorize a for loop? which i find very useful
You could define your own function and then use it inside dplyr's mutate statement as follows:
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
The cumsum_with_reset() function takes a column and a threshold value which resets the sum. cumsum_with_reset_group() is similar but identifies rows that have been grouped together. Definitions are as follows:
# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, group)
}
return (result)
}
# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, cumsum)
}
return (result)
}
# use functions above as window functions inside mutate statement
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
This can be done easily with purrr::accumulate
library(dplyr)
library(purrr)
df %>% mutate(cumsum_10 = accumulate(value, ~ifelse(.x + .y <= 10, .x + .y, .y)),
group_10 = cumsum(value == cumsum_10))
id order value cumsum_10 group_10
1 8 1 4 4 1
2 13 2 5 9 1
3 7 3 7 7 2
4 1 4 3 10 2
5 4 5 8 8 3
6 10 6 1 9 3
7 12 7 2 2 4
8 2 8 5 7 4
9 15 9 3 10 4
10 11 10 6 6 5
11 14 11 2 8 5
12 3 12 6 6 6
13 5 13 3 9 6
14 9 14 1 10 6
15 6 15 4 4 7
We can take advantage of the function cumsumbinning, from the package MESS, that performs this task:
library(MESS)
df %>%
group_by(group_10 = cumsumbinning(value, 10)) %>%
mutate(cumsum_10 = cumsum(value))
Output
# A tibble: 15 x 5
# Groups: group_10 [7]
id order value group_10 cumsum_10
<int> <int> <dbl> <int> <dbl>
1 6 1 4 1 4
2 10 2 5 1 9
3 1 3 7 2 7
4 5 4 3 2 10
5 3 5 8 3 8
6 9 6 1 3 9
7 14 7 2 4 2
8 11 8 5 4 7
9 15 9 3 4 10
10 8 10 6 5 6
11 12 11 2 5 8
12 2 12 6 6 6
13 4 13 3 6 9
14 7 14 1 6 10
15 13 15 4 7 4
The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:
1) vec: A vector of values that we want to group.
2) i: The index of the starting position in vec.
3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.
# Group a vector based on consecutive values with a cumulative sum <= 10
gf = function(vec, i, glv) {
## Break out of the recursion when we get to the last group
if (sum(vec[i:length(vec)]) <= 10) {
glv = c(glv, length(i:length(vec)))
return(glv)
}
## Keep recursion going if there are at least two groups left
# Calculate length of current group
gl = sum(cumsum(vec[i:length(vec)]) <= 10)
# Append to previous group lengths
glv.append = c(glv, gl)
# Call function recursively
gf(vec, i + gl, glv.append)
}
Run the function to return a vector of group lengths:
group_vec = gf(df$value, 1, numeric(0))
[1] 2 2 2 3 2 3 1
To add a column to df with the group lengths, use rep:
df$group10 = rep(1:length(group_vec), group_vec)
In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.
The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.
I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

remove rows based on substraction results

I have a large data set like this:
df <- data.frame(group = c(rep(1, 6), rep(5, 6)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 1 6
6 1 5
7 5 20
8 5 35
9 5 2
10 5 60
11 5 14
12 5 5
...
I want to do a subtraction for each neighboring score within each group, if the difference is greater than 30, remove the smaller score. For example, within group 1, 30-10=20<30, 10-22=-12<30, 22-44=-22<30, 44-6=38>30 (remove 6), 44-5=39>30 (remove 5)... The expected output should look like this:
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 5 20
6 5 35
7 5 60
...
Does anyone have idea about realizing this?
Like this?
repeat {
df$diff=unlist(by(df$score,df$group,function(x)c(0,-diff(x))))
if (all(df$diff<30)) break
df <- df[df$diff<30,]
}
df$diff <- NULL
df
# group score
# 1 1 30
# 2 1 10
# 3 1 22
# 4 1 44
# 7 5 20
# 8 5 35
# 10 5 60
This (seems...) to require an iterative approach, because the "neighboring score" changes after removal of a row. So before you remove 6, the difference 44 - 6 > 30, but 6 - 5 < 30. After you remove 6, the difference 44 - 5 > 30.
So this calculates difference between successive rows by group (using by(...) and diff(...)), and removes the appropriate rows, then repeats the process until all differences are < 30.
It's not elegant but it should work:
out = data.frame(group = numeric(), score=numeric())
#cycle through the groups
for(g in levels(as.factor(df$group))){
temp = subset(df, df$group==g)
#now go through the scores
left = temp$score[1]
for(s in seq(2, length(temp$score))){
if(left - temp$score[s] > 30){#Test the condition
temp$score[s] = NA
}else{
left = temp$score[s] #if condition not met then the
}
}
#Add only the rows without NAs to the out
out = rbind(out, temp[which(!is.na(temp$score)),])
}
There should be a way to do this using ave but carrying the last value when removing the next if the diff >30 is tricky! I'd appreciate the more elegant solution if there is one.
You can try
df
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 5 1 6
## 6 1 5
## 7 5 20
## 8 5 35
## 9 5 2
## 10 5 60
## 11 5 14
## 12 5 5
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
while (!identical(df, tmp)) {
df <- tmp
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
}
tmp
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 7 5 20
## 8 5 35
## 10 5 60

Resources