Function containing ifelse not working in dplyr, works fine outside - r

I'd like to write a function with two inputs (x and y) to create some mutated variables in a very large dataframe. Specifically, if x=y then return x, and if x!=y then draw 1 sample from a sequence of x to y.
The function works fine when I test it outside of my datafarme, but throws an error when I try to use it within mutate. I've tried both ifelse and if_else versions.
library(dplyr)
smx <- function(x,y){ #Function to allow sampling if length>1
if_else(x==y,x,sample(seq(x,y,1),1))}
#ifelse(x==y,x,sample(seq(x,y,1),1))} #Have also tried this with ifelse, doesn't work
smx(0,0) #This works
smx(0,5) #This works
#Create dummy data frame
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df
#This doesn't work
df1 <- df %>% mutate(
VarC = smx(varA,varB)
)
Ideally, my output should include a third column (VarC) in which the first row is equal to 0 (because varA=varB) and the remaining rows are a random sample between a sequence from varA to varB.
I have set up my data frame so that varA is always be smaller than varB, but I'm not certain. Appreciate any help on a clean solution to this problem!

The function is not working because it is not vectorized. First, you'll need to vectorized your function, in order to make it work inside mutate.
You can do that as follows:
vectorized_fun <- Vectorize(your_fun)
Your code will look like this:
smx_v <- Vectorize(smx)
#This works
df1 <- df %>%
mutate(VarC = smx_v(varA,varB)
)

The issue here comes from seq: when using this function inside dplyr verb, you need to make sure the length of input is 1, which isn't the case here.
Using rowwise() solves the problem:
smx <- function(x,y){
ifelse(x==y,x,sample(seq(x,y,1),1))
}
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df %>%
rowwise() %>%
mutate(VarC = smx(varA, varB))
Output:
# A tibble: 5 x 3
# Rowwise:
varA varB VarC
<dbl> <dbl> <dbl>
1 0 0 0
2 0 1 1
3 0 2 1
4 0 3 2
5 0 4 0

Related

Can you use rename() in R to rename a variable by referencing a variable that stores a string?

I am trying to write a function that takes a data frame and a string to rename a variable in the data frame as the string. If I have the following data frame df and variable x:
var1 var2
A 1 1
B 1 1
x = "new_var"
I want to create a function that will take the data frame and string as arguments and return the following data frame:
var1 new_var
A 1 1
B 1 1
My initial thought was do create something like the following:
dplyr_f <- function(df, x){
new_df <- df %>%
rename(x = var2)
return(new_df)
}
dplyr_f(df, x)
But this returns:
var1 x
A 1 1
B 1 1
instead of the second column being named new_var.
Is there anyway I can make rename() use the value of a variable like x instead of it thinking I literally want to rename the variable "x"? Or maybe there's a better solution?
I would be piping the output into other dplyr functions, so I am looking for a dplyr-based solution. Thanks!
What you need to do is use the bang bang operator, ie !!:
dplyr_f <- function(df, x){
new_df <- df %>%
rename(!!x := var2)
return(new_df)
}
dplyr_f(df, x)
var1 new_var
A 1 1
B 1 1

adding noise to a column in dplyr

Related to a previous question, I want to add some random noise to every value in a column in dplyr. However, when I tried the below code I get identical values back. I understand why this is happening (dplyr generate the random number and then uses that very same number to add to every single value). Is there any way to prevent this?
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000236
# 2 1.000236
# 3 1.000236
# 4 1.000236
# 5 1.000236
Here is a solution with jitter:
library(dplyr)
set.seed(2020) # Make the results reproducible
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = jitter(value))
data
# value
#1 1.0058761
#2 0.9957690
#3 1.0047401
#4 0.9990756
#5 0.9854439
You could generate your random vector externally and then add it to data$value:
nrows <- nrow(data)
rands <- 1e-3 * runif(nrows)
data$value <- data$value + rands
Stepwise clarity works better for me.
Found my own answer. Adding rowwise() evaluates each row individually and thus gives a new random number.
data <- data.frame(value=c(1,1,1,1,1)) %>% rowwise() %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000625
# 2 1.000764
# 3 1.000588
# 4 1.000536
# 5 1.000079

Purrr Implementation of For-Loop

I am trying to develop a logistic regression model in R. I am trying to loop over rows of a data frame (or tibble) so that I can multiply a subset of the columns in that row by another vector as a dot product.
I initially tried to accomplish some preparatory work using purrr's vector functions, but was having difficulty and decided to implement it in a for-loop.
This is the working design I have with a For-Loop:
library(tidyverse)
# Define necessary functions
lambdaFunc <- function(factors,theta){
return((1+exp(sum(factors*theta)))^(-1))
}
# y is 0 or 1
# x and theta are a numeric vectors
indiv_likhd <- function(y,x,theta){
return(lambdaFunc(x,theta)^y*(1-lambdaFunc(x,theta))^(1-y))
}
# Assuming df is dataframe of the form
# Col1 Col2 ... ColN
# isDefault(0 or 1) factor1 ... factorN
likhds <- function(df,theta){
df <- as.data.frame(df)
likhds <- vector("numeric",nrow(df))
for (i in 1:nrow(df)) {
likhds[i] <- indiv_likhd(df[i,1],df[i,2:ncol(df)],theta)
}
return(likhds)
}
So
testdf <- tibble(y=c(1,0),x_1=c(1,1),x_2=c(1,1),x_3=c(1,1))
testTheta <- c(1,1,1)
likhds(testdf,testTheta)
yields
[1] 0.04742587 0.95257413
Is there a way to implement this with vector functions-specifically the purr package? This is my first real question on stackoverflow so I apologize if there is something missing or unclear, in which case, please let me know.
Thank you.
Without changing your lambdaFunc and indiv_likhd we could rewrite your for loop with pmap
library(dplyr)
library(purrr)
testdf %>%
mutate(new_col = pmap_dbl(., ~indiv_likhd(c(...)[1], c(...)[-1], testTheta)))
# y x_1 x_2 x_3 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 0.0474
#2 0 1 1 1 0.953
c(...) is used to capture all the values passed to pmap (here the entire row), so c(...)[1] means the first value in the row, c(...)[-1] means everything other than the first values in the row.
Here is an option
f <- function(df, theta) {
df %>%
group_by(y) %>%
nest() %>%
mutate(likhds = map2_dbl(y, data, function(y, x) indiv_likhd(y, x, theta))) %>%
pull(likhds)
}
f(testdf, testTheta)
#[1] 0.04742587 0.95257413
Explanation: We nest data by y, then use map2_dbl to loop through the pairs of y and data (which are your x values) for every row, and return the output of indiv_likhd as a double vector.

Looping dplyr and creating multiple dataframe

I have a large dataset in which I would like to use dplyr and filter and select the data to create 12 separate dataframes.
Essentially, I am using only two columns of data from a larger dataset. The first column is "plot", where I filter by "plot" number and another condition in another 3rd column ("pos_ID"). I want to create a loop that filters by plot number (I tried plot==[i]) and the 3rd condition, and then creates a new dataframe. The loop would repeat 12 times (because plot spans from 1-12).
Here is the code that I used without a loop (based on sample data)
p1_Germ <- data %>% #p1 stands for plot 1
filter(plot==1, pos_ID<21) %>%
select(germ_bin)
Here is the code that I tried to incorporate a loop (based on sample data)
for(i in seq_along(plot)) {
data %>%
group_by(plot[[i]], pos_ID<21) %>%
select(germ_bin)
}
Here is some sample data
plot <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)
germ_bin <- c(0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0)
pos_ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
dataset <- data.frame(plot, germ_bin, pos_ID)
dataset
My guess is to use a list, but I'm not familiar with loops and list and could not find a solution online. I need to create 12 dataframes because I'm trying to convert them each into a matrix after for another function. Any helpful would be much appreciated!
We can use group_split and map to filter based on criteria to get list of dataframes.
library(dplyr)
library(purrr)
dataset %>%
group_split(plot) %>%
map(. %>% filter(pos_ID < 21) %>% select(germ_bin))
#[[1]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 0
#2 0
#[[2]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#[[3]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#....
For the shared example, if you want to drop empty groups you can filter first
dataset %>%
filter(pos_ID < 21) %>%
group_split(plot) %>%
map(. %>% select(germ_bin))
As far your attempt with for loop is concerned, you can correct that by doing
unique_plot <- unique(dataset$plot)
plot_list <- list(length = length(unique_plot))
for(i in seq_along(unique_plot)) {
plot_list[[i]] <- dataset %>%
filter(plot == unique_plot[i], pos_ID<21) %>%
select(germ_bin)
}
Or keeping it completely in base R
lapply(split(dataset, dataset$plot), function(x)
subset(x, pos_ID < 21, select = germ_bin, drop = FALSE))

Error when combining dplyr inside a function

I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:
group.count <- c(101,99,4)
data <- data.frame(
by = rep(3:1,group.count),
y = rep(letters[1:3],group.count))
data %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
Which gives me the outcome I'm looking for. However, when I try to do it as a function:
res0 <- function(x1,x2) {
output = data %>%
group_by(x2) %>%
summarise(non.miss = sum(!is.na(x1)))
}
res0(y,by)
I get an error (index out of bounds).
Can anybody tell me what I'm missing?
Thanks on advance.
You can't do this like that in dplyr.
The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:
data <- data.frame(
x2 = rep(3:1,group.count),
x1 = rep(letters[1:3],group.count)
)
Then call your function again and it will return the expected output.
I suggest changing the name of your dataframe to df.
This is basically what you have done:
df %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
which produces this:
# by non.miss
#1 1 4
#2 2 99
#3 3 101
but to count the number of observations per group, you could use length, which gives the same answer:
df %>%
group_by(by) %>%
summarise(non.miss = length(y))
# by non.miss
#1 1 4
#2 2 99
#3 3 101
or, use tally, which gives this:
df %>%
group_by(by) %>%
tally
# by n
#1 1 4
#2 2 99
#3 3 101
Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:
res0 <- function(df) {
df %>%
group_by(by) %>%
tally
}
res0(df)
# by n
#1 1 4
#2 2 99
#3 3 101
This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.

Resources