I am trying to develop a logistic regression model in R. I am trying to loop over rows of a data frame (or tibble) so that I can multiply a subset of the columns in that row by another vector as a dot product.
I initially tried to accomplish some preparatory work using purrr's vector functions, but was having difficulty and decided to implement it in a for-loop.
This is the working design I have with a For-Loop:
library(tidyverse)
# Define necessary functions
lambdaFunc <- function(factors,theta){
return((1+exp(sum(factors*theta)))^(-1))
}
# y is 0 or 1
# x and theta are a numeric vectors
indiv_likhd <- function(y,x,theta){
return(lambdaFunc(x,theta)^y*(1-lambdaFunc(x,theta))^(1-y))
}
# Assuming df is dataframe of the form
# Col1 Col2 ... ColN
# isDefault(0 or 1) factor1 ... factorN
likhds <- function(df,theta){
df <- as.data.frame(df)
likhds <- vector("numeric",nrow(df))
for (i in 1:nrow(df)) {
likhds[i] <- indiv_likhd(df[i,1],df[i,2:ncol(df)],theta)
}
return(likhds)
}
So
testdf <- tibble(y=c(1,0),x_1=c(1,1),x_2=c(1,1),x_3=c(1,1))
testTheta <- c(1,1,1)
likhds(testdf,testTheta)
yields
[1] 0.04742587 0.95257413
Is there a way to implement this with vector functions-specifically the purr package? This is my first real question on stackoverflow so I apologize if there is something missing or unclear, in which case, please let me know.
Thank you.
Without changing your lambdaFunc and indiv_likhd we could rewrite your for loop with pmap
library(dplyr)
library(purrr)
testdf %>%
mutate(new_col = pmap_dbl(., ~indiv_likhd(c(...)[1], c(...)[-1], testTheta)))
# y x_1 x_2 x_3 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 0.0474
#2 0 1 1 1 0.953
c(...) is used to capture all the values passed to pmap (here the entire row), so c(...)[1] means the first value in the row, c(...)[-1] means everything other than the first values in the row.
Here is an option
f <- function(df, theta) {
df %>%
group_by(y) %>%
nest() %>%
mutate(likhds = map2_dbl(y, data, function(y, x) indiv_likhd(y, x, theta))) %>%
pull(likhds)
}
f(testdf, testTheta)
#[1] 0.04742587 0.95257413
Explanation: We nest data by y, then use map2_dbl to loop through the pairs of y and data (which are your x values) for every row, and return the output of indiv_likhd as a double vector.
Related
I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))
I'd like to write a function with two inputs (x and y) to create some mutated variables in a very large dataframe. Specifically, if x=y then return x, and if x!=y then draw 1 sample from a sequence of x to y.
The function works fine when I test it outside of my datafarme, but throws an error when I try to use it within mutate. I've tried both ifelse and if_else versions.
library(dplyr)
smx <- function(x,y){ #Function to allow sampling if length>1
if_else(x==y,x,sample(seq(x,y,1),1))}
#ifelse(x==y,x,sample(seq(x,y,1),1))} #Have also tried this with ifelse, doesn't work
smx(0,0) #This works
smx(0,5) #This works
#Create dummy data frame
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df
#This doesn't work
df1 <- df %>% mutate(
VarC = smx(varA,varB)
)
Ideally, my output should include a third column (VarC) in which the first row is equal to 0 (because varA=varB) and the remaining rows are a random sample between a sequence from varA to varB.
I have set up my data frame so that varA is always be smaller than varB, but I'm not certain. Appreciate any help on a clean solution to this problem!
The function is not working because it is not vectorized. First, you'll need to vectorized your function, in order to make it work inside mutate.
You can do that as follows:
vectorized_fun <- Vectorize(your_fun)
Your code will look like this:
smx_v <- Vectorize(smx)
#This works
df1 <- df %>%
mutate(VarC = smx_v(varA,varB)
)
The issue here comes from seq: when using this function inside dplyr verb, you need to make sure the length of input is 1, which isn't the case here.
Using rowwise() solves the problem:
smx <- function(x,y){
ifelse(x==y,x,sample(seq(x,y,1),1))
}
df <- as.data.frame(cbind(c(rep(0,5)),c(seq(0,4,1))))
colnames(df) <- c("varA","varB")
df %>%
rowwise() %>%
mutate(VarC = smx(varA, varB))
Output:
# A tibble: 5 x 3
# Rowwise:
varA varB VarC
<dbl> <dbl> <dbl>
1 0 0 0
2 0 1 1
3 0 2 1
4 0 3 2
5 0 4 0
I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439
I am trying to write a function that takes a data frame and a string to rename a variable in the data frame as the string. If I have the following data frame df and variable x:
var1 var2
A 1 1
B 1 1
x = "new_var"
I want to create a function that will take the data frame and string as arguments and return the following data frame:
var1 new_var
A 1 1
B 1 1
My initial thought was do create something like the following:
dplyr_f <- function(df, x){
new_df <- df %>%
rename(x = var2)
return(new_df)
}
dplyr_f(df, x)
But this returns:
var1 x
A 1 1
B 1 1
instead of the second column being named new_var.
Is there anyway I can make rename() use the value of a variable like x instead of it thinking I literally want to rename the variable "x"? Or maybe there's a better solution?
I would be piping the output into other dplyr functions, so I am looking for a dplyr-based solution. Thanks!
What you need to do is use the bang bang operator, ie !!:
dplyr_f <- function(df, x){
new_df <- df %>%
rename(!!x := var2)
return(new_df)
}
dplyr_f(df, x)
var1 new_var
A 1 1
B 1 1
I have a large dataset in which I would like to use dplyr and filter and select the data to create 12 separate dataframes.
Essentially, I am using only two columns of data from a larger dataset. The first column is "plot", where I filter by "plot" number and another condition in another 3rd column ("pos_ID"). I want to create a loop that filters by plot number (I tried plot==[i]) and the 3rd condition, and then creates a new dataframe. The loop would repeat 12 times (because plot spans from 1-12).
Here is the code that I used without a loop (based on sample data)
p1_Germ <- data %>% #p1 stands for plot 1
filter(plot==1, pos_ID<21) %>%
select(germ_bin)
Here is the code that I tried to incorporate a loop (based on sample data)
for(i in seq_along(plot)) {
data %>%
group_by(plot[[i]], pos_ID<21) %>%
select(germ_bin)
}
Here is some sample data
plot <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)
germ_bin <- c(0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0)
pos_ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
dataset <- data.frame(plot, germ_bin, pos_ID)
dataset
My guess is to use a list, but I'm not familiar with loops and list and could not find a solution online. I need to create 12 dataframes because I'm trying to convert them each into a matrix after for another function. Any helpful would be much appreciated!
We can use group_split and map to filter based on criteria to get list of dataframes.
library(dplyr)
library(purrr)
dataset %>%
group_split(plot) %>%
map(. %>% filter(pos_ID < 21) %>% select(germ_bin))
#[[1]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 0
#2 0
#[[2]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#[[3]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#....
For the shared example, if you want to drop empty groups you can filter first
dataset %>%
filter(pos_ID < 21) %>%
group_split(plot) %>%
map(. %>% select(germ_bin))
As far your attempt with for loop is concerned, you can correct that by doing
unique_plot <- unique(dataset$plot)
plot_list <- list(length = length(unique_plot))
for(i in seq_along(unique_plot)) {
plot_list[[i]] <- dataset %>%
filter(plot == unique_plot[i], pos_ID<21) %>%
select(germ_bin)
}
Or keeping it completely in base R
lapply(split(dataset, dataset$plot), function(x)
subset(x, pos_ID < 21, select = germ_bin, drop = FALSE))