Nested for loop to functions and lapply - r

I am trying to write to functions and call code for a nested for loop. The below code I could easily put as it is with for loop and my function runs too. But I am trying to avoid for loop with in my function and go for lapply. How can I create the function and its respective call code using lapply?
Code with for loop:
df <- data.frame(actual=c("reaok_oc giade_len","reaok_oc giade_len reaok_oc giade_len"),
Predicted = c("giade_len","reaok_oc giade_len reaok_oc giade_len"))
df[] <- lapply(df, as.character)
str(df)
all_acc<-NULL
for(s in 1:nrow(df)){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
all_count<-0
for(g in 1:length(actual_words)){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
}
sub_acc<-all_count/length(actual_words)
all_acc<-c(all_acc,sub_acc)
}
df$trans_acc<-all_acc
sensitivity=sum(df$trans_acc)/nrow(df)
sensitivity
Here is the non-working code using lapply call code to functions:
a1 <- function(df){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
all_count<-0
}
a2 <- function(df){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
sub_acc<-all_count/length(actual_words)
all_acc<-c(all_acc,sub_acc)
df$trans_acc<-all_acc
sensitivity=sum(df$trans_acc)/nrow(df)
sensitivity
}
lapply(1:nrow(df) FUN = a1, lapply(1:length(actual_words) FUN = a2, actual_words,sub_aa1))

In base R it is usually best to find solutions that are 'vectorized' (only one R function call) rather than 'iterated' (one call for each element). So for instance
for(s in 1:nrow(df)){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
...
involves nrow(df) calls to strsplit(), but
actual <- strsplit(df$actual, " ")
involves just one but performs the same transformation.
I think also that when you say
for(g in 1:length(actual_words)){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
}
really you are just looking for exact matches between actual words and predicted words. So you could split the predicted words
predicted <- strsplit(df$Predicted, " ")
and calculate sum(actual[[1]] %in% predicted[[1]]), and so on. Write this as a function
actual_in_predicted <- function(actual, predicted) {
sum(actual %in% predicted)
}
A 'for' loop might iterate over each element of actual and predicted
all_count <- integer()
for (i in 1:nrow(df))
all_count[[i]] <- actual_in_predicted(actual[[i]], predicted[[i]])
but it's better to use mapply() to iterate over each element of actual and predicted
all_count <- mapply(actual_in_predicted, actual, predicted)
Your variable all_acc is this vector of numbers divided by the number of actual words in each comparison
all_acc <- all_count / lengths(actual)
The complete revised code uses a function to compare actual and predicted words in each row, and uses a loop to iterate over each row.
actual_in_predicted <- function(actual, predicted) {
sum(actual %in% predicted)
}
actual <- strsplit(df$actual, " ")
predicted <- strsplit(df$Predicted, " ")
all_count <- mapply(actual_in_predicted, actual, predicted)
all_acc <- all_count / lengths(actual)
df$trans_acc <- all_acc
sensitivity <- sum(df$trans_acc) / nrow(df)

Perhaps, we can use separate_rows
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(actual, sep="_") %>%
summarise(perc = mean(str_detect(Predicted, actual)))
# perc
#1 0.75
It can be wrapped into a function
f1 <- function(data, act, pred) {
data %>%
separate_rows({{act}}, sep="_") %>%
summarise(perc = mean(str_detect({{pred}}, {{act}})))
}
f1(df, actual, Predicted)
# perc
#1 0.75

Related

How to write function to return plot but also an object (data.frame)?

Description and goal: In R Studio, I would like to define a function that drops columns of a given data.frame if it contains a too high share of missing values, defined by a cutoff value in percent. This function should return information about the subsetted data.frame (number of remaining columns and remaining share of missing cases) together with the subsetted data.frame itself for further analyses. Additionally, there should be an option to visualize remaining missing cases using the function vis_miss() of the identically named package.
Packages used:
library(tidyverse)
library(vismiss)
Data:
my.data <- tibble(col_1 = c(1:5),
col_2 = c(1,2,NA,NA,NA))
My function:
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
df
}
Test:
cut_cols(my.data, 0.5, vis_miss = F) # without visualization
cut_cols(my.data, 0.5, vis_miss = T) # with visualization
Problem:
As you might have already seen in the example above, only the first line, where vis_miss = F actually returns the data.frame but not the second line, where vis_miss = T. I assume that this is because of the extra if () {} clause, which returns a plot and then ends the process without printing df. Is there a way to prevent this from happening so that the first line also returns the new data.frame?
You were correct in your suspicion that the if(){} clause was stopping the df from printing. I think return() stops any function from running further. If that's the case then it's best practice to put it at the end of any function.
Further, use print(df) to make sure your function outputs your data frame. Here are a few changes to your code
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
print(df)
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
}
cut_cols(my.data, 0.5, vis_miss = T)
Here's another option if it interests you. You can assign both the df and the plot to a list then call the list.
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
# empty list
list_ <- c()
# assign df to first index of list
list_[[1]] <- df
if (vis_miss==TRUE){
plot <- vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F)
# assign plot to second index in list
list_[[2]] <- plot
}
return(list_)
}
output <- cut_cols(my.data, 0.5, vis_miss = T)
Calling output will print both the df and plot. output[[1]] will print just the df. output[[2]] will print just the plot.

replacing nested for loop with lapply()

I have sen this example to explain how to replace an nested for loop with the lapply() function. However i do not fully understand what is happening in the nested for loop?
according to my understanding, the for loop creates for every country for all years two new variables called tempX and tempY, but what happens in the last line of the argument in the for loop?
what is the purpose of variable1 and variable2 ?
# Generate random data:
allCountries <- LETTERS[1:10]
allYears <- 1990:2012
myData <- expand.grid(allCountries, allYears) # create a dataframe with all possible combinations
colnames(myData) <- c("Country", "Year")
myData$variable1 <- rnorm(nrow(myData))
myData$variable2 <- rnorm(nrow(myData))
# Silly function to perform
myFunction <- function(x, y){
x * y - x / y
}
### Doubly-nested loop ###
myData$computedFigure <- NA # Make an "empty" variable in my data.frame
for(ii in allCountries){
for(jj in allYears){
tempX <- myData[myData$Country == ii & myData$Year == jj, c("variable1")]
tempY <- myData[myData$Country == ii & myData$Year == jj, c("variable2")]
# "Save" results into appropriate location in my data.frame
myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
}
}
### Simple lapply() approach ###
computedFigureList <- lapply(1:nrow(myData), function(x){
tempX <- myData[x, c("variable1")]
tempY <- myData[x, c("variable2")]
# "Save" results into appropriate location in my data.frame
myFunction(tempX, tempY)
})
myData$computedFigure2 <- unlist(computedFigureList)
with(myData, plot(computedFigure, computedFigure2))
In the last line of the loop myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
, the function myFunctionis applied and recorded in the computedFigure column.
variable1 and variable2 are set randomly to illustrate the data in myData (x and y) in myFunction.
The for loops are exploring the combinations in countries and years...
The two codes (for loop and lappy) will not generate exactly the same result. The lapply will generate a list just with the result of the myFunction. The for loops will generate a dataframe.
Actually you don't need nested *apply functions, you can actually use outer + diag to compute computedFigure, which can achieve the same results as you did via nested for loop.
myData$computedFigure <- diag(with(myData,outer(variable1,variable2,myFunction)))

Loop through df and create new df in R

I have a df (10 rows, 15 columns)
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
I want to loop over each column, do something to each row and create a new df with the answer.
I actually want to do a linear regression on each column. I get back a list for each column. For example I have a second df with what I want to put into the lm. df2<-data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I then want to do something like:
new_df <- data.frame()
for (i in 1:ncol(df)){
j<-lm(df[,i] ~ df2$X1 + df2$X2)
temp_df<-j$residuals
new_df[,i]<-cbind(new_df,temp_df)
}
I get the error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 8
I have checked other similar posts but they always seem to involve a function or something similarly complex for a newbie like me. Please help
This can be done without loops but for your understanding, using loops we can do
new_df <- df
for (i in names(df)) {
j<-lm(df[,i] ~ df$X1 + df$X2)
new_df[i] <- j$residuals
}
You are initialising an empty dataframe with 0 rows and 0 columns initially as new_df and hence when you are trying to assign the value to it, it gives you an error. Instead of that assign original df to new_df as they both are going to share the same structure and then use the above.
Update
Based on the new example
lst1 <- lapply(names(df), function(nm) {dat <- cbind(df[nm], df2[c('X1', 'X2')])
lm(paste0(nm, "~ X1 + X2"), data = dat)$residuals})
out <- setNames(data.frame(lst1), names(df))
Also, this doesn't need any loop
out2 <- lm(as.matrix(df) ~ X1 + X2, data = cbind(df, df2))$residuals
Old
We can do this easily without any loop
new_df <- df + 10
---
If we need a loop, it can be done with `lapply`
new_df <- df
new_df[] <- lapply(df, function(x) x + 10)
---
Or with a `for` loop
lst1 <- vector('list', ncol(df))
for(i in seq_along(df)) lst1[[i]] <- df[, i] + 10
new_df <- as.data.frame(lst1)
data
set.seed(24)
df <- data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
df2 <- data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I would do as suggested by akrun. But if you do need (or want) to loop for some reasons you can use:
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
new_df <- data.frame(replicate(15, rep(NA, 10)))
for (i in 1:ncol(df)){
new_df[ ,i] <- df[ , i] + 10
}

Compute p-values across all columns of (possibly large) matrices in R

is there are any more efficient/faster way to compare two matrices (column by columns) and to compute p-values using t-test for no difference in means (eventually switching to the chisq.test when necessary)?
Here is my solution:
## generate fake data (e.g., from treatment and control data)
z0 <- matrix(rnorm(100),10,10)
z1 <- matrix(rnorm(100, mean=1.1, sd=2),10,10)
## function to compare columns (bloody for loop)
compare.matrix <- function(z0, z1){
pval <- numeric(ncol(z0)) ## initialize
for(i in 1:ncol(z0)){ ## compare columns
pval[i] <- t.test(z1[, i], z0[, i])$p.value
## if var is categorical, switch test type
if ( length(unique(z1[,i]))==2){
index <- c(rep(0, nrow(z0)), rep(1, nrow(z1)))
xx <- c(z0[,i], z1[,i])
pval[i] <- chisq.test(table(xx, index), simulate.p.value=TRUE)$p.value
}
}
return(pval)
}
compare.matrix(z0, z1)
Here's one way using dplyr. It would probably be better to combine the first three lines into a single step if you've got large matrices, but I separated them for clarity. I think the chi-squared case would be a fairly simple extension.
z0_melt = melt(z0, value.name='z0')[,c('Var2','z0')]
z1_melt = melt(z1, value.name='z1')[,c('Var2','z1')]
all_df = merge(z0_melt, z1_melt)
library(dplyr)
all_df %>%
group_by(Var2) %>%
summarize(p = t.test(z0, z1)$p.value)

Clip outliers in columns in df2,3,4... based on quantiles from columns in df.tr

I am trying to replace the "outliers" in each column of a dataframe with Nth percentile.
n <- 1000
set.seed(1234)
df <- data.frame(a=runif(n), b=rnorm(n), c=rpois(n,1))
df.t1 <- as.data.frame(lapply(df, function(x) { q <- quantile(x,.9,names=F); x[x>q] <- q; x }))
I need the computed quantiles to truncate other dataframes. For example, I compute these quantiles on a training dataset and apply it; I want to use those same thresholds in several test datasets. Here's an alternative approach which allows that.
q.df <- sapply(df, function(x) quantile(x,.9,names=F))
df.tmp <- rbind(q.df, df.t1)
df.t2 <- as.data.frame(lapply(df.tmp, function(x) { x[x>x[1]] <- x[1]; x }))
df.t2 <- df.t2[-1,]
rownames(df.t2) <- NULL
identical(df.t1, df.t2)
The dataframes are very large and hence I would prefer not to use rbind, and then delete the row later. Is is possible to truncate the columns in the dataframes using the q.df but without having to rbind? Thx.
So just write a function that directly computes the quantile, then directly applies clipping to each column. The <- conditional assignment inside your lapply call is bogus; you want ifelse to return a vectorized expression for the entire column, already. ifelse is your friend, for vectorization.
# Make up some dummy df2 output (it's supposed to have 1000 cols really)
df2 <- data.frame(d=runif(1000), e=rnorm(1000), f=runif(1000))
require(plyr)
print(colwise(summary)(df2)) # show the summary before we clamp...
# Compute quantiles on df1...
df1 <- df
df1.quantiles <- apply(df1, 2, function(x, prob=0.9) { quantile(x, prob, names=F) })
# ...now clamp by sweeping col-index across both quantile vector, and df2 cols
clamp <- function(x, xmax) { ifelse(x<=xmax, x, xmax) }
for (j in 1:ncol(df2)) {
df2[,j] <- clamp(df2[,j], df1.quantiles[j]) # don't know how to use apply(...,2,)
}
print(colwise(summary)(df2)) # show the summary after we clamp...
Reference:
[1] "Clip values between a minimum and maximum allowed value in R"

Resources