I am deploying a Shiny App on Heroku.
However the buildpack does not support rowr as the version of R is not compatible with this package.
How can I replace rowr::cbind.fill with base R or dplyr functions ?
I tried to see what was in the function, but it is not clear to me:
function (..., fill = NULL)
{
inputs <- list(...)
inputs <- lapply(inputs, vert)
maxlength <- max(unlist(lapply(inputs, len)))
bufferedInputs <- lapply(inputs, buffer, length.out = maxlength,
fill, preserveClass = FALSE)
return(Reduce(cbind.data.frame, bufferedInputs))
}
Another way of asking this question is is there a solution to the following question in dplyr, I need fill = NA ?
cbind a vector of different length to a dataframe
Here is a solution to detect the number of missing elements in every column, append it with NAs and bind it together. I assume that the columns are stored in a list:
# test data
test_data <- list(c1 = rep(1, 5),
c2 = rep(2, 7),
c3 = rep(3, 4))
# find the longest column
longest_column <- max(unlist(lapply(test_data, length)))
# calculate how many NAs need to be added
number_append <- lapply(test_data, function(x) longest_column - length(x))
# append the columns
data_appended <- lapply(seq_len(length(test_data)), function(i) {
c(test_data[[i]], rep(NA, number_append[[i]]))
})
# combine the columns
data_combined <- do.call("cbind", data_appended)
Related
Description and goal: In R Studio, I would like to define a function that drops columns of a given data.frame if it contains a too high share of missing values, defined by a cutoff value in percent. This function should return information about the subsetted data.frame (number of remaining columns and remaining share of missing cases) together with the subsetted data.frame itself for further analyses. Additionally, there should be an option to visualize remaining missing cases using the function vis_miss() of the identically named package.
Packages used:
library(tidyverse)
library(vismiss)
Data:
my.data <- tibble(col_1 = c(1:5),
col_2 = c(1,2,NA,NA,NA))
My function:
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
df
}
Test:
cut_cols(my.data, 0.5, vis_miss = F) # without visualization
cut_cols(my.data, 0.5, vis_miss = T) # with visualization
Problem:
As you might have already seen in the example above, only the first line, where vis_miss = F actually returns the data.frame but not the second line, where vis_miss = T. I assume that this is because of the extra if () {} clause, which returns a plot and then ends the process without printing df. Is there a way to prevent this from happening so that the first line also returns the new data.frame?
You were correct in your suspicion that the if(){} clause was stopping the df from printing. I think return() stops any function from running further. If that's the case then it's best practice to put it at the end of any function.
Further, use print(df) to make sure your function outputs your data frame. Here are a few changes to your code
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
print(df)
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
}
cut_cols(my.data, 0.5, vis_miss = T)
Here's another option if it interests you. You can assign both the df and the plot to a list then call the list.
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
# empty list
list_ <- c()
# assign df to first index of list
list_[[1]] <- df
if (vis_miss==TRUE){
plot <- vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F)
# assign plot to second index in list
list_[[2]] <- plot
}
return(list_)
}
output <- cut_cols(my.data, 0.5, vis_miss = T)
Calling output will print both the df and plot. output[[1]] will print just the df. output[[2]] will print just the plot.
Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)
I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))
reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}
I am struggling with creating multiple columns in a DRY way.
I have searched google and stack exchange and I am still struggling with the below.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
df$green <- myfunction(df$red)
df$yellow <- myfunction(df$blue)
My questions are:
how can I create the green and yellow columns using a for loop?
how can I create the green and yellow using an apply function?
I've spent a bit of time working on these kinds of things. Most of the time you're going to want to either know all the names of the new variables or have them work in an orderly pattern so you can just paste together the names with an indexing variable.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
newcols = apply(X = df, MARGIN = 2, FUN = myfunction)
colnames(newcols) = c("greeen","yellow")
df = cbind(df,newcols)
# Alternative
df <- data.frame(red = 1:10, blue=seq(1,30,3))
colors = c("green", "yellow")
for(i in 1:length(colors)){
df[,colors[i]] = myfunction(df[,i])
}
As pointed out by Sotos, apply is slower than lapply. So I believe the optimal solution is:
df[,c("green","yellow")] <- lapply(df, myfunction)