Generalizable function to select and filter dataframe r - using shiny input - r

I am building a shiny app. The user will need to be able to reduce the data by selecting variables and filtering on specific values for those variables. I am stuck trying to get a generalizable function that can work based on all possible selections.
Here is an example - I skip the shiny code because I think the problem is with the function:
#sample dataframe
df <- data.frame('date' = c(1, 2, 3, 2, 2, 3, 1),
'time' = c('a', 'b', 'c', 'e', 'b', 'a', 'e'),
'place' = c('A', 'A', 'A', 'H', 'A', 'H', 'H'),
'result' = c('W', 'W', 'L', 'W', 'W', 'L', 'L'))
If the user selected date and result for the date values 1, 2; and the result values W, I would do the following:
out <- df %>%
select(date, result) %>%
filter(date %in% c(1,2)) %>%
filter(result %in% c('W'))
The challenge I am having is that the user can select any unique combinations of variables and values. Using the input$ values from my shiny app, I can get the selected variables into a vector and I can get the selected values into a list of values, positionaly matching the selected variables. For example:
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
What i think i then need is a generalizable function that will match up the filter calls with the correct variables. Something like:
#function that takes data frame, vector of selected variables, list of vectors of chosen values for each variable
#Returns a reduced table of selected variables, filtered values
table_reducer <- function(df, select_var, filter_values) {
#select the variables
out <- df %>%
#now filter each variable by the values contained in the list
select(vect_of_var)
out <- [for loop that iterates over vect_of_var, list_of_vec, filtering accordingly]
out #return out
}
My thinking would be to use a zip equivalent from python, but all my searching on that just points me to mapply and i can't see how to use that within the for loop (which i also know is not always approved in R - but i am talking about a relatively small number of iterations). If there is a better solution to this i would welcome it.

Here's a 1-liner table_reducer function in base R -
table_reducer <- function(df, select_var, filter_values) {
subset(df, Reduce(`&`, Map(`%in%`, df[select_var], filter_values)))
}
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
table_reducer(df, selected_variables, selected_values)
# date time place result
#1 1 a A W
#2 2 b A W
#4 2 e H W
#5 2 b A W
Map is a wrapper over mapply so you were right in thinking that you should use mapply for this task. This answer is also free of dreaded for loops.

Related

How to retain class of variable in `tapply`?

Suppose my data frame is set up like so:
X <- data.frame(
id = c('A', 'A', 'B', 'B'),
dt = as.Date(c('2020-01-01', '2020-01-02', '2021-01-01', '2021-01-02'))
)
and I want to populate a variable of the id-specific minimum value of date dt
Doing: X$dtmin <- with(X, tapply(dt, id, min)[id]) gives a numeric because the simplify=T in tapply has cast the value to numeric. Why has it done this? Setting simplify=F returns a list which each element in the list has the desired data structure, but populating the variable in my dataframe X casts these back to numeric. Yet calling as.Date(<output>, origin='1970-01-01') seems needlessly verbose. How can I retain the data structure of dt?
We may use
X$dtmin <- with(X, do.call("c", tapply(dt, id, min, simplify = FALSE)[id]))
Or use dplyr
library(dplyr)
X %>%
mutate(dtmin = min(dt), .by = "id")

R; How to select() columns that contains() strings where the string is any element of a list

I want to subset a dataframe whereby I select columns based on the fact that the colname contains a certain string or not. These strings that it must contain are stored in a separate list.
This is what I have now:
colstrings <- c('A', 'B', 'C')
for (i in colstrings){
df <- df %>% select(-contains(i))
}
However, it feels like this shouldn't be done with a for loop. Any suggestions on how to make this code shorter?
Here's an answer adapted from a previous SO post:
library(dplyr)
df <-
tibble(
ash = c(1, 2),
bet = c(2, 3),
can = c(3, 4)
)
df
substr_list <- c("sh", "an")
df %>%
select(matches(paste(substr_list, collapse="|")))
See more here: select columns based on multiple strings with dplyr contains()

'Macro Variables' in R

I am trying to build a process that accepts user input parameter and then produces things accordingly.
I need to be able to:
1. Input a variable
2. Pull max date for that variable
3. Pull all data less than or equal to that date
dates <- c('2001-01-08', '2015-01-07', '2013-03-03', '2001-01-01', '2013-07-25', '2000-09-20', '2017-02-20')
groups <- c('A', 'A', 'A', 'B', 'B', 'C', 'D')
dat <- data.frame(groups, dates)
dat$dates <- as.Date(dat$dates)
The following piece works for what I want to do....
querydate <- sqldf(
"SELECT max(dates) as x
FROM dat
WHERE groups == 'A'")
But I want to edit this to do something like this....where I specify a value and query references...
group_i_want <- 'A'
querydate <- sqldf(
"SELECT max(dates) as x
FROM dat
WHERE groups == group_i_want")
How can I get R to recognize this value?
You can look into using sprintf to do string formatting on values you collect at runtime. For example:
g <- "A"
if (invalid.input(g)) stop("Error") # Make sure input was valid
query <- sprintf("SELECT max(dates) as x FROM dat WHERE groups == '%s'", g)
querydate <- sqldf(query)
Here the %s will be substituted by the string contained in g. You can also substitute numbers with specific formatting, check out ?sprintf for more information on it.

Calculate length of each object in R

I would like to calculate the length of many objects in R and return those objects with the name-prefix 'length_'. However, when I type this code:
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(unlist(files[i])))
This returns the vectors length_A and length_B, but each with the value 1 and not 3 and 2.
Thank you for any help,
Paul
p.s. I actually would like to apply this to a different function instead of length (GC.content from package ape to calculate GC content of DNA-sequences), but with that function I have the same problem as with the abovementioned example.
In R 3.2.0, the lengths function was introduced which calculates the length of each item of a list. Using this function, as #docendo-discimus notes in the comments above, a super compact (and R-like) solution is
lengths(mget(ls()))
which returns a named vector
A B
3 2
mget returns a list of objects in the environment and is sort of like "multipleget."
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(get(files[i])))
This create a length_A of value 3 and length_B of value 2.
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- list(A,B)
sapply(files,length)
this will give you the answer but I don't know if it's what you want.

mapply within ddply

note: this is a direct follow up to this previous question
I have very long dataframe consisting of two columns that I am using as arguments for a function that will find the value of a third column using mapply as so:
df$3rd <- mapply(myfunction, A=df$1st, B=df$2nd)
where myfunction has arguments A and B. While this works great for small datasets, it stalls for large datasets so I was thinking a good way to approach the problem would be to apply this function using ddply. I don't know if ddply is the best approach for this problem but I am also having some trouble with syntax. So suggestions for either would be appreciated.
This is what I am trying:
> df$3rd <- ddply(df, .(1st), function(x) x$3rd <-
> mapply(myfunction, A=x$1st, B=df$second))
and this is the error I am getting:
Error in `$<-.data.frame`(`*tmp*`, "n", value = c(1L, 1L, 1L, 1L, 1L, :
replacement has 112 rows, data has 16
EDIT:
In light of the answer and comments I I am posting a small reproducible example below - it is one of the answers from the previous question. However as the commenters below note, ddply is probably not the way to go. I am trying Ramnath's solution right now.
library(reshape2)
foo <- data.frame(x = c('a', 'a', 'a', 'b', 'b', 'b'),
y = c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x = c('c', 'c', 'c', 'd', 'd', 'd'),
y = c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
nShared <- function(A, B) {
length(intersect(with(foo, y[x==A]), with(bar, y[x==B])))
}
# Enumerate all combinations of groups in foo and bar
(combos <- expand.grid(foo.x=unique(foo$x), bar.x=unique(bar$x)))
# Find number of elements in common among all pairs of groups
combos$n <- mapply(nShared, A=combos$foo.x, B=combos$bar.x)
# Reshape results into matrix form
dcast(combos, foo.x ~ bar.x)
# foo.x c d
# 1 a 1 0
# 2 b 0 1
ddply isn't what you're after here, ddply(df,.(1st), FUNCTION) is more like:
for each val in unique(df$1st)
outdf[nrow(outdf)+1,] = FUNCTION( df[df$1st==val] )
That is, it makes outdf consisting of FUNCTION applied to subsets of df determined by column 1st.
In any case, I think your error might be because you have df instead of x in function(x) x$3rd<-mapply(myfunction,A=x$1st, B=df$second) (the B argument)? Although it is hard to tell without a working example.
What exactly does myfunction do? I think your best bet is to vectorise myfunction so that you can just do df$third <- myfunction( A=df$first, B=df$second ).
For example, if myfunction <- function(A,B) { A+B }, instead of doing mapply(myfunction,df$first,df$second) you could equivalently do myfunction(df$first,df$second) and not even need mapply at all.

Resources