How to retain class of variable in `tapply`? - r

Suppose my data frame is set up like so:
X <- data.frame(
id = c('A', 'A', 'B', 'B'),
dt = as.Date(c('2020-01-01', '2020-01-02', '2021-01-01', '2021-01-02'))
)
and I want to populate a variable of the id-specific minimum value of date dt
Doing: X$dtmin <- with(X, tapply(dt, id, min)[id]) gives a numeric because the simplify=T in tapply has cast the value to numeric. Why has it done this? Setting simplify=F returns a list which each element in the list has the desired data structure, but populating the variable in my dataframe X casts these back to numeric. Yet calling as.Date(<output>, origin='1970-01-01') seems needlessly verbose. How can I retain the data structure of dt?

We may use
X$dtmin <- with(X, do.call("c", tapply(dt, id, min, simplify = FALSE)[id]))
Or use dplyr
library(dplyr)
X %>%
mutate(dtmin = min(dt), .by = "id")

Related

Generalizable function to select and filter dataframe r - using shiny input

I am building a shiny app. The user will need to be able to reduce the data by selecting variables and filtering on specific values for those variables. I am stuck trying to get a generalizable function that can work based on all possible selections.
Here is an example - I skip the shiny code because I think the problem is with the function:
#sample dataframe
df <- data.frame('date' = c(1, 2, 3, 2, 2, 3, 1),
'time' = c('a', 'b', 'c', 'e', 'b', 'a', 'e'),
'place' = c('A', 'A', 'A', 'H', 'A', 'H', 'H'),
'result' = c('W', 'W', 'L', 'W', 'W', 'L', 'L'))
If the user selected date and result for the date values 1, 2; and the result values W, I would do the following:
out <- df %>%
select(date, result) %>%
filter(date %in% c(1,2)) %>%
filter(result %in% c('W'))
The challenge I am having is that the user can select any unique combinations of variables and values. Using the input$ values from my shiny app, I can get the selected variables into a vector and I can get the selected values into a list of values, positionaly matching the selected variables. For example:
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
What i think i then need is a generalizable function that will match up the filter calls with the correct variables. Something like:
#function that takes data frame, vector of selected variables, list of vectors of chosen values for each variable
#Returns a reduced table of selected variables, filtered values
table_reducer <- function(df, select_var, filter_values) {
#select the variables
out <- df %>%
#now filter each variable by the values contained in the list
select(vect_of_var)
out <- [for loop that iterates over vect_of_var, list_of_vec, filtering accordingly]
out #return out
}
My thinking would be to use a zip equivalent from python, but all my searching on that just points me to mapply and i can't see how to use that within the for loop (which i also know is not always approved in R - but i am talking about a relatively small number of iterations). If there is a better solution to this i would welcome it.
Here's a 1-liner table_reducer function in base R -
table_reducer <- function(df, select_var, filter_values) {
subset(df, Reduce(`&`, Map(`%in%`, df[select_var], filter_values)))
}
selected_variables <- c('date', 'result')
selected_values <- list(c(1,2), c('W'))
table_reducer(df, selected_variables, selected_values)
# date time place result
#1 1 a A W
#2 2 b A W
#4 2 e H W
#5 2 b A W
Map is a wrapper over mapply so you were right in thinking that you should use mapply for this task. This answer is also free of dreaded for loops.

R; How to select() columns that contains() strings where the string is any element of a list

I want to subset a dataframe whereby I select columns based on the fact that the colname contains a certain string or not. These strings that it must contain are stored in a separate list.
This is what I have now:
colstrings <- c('A', 'B', 'C')
for (i in colstrings){
df <- df %>% select(-contains(i))
}
However, it feels like this shouldn't be done with a for loop. Any suggestions on how to make this code shorter?
Here's an answer adapted from a previous SO post:
library(dplyr)
df <-
tibble(
ash = c(1, 2),
bet = c(2, 3),
can = c(3, 4)
)
df
substr_list <- c("sh", "an")
df %>%
select(matches(paste(substr_list, collapse="|")))
See more here: select columns based on multiple strings with dplyr contains()

Why isn't a data.table sorted properly after calling setDT?

When a data table is turned into a data frame and then back into a data table it may keep the sorted attribute even though it is not sorted (see the example below). This leads to incorrect results when merging data.tables, and possible undetected bugs.
Is this the expected behavior? What is the best way to turn a data.frame into a sorted data.table and verify that it is indeed sorted?
library(data.table)
library(dplyr)
a <- data.table(id = c('a', 'B', 'c'), value = c(1,2,3))
b <- data.table(id = c('a', 'B', 'c'))
setkey(a,id)
a_sum <- a %>%
group_by(id) %>%
summarize_at(vars(value), sum)
setDT(a_sum, key = "id")
a_sum_nokey = setkey(copy(a_sum), NULL)
merged_key_fails = merge(a_sum, b, by="id")
merged_no_key_works = merge(a_sum_nokey, b, by="id")

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Case usage in R:Count number of events from Table 2 when case in Table 1 satisfy specific restrictions

The DF for Table 1 is like this:
df1 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-08-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'))
And the DF for Table 2 is like this:
df2 <- data.frame(Event = c('A', 'B', 'C', 'D', 'E'),
Event_date = c('2015-01-21', '2015-01-21', '2015-03-29', '2015-08-12', '2015-10-12'))
what I want to get is to get case when df1$date_last < df2$Event_date < df1$date, then count(Event) as 1 and sum up how many events during the time period. The ideal result I want to have is like the following:
df3 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-02-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'),
number_of_events = c(3,1,0,3,1,0))
Anyone know the R code for this? Thank you so much!
Make sure that all your dates are of class date. You simply to this by putting as.Date() around the columns in the creation of the data frames.
First define a function with x being a vector with end and start date respectively, and y being a vector with dates that should be checked.
nr_events_in_between <- function(x, y) sum(x[2] < y & x[1] > y)
Apply this to all rows in df1 and you get the number_of_events column.
apply(df1[ ,c('date', 'date_last')], 1, nr_events_in_between, df2[,'Event_date'])
(Note that for the second row the value is 0 not 1 as you state in the example for df3)

Resources