Applying functions on columns in nested data frame - r

I have data that I'm nesting into list columns, then I'd like to use purrr::map() to apply a plotting function separately to each column within the nested data frames. Minimal reproducible example:
library(dplyr)
library(tidyr)
library(purrr)
data=data.frame(Type=c(rep('Type1',20),
rep('Type2',20),
rep('Type3',20)),
Result1=rnorm(60),
Result2=rnorm(60),
Result3=rnorm(60)
)
dataNested=data%>%group_by(Type)%>%nest()
Say, I wanted to generate a histogram for Result1:Result3 for each element of dataNested$data:
dataNested%>%map(data,hist)
Any iteration of my code won't separately iterate over the columns within each nested dataframe.

Why would you need to complicate things in such way, when you're already in the tidyverse? List columns are rather a last resort solution to problems..
library(tidyverse)
data %>%
gather(result, value, -Type) %>%
ggplot(aes(value)) +
geom_histogram() +
facet_grid(Type ~ result)
gather reformats the wide dataset into a long one, with Type column, result column and a value column, where all the numbers are.

Perhaps do not create a nested data frame. We can split the data frame by the Type column and plot the histogram.
library(tidyverse)
dt %>%
split(.$Type) %>%
map(~walk(.[-1], ~hist(.)))
DATA
library(tidyverse)
set.seed(1)
dt <- data.frame(Type = c(rep('Type1', 20),
rep('Type2', 20),
rep('Type3', 20)),
Result1 = rnorm(60),
Result2 = rnorm(60),
Result3 = rnorm(60),
stringsAsFactors = FALSE)

So I think you are thinking about this the right way. Running this code:
dataNested$data[[1]
You can see that you have data that you can iterate. You can loop through it like:
for(i in dataNested) {
print(i)
}
This clearly demonstrates that the structure is nothing too complicated to work with. Okay so how to create the histograms? We can create a helper function:
helper_hist <- function(df) {
lapply(df, hist)
}
And run using:
map(dataNested$data, helper_hist)
Hope this helps.

Related

Apply R glm predict function on dataframe by group

At the moment I am trying to apply GLM predict on a dataframe. The dataframe is quite large therefore I want to apply predict by chunks.
I have found a solution but it is quite unhandy. I first create an empty dataframe and then use rbind. Is there a more efficient way of doing this?
df=data[c(),]
for (x in split(data, factor(sort(rank(row.names(data))%%10)))) {
x["prediction"]=predict(model, x, type="response")
df=rbind(df,x)
}
As the comments mention, an example of what you want your output dataframe to look like would be very helpful.
But I think you can achieve what you want by making a grouping variable first then using 'group_by', something like this:
df <- data %>%
mutate(group = rep(1:10, times = nrow(.)/10)) %>% # make an arbitrary grouping factor for this example
group_by(group) %>% # group by whatever your grouping factor is
summarise(predictions = predict(model, x, type = 'response')) # summarise could be replaced by mutate

Extending an sapply to apply list of variables and saving output as list of data frames in R

I have a data set similar to the example below, complex sample data. Thanks to SO user IRTFM, I was able to adapt the code and save results (i'm only interested in the total proportions, not the confidence intervals) as a reshaped object for further processing. What I would like to do is extend this sapply to generate results for 20 other variables. I would like to save the results as data frames in a list, ideally, since I think this is the most efficient way. My struggle is how to extend the sapply so that I can process multiple variables at once. I thought about a for loop over a list that holds the names of the variables and started to make this list, var_list below, but this seems not the way forward. I'd rather take advantage of the apply family since I would like the results to be stored in a list.
library(survey) # using the `dclus1` object that is standard in the examples.
library(reshape)
library(tidyverse)
data(api)
stype_t <- sapply( levels(dclus1$variables$stype),
function(x){
form <- as.formula( substitute( ~I(stype %in% x), list(x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c( z, c(attr(z,"ci")) )} ) %>%
as.data.frame() %>% slice(1) %>% reshape::melt() %>% dplyr::mutate(value = round(value, digits = 4)*100)
Lets say you then wanted to repeat the above using the variable awards. You could copy the lines and do it that way but it would be better to be more efficient. So I started by making a list of the names of the two variables in this example data but I am stumped as to how to apply this list to the code above and retain the results in a list of dataframes. I tried wrapping the sapply with an lapply but this did not work because I'm betting that was wrong. Any advice or thoughts would be appreciated.
var_list <- list("stype", "awards")
Instead of $ to reference named elements, consider [[ extractor to reference names by string. Also, extend substitute for dynamic variable:
# DEFINED METHOD
df_build <- function(var) {
sapply(levels(dclus1$variables[[var]]), function(x) {
form <- as.formula(substitute(~I(var %in% x),
list(var=as.name(var), x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c(z, c(attr(z,"ci")))
}) %>%
as.data.frame() %>%
slice(1) %>%
reshape::melt() %>%
dplyr::mutate(value = round(value, digits = 4)*100)
}
# ITERATE THROUGH CHARACTER VECTOR AND CALL METHOD
var_list <- list("stype", "awards")
df_list <- lapply(var_list, df_build)

lapply strings to subset via brackets[]

There is almost certainly an easier way to go about this, but perhaps I've just been awake too long. I want to use the following vector of strings:
lap_list <- paste0(seq(1,length(mpg[[1]]),10), ":", seq(10,length(mpg[[1]]),10))
and use the vector to subset such as mpg[lap_list[1], ]. Alternatively, I could use dplyr for something with slice:
mpg %>%
slice(lap_list[1])
Both methods are giving the same error, and beyond parse(eval()) or as.numeric() I'm having a hard time wording my question for google.
The ultimate goal is to have a function such that I could lapply the graph outputs. Say:
barchart <- function(data_slice) {
mpg %>%
slice(data_slice) %>%
ggplot(aes(x=model)) + geom_bar()
}
lapply(lap_list, barchart)
If you paste the sequence of rows you want to subset using paste0, you don't have much option then to use eval(parse)) in some way or the other.
An alternative is to create a sequence of rows that you want to subset and store it in vectors. Pass them in Map to slice from the data and then plot.
library(dplyr)
library(ggplot2)
n <- nrow(mpg)
start <- seq(1,n,10)
#Added an extra `n` here to make the length of start and end equal
end <- c(seq(10,n,10), n)
barchart <- function(data, start, end) {
data %>%
slice(start:end) %>%
ggplot(aes(x=model)) + geom_bar()
}
list_of_plots <- Map(barchart, start, end, MoreArgs = list(data = mpg))
You can access each individual plots using list_of_plots[[1]], list_of_plots[[2]] etc.
Perhaps, you can also create groups of 10 rows and store the plots in the dataframe :
mpg %>%
group_by(grp = ceiling(row_number()/10)) %>%
summarise(plot = list(ggplot(cur_data(), aes(x=model)) + geom_bar()))

Generate dplyr arguments using values in another dataframe

I have data where the factor labels have been provided in separate files. As a result, when I read things in I have data that looks like this:
id <- seq(1,10,1)
factor_x <- as.factor(sample(x = 1:7, size = 10, replace = T))
data <- data.frame(id, factor_x)
And a separate data frame containing the labels for factor_x that looks like this:
code <- seq(1,7,1)
label <- letters[1:7]
factor_x_labels <- data.frame(code, label)
factor_x_labels$label <- as.character(factor_x_labels$label)
I am looking for an efficient way to update factor_x in data frame 'data' with the labels in data frame 'factor_x_labels'.
I have been trying to work with fct_recode from the forcats package or recode from dplyr but am running into trouble because (for example) the existing and updated labels need to be pasted as strings but need to separated by = as a symbol.
#Ronak comment is obviously working (and should maybe be an answer) but since this post was tagged dplyr, I'm also posting a dplyr solution:
factor_x_labels$code <- as.character(factor_x_labels$code) #this won't work if one of "code" and "factor_x" is numeric but not the other
data %>%
left_join(factor_x_labels, by=c("factor_x"="code")) %>%
rename(factor_x_label = label)

R aggregate function with two values

Let's say I have a function that takes two vectors:
someFunction <- function(x,y){
return(mean(x+y));
}
And say I have some data
toy <- data.frame(a=c(1,1,1,1,1,2,2,2,2,2), b=rnorm(10), c=rnorm(10))
What I want to do is return the result of the function someFunction for each value of toy$a, i.e. I want to acchieve the same result as the code
toy$d <- toy$b + toy$c
result <- aggregate(toy$d, list(toy$a), mean)
However, in real life, the function someFunction is way more complicated and it needs two inputs, so the workaround in this toy example is not possible. So, what I want to do is:
Group the data set according to one column.
For each value in the column (in the toy example, that's 1 and 2), take two vectors v1, v2, and return someFunction(v1,v2)
Checkout dplyr package, specifically group_by and summarize functions.
Assuming that you want to compute someFunction(b, c) for each value of a, the syntax would look like
library(dplyr)
data %>% group_by(a) %>% summarize(someFunction(b, c))
library(data.table)
toy <- data.table(toy)
toy[, list(New_col = someFunction(b, c)), by = 'a']

Resources