lapply strings to subset via brackets[] - r

There is almost certainly an easier way to go about this, but perhaps I've just been awake too long. I want to use the following vector of strings:
lap_list <- paste0(seq(1,length(mpg[[1]]),10), ":", seq(10,length(mpg[[1]]),10))
and use the vector to subset such as mpg[lap_list[1], ]. Alternatively, I could use dplyr for something with slice:
mpg %>%
slice(lap_list[1])
Both methods are giving the same error, and beyond parse(eval()) or as.numeric() I'm having a hard time wording my question for google.
The ultimate goal is to have a function such that I could lapply the graph outputs. Say:
barchart <- function(data_slice) {
mpg %>%
slice(data_slice) %>%
ggplot(aes(x=model)) + geom_bar()
}
lapply(lap_list, barchart)

If you paste the sequence of rows you want to subset using paste0, you don't have much option then to use eval(parse)) in some way or the other.
An alternative is to create a sequence of rows that you want to subset and store it in vectors. Pass them in Map to slice from the data and then plot.
library(dplyr)
library(ggplot2)
n <- nrow(mpg)
start <- seq(1,n,10)
#Added an extra `n` here to make the length of start and end equal
end <- c(seq(10,n,10), n)
barchart <- function(data, start, end) {
data %>%
slice(start:end) %>%
ggplot(aes(x=model)) + geom_bar()
}
list_of_plots <- Map(barchart, start, end, MoreArgs = list(data = mpg))
You can access each individual plots using list_of_plots[[1]], list_of_plots[[2]] etc.
Perhaps, you can also create groups of 10 rows and store the plots in the dataframe :
mpg %>%
group_by(grp = ceiling(row_number()/10)) %>%
summarise(plot = list(ggplot(cur_data(), aes(x=model)) + geom_bar()))

Related

Add to the values in a column in a loop to generate new columns

I'm trying to create some sort of loop to generate a % of age over the next few years, in months. I have two columns, age and term. Dividing them gets me the % I'm looking for, but I need an easy way to add 1 to age, and keep term consistent, and use that to create a new column. Something like:
for i = n
col_n<-data_set$term/(data_set$age + n)
n=30
library(tidyverse)
# create example data frame
df <- tribble(~age, ~term,
10, 5,
12, 6)
# create function to add new column
agePlusN <- function(df, n) {
mutate(df, "col.{n}" := term/(age+ n))
}
# iterate through 1:30 applying agePlusN()
walk(1:30, \(n) df <<- agePlusN(df, n))
This works, but the last step is a bit ugly. It should really use map instead of walk, but I couldn't quite figure out how to get it not to add new rows.
Attempt 2
# create function to add new column
agePlusN <- function(df, n) {
mutate(df, "col.{n}" := term/(age+n)) %>%
select(-term, -age)
}
# iterate through 1:30 applying agePlusN()
df2 <-
map_dfc(1:30, \(n) agePlusN(df, n)) %>%
bind_cols(df, .)
Notes:
The := in mutate allows you to use glue() syntax in the names on the left hand side argument (eg. "col.{n}")
map_dfc() means map and then use bind_cols to combine all of the outputs
\(n) is equivalent to function(n)
The . in the call to bind_cols() isn't necessary but makes sure the 'age' and 'term' columns are put at the beginning of the resulting dataframe.
I still think this could be done better without having to call bind_cols, but I'm not smart enough to figure it out.

A better way to split apply and combine in R using sp::merge() as function

I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.

scatter plot against all groups for a long data frame

I am pretty sure something like this is already asked but I don't know how to search for it.
I often get data in a wide format like in my little example with 3 experiments (a-c). I normally convert to long format and convert the values by some function (here log2 as an example).
What I often want to do is to plot all experiments against each other and here I am looking for a handy solution. How can I convert my data frame to get facets for example with a~b, a~c and b~c...
So far I tidy::spread the data again and execute 3 times a ggplot command with the individual column names as x and y. Later I merge the individual graphs together.
Is there a more convenient way?
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
)
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value))
EDIT
Since I got a very useful answer from #hdkrgr I adapted a bit my code. The inner_join was a great trick which I can implement to automate my idea, what I still miss is a clever filter to get rid of the redundant data, since I don't want to plot c~c or b~a if I already plot a~b.
I solved this now by providing the pairings I want to do, but can anyone think ob a straight forward solution? I couldn't think of something which gives me the unique pairing.
my_pairs <- c('a vs. b', 'a vs. c', 'b vs. c')
df %>%
as_tibble() %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
mutate(pairing=sprintf('%s vs. %s', experiment.x, experiment.y)) %>%
filter(pairing %in% my_pairs) %>%
ggplot(aes(log2.value.x, log2.value.y)) +
geom_point() +
facet_wrap( ~ pairing, labeller=label_both)
One way starting from long format would be to do a self-join on the long-data in order to get all combinations of two experiments in each row:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_grid(experiment.y ~ experiment.x)
Edit: To avoid plotting redundant experiment-pairs, you can do:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>% inner_join(., ., by=c("names")) %>%
filter(experiment.x < experiment.y) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_wrap(~experiment.y + experiment.x)
This is really interesting because it's actually more complex than it first seems. One thing that sticks out is getting unique pairs of experiments—it seems like you'd want a vs b but not necessarily b vs a as well. To do that, you need the unique set of experiment pairs.
Initially, I tried to work from your gathered data, but realized it might be simpler to start from the wide version. Take the names of the experiments from the column names—you can do this multiple ways, but I just took the strings that aren't "names"—and get the combinations of them. I pasted them together to make them a little easier to work with.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
) %>%
as_tibble()
exp <- stringr::str_subset(names(df), "names", negate = T)
pairs <- combn(exp, 2, paste, simplify = F, collapse = ",") %>%
unlist()
pairs
#> [1] "a,b" "a,c" "b,c"
Then, for each pair, extract the associated column names, do a little tidyeval to select those columns, do the log2 transform that you had. I had to detour here to rename the columns with something I could refer back to—I think this isn't necessary, but I couldn't get my tidyeval working inside the ggplot aes. Someone else might have an idea on that. Then make your plot, and label the axes and title accordingly. That leaves you with a list of 3 plots.
plots <- purrr::map(pairs, function(pair) {
cols <- strsplit(pair, split = ",", fixed = T)[[1]]
df %>%
select(names, !!cols[1], !!cols[2]) %>%
mutate_at(vars(-names), log2) %>%
rename(exp1 = !!cols[1], exp2 = !!cols[2]) %>%
ggplot(aes(x = exp1, y = exp2)) +
geom_point() +
labs(x = cols[1], y = cols[2], title = pair)
})
Use your method of choice to put the plots together however you want. I went with cowplot, but I also like the patchwork package.
cowplot::plot_grid(plotlist = plots, nrow = 1)
This is probably not what you want, but if the purpose is to explore the correlation pattern between each variable, you may want to consider ggpairs from the GGally package. It provides not only scatter plots, but also correlation score and distribution.
library(GGally)
ggpairs(df[, c("a", "b", "c")])
You could start from creating all combinations via combnand then work your way through:
library(purrr)
t(combn(names(df)[-1], 2)) %>% ## get all combinations
as.data.frame(stringsAsFactors = FALSE) %>%
mutate(l = paste(V1, V2, sep = " vs. ")) %>%
pmap_dfr(function(V1, V2, l)
df %>%
select(one_of(c(V1, V2))) %>% ## select the elements given by the combination
mutate_all(log2) %>%
setNames(c("x", "y")) %>%
mutate(experiment = l)) %>%
ggplot(aes(x, y)) + geom_point() + facet_wrap(~experiment)

Applying functions on columns in nested data frame

I have data that I'm nesting into list columns, then I'd like to use purrr::map() to apply a plotting function separately to each column within the nested data frames. Minimal reproducible example:
library(dplyr)
library(tidyr)
library(purrr)
data=data.frame(Type=c(rep('Type1',20),
rep('Type2',20),
rep('Type3',20)),
Result1=rnorm(60),
Result2=rnorm(60),
Result3=rnorm(60)
)
dataNested=data%>%group_by(Type)%>%nest()
Say, I wanted to generate a histogram for Result1:Result3 for each element of dataNested$data:
dataNested%>%map(data,hist)
Any iteration of my code won't separately iterate over the columns within each nested dataframe.
Why would you need to complicate things in such way, when you're already in the tidyverse? List columns are rather a last resort solution to problems..
library(tidyverse)
data %>%
gather(result, value, -Type) %>%
ggplot(aes(value)) +
geom_histogram() +
facet_grid(Type ~ result)
gather reformats the wide dataset into a long one, with Type column, result column and a value column, where all the numbers are.
Perhaps do not create a nested data frame. We can split the data frame by the Type column and plot the histogram.
library(tidyverse)
dt %>%
split(.$Type) %>%
map(~walk(.[-1], ~hist(.)))
DATA
library(tidyverse)
set.seed(1)
dt <- data.frame(Type = c(rep('Type1', 20),
rep('Type2', 20),
rep('Type3', 20)),
Result1 = rnorm(60),
Result2 = rnorm(60),
Result3 = rnorm(60),
stringsAsFactors = FALSE)
So I think you are thinking about this the right way. Running this code:
dataNested$data[[1]
You can see that you have data that you can iterate. You can loop through it like:
for(i in dataNested) {
print(i)
}
This clearly demonstrates that the structure is nothing too complicated to work with. Okay so how to create the histograms? We can create a helper function:
helper_hist <- function(df) {
lapply(df, hist)
}
And run using:
map(dataNested$data, helper_hist)
Hope this helps.

find grouping variable(s) from within a function called by do()

I am trying to call a plotting function on subgroups of a data.frame with dplyr::do(), producing one figure (ggplot object) per subgroup. I
want the title of each figure based on the grouping variable. To do this, my function needs to know what the grouping variable is.
Currently, what gets passed to do() as . is an object of class tbl_df
and data.frame. Without explicitly passing it as a separate variable, is there a way to inspect the data.frame directly to learn what was the grouping variable(s) is/are?
The solutions posted here calls for explicitly passing (each of) the grouping variables as an additional argument to the function. I'm wondering if there is a a more, elegant and general solution that is scaleable to varying numbers of grouping variables. While in this specific instance i'm interested in plotting, there are other other use cases where I want to know how the subgroups are defined from within the function called on each subgroup.
I don't want to want to guess by looking for columns where the
length(unique(col)) == 1 because that is going to lead to lots of false
positives with my data.
Is there an elegant way to do this?
Here is some sample code to get started.
library(ggplot2)
my_plot <- function(df) {
subgroup_name <- "" # ??
ggplot(aes(cty, hwy)) + geom_point() +
ggtitle(subgroup_name)
}
mpg %>%
group_by(manufacturer) %>%
do(my_plots = my_plot(.))
I don't think its possible to do this without passing the names of the grouping variable(s) into the function (I think the grouping variable "vars" attribute is lost after splitting the grouped_df data.frame, before executing the "do"). Here's an alternative solution that requires defining the grouping variable(s) in a vector before applying the dplyr group_by %>% do chain:
library(ggplot2)
library(dplyr)
my_plot <- function(df, group_vars) {
# get plot name from value(s) in grouping variable(s)
subgroup_name <- paste(df[1, group_vars], collapse = " ")
ggplot(data = df, aes(cty, hwy)) + geom_point() + ggtitle(subgroup_name)
}
group1 <- "manufacturer"
plots1 <-
mpg %>%
group_by_(.dots = group1) %>%
do(my_plots = my_plot(., group1))
plots1$my_plots[1]
group2 <- c("manufacturer", "year")
plots2 <-
mpg %>%
group_by_(.dots = group2) %>%
do(my_plots = my_plot(., group2))
plots2$my_plots[2]

Resources