scatter plot against all groups for a long data frame - r

I am pretty sure something like this is already asked but I don't know how to search for it.
I often get data in a wide format like in my little example with 3 experiments (a-c). I normally convert to long format and convert the values by some function (here log2 as an example).
What I often want to do is to plot all experiments against each other and here I am looking for a handy solution. How can I convert my data frame to get facets for example with a~b, a~c and b~c...
So far I tidy::spread the data again and execute 3 times a ggplot command with the individual column names as x and y. Later I merge the individual graphs together.
Is there a more convenient way?
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
)
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value))
EDIT
Since I got a very useful answer from #hdkrgr I adapted a bit my code. The inner_join was a great trick which I can implement to automate my idea, what I still miss is a clever filter to get rid of the redundant data, since I don't want to plot c~c or b~a if I already plot a~b.
I solved this now by providing the pairings I want to do, but can anyone think ob a straight forward solution? I couldn't think of something which gives me the unique pairing.
my_pairs <- c('a vs. b', 'a vs. c', 'b vs. c')
df %>%
as_tibble() %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
mutate(pairing=sprintf('%s vs. %s', experiment.x, experiment.y)) %>%
filter(pairing %in% my_pairs) %>%
ggplot(aes(log2.value.x, log2.value.y)) +
geom_point() +
facet_wrap( ~ pairing, labeller=label_both)

One way starting from long format would be to do a self-join on the long-data in order to get all combinations of two experiments in each row:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_grid(experiment.y ~ experiment.x)
Edit: To avoid plotting redundant experiment-pairs, you can do:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>% inner_join(., ., by=c("names")) %>%
filter(experiment.x < experiment.y) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_wrap(~experiment.y + experiment.x)

This is really interesting because it's actually more complex than it first seems. One thing that sticks out is getting unique pairs of experiments—it seems like you'd want a vs b but not necessarily b vs a as well. To do that, you need the unique set of experiment pairs.
Initially, I tried to work from your gathered data, but realized it might be simpler to start from the wide version. Take the names of the experiments from the column names—you can do this multiple ways, but I just took the strings that aren't "names"—and get the combinations of them. I pasted them together to make them a little easier to work with.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
) %>%
as_tibble()
exp <- stringr::str_subset(names(df), "names", negate = T)
pairs <- combn(exp, 2, paste, simplify = F, collapse = ",") %>%
unlist()
pairs
#> [1] "a,b" "a,c" "b,c"
Then, for each pair, extract the associated column names, do a little tidyeval to select those columns, do the log2 transform that you had. I had to detour here to rename the columns with something I could refer back to—I think this isn't necessary, but I couldn't get my tidyeval working inside the ggplot aes. Someone else might have an idea on that. Then make your plot, and label the axes and title accordingly. That leaves you with a list of 3 plots.
plots <- purrr::map(pairs, function(pair) {
cols <- strsplit(pair, split = ",", fixed = T)[[1]]
df %>%
select(names, !!cols[1], !!cols[2]) %>%
mutate_at(vars(-names), log2) %>%
rename(exp1 = !!cols[1], exp2 = !!cols[2]) %>%
ggplot(aes(x = exp1, y = exp2)) +
geom_point() +
labs(x = cols[1], y = cols[2], title = pair)
})
Use your method of choice to put the plots together however you want. I went with cowplot, but I also like the patchwork package.
cowplot::plot_grid(plotlist = plots, nrow = 1)

This is probably not what you want, but if the purpose is to explore the correlation pattern between each variable, you may want to consider ggpairs from the GGally package. It provides not only scatter plots, but also correlation score and distribution.
library(GGally)
ggpairs(df[, c("a", "b", "c")])

You could start from creating all combinations via combnand then work your way through:
library(purrr)
t(combn(names(df)[-1], 2)) %>% ## get all combinations
as.data.frame(stringsAsFactors = FALSE) %>%
mutate(l = paste(V1, V2, sep = " vs. ")) %>%
pmap_dfr(function(V1, V2, l)
df %>%
select(one_of(c(V1, V2))) %>% ## select the elements given by the combination
mutate_all(log2) %>%
setNames(c("x", "y")) %>%
mutate(experiment = l)) %>%
ggplot(aes(x, y)) + geom_point() + facet_wrap(~experiment)

Related

How to aggregate count by grouped rows of multiple columns inside pipes?

I want to get the head of the count of grouped rows by multiple columns in ascending order for a plot.
I found some answers on the internet but nothing seems to work when I try to merge it with arrange and pipes.
df_Cleaned %>%
head(arrange(aggregate(df_Cleaned$Distance,
by = list(df_Cleaned$start_station_id, df_Cleaned$end_station_id),
FUN = nrow)))) %>%
ggplot(mapping = aes(x = ride_id, color = member_casual)) +
geom_bar()
it seems to have problems with df_Cleaned$ since it's required in front of each column.
I hope I understood your meaning correctly. If you want to group your data by the columns Distance, start_station_id, and end_station_id and then count how many values there are under each group and then take only the head of those values, then maybe the following code will help using tidyverse:
df_Cleaned %>%
group_by(Distance, start_station_id, end_station_id) %>%
count() %>%
head() %>%
In addition, it seems like you you are later trying to plot using a variable you did not group by, so either you add it to your group_by or choose a different variable to plot by.
We may use add_count to create a count column by 'start_station_id' and 'end_station_id', and sort it, then filter the first 6 unique values (head ) or last 6 (tail) of 'n' and plot on the subset of the data
library(dplyr)
library(ggplot2)
df_Cleaned %>%
add_count(start_station_id, end_station_id, sort = TRUE) %>%
filter(n %in% head(unique(n), 6)) %>%
ggplot(mapping = aes(x = ride_id, color = member_casual)) +
geom_bar()

R Function to Create Custom Data Frames from Larger Data Frame

Ok, So I found somewhat similar questions asked of this already, but I'm not quite getting it. So, here is my example. I have a very large table of data that has a basic setup like the small example data below. I will try to explain very clearly what I am wanting to do. I'm guessing maybe it's easier to do than I think, but I'm not really good at creating functions or for-loops at this point, and I'm guessing that's what I need. So here is the basic setup for my data.
test_year <- c(2019,2019,2019,2020,2020,2020,2021,2021,2021)
SN <- c(1001,1002,1003,1004,1005,1006,1007,1008,1009)
Owner <- c("Adam","Bob","Bob","Carl","Adam","Bob","Adam","Carl","Adam")
ObsA <- c(0,0,1,1,0,1,1,NA,1)
ObsB <- c(1,1,1,0,0,0,0,0,1)
ObsC <- c(0,0,0,0,1,1,0,0,0)
df <- data.frame(test_year, SN, Owner, ObsA, ObsB, ObsC)
From this, I need to be able to create smaller data frames by selecting individual observation columns. So if this were a small data set:
df_A <- df %>% select(test_year, SN, Owner, ObsA)
and then have a data frame for each of the other observations. And yes, it is easier to select the columns that I want versus the columns I don't want as most of the columns selected will be standard, and I just need to change which observation is picked out of over 40 in my real data.
From these smaller data frames, I will be doing numerous other operations including making multiple tables and graphs. As examples, the following are similar to the types of graphs I will make (with some additional formatting that is simple enough). Notice too in these graphs a title that is based on (though not identical to), the column selected.
df_A[is.na(df_A)] = 0
df_A
df_A %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = test_year, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_point()
df_A %>% group_by(Owner) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = Owner, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_bar(stat = "identity") +
coord_flip() +
scale_x_discrete()
As I said, additional analysis will also need to be done. So, I'm needing help figuring out how I can structure a function to do what it is I'm wanting to do. Thanks!
Here is a way to return a list of plots.
Split all the 'Obs' columns in a list of dataframes, use imap to pass dataframe along with the column name (to use it as title).
library(tidyverse)
common_cols <- 1:3
df[is.na(df)] = 0
list_plots <- df %>%
select(starts_with('Obs')) %>%
split.default(names(.)) %>%
imap(~{
tmp <- df[common_cols] %>% bind_cols(.x)
tmp %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(.data[[.y]])) %>%
ggplot(aes(x = factor(test_year), y = 100*obs/n)) +
geom_point() +
labs(x = 'Year', y = 'ratio', title = .y)
})
Individual plots can be accessed by list_plots[[1]],list_plots[[2]] etc.

How to make dot plot with multiple data points for single variable?

I would like to create dot-plot for my data set. I know how to create a normal dot-plot for treatment comparisons or similar data sets using ggplot. I have the following data. And would like to create a dot-plot with three different colors. Please suggest me how to prepare data for this dot-plot. If I have a single data point in NP and P, it is easy to plot as I already worked with similar data but not getting any idea with this kind of data. I can use ggplot module from R and can be done.
The variable W has always single data point while NP and P has different data points i.e. some time one in NP and some times three and same with variable P,as I shown in the table.
Here is the screen shot for my data.
Sorry for my language
I agree my data is mess. I googled and did some coding to get the plot. I used tidyverse and dplyr packages to attain the plot but again there is a problem with y-axis. Y-axis is very clumsy. I used this following code
d <- read.table("Data1.txt", header = TRUE, sep = "\t", stringsAsFactors = NA)
df <- data.frame(d)
df <- df %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(!ID, names_to="colid", values_to="val") %>%
separate_rows(val, sep="\t", convert=TRUE) %>%
mutate(ID=as_factor(ID)
Then I plot the graph with ggplot
ggplot(df, aes(x=ID, y=val, color=colid))+geom_point(size=1.5) +theme(axis.text.x = element_text(angle = 90))
The output is this. I tried to adjust Y-axis with ylim and scale_y_discrete() but nothing worked. Please suggest a way to rectify it.
This contains many necessary steps for data cleaning, as suggested by user Dan Adams in the comment. This was kind of fun, and it helped me procrastinate my own thesis.
I am using a function from a very famous thread which offers a way to splits columns when the number of resulting columns is unknown.
P.S. The way you shared the data was less than ideal.
#your data is unreadable without this awesome package
# devtools::install_github("alistaire47/read.so")
library(tidyverse)
df <- read.so::read_md("|ID| |W| |NP| |P|
|:-:| |:-:| |:-:| |:-:|
|1| |4.161| |1.3,1.5| |1.5,2.8|
|2| |0.891| |1.33,1.8,1.79| |1.6|
|3| |7.91| |4.3| |0.899,1.43,0.128|
|40| |2.1| |1.4,0.99,7.9,0.32| |0.6,0.5,1.57|") %>%select(-starts_with("x"))
#> Warning: Missing column names filled in: 'X2' [2], 'X4' [4], 'X6' [6]
# from this thread https://stackoverflow.com/a/47060452/7941188
split_into_multiple <- function(column, pattern = ", ", into_prefix){
cols <- str_split_fixed(column, pattern, n = Inf)
cols[which(cols == "")] <- NA
cols <- as.tibble(cols)
m <- dim(cols)[2]
names(cols) <- paste(into_prefix, 1:m, sep = "_")
cols
}
# apply this over the columns of interest
ls_cols <- lapply(c("NP", "P"), function(x) split_into_multiple(df$NP, pattern = ",", x))
# bind it to the single columns of the old data frame
# convert character columns to numeric
# apply pivot longer twice (there might be more direct options, but I won't be
# bothered to do too much here)
df_new <-
bind_cols(df[c("ID", "W")], ls_cols) %>%
pivot_longer(cols = c(-ID,-W), names_sep = "_", names_to = c(".value", "value")) %>%
mutate(across(c(P, NP), as.numeric)) %>%
select(-value) %>%
pivot_longer(W:P, names_to = c("var"), values_to = "value")
# The new tidy data can easily be plotted
ggplot(df_new, aes(ID, value, color = var)) +
geom_point()
#> Warning: Removed 12 rows containing missing values (geom_point).

lapply strings to subset via brackets[]

There is almost certainly an easier way to go about this, but perhaps I've just been awake too long. I want to use the following vector of strings:
lap_list <- paste0(seq(1,length(mpg[[1]]),10), ":", seq(10,length(mpg[[1]]),10))
and use the vector to subset such as mpg[lap_list[1], ]. Alternatively, I could use dplyr for something with slice:
mpg %>%
slice(lap_list[1])
Both methods are giving the same error, and beyond parse(eval()) or as.numeric() I'm having a hard time wording my question for google.
The ultimate goal is to have a function such that I could lapply the graph outputs. Say:
barchart <- function(data_slice) {
mpg %>%
slice(data_slice) %>%
ggplot(aes(x=model)) + geom_bar()
}
lapply(lap_list, barchart)
If you paste the sequence of rows you want to subset using paste0, you don't have much option then to use eval(parse)) in some way or the other.
An alternative is to create a sequence of rows that you want to subset and store it in vectors. Pass them in Map to slice from the data and then plot.
library(dplyr)
library(ggplot2)
n <- nrow(mpg)
start <- seq(1,n,10)
#Added an extra `n` here to make the length of start and end equal
end <- c(seq(10,n,10), n)
barchart <- function(data, start, end) {
data %>%
slice(start:end) %>%
ggplot(aes(x=model)) + geom_bar()
}
list_of_plots <- Map(barchart, start, end, MoreArgs = list(data = mpg))
You can access each individual plots using list_of_plots[[1]], list_of_plots[[2]] etc.
Perhaps, you can also create groups of 10 rows and store the plots in the dataframe :
mpg %>%
group_by(grp = ceiling(row_number()/10)) %>%
summarise(plot = list(ggplot(cur_data(), aes(x=model)) + geom_bar()))

Generate histograms out of dplyr pipe

I have a dataset that I want to group_by() and generate a histogram for each group. My current code is as follows:
df %>%
group_by(x2) %>%
with(hist(x3,breaks = 50))
This however generates a single histogram of the entirety of x3 rather than several chunks of x3 here is some example data
df = data.frame(x1 = rep(c(1998,1999,2000),9),
x2 = rep(c(1,1,1,2,2,2,3,3,3),3),
x3 = rnorm(27,.5))
desired output:
actual output:
My comment about do is dated, I guess. ?do points us to the current ?group_walk:
df %>%
group_by(x2) %>%
group_walk(~ hist(.x$x3))
In versions of dplyr < 0.8.0, there is no group_walk, so you can use do:
df %>%
group_by(x2) %>%
do(h = hist(.$x3))
Assuming you only want the side-effects of hist (printed histogram), not the returned values, you can add a %>% invisible() to the end of the chain to not print the resulting tibble.
I think it's time to advance to ggplot, for instance:
library(tidyverse)
df %>%
ggplot(aes(x = x3)) +
geom_histogram(bins = 50) +
facet_wrap(~x2) # optional: use argument "ncols = 1"
You can use split.data.frame command to split the data based on the categories after this you run a hist command in the list of data frames
list_df <- split.data.frame(df, f= df$x2)
par(mfrow = c(round(length(list_df), 0), 1))
for( lnam in names(list_df)){
hist(list_df[[lnam]][, "x3"])
}
I really like #Gregor's answer with group_walk, but it's still listed as experimental in dplyr v0.8.0.1. If you want to avoid working with functions that may break later, I'd use base split, then purrr::walk. I'm using walk and plot to avoid all the text printout that hist gives.
library(dplyr)
library(purrr)
df %>%
split(.$x2) %>%
walk(~hist(.$x3) %>% plot())

Resources