Here is sample data where ID is a categorical variable.
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
I want to draw five bar plots for each of these IDs using the frequency data for the three weeks W1:W3. In the actual dataset, I have 30+ weeks and around 150 IDs, hence the intention here is to do this efficiently. Nothing fancy, but ggplot would be ideal as I would need to manipulate some aesthetics.
How to do this using loop and save the images in one file(pdf)?
Thanks for your help!
This sort of problem is usually a data reformating problem. See reshaping data.frame from wide to long format. After reshaping the data, the plot is faceted by ID, avoiding loops.
library(ggplot2)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-09-30 with reprex v2.0.2
Edit
If there is a mix of week numbers with 1 and 2 digits, the lexicographic order is not the numbers' order. For instance, after W1 comes W11, not W2. Package stringr function str_sort sorts by numbers when argument numeric = TRUE.
In the example below I reuse the data changing W2 to W11. The correct bars order should therefore be W1, W3, W11.
library(ggplot2)
library(stringr)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W11 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W11, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
dplyr::mutate(Week = factor(Week, levels = str_sort(unique(Week), numeric = TRUE))) |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-10-01 with reprex v2.0.2
Related
I have two dataframes that have mostly the same variables and I want to compare the cases of the two dataframes. I want to create a new dataframe with all the cases that are the same in df1 and df2.
Cases are assumed to be the same if all values of the variables that are present in both dataframes are the same. There is an exception for the variable "Age", where cases are assumed to be the same if the values have a difference of maximum 1 year and the variable "Time" where a difference of 1 hour is acceptable.
ID1 <- c(100, 101, 102, 103)
V1 <- c(1, 1, 2, 1)
V2 <- c(1, 2, 3, 4)
Age <- c(25, 16, 74, 46)
Time <- c("9:30", "13:25", "17:20", "7:45")
X <- c (1, 3, 4, 1)
df1 <- data.frame(ID1, V1, V2, Age, Time, X)
ID2 <- c(250, 251, 252, 253)
V1 <- c(1, 2, 1, 2)
V2 <- c(1, 2, 2, 4)
Age <- c(26, 55, 16, 80)
Time <- c("9:30", "12:00", "12:55", "18:00")
Y <- c (3, 2, 1, 1)
df2 <- data.frame(ID2, V1, V2, Age, Time, Y)
In this example ID1=100 and ID2=250 are the same and also ID1=101 and ID2=252.
I'd like to have a new dataframe-output
like this one
Note that it is not important if the values for "Age" and "Time" are taken from df1 or df2. The important Variables are X an Y.
I hope someone can help me out with this problem. Thanks a lot in advance :)
Kind regards
Philip
In base R:
df3 <- merge(df1, subset(df2, select = -c(Age, Time)), by = c("V1", "V2"))
df3[,c("ID1", "ID2", "V1", "V2", "Age", "Time", "X", "Y")]
My data is currently in the format of df1:
outcome <- c("success", "failure", "success", "failure", "success", "failure")
basketball <- c(10, 7, 7, 8, 9, 10)
soccer <- c(8, 21, 30, 21, 6, 10)
football <- c(9, 2, 1, 3, 1, 5)
df1 <- data.frame(outcome, basketball, soccer, football)
And I would like it to be in the format of df2, so I can more easily create a bar graph with ggplot2.
symptom <- c("basketball", "basketball", "soccer", "soccer", "football", "football")
mean <- c(10, 6, 9, 7, 3, 1)
sd <- c(1, 2, 1, 3, 0.5, 0.2)
df2 <- data.frame(outcome, symptom, mean, sd)
Currently I have a lot of code that can get me there in a roundabout way, but I feel like there must be a streamlined way to do this in a few lines of code. Is there a way to use this using dplyr or tidyr verbs?
Thanks!
We can reshape to 'long' format with pivot_longer and then do a group by operation
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = basketball:football, names_to = 'symptom') %>%
group_by(outcome, symptom) %>%
summarise(mean = mean(value), sd = sd(value), .groups = 'drop')
If we also need to plot
library(ggplot2)
df1 %>%
pivot_longer(cols = basketball:football, names_to = 'symptom') %>%
group_by(outcome, symptom) %>%
summarise(mean = mean(value), sd = sd(value), .groups = 'drop') %>%
ggplot(aes(x = outcome, y = mean, fill = symptom)) +
geom_bar(position = position_dodge(), stat = 'identity') +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd),
width = .2, position = position_dodge(.9))
I am trying to set keys yo a data.table and keep the original column names on the second row. All that I have tried so far changes the column names to keys and erases the original variables. I have ten data.tables to merge and all the variables have different names like in the example. So I made keys but would like to keep the originals as well before harmonisation just to be sure.
library(tidyverse)
library(lubridate)
library(forcats)
library(stringr)
library(data.table)
library(rio)
library(dplyr)
1. Keys
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
keys2 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
2. data.table example with variable names.
TD3 = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD3
TD4 = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD4
I'm not sure this is really the data structure you want to have, that is to have mixed variable types like r2evans said. However...this solution works. Just put all your little data.tables into a list and voila.
I noticed that keys1 and keys2 are identical, so I just used one of them. If they should be different keys for each they can also be listed.
keys1 <- c("SDC_GENDER","SDC_CHILD_NB","LAB_CRP","PM_HIP")
TD <- list()
TD[[1]] = data.table(q128 = c(1, 2, 1, 2), q129 = c(1, 5, 2, 4), q130 = c(0.8, 3.0, 10.0, NA), q131 = c(55, 56, 80, 79))
TD[[2]] = data.table(q128 = c(1, 1, 1, 2), q129 = c(1, 3, 2, 999), q130 = c(0.9, 3.1, NA, 9.0), q131 = c(58, 60, 45, NA))
TD <- lapply(TD, FUN = function(x){
oldcolumns <- colnames(x)
td <- data.table(
'V1' = oldcolumns[1],
'V2' = oldcolumns[2],
'V3' = oldcolumns[3],
'V4' = oldcolumns[4]
)
colnames(td) <- keys1
colnames(x) <- keys1
x <- rbind(td, x)
return(x)
})
I have a dataframe with data on the number of TVs and radios owned by survey respondents in three different countries (Canada, Mexico, US) at two different points in time (now and before):
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)),
Country = as.factor(c("Mexico", "Canada", "US", "US", "Canada", "US")))
I want to sum the total value of each variable and then create a barplot that shows the number of TVs and radios owned by survey respondents now and before per country.
Now, if my dataframe didn't contain the Country factor, I could generate the plot in this way:
library(tidyverse)
library(ggplot2)
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity")
However, the variation
DF %>% mutate_all(funs(sum), na.rm = TRUE) %>%
gather(key=Device, value=Number, -Country) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number), position = "dodge", stat = "identity") +
facet_wrap(~Country)
results in the error:
Error in mutate_impl(.data, dots) :
Evaluation error: ‘sum’ not meaningful for factors.
Is there a way to exclude the factor from sum, or another way to generate the intended plot?
You can use the summarise function to sum up the different columns. Below I have summed up the numeric columns using dplyr's summarise_if() function.
DF <- data.frame(TV_now = as.numeric(c(4, 9, 1, 0, 4, NA)),
TV_before = as.numeric(c(4, 1, 2, 4, 5, 2)),
Radio_now = as.numeric(c(4, 5, 1, 5, 6, 9)),
Radio_before = as.numeric(c(6, 5, 3, 6, 7, 10)),
Country = as.factor(c("Mexico", "Canada", "US", "US", "Canada", "US")))
DF %>%
group_by(Country) %>%
summarise_if(is.numeric,sum,na.rm=TRUE) %>%
gather(key=Device, value=Number, -Country) %>%
ggplot(aes(x=Device,fill=Device)) +
geom_bar(aes(x = Device, y = Number),position = "dodge", stat = "identity") +
facet_wrap(~Country)
The result is:
I need to display two datasets on the same faceted plots with ggplot2. The first dataset (dat) is to be shown as crosses like this:
While the second dataset (dat2) is to be shown as a color line. For an element of context, the second dataset is actually the Pareto frontier of the first set...
Both datasets (dat and dat2) look like this:
modu mnc eff
1 0.3080473 0 0.4420544
2 0.3110355 4 0.4633741
3 0.3334024 9 0.4653061
Here's my code so far:
library(ggplot2)
dat <- structure(list(modu = c(0.30947265625, 0.3094921875, 0.32958984375,
0.33974609375, 0.33767578125, 0.3243359375, 0.33513671875, 0.3076171875,
0.3203125, 0.3205078125, 0.3220703125, 0.28994140625, 0.31181640625,
0.352421875, 0.31978515625, 0.29642578125, 0.34982421875, 0.3289453125,
0.30802734375, 0.31185546875, 0.3472265625, 0.303828125, 0.32279296875,
0.3165234375, 0.311328125, 0.33640625, 0.3140234375, 0.33515625,
0.34314453125, 0.33869140625), mnc = c(15, 9, 6, 0, 10, 12, 14,
9, 5, 11, 0, 15, 0, 2, 14, 13, 14, 17, 11, 12, 13, 6, 4, 0, 13,
7, 10, 12, 7, 13), eff = c(0.492448979591836, 0.49687074829932,
0.49421768707483, 0.478571428571428, 0.493537414965986, 0.493809523809524,
0.49891156462585, 0.499319727891156, 0.495102040816327, 0.492285714285714,
0.482312925170068, 0.498911564625851, 0.479931972789116, 0.492857142857143,
0.495238095238095, 0.49891156462585, 0.49530612244898, 0.495850340136055,
0.50156462585034, 0.496, 0.492897959183673, 0.487959183673469,
0.495605442176871, 0.47795918367347, 0.501360544217687, 0.497850340136054,
0.493496598639456, 0.493741496598639, 0.496734693877551, 0.499659863945578
)), .Names = c("modu", "mnc", "eff"), row.names = c(NA, 30L), class = "data.frame")
dat2 <- structure(list(modu = c(0.26541015625, 0.282734375, 0.28541015625,
0.29216796875, 0.293671875), mnc = c(0.16, 0.28, 0.28, 0.28,
0.28), eff = c(0.503877551020408, 0.504149659863946, 0.504625850340136,
0.505714285714286, 0.508503401360544)), .Names = c("modu", "mnc",
"eff"), row.names = c(NA, 5L), class = "data.frame")
dat$modu = dat$modu
dat$mnc = dat$mnc*50
dat$eff = dat$eff
dat2$modu = dat2$modu
dat2$mnc = dat2$mnc*50
dat2$eff = dat2$eff
res <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat)[ii], collapse="/")), simplify=F))
ggplot(res, aes(x=x, y=y))+ geom_point(shape=4) +
facet_wrap(~ var, scales="free")
How should I go about doing this? Do I need to add a layer? If so, how to do this in a faceted plot?
Thanks!
Here's one way:
pts <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat)[ii], collapse="/")), simplify=F))
lns <- do.call(rbind, combn(1:3, 2, function(ii)
cbind(setNames(dat2[,c(ii, setdiff(1:3, ii))], c("x", "y")),
var=paste(names(dat2)[ii], collapse="/")), simplify=F))
gg.df <- rbind(cbind(geom="pt",pts),cbind(geom="ln",lns))
ggplot(gg.df,aes(x,y)) +
geom_point(data=gg.df[gg.df$geom=="pt",], shape=4)+
geom_path(data=gg.df[gg.df$geom=="ln",], color="red")+
facet_wrap(~var, scales="free")
The basic idea is to create separate data.frames for the points and the lines, then bind them together row-wise with an extra column (geom) indicating which geometry the data goes with. Then we plot the points based on the subset of gg.df with geom=="pt" and similarly with the lines.
The result isn't very interesting with your limited example, but this seems (??) to be what you want. Notice the use of geom_path(...) rather than geom_line(...). The latter orders the x-values before plotting.