Variable observations disappear after merging - r

I'm pretty new to R and have a professor who is only providing examples and not much explanation, so I'll do my best to explain my problem.
I used randomizr to create an experiment.
set.seed(20016)
A <- complete_ra(N = 9954, m_each = c(3318, 3318, 3318), conditions=c("Treatment 1", "Treatment 2", "Control"))
table(A)
Then I was to show a distribution of the treatments on the existing data, separated by gender. So I followed example work to make and new data set with the treatment, calling the new variable "gender" so I could then merge with the existing data...
gender <- seq(1.9954)
new_gender <- data.frame(gender, A)
final_data <- merge(colombia, new_gender, by.x = "gender", all.x = TRUE)
This all seemed well and good but now when I view final_data, variable A is all full of N/A. When I would view it in the new_gender data set all the observations were filled out with assignments of "Treatment 1", "Treatment 2", "Control" but now that's gone. Where did my observations go? My goal is to graph it like this:
ggplot(final_data, aes(x = A)) + geom_bar(aes(fill = gender)) + labs(title = "Treatment Separated by Gender", x = "Treatment", y = "Count")
But I only get one bar right now instead of three because variable A is all "N/A"s.
Any help is greatly appreciated as I am at a total loss...

Related

Breaking up a large ggplot by category; subset shows no errors but plots no data

I have a very large dataset derived from a spreadsheet of the format below:
df = data.frame(name = c('Ger1', 'Ger2', 'Ger3', 'Ger4', 'Ger5', 'Ger6'),
issued = c(UKS, USD, UKS, UKS, USD, USD),
mat = c(2024-01-31, 2023-01-31, 2026-10-22, 2022-07-22, 2029-01-31, 2025-06-07)
volume = c(0.476, 0.922, 0.580, 1.259, 0.932, 0.417)
Currently, I plot all the data on one very long ggplot with the following code:
chart1<-ggplot(df)+geom_bar(stat="ID",aes(x=volume,y=name),fill="#1170aa")+
theme(title=element_text(size=12),panel.background = element_rect(fill='white',color='black'),legend.position='right')+
labs(title = "Total carriage by Volume on the day", x = "Volume", y = "Name")
Now while that worked for a while, given the size the dataset has grown to it is no longer feasible to use that way. Therefore I'd like to plot the data based on the contents of the "issued" column.
I first thought about a condition statement of the type:
if (df$issued == "UKS"){
chart1<-ggplot(df)+geom_bar(stat="ID",aes(x=volume,y=name),fill="#1170aa")+
theme(title=element_text(size=12),panel.background = element_rect(fill='white',color='black'),legend.position='right')+
labs(title = "Total carriage by Volume on the day", x = "Volume", y = "Name")
}
It unfortunately didn't work (although on closer inspection my logic wasn't particularly well thought-out)
I have then tried using the subset() function in the hopes that would allow to only plot data meeting my requirements as so:
chart1<-ggplot(subset(df, 'issued' == "UKS"))+geom_bar(stat="ID",aes(x=volume,y=name),fill="#1170aa")+
theme(title=element_text(size=12),panel.background = element_rect(fill='white',color='black'),legend.position='right')+
labs(title = "Total carriage by Volume on the day", x = "Volume", y = "Name")
This particular code didn't show up any errors, but the chart that was produced had no data on it all. Does anyone have any ideas on how I can filter and plot this data?
You don't need quote "" for column names in subset().
ggplot(subset(df, issued == "UKS")) +
geom_bar(stat="identity", aes(x=volume,y=name),fill="#1170aa")+
theme(title=element_text(size=12),
panel.background = element_rect(fill='white',color='black'),
legend.position='right')+
labs(title = "Total carriage by Volume on the day", x = "Volume", y = "Name")
Or use a tidyverse way of plotting:
library(tidyverse)
df %>% filter(issued == "UKS") %>%
ggplot() +
geom_bar(stat="identity", aes(x=volume,y=name),fill="#1170aa")+
theme(title=element_text(size=12),
panel.background = element_rect(fill='white',color='black'),
legend.position='right')+
labs(title = "Total carriage by Volume on the day", x = "Volume", y = "Name")

Plot multiple variables on same Barplot in R

I would really appreciate some help with this plot. I'm very new to R and struggling (after looking at many tutorials!)to understand how to plot the following:
This is my Table
The X axis is meant to have PatientID, the Y is cell counts for each patient
I've managed to do a basic plot for each variable individually, eg:
This is for 2 of the variables
And this gives me 2 separate graphs
Total cell counts
Cells counts for zone 1
I would like all the data represented on 1 graph...That means for each patient, there will be 4 bars (tot cell counts, and cell counts for each zone (1 - 3).
I don't understand whether I should be doing this as a combined plot or make the 4 different plots and then combine them together? I'm also very confused with how to actually code this. I've tried ggplot and I've done the regular Barplot in R (worked for 1 variable at a time but not sure how to do many variables). Some very step-by-step help would be so much appreciated here. TIA
Here's a way of doing it using the ggplot2 and tidyr packages from the tidyverse. The key steps are pivoting your data from "wide" to "long" format in order to make it usable with ggplot2. Afterwards, the ggplot call is pretty simple - more info here if you want a bit more explanation about stacked and bar plots in ggplot2, with an example that's pretty much identical to yours.
library(ggplot2)
library(tidyr)
# Reproducing your data
dat <- tibble(
patientID = c("a", "b", "c"),
tot_cells = c(2773, 3348, 4023),
tot_cells_zone1 = c(994, 1075, 1446),
tot_cells_zone2 = c(1141, 1254, 1349),
tot_cells_zone3 = c(961, 1075, 1426)
)
to_plot <- pivot_longer(dat, cols = starts_with("tot"), names_to = "Zone", values_to = "Count")
ggplot(to_plot, aes(x = patientID, y = Count, fill = Zone)) +
geom_bar(position="dodge", stat="identity")
Output:
Thanks everyone for your help. I was able to make the plot as follows:
First, I made a new table from data I imported into R:
#Make new table of patientID and tot cell count
patientID <- c("a", "b", "c")
tot_cells <- c(tot_cells_a, tot_cells_b, tot_cells_c)
tot_cells_zone1 <- c(tot_cells_a_zone1, tot_cells_b_zone1, tot_cells_c_zone1)
tot_cells_zone2 <- c(tot_cells_a_zone2, tot_cells_b_zone2, tot_cells_c_zone2)
tot_cells_zone3 <- c(tot_cells_a_zone3, tot_cells_b_zone3, tot_cells_c_zone3)
tot_cells_table <- data.frame(tot_cells,
tot_cells_zone1,
tot_cells_zone2,
tot_cells_zone3)
rownames(tot_cells_table) <- c(patientID)
Then I plotted as such, first converting the data.frame to matrix :
#Plot "Total Microglia Counts per Patient"
tot_cells_matrix <- data.matrix(tot_cells_table, rownames.force = patientID)
par(mar = c(5, 4, 4, 10),
xpd = TRUE)
barplot(t(tot_cells_table[1:3, 1:4]),
col = c("red", "blue", "green", "magenta"),
main = "Total Microglia Counts per Patient",
xlab = "Patient ID", ylab = "Cell #",
beside = TRUE)
legend("topright", inset = c(- 0.4, 0),
legend = c("tot_cells", "tot_cells_zone1",
"tot_cells_zone2", "tot_cells_zone3"),
fill = c("red", "blue", "green", "magenta"))
And the graph looks like this:
Barplot of multiple variables
Thanks again for pointing me in the right direction!

r - plotting gantt chart where multiple periods exist within one category

I borrowed example data from another post and modified to suit my situation.
(Add shading to a gantt chart to delineate weekends)
My issue with plotting a Gantt chart is that my data includes multiple periods within the category.
In the below sample, I added two more periods for "Write introduction" and one more for "Write results".
Also, I want to colour specific periods that meet the criteria. Here, if any portion of a period falls in either May or August, I flagged it.
It seems having up to two periods in one category works fine.
But when there are three periods, they become merged into a single period?!
My real dataset is much more complicated with one category sometimes having more than 10 periods, with very defined criteria for flagging.
I'm not sure how I could address this issue.
require(reshape2)
require(ggplot2)
# Create a list of tasks name strings.
tasks <- c("Write introduction", "Parse citation data",
"Construct data timeline",
"Write methods", "Model formulation",
"Model selection", "Write results", "Write discussion",
"Write abstract and editing",
"Write introduction", "Write introduction", "Write results")
# Compile dataframe of task names, and respective start and end dates.
dfr <- data.frame(
name = factor(tasks, levels = tasks[1:9]),
start.date = as.Date(c("2018-04-09", "2018-04-09", "2018-04-16",
"2018-04-30", "2018-04-16", "2018-05-21",
"2018-06-04", "2018-07-02", "2018-07-30",
"2018-05-15", "2018-06-03", "2018-07-25"
)),
end.date = as.Date(c("2018-04-30", "2018-04-20", "2018-05-18",
"2018-06-01", "2018-05-18", "2018-06-01",
"2018-06-29", "2018-07-27", "2018-08-31",
"2018-05-29", "2018-06-20", "2018-08-15")),
flag = c(0, 0, 1,
1, 1, 1,
0, 0, 1,
1, 0, 1)
)
# Merge start and end dates into durations.
mdfr <- melt(dfr, measure.vars = c("start.date", "end.date"))
# gannt chart
ggplot(mdfr) +
geom_line(aes(value, name, colour = as.factor(flag)), size = 4) +
labs(title = "Project gantt chart",
x = NULL,
y = NULL) +
theme_minimal()
The geom_line approach should work if you retain an identifier for each pair of start / end dates, and specify that as the grouping variable so that ggplot knows which coordinates belong to the same line:
dfr$row.id <- seq(1, nrow(dfr)) # use row id to identify each original row uniquely
mdfr <- melt(dfr, measure.vars = c("start.date", "end.date"))
ggplot(mdfr,
aes(x = value, y = name, colour = factor(flag), group = row.id)) + # add group aesthetic
geom_line(size = 4) +
labs(title = "Project gantt chart",
x = NULL,
y = NULL) +
theme_minimal()
I solved it by skipping melt and using geom_linerange instead of geom_line
ggplot(dfr) +
geom_linerange(aes(y = name,
xmin = start.date,
xmax = end.date,
colour = as.factor(flag)),
size = I(5)) +
theme_minimal()
Still, it would be good to know why it didn't work with geom_line.
If anyone could help me with that, I'd appreciate it very much!

Specify location of bar on bar chart R

I have example data below:
> eg_data <- data.frame(period = c("1&2", "1&2","1", "1", "2","2"), size = c("big", "small", "big", "small","big", "small"), trip=c(1000, 250, 600, 100, 400, 150))
I want to make a stacked bar chart, where I have both periods as the first bar, period one as second, and period two as third. This is specified in the data as they are entered, but when I run the ggplot bar command, R decides that period one is a better candidate for first position.
ggplot() +
geom_bar(data = eg_data, aes(y = trip, x = period, fill = size),
stat = "identity",position = 'stack')
First, why does R feel the need to display data in a manner other than how I fed it in, and second, how do I correct this IE specify which groupings I want and in what order.
All help is appreciated, thank you.
We can create the column as a factor with levels specified as the unique values of that column. With that, the values are not sorted and would be in the same order as in the order of sequence of occurrence of the first unique value of 'period'
library(tidyverse)
eg_data %>%
mutate(period = factor(period, levels = unique(period))) %>%
ggplot() +
geom_bar(aes(y = trip, x = period, fill = size),
stat="identity",position='stack')
EDIT - solution with baseR would be as follows -
eg_data$period <- factor(eg_data$period, levels = c("1 & 2", "1", "2"))

Specifying different x-tick labels for two facet groups in ggplot2

I have boxplots representing results of two methods, each with two validation approaches and three scenarios, to be plotted using ggplot2. Everything works fine, but I want to change the x-axis tick label to differentiate between the type of technique used in each group.
I used the following code:
data <- read.csv("results.csv", header = TRUE, sep=',')
ggplot() +
geom_boxplot(data = data, aes(x = Validation, y = Accuracy, fill = Scenario)) +
facet_wrap(~ Method) +
labs(fill = "")
where the structure of my data is as follows:
Method Validation Scenario Accuracy
-------------------------------------------------------
Method 1 Iterations Scenario 1 0.90
Method 1 Iterations Scenario 2 0.80
Method 1 Iterations Scenario 3 0.86
Method 1 Recursive Scenario 2 0.82
Method 2 Iterations Scenario 1 0.69
Method 2 Recursive Scenario 3 0.75
and got the following plot:
I just want to change the first x-tick label (Iterations) in Method 1 and Method 2 into 100-iterations and 10-iterations, respectively.
I tried to add this code but that changes the labels for both groups.
+ scale_x_discrete(name = "Validation",
labels = c("100-iterations", "Recursive",
"10-iterations", "Recursive")) +
Thanks in advance.
The ggplot package's facet options were not designed for varying axis labels / scales across facets (see here for a detailed explanation), but one work around in this instance would be to vary the underlying x-axis variable's values for different facets, & set scales = "free_x" in facet_wrap() so that only the relevant values are shown in each facet's x-axis:
library(ggplot2)
library(dplyr)
ggplot(data %>%
mutate(Validation = case_when(Validation == "Recursive" ~ "Recursive",
Method == "Method 1" ~ "100-iterations",
TRUE ~ "10-iterations")),
aes(x = Validation, y = Accuracy, fill = Scenario)) +
geom_boxplot() +
facet_wrap(~ Method, scales = "free_x")
Data:
set.seed(1)
data <- data.frame(
Method = rep(c("Method 1", "Method 2"), each = 100),
Validation = rep(c("Iterations", "Recursive"), times = 100),
Scenario = sample(c("Scenario 1", "Scenario 2", "Scenario 3"), 200, replace = TRUE),
Accuracy = runif(200)
)

Resources