Timeseries graphs of mean values of group in R - r

I am learning R and dealing a data set of with multiple repetitive columns, say 200 times as given columns are repeated 200 times.
I want to take mean of each column and the group the mean of each variable. So there will be 200 values of mean of each variable. I want to make a line chart like this of mean values of each variable.
I am trying these codes
library(data.table)
library(tidyverse)
library(ggplot2)
library(viridisLite)
df <- read.table("H-W.csv", sep = ",")
df
dat %>% filter(Scenario != 'NULL') %>%
mutate("Scenario" = ifelse(Scenario == 'NULL2', "BASELINE", Scenario)) %>%
group_by(.dots = c("X.step.", "Scenario")) %>%
summarise('height.people' = mean(height),
'weight.people' = mean(weight),
"wealth.people" = mean(wealth)) %>%
pivot_longer(c('height.people', 'weight.people', 'wealth.people')) %>%
ggplot(aes(x = X.step., y = value, colour = Scenario)) +
geom_line(size = 1) + facet_grid(name~., scales = "free_y") + theme_classic() +
scale_colour_viridis_d() + scale_y_log10()
I found this error
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "NULL"

I think you might have the same problem as this...
Is your data in a data.frame or tibble?
Other wise if that doesn't work try this...
filter is a function in stats and dplyr,
so you could try changing
dat %>% filter(Scenario != 'NULL') %>%
to
dat %>% dplyr::filter(Scenario != "NULL") %>%

Related

Plot percentages in R as blocks

I have the table to the left
table <- cbind(c("x1","x2", "x3"), c("0.4173","0.9211","0.0109"))
and is trying to make the plot two the right.
Is there any packages in R, which can do, what I'm trying to achieve?
A base R, option would be to use barplot applied on a named vector
barplot(v1)
Or convert to two column data.frame with stack and use the formula method
barplot(values ~ ind, stack(v1))
Or we can can use tidyverse with ggplot
library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
enframe(v1, name = "id", value = 'block') %>%
mutate(non_block = 1 - block) %>%
pivot_longer(cols = -id) %>%
ggplot(aes(x = id, y = value, fill = name)) +
geom_col() +
coord_flip() +
theme_bw()
-output
data
v1 <- setNames(c(0.4173, 0.9211, 0.0109), paste0("x", 1:3))

Put dplyr & ggplot in Loop/Apply

I'm newish to R programming and am trying to standardise, or generalise, a piece of code so that I apply it to different data exports of the same structure. The code is trivial, but I am having trouble getting getting it to loop:
Here is my code:
plot <- data %>%
group_by(Age, ID) %>%
summarise(Rev = sum(TotalRevenue)) %>%
ggplot(aes(
x = AgeGroup,
y = Rev,
fill = AgeGroup
)) +
geom_col(alpha = 0.9) +
theme_minimal()
I want to generalise the code so that I can switch out 'Age' w/ variables I put into a list. Here is my amateur code:
cols <- c(data$Col1, data$Col2) #Im pretty sure this is wrong
for (i in cols) {
plot <- data %>%
group_by(i, ID) %>%
summarise(Rev = sum(TotalRevenue)) %>%
ggplot(aes(
x = AgeGroup,
y = Rev,
fill = AgeGroup
)) +
geom_col(alpha = 0.9) +
theme_minimal()
}
And this doesn't work. The datasets I will be receiving will have the same variables, just different observations and so standardising this process will be a lifesaver.
Thanks in advance.
You were probably trying to do :
library(dplyr)
library(rlang)
cols <- c('col1', 'col2')
plot_list <- lapply(cols, function(i)
data %>%
group_by(!!sym(i), ID) %>%
summarise(Rev = sum(TotalRevenue)) %>%
ggplot(aes(x = AgeGroup,y = Rev,fill = AgeGroup)) +
geom_col(alpha = 0.9) + theme_minimal())
This will return you list of plots which can be accessed as plot_list[[1]], plot_list[[2]] etc. Also look into facets to combine multiple plots.

ifelse condition: is in top n

Usually when i need a subset on geom_label() i use ifelse() and i specify a number as below:
library(tidyverse)
data = starwars %>% filter(mass < 500)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year > 100, name, NA))) +
geom_point() +
geom_label()
#> Warning: Removed 54 rows containing missing values (geom_label).
Created on 2020-05-31 by the reprex package (v0.3.0)
But with the dataset i'm working on, i need a dynamic solution, something like ifelse("birth_year is in top n", name, NA).
Thoughts?
For your method, I think using rank should work fine, e.g.,
ifelse(rank(birth_year) < 10, name, NA))
You can use rank(-birth_year) if you want it sorted the other way (or, if you're using dplyr, rank(desc(birth_year)), which will work on non-numeric columns too). You may want to read up on tie methods at ?rank.
I'd also propose a more general solution: filtering data for the geom_label layer. For more complex conditions (e.g., where a group_by would come in handy) it will be more straightforward:
data %>%
ggplot(aes(x = mass, y = height, label = name)) +
geom_point() +
geom_label(
data = data %>%
group_by(species) %>%
top_n(n = 1, wt = desc(birth_year)) # youngest of each species
)
Something like this? To get top 4 values.
library(ggplot2)
data %>%
ggplot(aes(x = mass, y = height, label = ifelse(birth_year >= sort(birth_year, decreasing = TRUE)[4], name, NA))) +
geom_point() +
geom_label()
This is a more explicit approach. I assume you want to count the number of characters per birth year, per your example. In this case, we handle the ranking first, then add a column to the original dataset, then plot. The new 'label' field is either blank/NA or has members of the top set. I suppress the pesky missing data warning in the geom_label arguments.
data = starwars %>% filter(mass < 500)
# counts names per birthyear, returns vector of top 4
top4 <- data %>%
drop_na(birth_year) %>%
count(birth_year, sort = TRUE) %>%
top_n(4) %>%
pull(birth_year)
# adds column to data with the names from the top 4 birth years
data <- data %>%
mutate(label = ifelse(birth_year %in% top4, name, NA))
# plots data with label, dropping NAs
data %>%
ggplot(aes(x = mass, y = height, label = label)) +
geom_point() +
geom_label(na.rm = TRUE)

Having trouble converting variable to factor for ggplot

I am trying to plot data from the nycflights13 data set. I want month and dep_delay variable to be factors rather than continuous. I am getting a error with no explanation and am stuck. Here's my code:
library(ggplot2)
library(dplyr)
library(nycflights13)
f <- group_by(flights, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mutate(month = as.factor(unlist(month))) +
geom_bar(aes(month, delay, fill=month),stat = "identity")
You can't do the mutate inside the ggplot call like that. It does not get properly parsed inside, as the ggplot call gets the data, but cannot carry out the mutate step.
Do it in an outside call:
f <- group_by(flights, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate(month = as.factor(month)) %>%
ggplot() +
geom_bar(aes(month, delay, fill=month),stat = "identity")

r dplyr non standard evaluation - ordering bar plot in a function

I have read http://dplyr.tidyverse.org/articles/programming.html about non standard evaluation in dplyr but still can't get things to work.
plot_column <- "columnA"
raw_data %>%
group_by(.dots = plot_column) %>%
summarise (percentage = mean(columnB)) %>%
filter(percentage > 0) %>%
arrange(percentage) %>%
# mutate(!!plot_column := factor(!!plot_column, !!plot_column))%>%
ggplot() + aes_string(x=plot_column, y="percentage") +
geom_bar(stat="identity", width = 0.5) +
coord_flip()
works fine when the mutate statement is disabled. However, when enabling it in order to order the bars by height only a single bar is returned.
How can I convert the statement above into a function / to use a variable but still plot multiple bars ordered by their size.
An example Dataset could be:
columnA,columnB
a, 1
a, 0.4
a, 0.3
b, 0.5
edit
a sample:
mtcars %>%
group_by(mpg) %>%
summarise (mean_col = mean(cyl)) %>%
filter(mean_col > 0) %>%
arrange(mean_col) %>%
mutate(mpg := factor(mpg, mpg))%>%
ggplot() + aes(x=mpg, y=mean_col) +
geom_bar(stat="identity")
coord_flip()
will output an ordered bar chart.
How can I wrap this into a function where the column can be replaced and I get multiple bars?
This works with dplyr 0.7.0 and ggplot 2.2.1:
rm(list = ls())
library(ggplot2)
library(dplyr)
raw_data <- tibble(columnA = c("a", "a", "b", "b"), columnB = c(1, 0.4, 0.3, 0.5))
plot_col <- function(df, plot_column, val_column){
pc <- enquo(plot_column)
vc <- enquo(val_column)
pc_name <- quo_name(pc) # generate a name from the enquoted statement!
df <- df %>%
group_by(!!pc) %>%
summarise (percentage = mean(!!vc)) %>%
filter(percentage > 0) %>%
arrange(percentage) %>%
mutate(!!pc_name := factor(!!pc, !!pc)) # insert pc_name here!
ggplot(df) + aes_(y = ~percentage, x = substitute(plot_column)) +
geom_bar(stat="identity", width = 0.5) +
coord_flip()
}
plot_col(raw_data, columnA, columnB)
plot_col(mtcars, mpg, cyl)
Problem I ran into was kind of that ggplot and dplyr use different kinds of non-standard evaluation. I got the answer at this question: Creating a function using ggplot2 .
EDIT: parameterized the value column (e.g. columnB/cyl) and added mtcars example.

Resources