R: Multiple bar plots of mean value vs. month vs. genre [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have the following data-frame, where variable is 10 different genre categories of movies, eg. drama, comedy etc.
> head(grossGenreMonthLong)
Gross ReleasedMonth variable value
5 33508485 2 drama 1
6 67192859 2 drama 1
8 37865 4 drama 1
9 76665507 1 drama 1
10 221594911 2 drama 1
12 446438 2 drama 1
Reproducible dataframe:
dput(head(grossGenreMonthLong))
structure(list(Gross = c(33508485, 67192859, 37865, 76665507,
221594911, 446438), ReleasedMonth = c(2, 2, 4, 1, 2, 2), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("drama", "comedy", "short", "romance",
"action", "crime", "thriller", "documentary", "adventure", "animation"
), class = "factor"), value = c(1, 1, 1, 1, 1, 1)), .Names = c("Gross",
"ReleasedMonth", "variable", "value"), row.names = c(5L, 6L,
8L, 9L, 10L, 12L), class = "data.frame")
I would like to calculate the mean gross vs. month for each of the 10 genres and plot them in separate bar charts using facets (varying by genre).
In other words, what's a quick way to plot 10 bar charts of mean gross vs. month for each of the 10 genres?

You should provide a reproducible example to make it easier for us to help you. dput(my.dataframe) is one way to do it, or you can generate an example dataframe like below. Since you haven't given us a reproducible example, I'm going to put on my telepathy hat and assume the "variable" column in your screenshot is the genre.
n = 100
movies <- data.frame(
genre=sample(letters[1:10], n, replace=T),
gross=runif(n, min=1, max=1e7),
month=sample(12, n, replace=T)
)
head(movies)
# genre gross month
# 1 e 5545765.4 1
# 2 f 3240897.3 3
# 3 f 1438741.9 5
# 4 h 9101261.0 6
# 5 h 926170.8 7
# 6 f 2750921.9 1
(My genres are 'a', 'b', etc).
To do a plot of average gross per month, you will need to calculate average gross per month. One such way to do so is using the plyr package (there is also data.table, dplyr, ...)
library(plyr)
monthly.avg.gross <- ddply(movies, # the input dataframe
.(genre, month), # group by these
summarize, avgGross=mean(gross)) # do this.
The dataframe monthly.avg.gross now has one row per (month, genre) with a column avgGross that has the average gross in that (month, genre).
Now it's a matter of plotting. You have hinted at "facet" so I assume you're using ggplot.
library(ggplot2)
ggplot(monthly.avg.gross, aes(x=month, y=avgGross)) +
geom_point() +
facet_wrap(~ genre)
You can do stuff like add month labels and treat month as a factor instead of a number like here, but that's peripheral to your question.

Thank you very much #mathematical.coffee. I was able to adapt your answer to produce the appropriate bar charts.
meanGrossGenreMonth = ddply(grossGenreMonthLong,
.(ReleasedMonth, variable),
summarise,
mean.Gross = mean(Gross, na.rm = TRUE))
# plot bar plots with facets
ggplot(meanGrossGenreMonth, aes(x = factor(ReleasedMonth), y=mean.Gross))
+ geom_bar(stat = "identity") + facet_wrap(~ variable) +ylab("mean Gross ($)")
+ xlab("Month") +ggtitle("Mean gross revenue vs. month released by Genre")

Related

Plotting Arbitrary Functions by Group in R

I have a dataset (test_df) that looks like:
Species
TreatmentA
TreatmentB
X0
L
K
Apple
Hot
Cloudy
1
2
3
Apple
Cold
Cloudy
4
5
6
Orange
Hot
Sunny
7
8
9
Orange
Cold
Sunny
10
11
12
I would like to display the effect of the treatments by using the X0, L, and K values as coefficients in a standard logistic function and plotting the same species across various treatments on the same plot. I would like a grid of plots with the logistic curves for each species on it's own plots, with each treatment then being grouped by color within every plot. In the above example, Plot1.Grid1 would have 2 logistic curves corresponding to Apple Hot and Apple Cold, and plot1.Grid2 would have 2 logistic curves corresponding to Orange Hot and Orange Cold.
The below code will create a single logistic function curve which can then be layered, but manually adding the layers for multiple treatments is tedious.
testx0 <- 1
testL <- 2
testk <- 3
days <- seq(from = -5, to = 5, by = 1)
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
ggplot()+aes(x = days, y = functionmultitest(days,testL,testk,testx0))+geom_line()
The method described in (https://statisticsglobe.com/draw-multiple-function-curves-to-same-plot-in-r) works for dataframes with few species or treatments, but it becomes very tedious to individually define the curves if you have many treatments/species. Is there a way to programatically pass the list of coefficients and have ggplot handle the grouping?
Thank you!
Your current code shows how to compute the curve for a single row in your data frame. What you can do is pre-compute the curve for each row and then feed to ggplot.
Setup:
# Packages
library(ggplot2)
# Your days vector
days <- seq(from = -5, to = 5, by = 1)
# Your sample data frame above
df = structure(list(Species = c("Apple", "Apple", "Orange", "Orange"
), TreatmentA = c("Hot", "Cold", "Hot", "Cold"), TreatmentB = c("Cloudy",
"Cloudy", "Sunny", "Sunny"), X0 = c(1L, 4L, 7L, 10L), L = c(2L,
5L, 8L, 11L), K = c(3L, 6L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-4L))
# Your function
functionmultitest <- function(x,testL,testK,testX0) {
(testL)/(1+exp((-1)*(testK) *(x - testX0)))
}
We'll "expand" each row of your data frame with the days vector:
# Define first a data frame of days:
days_df = data.frame(days = days)
# Perform a cross join
df_all = merge(days_df, df, all = T)
At this point, you will have a data frame where each original row is duplicated for as many days you have.
Now, just as you did for one row, we'll compute the value of the function for each row and store in the df_all as result:
df_all$result = mapply(functionmultitest, df_all$days, df_all$L, df_all$K, df_all$X0)
I'm not sure how you intended to handle treatmentA and treatmentB, so I'll just combine for illustration purposes:
df_all$combined_treatment = paste0(df_all$TreatmentA, "-", df_all$TreatmentB)
We can now feed this data frame to ggplot, set the color to be combined_treatment, and use the facet_grid function to split by species
ggplot(data = df_all, aes(x = days, y = result, color = combined_treatment))+
geom_line() +
facet_grid(Species ~ ., scales = "free")
The result is as follows:

R GGplot2 Stacked Columns Chart

I am trying to do a Stacked Columns Chart in R. Sorry but I am learning thats why i need help
This is how i have the data
structure(list(Category = structure(c(2L, 3L, 4L, 1L), .Label = c("MLC1000",
"MLC1051", "MLC1648", "MLC5726"), class = "factor"), Minutes = c(2751698L,
2478850L, 556802L, 2892097L), Items = c(684L, 607L, 135L, 711L
), Visits = c(130293L, 65282L, 25484L, 81216L), Sold = c(2625L,
1093L, 681L, 1802L)), .Names = c("Category", "Minutes", "Items",
"Visits", "Sold"), class = "data.frame", row.names = c(NA, -4L)
)
And i want to create this graphic
I think there are two pretty basic principles that you should apply to make this problem easier to handle. First, you should make your data tidy. Second, you shouldn't leave ggplot to do your calculations for you.
library(tidyverse)
a <- data_frame(
category = letters[1:4],
minutes = c(2751698, 2478850, 556802, 2892097),
visits = c(130293, 65282, 25484, 81216),
sold = c(2625, 1093, 681, 1802)
) %>%
gather(variable, value, -category) %>% # make tidy
group_by(variable) %>%
mutate(weight = value / sum(value)) # calculate weight variable
## Source: local data frame [12 x 4]
## Groups: variable [3]
## category variable value weight
## <chr> <chr> <dbl> <dbl>
## 1 a minutes 2751698 0.31703610
## 2 b minutes 2478850 0.28559999
## 3 c minutes 556802 0.06415178
## 4 d minutes 2892097 0.33321213
## 5 a visits 130293 0.43104127
## 6 b visits 65282 0.21596890
## 7 c visits 25484 0.08430734
## 8 d visits 81216 0.26868249
## 9 a sold 2625 0.42331882
## 10 b sold 1093 0.17626189
## 11 c sold 681 0.10982100
## 12 d sold 1802 0.29059829
I don't know what was up with your structure(), but I couldn't build a data frame from it without crashing my R session.
Once we get the data into this format, the ggplot2 call is actually really easy:
ggplot(a, aes(x = variable, weight = weight * 100, fill = category)) +
geom_bar()

R reshape wide to long data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a dataframe like so:
[1] "drugevent" "prr" "prr_lowerCI" "prr_upperCI" "EBGM"
[6] "EBG_lowerCI" "EBGM_upperCI" "strata.coded" "strata" "Reference"
And I want to make a plot for each drugevent, using ggplot. In order to do so I need to format my DF like so:
[1] "drug", "event", "measurement"(prr or EBGM), "lowerCI"(for coresponding measurement), upperCI, strata
But despite the many posts on SO, or R tutorials I was not able to corectly reshape the data. In my last try I added an Id like so:
mutate(DF, count=1:n())
melted the data
melt(DF, id.vars="count")
then I made several DFs subsetting the values of interest
subset(melted, variable in c("prr","EBGM"))
then the upper and lower confidence intervals, strata and drug event,
but when I merged them like so:
merge(measurement, lowerCI, by="count")
in the end I had duplicated values with 4 rows for each count.
The code is messy and the result is wrong. Could you please help me with this?
Edit exampples:
initial data:
drugevent prr prr_lowerCI prr_upperCI
1 CLARITHROMYCIN-Erythema Multiforme 1.3539930 0.1903270 2.517659
2 CLARITHROMYCIN-Erythema Multiforme 1.7741342 0.6647390 2.883529
EBGM EBG_lowerCI EBGM_upperCI strata count
1 0.9003325 0.2128934 2.772558 Infants 1
2 1.4471096 0.5997188 3.053965 Children 2
the desired result:
measurement value upperCI strata drug
1 prr 1.353992979 2.51765895 Infants CLARITHROMYCIN
2 EBGM 0.9009 2.77 Infants CLARITHROMYCIN
reaction lowerCI
1 Erythema Multiforme 2.51765895
2 Erythema Multiforme 1.447
From what I understand you want a long format of the original data frame split based on prr or ebgm
dfPRR <- cbind(df[, !grepl("EBG", colnames(df))], measurement="prr")
colnames(dfPRR)[2:4] <- c("value", "lowerCI", "upperCI")
dfEBGM <- cbind(df[, !grepl("prr", colnames(df))], measurement="EBGM")
colnames(dfEBGM)[2:4] <- c("value", "lowerCI", "upperCI")
rbind(dfPRR, dfEBGM)
Data used
structure(list(drugevent = structure(c(1L, 1L), .Label = "CLARITHROMYCIN-Erythema Multiforme", class = "factor"),
prr = c(1.353993, 1.7741342), prr_lowerCI = c(0.190327, 0.664739
), prr_upperCI = c(2.517659, 2.883529), EBGM = c(0.9003325,
1.4471096), EBG_lowerCI = c(0.2128934, 0.5997188), EBGM_upperCI = c(2.772558,
3.053965), strata = structure(1:2, .Label = c(" Infants",
" Children"), class = "factor"), count = 1:2), .Names = c("drugevent",
"prr", "prr_lowerCI", "prr_upperCI", "EBGM", "EBG_lowerCI", "EBGM_upperCI",
"strata", "count"), class = "data.frame", row.names = c(NA, -2L
))

Results from aggregate with multiple functions not usable in further calculations. WHY?

I have a problem regarding results from an aggregate function in R. My aim is to select certain bird species from a data set and calculate the density
of observed individuals over the surveyed area. To that end, I took a subset of the main data file, then aggregated over area, calculating the
mean, and the number of individuals (represented by length of vector). Then I wanted to use the calculated mean area and number of individuals to
calculate density. That didn't work. The code I used is given below:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
R doesn't seem to view the results from the function (observed_area and NoInd) as numeric values in their own right. That was already apparent, when I couldn't give them a name each, but had to call them "x".
How can I calculate density under these circumstances? Or is there another way to aggregate with multiple functions over the same variable that will result in a usable output?
It's a quirk of aggregate with multiple aggregations that the resulting aggregations are stored in a list within the column related to the aggregated variable.
The easiest way to get rid of this is to go through an as.list before as.dataframe, which flattens the data structure.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
It will still use x as the name. The way I discovered to fix this is to use the formula interface to aggregate, so your aggregate would look more like
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Walking it through (here I haven't taken the subset to illustrate the aggregation by species)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Which results in what you want:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Obligatory note here, that dplyr and data.table are powerful libraries that allow doing this sort of aggregation very simply and efficiently.
dplyr
Dplyr has some strange syntax (the %>% operator), but ends up being quite readable, and allows chaining more complex operations
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
data.table
data.table has a more compact syntax and may be faster with large datasets.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]

ggplot2 line chart gives "geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?"

With this data frame ("df"):
year pollution
1 1999 346.82000
2 2002 134.30882
3 2005 130.43038
4 2008 88.27546
I try to create a line chart like this:
plot5 <- ggplot(df, aes(year, pollution)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Particulate matter emissions (tons)", title = "Motor vehicle emissions in Baltimore")
The error I get is:
geom_path: Each group consist of only one observation. Do you need to
adjust the group aesthetic?
The chart appears as a scatter plot even though I want a line chart. I tried to replace geom_line() with geom_line(aes(group = year)) but that didn't work.
In an answer I was told to convert year to a factor variable. I did and the problem persists. This is the output of str(df) and dput(df):
'data.frame': 4 obs. of 2 variables:
$ year : num 1 2 3 4
$ pollution: num [1:4(1d)] 346.8 134.3 130.4 88.3
..- attr(*, "dimnames")=List of 1
.. ..$ : chr "1999" "2002" "2005" "2008"
structure(list(year = c(1, 2, 3, 4), pollution = structure(c(346.82,
134.308821199349, 130.430379885892, 88.275457392443), .Dim = 4L, .Dimnames = list(
c("1999", "2002", "2005", "2008")))), .Names = c("year",
"pollution"), row.names = c(NA, -4L), class = "data.frame")
You only have to add group = 1 into the ggplot or geom_line aes().
For line graphs, the data points must be grouped so that it knows which points to connect. In this case, it is simple -- all points should be connected, so group=1. When more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.
Reference: Cookbook for R, Chapter: Graphs Bar_and_line_graphs_(ggplot2), Line graphs.
Try this:
plot5 <- ggplot(df, aes(year, pollution, group = 1)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Particulate matter emissions (tons)",
title = "Motor vehicle emissions in Baltimore")
You get this error because one of your variables is actually a factor variable
. Execute
str(df)
to check this.
Then do this double variable change to keep the year numbers instead of transforming into "1,2,3,4" level numbers:
df$year <- as.numeric(as.character(df$year))
EDIT: it appears that your data.frame has a variable of class "array" which might cause the pb.
Try then:
df <- data.frame(apply(df, 2, unclass))
and plot again?
I had similar problem with the data frame:
group time weight.loss
1 Control wl1 4.500000
2 Diet wl1 5.333333
3 DietEx wl1 6.200000
4 Control wl2 3.333333
5 Diet wl2 3.916667
6 DietEx wl2 6.100000
7 Control wl3 2.083333
8 Diet wl3 2.250000
9 DietEx wl3 2.200000
I think the variable for x axis should be numeric, so that geom_line knows how to connect the points to draw the line.
after I change the 2nd column to numeric:
group time weight.loss
1 Control 1 4.500000
2 Diet 1 5.333333
3 DietEx 1 6.200000
4 Control 2 3.333333
5 Diet 2 3.916667
6 DietEx 2 6.100000
7 Control 3 2.083333
8 Diet 3 2.250000
9 DietEx 3 2.200000
then it works.
Start up R in a fresh session and paste this in:
library(ggplot2)
df <- structure(list(year = c(1, 2, 3, 4), pollution = structure(c(346.82,
134.308821199349, 130.430379885892, 88.275457392443), .Dim = 4L, .Dimnames = list(
c("1999", "2002", "2005", "2008")))), .Names = c("year",
"pollution"), row.names = c(NA, -4L), class = "data.frame")
df[] <- lapply(df, as.numeric) # make all columns numeric
ggplot(df, aes(year, pollution)) +
geom_point() +
geom_line() +
labs(x = "Year",
y = "Particulate matter emissions (tons)",
title = "Motor vehicle emissions in Baltimore")
I got a similar prompt. It was because I had specified the x-axis in terms of some percentage (for example: 10%A, 20%B,....).
So an alternate approach could be that you multiply these values and write them in the simplest form.
I found this can also occur if the most of the data plotted is outside of the axis limits. In that case, adjust the axis scales accordingly.

Resources