ggplot not as to what I set it out to be - r

I'm trying to set my ggplot to a certain colour by the year.
All_Flights_Combined_DayOfWeek %>%
pivot_longer(cols = Delay_Count:Total_Count) %>%
mutate(Year2 = paste0(Year, " ", gsub("_", " ", name)),
DayOfWeek = factor(DayOfWeek, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
ggplot(aes(x = DayOfWeek, y = value, color = Year2, group = Year2)) +
geom_line() +
labs(color = "Year", x = "DayOfWeek", y = "Number of Flights") +
scale_color_manual(values = c("2003 Delay count" = "red",
"2004 Delay count" = "green",
"2005 Delay count" = "blue",
"2003 Total count" = "orange",
"2004 Total count" = "yellow",
"2005 Total count" = "purple"),
breaks = c("2003 Delay count",
"2004 Delay count",
"2005 Delay count",
"2003 Total count",
"2004 Total count",
"2005 Total count")) +
ggtitle("Flight Count by DayOfWeek")
However, this is what R gave me.
It's Grey.
This is my dataframe:
All_Flights_Combined_DayOfWeek
| Year| DayOfWeek| Delay_Count| Total_Count|
| --- | -------- | ---------- | ---------- |
|2003 | Monday | 274690 | 959114 |
|2003 | Tuesday | 237921 | 947126 |
|2003 |Wednesday | 256557 | 962100 |
|2003 | Thursday | 287079 | 952542 |
|2003 | Friday | 303926 | 954701 |
|2003 | Saturday | 222247 | 811260 |
|2003 | Sunday | 276785 | 901697 |
|2004 | Monday | 367992 | 1044508 |
|2004 | Tuesday | 314613 | 1033863 |
|2004 |Wednesday | 327136 | 1036521 |
|2004 | Thursday | 374378 | 1060245 |
|2004 | Friday | 394566 | 1061447 |
|2004 | Saturday | 285131 | 903807 |
|2004 | Sunday | 348867 | 988879 |
|2005 | Monday | 376263 | 1048968 |
|2005 | Tuesday | 327428 | 1037289 |
|2005 |Wednesday | 349048 | 1043208 |
|2005 | Thursday | 401044 | 1047749 |
|2005 | Friday | 420650 | 1050985 |
|2005 | Saturday | 299717 | 919442 |
|2005 | Sunday | 363711 | 992955 |
I have tried adding theme_classic() to the ggplot but to no avail.

Related

How to drop unused value labels in crosstabulations table outputs using cro function from expss package?

I'm using heaven labelled dataframes (variables already have value labels when importing datasets). I need to run many crosstabulations of two variables. I’m using the cro function from expss package because by default displays value labels, and computes weighted crosstabs.
However, the output tables I get display unused value labels. How can I drop unused labels without manually dropping unused value labels for each variable? (by the way: the fre function from expss package has this argument by default: drop_unused_labels = TRUE, but cro function doesn’t)
Here is a reproducible example:
# Dataframe
df <- data.frame(sex = c(1, 2, 99, 2, 1, 2, 2, 2, 1, 2),
agegroup= c(1, 2, 99, 2, 3, 3, 2, 2, 2, 1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
library(expss)
# Variable labels
var_lab(df$sex) <-"Sex"
var_lab(df$agegroup) <-"Age group"
# Value labels
val_lab(df$sex) <- make_labels("1 Male
2 Female
97 Didn't know
98 Didn't respond
99 Abandoned survey")
val_lab(df$agegroup) <- make_labels("1 1-29
2 30-49
3 50 and more
97 Didn't know
98 Didn't respond
99 Abandoned survey")
cro(df$sex, df$agegroup, weight = df$weight)
| | | Age group | | | | | |
| | | 1-29 | 30-49 | 50 and more | Didn't know | Didn't respond | Abandoned survey |
| --- | ---------------- | --------- | ----- | ----------- | ----------- | -------------- | ---------------- |
| Sex | Male | 100 | 100 | 50 | | | |
| | Female | 100 | 650 | 50 | | | |
| | Didn't know | | | | | | |
| | Didn't respond | | | | | | |
| | Abandoned survey | | | | | | 400 |
| | #Total cases | 2 | 5 | 2 | | | 1 |
I want to get rid of the columns and rows called ‘Didn't know’ and ‘Didn't respond’.
You can use drop_unused_labels function to remove the labels which are not used.
library(expss)
df1 <- drop_unused_labels(df)
cro(df1$sex, df1$agegroup, weight = df1$weight)
| | | Age group | | | |
| | | 1-29 | 30-49 | 50 and more | Abandoned survey |
| --- | ---------------- | --------- | ----- | ----------- | ---------------- |
| Sex | Male | 100 | 100 | 50 | |
| | Female | 100 | 650 | 50 | |
| | Abandoned survey | | | | 400 |
| | #Total cases | 2 | 5 | 2 | 1 |

Using Stat_Summary when the data isn't normal

I'm trying to plot confidence intervals for data that isn't normal. I was given advice in using stat_summary, but I can't find any details on how to list the fun.args so that I get the 95th percentiles plotted as a ribbon. Just taking the Excel percentiles for 0.05 and 0.95 for year 2 of S4 I get 31.015 and 31.104, rather than what the plot shows. I assume the issue is with fun.data= mean_cl_normal, but there is very little info on what the options are.
Here is the data I'm using:
data:
+-----------+-----------+-----------+-----------+------+
| S4 | S5 | S6 | S7 | year |
+-----------+-----------+-----------+-----------+------+
| 31.052168 | 30.612594 | 30.328008 | 30.162733 | 2 |
| 31.0111 | 30.664017 | 30.277935 | 30.118793 | 2 |
| 31.049706 | 30.70231 | 30.341677 | 30.202466 | 2 |
| 31.077554 | 30.701983 | 30.355643 | 30.161663 | 2 |
| 31.056968 | 30.696955 | 30.323812 | 30.186214 | 2 |
| 31.096337 | 30.679318 | 30.261566 | 30.080544 | 2 |
| 31.073879 | 30.618196 | 30.281664 | 30.187808 | 2 |
| 31.115269 | 30.700809 | 30.301731 | 30.211642 | 2 |
| 31.085665 | 30.716211 | 30.362345 | 30.16574 | 2 |
| 31.076053 | 30.720127 | 30.319381 | 30.14898 | 2 |
| 31.017615 | 30.73175 | 30.326711 | 30.142657 | 2 |
| 31.020176 | 30.660135 | 30.274531 | 30.144741 | 2 |
| 31.04606 | 30.635148 | 30.362041 | 30.061961 | 2 |
| 31.06509 | 30.65724 | 30.305546 | 30.062432 | 3 |
| 30.974952 | 30.690091 | 30.305273 | 30.186476 | 3 |
| 30.99952 | 30.658606 | 30.29415 | 30.203725 | 3 |
| 31.013494 | 30.621646 | 30.2701 | 30.169807 | 3 |
| 31.081632 | 30.702792 | 30.326554 | 30.063521 | 3 |
| 31.033945 | 30.650637 | 30.334073 | 30.158865 | 3 |
| 31.075722 | 30.627908 | 30.331883 | 30.125196 | 3 |
| 31.036684 | 30.694549 | 30.322353 | 30.125278 | 3 |
| 31.054786 | 30.60339 | 30.356116 | 30.125177 | 3 |
| 31.089391 | 30.652875 | 30.268113 | 30.173289 | 3 |
| 31.063207 | 30.65264 | 30.346941 | 30.174659 | 3 |
| 31.050838 | 30.7144 | 30.28113 | 30.104956 | 3 |
| 31.002156 | 30.727084 | 30.28905 | 30.15026 | 3 |
| 31.052874 | 30.672237 | 30.325414 | 30.055 | 3 |
| 31.116682 | 30.737313 | 30.309537 | 30.13867 | 3 |
| 31.051456 | 30.662466 | 30.264082 | 30.125838 | 3 |
| 31.082019 | 30.646523 | 30.300457 | 30.119709 | 3 |
+-----------+-----------+-----------+-----------+------+
and the code:
Code:
library(tidyverse)
dat <- read.table("C:/temp.txt",sep="\t", header=TRUE)
df <- dat %>%
pivot_longer(cols = c(S4), names_to = "variable", values_to = "value")
ggplot(df, aes(x = year, y = value, color = variable)) +
stat_summary(geom = "line", fun = mean, linetype = "solid") +
stat_summary(geom = "ribbon", fun.data= mean_cl_normal, fun.args = list(conf.int=0.95), alpha=.1)
Adjusted code to register quantiles.
ggplot(df, aes(x = year, y = value, color = variable)) +
stat_summary(geom = "line", fun = mean, linetype = "solid") +
stat_summary(geom = "ribbon", fun.min = function(z) { quantile(z,0.05) },
fun.max = function(z) { quantile(z,0.95) }, alpha=.1)

expss table with row percentage within nested variables in R

When using the expss package in R for creating tables, how does one get the row_percentages to be calculated within a nested variable? In the example below, I would like the row percentage to be calculated within each time period. Thus, I would like the row percentage to sum to 100% within each time period (2015-2016 and 2017-2018). Now however, the percentage is calculated over the entire row.
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot()
Created on 2019-09-28 by the reprex package (v0.3.0)
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 13.6 | 36.4 | 13.6 | 36.4 |
| | 6 | row_perc | 28.6 | 21.4 | 28.6 | 21.4 |
| | 8 | row_perc | 42.9 | 7.1 | 42.9 | 7.1 |
| | #Total cases | row_perc | 19.0 | 13.0 | 19.0 | 13.0 |
I believe this is what you are after:
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am ) %>%
tab_subgroup(period =="2015-2016") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period =="2017-2018") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "inside_rows")
Pay attention to the use of tab_subgroup() which determines which subgroup of year period we want to calculate the percentage as well as to stat_position = "inside_rows" which determines where we want to put the calculated output in the final table.
Output:
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 27.3 | 72.7 | | |
| | | | | | 27.3 | 72.7 |
| | 6 | row_perc | 57.1 | 42.9 | | |
| | | | | | 57.1 | 42.9 |
| | 8 | row_perc | 85.7 | 14.3 | | |
| | | | | | 85.7 | 14.3 |
| | #Total cases | row_perc | 19.0 | 13.0 | | |
| | | | | | 19.0 | 13.0 |
EDIT:
We do not need %nest% if we do not want nested rows(i.e. twice more rows). In this case, the final part of the code should be modified as follows:
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period,am) %>%
tab_subgroup(period ==c("2015-2016")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period ==c("2017-2018")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "outside_columns")
Output:
| | | Measurement period | Transmission | | |
| | | 2015-2016 | Automatic | Manual | Automatic |
| | | row_perc | row_perc | row_perc | row_perc |
| ------------------- | ------------ | ------------------ | ------------ | -------- | --------- |
| Number of cylinders | 4 | 100 | 27.3 | 72.7 | 27.3 |
| | 6 | 100 | 57.1 | 42.9 | 57.1 |
| | 8 | 100 | 85.7 | 14.3 | 85.7 |
| | #Total cases | 32 | 19.0 | 13.0 | 19.0 |
| Measurement period |
Manual | 2017-2018 |
row_perc | row_perc |
-------- | ------------------ |
72.7 | 100 |
42.9 | 100 |
14.3 | 100 |
13.0 | 32 |

Calculate sum and frequency of two different columns with multiple variables and plot using area graph

I have a data which looks like this:
| Employee | Employee_id | Transaction_date | Expense_Type | Attendees | Vendor | Purpose | Amount |
|----------|:-----------:|-----------------:|-----------------|-----------|--------------|-----------------------------|--------|
| Nancy | 1 | 12/27/2018 | Individual_Meal | NA | Chiles | Dinner in NYC | 128 |
| David | 2 | 9/9/2017 | Group_Meal | Jess | Renaissance | External Business Meeting | 600 |
| David | 2 | 9/9/2017 | Group_Meal | Peter | Renaissance | External Business Meeting | 600 |
| David | 2 | 9/9/2017 | Group_Meal | David | Renaissance | External Business Meeting | 600 |
| John | 3 | 10/4/2017 | Group_Meal | Mike | Subway | Lunch with Mike and Maximus | 130 |
| Mary | 4 | 1/16/2019 | Group_Meal | Carol | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | Sonia | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | James | Olive_Garden | summit with Intel | 235 |
| Mary | 4 | 1/16/2019 | Group_Meal | Mary | Olive_Garden | summit with Intel | 235 |
| John | 3 | 10/4/2017 | Group_Meal | Maximus | Subway | Lunch with Mike and Maximus | 130 |
| John | 3 | 10/4/2017 | Group_Meal | John | Subway | Lunch with Mike and Maximus | 130 |
| Richard | 5 | 4/11/2018 | Individual_Meal | NA | Dominos | Dinner in Ohio | 50 |
I want to aggregate the table in such a way that I can see the no of attendees for each employee and the total expense incurred for them. The final table should look something like this:
| Employee | Employee_id | Transaction_date | Expense_Type | Vendor | Purpose | No_of_Attendee | Total_Amount |
|----------|:-----------:|-----------------:|-----------------|--------------|-----------------------------|----------------|--------------|
| Nancy | 1 | 12/27/2018 | Individual_Meal | Chiles | Dinner in NYC | 1 | 128 |
| David | 2 | 9/9/2017 | Group_Meal | Renaissance | External Business Meeting | 3 | 1800 |
| John | 3 | 10/4/2017 | Group_Meal | Subway | Lunch with Mike and Maximus | 3 | 390 |
| Mary | 4 | 1/16/2019 | Group_Meal | Olive_Garden | summit with Intel | 4 | 940 |
| Richard | 5 | 4/11/2018 | Individual_Meal | Dominos | Dinner in Ohio | 1 | 50 |
Next, I want to generate an area plot where I have 'transaction date' on x axis and 'Amount' on y axis with different variables such as vendor, purpose mentioned in the tooltip. I I have tried some code but I'm not sure how to calculate frequency and sum of two different columns while retaining other columns as shown in the desired output table. Also, when I try to use text within ggplot2, the area graph comes fine until only employee is mentioned. As soon as I include vendor and/or purpose, the area graph changes. I'm not sure why is this happening. Can someone please have a look at my code and let me know what is wrong and how to rectify it?
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=read_excel("C:/Users/xyz/Desktop/eg1.xlsx")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type,
Amount) %>% summarise(count=n())
colnames(df4_freq)[8]= "No_of_Attendee"
plot=ggplot(d4_freq, aes(x = Transaction_date, y = Amount,
text=paste('Employee:',Employee,
'<br>No of Attendees:', No_of_Attendee,
'<br>Amount Per Attendee:', Amount,
'<br>Purpose:', Purpose,
'<br>Vendor:', Vendor
))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(p, tooltip = c("x","y","text"))
plot
PART 2:
The other problem that I'm facing is with area graph. If I enter only "employee" as the variable in the "text", my plot is perfect. But when I enter other variables such as "No_of_Attendee","Vendor" etc, my plot changes to straight lines. Is there any issue with ggplotly or text? For reference, I'm posting the code again, since I have added some more data to it.
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
df4=data.frame("Employee"=c("Nancy","David","David","David","John","Mary","Mary","Mary","Mary",
"John","John","Richard","David","David","Mary","Mary","Mary"),
"Employee_id"=c(1,2,2,2,3,4,4,4,4,3,3,5,2,2,4,4,4),
"Transaction_date"=c("12/27/2018","9/9/2017","9/9/2017","9/9/2017","10/4/2017","1/16/2019",
"1/16/2019","1/16/2019","1/16/2019","10/4/2017","10/4/2017","4/11/2018","1/1/2018","1/1/2018",
"4/5/2018","4/5/2018","4/5/2018"),
"Expense_Type"=c("Individual_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal",
"Group_Meal","Group_Meal","Group_Meal","Group_Meal","Group_Meal", "Group_Meal",
"Individual_Meal","Group_Meal","Group_Meal","Group_Meal" ,"Group_Meal","Group_Meal"),
"Attendees"=c("NA","Jess","Peter","David","Mike","Carol","Sonia","James","Mary","Maximus",
"John","NA","Arya","David","Jon","Elizabeth","Marco"),
"Vendor"=c("Chiles","Renaissance","Renaissance","Renaissance","Subway","Olive_Garden","Olive_Garden",
"Olive_Garden","Olive_Garden","Subway","Subway","Dominos","BJ","BJ","Little_Italy","Little_Italy","Little_Italy"),
"Purpose"=c("Dinner in NYC","External Business Meeting","External Business Meeting","External Business Meeting",
"Lunch with Mike and Maximus","summit with Intel","summit with Intel","summit with Intel","summit with Intel",
"Lunch with Mike and Maximus","Lunch with Mike and Maximus","Dinner in Ohio","Lunch with Arya","Lunch with Arya",
"Business_Meeting","Business_Meeting","Business_Meeting"),
"Amount"= c(128,600,600,600,130,235,235,235,235,130,130,50,95,95,310,310,310))
str(df4)
df4$Transaction_date<- as.Date(df4$Transaction_date, "%m/%d/%Y")
df4_freq=df4 %>% group_by(Employee,Employee_id,Transaction_date,Vendor,Purpose,Expense_Type)%>% summarise(No_of_Attendee=n(), Total_Amount=sum(Amount))
plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount,
text=paste('Employee:',Employee))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot
Below is the plot which looks like perfect with only 'Employee' variable in the text.
However, when I include other variables such as 'No_of_Attendee', 'Vendor' etc, my plot comes as single line. Below is the code and plot.
plot=ggplot(df4_freq, aes(x = Transaction_date, y = Total_Amount,
text=paste('Employee:',Employee,
'<br>No of Attendees:', No_of_Attendee,
'<br>Total_Amount:', Total_Amount,
'<br>Purpose:', Purpose,
'<br>Vendor:', Vendor
))) +
geom_area(aes(color = Expense_Type, fill = Expense_Type),
alpha = 0.5, position = position_dodge(0.8))+
geom_point(colour="black")+
scale_color_manual(values = c("#CC6600", "#606060")) +
scale_fill_manual(values = c("#CC6600", "#606060"))
plot=ggplotly(plot, tooltip = c("x","y","text"))
plot
It would be really great and helpful if someone could help me what is wrong with my code.
It seems like by grouping by Amount you are preventing Total_Amount from being calculated. For example David's Meal on 9/9/2017 will create a group that represents those three rows, but then you can only summarize with count = n() which will count the number of rows in that group. But because you grouped on Amount you won't be able to produce a row that summarizes the Total_Amount. I would suggest the following to create the dataset you're looking for:
data %>%
group_by(Employee, Employee_id, Transaction_date, Expense_Type, Vendor, Purpose) %>%
summarize(No_of_Attendee = n(),
Total_Amount = sum(Amount))

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Resources