I'm trying to plot my dataset, i.e., the number of internship ads, using ggplot, but for some reason, the highest value shows at the bottom of y axis.
Here's the dataset:
month internship_count_2012
January 68
February 48
March 43
April 49
May 52
June 83
July 104
August 91
September 72
October 58
November 70
December 77
The image of the plot:
I've run the following code to come up with the above dataset, which was extracted from a larger, filtered dataset found here: https://www.dropbox.com/home?preview=rwjAll_internship_2012.csv
Filter the dataset by month:
rwjAll_internship_jan2012 <- filter(rwjAll_internship_2012, month == 1)
rwjAll_internship_feb2012 <- filter(rwjAll_internship_2012, month == 2)
rwjAll_internship_mar2012 <- filter(rwjAll_internship_2012, month == 3)
rwjAll_internship_apr2012 <- filter(rwjAll_internship_2012, month == 4)
rwjAll_internship_may2012 <- filter(rwjAll_internship_2012, month == 5)
rwjAll_internship_jun2012 <- filter(rwjAll_internship_2012, month == 6)
rwjAll_internship_jul2012 <- filter(rwjAll_internship_2012, month == 7)
rwjAll_internship_aug2012 <- filter(rwjAll_internship_2012, month == 8)
rwjAll_internship_sep2012 <- filter(rwjAll_internship_2012, month == 9)
rwjAll_internship_oct2012 <- filter(rwjAll_internship_2012, month == 10)
rwjAll_internship_nov2012 <- filter(rwjAll_internship_2012, month == 11)
rwjAll_internship_dec2012 <- filter(rwjAll_internship_2012, month == 12)
Get the number of internship ads by month
jan_2012_internship <- sum(rwjAll_internship_jan2012$internship)
feb_2012_internship <- sum(rwjAll_internship_feb2012$internship)
mar_2012_internship <- sum(rwjAll_internship_mar2012$internship)
apr_2012_internship <- sum(rwjAll_internship_apr2012$internship)
may_2012_internship <- sum(rwjAll_internship_may2012$internship)
jun_2012_internship <- sum(rwjAll_internship_jun2012$internship)
jul_2012_internship <- sum(rwjAll_internship_jul2012$internship)
aug_2012_internship <- sum(rwjAll_internship_aug2012$internship)
sep_2012_internship <- sum(rwjAll_internship_sep2012$internship)
oct_2012_internship <- sum(rwjAll_internship_oct2012$internship)
nov_2012_internship <- sum(rwjAll_internship_nov2012$internship)
dec_2012_internship <- sum(rwjAll_internship_dec2012$internship)
Create dataset to plot 2012 internship ads
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
internship_count_2012<- c(jan_2012_internship, feb_2012_internship, mar_2012_internship, apr_2012_internship, may_2012_internship, jun_2012_internship, jul_2012_internship, aug_2012_internship, sep_2012_internship, oct_2012_internship, nov_2012_internship, dec_2012_internship)
all_internship_2012 <- cbind(month, internship_count_2012)
all_internship_2012 <- as.data.frame(all_internship_2012)
plot2 <- ggplot(all_internship_2012, aes(x = month, y = internship_count_2012, group = 1)) +
geom_line(colour="purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How can I order the y values (internship_count_2012) so that the highest value is at the top? Thanks!
The following code simplifies the creation of your new dataset greatly using
the library dplyr and produces the plot you desire in just a few lines.
library(dplyr)
library(ggplot2)
all_internship_2012 <- rwjAll_internship_2012 %>%
group_by(month) %>%
summarise(internship_count_2012 = sum(internship))
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
ggplot(all_internship_2012, aes(x = month, y = internship_count_2012)) +
geom_line(colour = "purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_continuous(breaks=1:12, labels = month) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For info about dplyr check out this link on CRAN.
Related
I am trying to create a function so I convert the month values to quarters using the case when function.
Then I want to leverage mutate() to create a new variable Qtr and determine how many observations I see in each quarter.
convert_to_qtr <- function(Month) {
case_when(
Month == "Jan" ~ "Q1",
Month == "Feb" ~ "Q1",
Month == "Mar" ~ "Q1",
Month == "Apr" ~ "Q2",
Month == "May" ~ "Q2",
Month == "Jun" ~ "Q2",
Month == "Jul" ~ "Q3",
Month == "Aug" ~ "Q3",
Month == "Sep" ~ "Q3",
Month == "Oct" ~ "Q4",
Month == "Nov" ~ "Q4",
Month == "Dec" ~ "Q4"
)
}
example_months <- c("Jan", "Mar", "May", "May", "Aug", "Nov", "Nov", "Dec")
convert_to_qtr(example_months)
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
group_by(Qtr) %>%
count(Qtr)
However I am not getting the same answer as my professor in his drop down so I am not sure if I am doing something wrong in my r coding.
He sees the numbers 161,071 85,588 100,227 142,651
I am not getting that, I see 152174 165778 205615 174592
You could write the function as below:
convert_to_qtr <- function(Month){
setNames(paste0("Q", rep(1:4,each=3)), month.abb)[Month]
}
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
count(Qtr)
Qtr n
1 Q1 29
2 Q2 24
3 Q3 26
4 Q4 21
Full disclosure- I inherited this code and tried to Frankenstein it enough to make it work. It isn't perfect.
I have a series of Excel workbooks I'm iterating through to extract financial data for a group of medical practices. The workbooks have a tab for each month. I used lapply() to iterate over the sheets to pull only the months in each quarter. One of the practices only has data from January and February of 2022 so I wouldn't expect that to show up for the 4th quarter update we just ran. However, that data is there.
library(tidyverse)
library(readxl)
library(openxlsx)
df1 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
df2 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
list_of_datasets <- list("January" = df1, "February" = df2)
write.xlsx(list_of_datasets, file = "C:/MC_report.xlsx")
current_month <- lubridate::month(as.Date(Sys.Date(), format = "%Y/%m/%d"))
current_year <- lubridate::year(as.Date(Sys.Date(), format = "%Y/%m/%d"))
Q1 <- c("January", "February", "March")
Q2 <- c("April", "May", "June")
Q3 <- c("July", "August", "September")
Q4 <- c("October", "November", "December")
quarter <- switch(current_month,
"1" = Q4, "2" = Q4, "3" = Q4,
"4" = Q1, "5" = Q1, "6" = Q1,
"7" = Q2, "8" = Q2, "9" = Q2,
"10" = Q3, "11" = Q3, "12" = Q3)
year <- ifelse(current_month %in% c(1, 2, 3), current_year - 1, current_year)
names = c("Medication", "Total WAC")
MCPath22 = "C:/MC_report.xlsx"
MClist22 = lapply(quarter, function(x){ # this function is repeated for each practice. I won't paste it over and over
dat = read_excel(MCPath22, sheet = x, skip = 1)[c(1,2)] # 1 is 'Medication' 2 is "Total WAC'
names(dat) = names
dat$Month = x
dat$Year = year
dat$Location = "Medical Center"
return(dat)
})
MC_newdata = do.call(rbind, MClist22) %>%
select( Medication, `Total WAC`, Month, Year, Location) %>%
mutate(Date.Added = Sys.time())
data = rbind(MC_newdata, DHP_newdata, Lex_newdata, Derm_newdata, Onc_newdata, oldvalues) %>%
filter(!is.na(Medication)) #includes all the practices
write_csv(data,"PAP Data.csv")
I just ran this again and all facilities save for the one with only January and February tabs are running correctly. It throws an error that 'October" not found, which is expected. I can stop that piece in R Studio and the script completes. And then Jan and Feb are in the output. Any idea why it's outputting the wrong data?
I have a data frame with a column of 'months' and coordinating values. When I create a graph, the months are ordered alphabetically. I want to order the months using the factor function, but now my graph is only showing the month of May and 'NAs'.
xnames<-c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Data$Month<-factor(Data$Month, levels = xnames)
ggplot(DAtaTidy_MergeRWPeaks2, (aes(x=factor(Month, xnames), y=Volume)), na.rm=TRUE) +
geom_bar()
I tried embedding the factor in the ggplot function but it produced the same result. When I delete 'May' from 'xnames', the graph just shows NAs.
We can't see your data, but the behavior is indicative of Data$Month containing a value that is not included in your level term xnames. Is anything misspelled? I would suggest you compare levels(as.factor(Data$Month)) and xnames - it will certainly show you the issue.
Example dataset that shows the same problem you have:
yums <- c('soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df1 <- data.frame(yums, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df1, aes(x=factor(yums, yum.levels), y=nums)) + geom_col()
That gives you this:
...but if we mispell one of them (like capitalizing "Soup" in yums), you get this:
yums1 <- c('Soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df2 <- data.frame(yums1, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df2, aes(x=factor(yums1, yum.levels), y=nums)) + geom_col()
I've got the following dataset:
tab <- tibble(year = c(2017,2017,2017,2018,2018,2018)
mth = c("Apr", "Apr", "Jun", "Jul", "Jul", "Sep"),
var1 = 1:6,
var2 = 10:15)
Is it possible to use kableExtra to generate a table of this data where there are two grouping variables, year and month? This would give:
var1 var2
2017
Apr
1 10
2 11
Jun
3 12
2018
Jul
4 13
5 14
Sep
6 15
I've tried:
kable(tab[,3:4]) %>% pack_rows(index = table(year$Month, tab$mth))
It works fine with one grouping variable, but it doesn't work for two grouping variables.
This tutorial has great examples and explains how to do this.
library(dplyr)
library(kableExtra)
kable(tab, align = "c", col.names = c("","",names(tab)[3:4])) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "top")
I used the dcast function to show the spendings per month of different companies. Of course I want January first, then February etc. and not the alphabetical order.
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings
Firma Produktmarke Apr Aug Dez Feb Jan Jul Jun Mai Mrz Nov Okt Sep
Company1 Product1 228582 1902138 725781 NA 709970 NA 265313 228177 NA NA 1463258 4031267
Is there a way to reorder the colums dynamically ? For 2018 for example the dataframe is shorter, so i can not use:
Spendings <- Spendings[,c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")]
Spendings_raw <- data.frame(matrix(ncol = 14, nrow = 0))
colnames(Spendings_raw) <- c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
Spendings_raw
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings <- rbind.fill(Spendings_raw, Spendings)
This works perfectly ;-).