Factor function in R returning NAs - r

I have a data frame with a column of 'months' and coordinating values. When I create a graph, the months are ordered alphabetically. I want to order the months using the factor function, but now my graph is only showing the month of May and 'NAs'.
xnames<-c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Data$Month<-factor(Data$Month, levels = xnames)
ggplot(DAtaTidy_MergeRWPeaks2, (aes(x=factor(Month, xnames), y=Volume)), na.rm=TRUE) +
geom_bar()
I tried embedding the factor in the ggplot function but it produced the same result. When I delete 'May' from 'xnames', the graph just shows NAs.

We can't see your data, but the behavior is indicative of Data$Month containing a value that is not included in your level term xnames. Is anything misspelled? I would suggest you compare levels(as.factor(Data$Month)) and xnames - it will certainly show you the issue.
Example dataset that shows the same problem you have:
yums <- c('soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df1 <- data.frame(yums, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df1, aes(x=factor(yums, yum.levels), y=nums)) + geom_col()
That gives you this:
...but if we mispell one of them (like capitalizing "Soup" in yums), you get this:
yums1 <- c('Soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df2 <- data.frame(yums1, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df2, aes(x=factor(yums1, yum.levels), y=nums)) + geom_col()

Related

Why are undesired sheets being read into my list with lapply()

Full disclosure- I inherited this code and tried to Frankenstein it enough to make it work. It isn't perfect.
I have a series of Excel workbooks I'm iterating through to extract financial data for a group of medical practices. The workbooks have a tab for each month. I used lapply() to iterate over the sheets to pull only the months in each quarter. One of the practices only has data from January and February of 2022 so I wouldn't expect that to show up for the 4th quarter update we just ran. However, that data is there.
library(tidyverse)
library(readxl)
library(openxlsx)
df1 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
df2 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
list_of_datasets <- list("January" = df1, "February" = df2)
write.xlsx(list_of_datasets, file = "C:/MC_report.xlsx")
current_month <- lubridate::month(as.Date(Sys.Date(), format = "%Y/%m/%d"))
current_year <- lubridate::year(as.Date(Sys.Date(), format = "%Y/%m/%d"))
Q1 <- c("January", "February", "March")
Q2 <- c("April", "May", "June")
Q3 <- c("July", "August", "September")
Q4 <- c("October", "November", "December")
quarter <- switch(current_month,
"1" = Q4, "2" = Q4, "3" = Q4,
"4" = Q1, "5" = Q1, "6" = Q1,
"7" = Q2, "8" = Q2, "9" = Q2,
"10" = Q3, "11" = Q3, "12" = Q3)
year <- ifelse(current_month %in% c(1, 2, 3), current_year - 1, current_year)
names = c("Medication", "Total WAC")
MCPath22 = "C:/MC_report.xlsx"
MClist22 = lapply(quarter, function(x){ # this function is repeated for each practice. I won't paste it over and over
dat = read_excel(MCPath22, sheet = x, skip = 1)[c(1,2)] # 1 is 'Medication' 2 is "Total WAC'
names(dat) = names
dat$Month = x
dat$Year = year
dat$Location = "Medical Center"
return(dat)
})
MC_newdata = do.call(rbind, MClist22) %>%
select( Medication, `Total WAC`, Month, Year, Location) %>%
mutate(Date.Added = Sys.time())
data = rbind(MC_newdata, DHP_newdata, Lex_newdata, Derm_newdata, Onc_newdata, oldvalues) %>%
filter(!is.na(Medication)) #includes all the practices
write_csv(data,"PAP Data.csv")
I just ran this again and all facilities save for the one with only January and February tabs are running correctly. It throws an error that 'October" not found, which is expected. I can stop that piece in R Studio and the script completes. And then Jan and Feb are in the output. Any idea why it's outputting the wrong data?

Multiple grouping variables using kableExtra

I've got the following dataset:
tab <- tibble(year = c(2017,2017,2017,2018,2018,2018)
mth = c("Apr", "Apr", "Jun", "Jul", "Jul", "Sep"),
var1 = 1:6,
var2 = 10:15)
Is it possible to use kableExtra to generate a table of this data where there are two grouping variables, year and month? This would give:
var1 var2
2017
Apr
1 10
2 11
Jun
3 12
2018
Jul
4 13
5 14
Sep
6 15
I've tried:
kable(tab[,3:4]) %>% pack_rows(index = table(year$Month, tab$mth))
It works fine with one grouping variable, but it doesn't work for two grouping variables.
This tutorial has great examples and explains how to do this.
library(dplyr)
library(kableExtra)
kable(tab, align = "c", col.names = c("","",names(tab)[3:4])) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "top")

Read values in dataframe and store it in a vector

I have a dataframe like this
X 2001,2002,2003
JAN NA,1,2
JUN NA,2,3
DEC 1,2,NA
I want an empty vector to store values and generate a time series
What can I do
Intended output formated by month and year, omit NAs
output=c(1,1,2,2,2,3)
How can I do?
You might go that direction:
library(tidyverse)
dta <- tribble(
~X, ~"2001", ~"2002", ~"2003",
"JAN", NA, 1, 2,
"JUN", NA, 2, 3,
"DEC", 1, 2, NA)
dta %>%
pivot_longer(cols = '2001':'2003',
names_to = "year",
values_to = "val") %>%
arrange(year) %>%
filter(!is.na(val))
However, you need to assure that the months are sorted correctly.

Dcast Function -> Reorder Columns

I used the dcast function to show the spendings per month of different companies. Of course I want January first, then February etc. and not the alphabetical order.
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings
Firma Produktmarke Apr Aug Dez Feb Jan Jul Jun Mai Mrz Nov Okt Sep
Company1 Product1 228582 1902138 725781 NA 709970 NA 265313 228177 NA NA 1463258 4031267
Is there a way to reorder the colums dynamically ? For 2018 for example the dataframe is shorter, so i can not use:
Spendings <- Spendings[,c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")]
Spendings_raw <- data.frame(matrix(ncol = 14, nrow = 0))
colnames(Spendings_raw) <- c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
Spendings_raw
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings <- rbind.fill(Spendings_raw, Spendings)
This works perfectly ;-).

Plotting issueHighest value shown at the bottom of y-axis

I'm trying to plot my dataset, i.e., the number of internship ads, using ggplot, but for some reason, the highest value shows at the bottom of y axis.
Here's the dataset:
month internship_count_2012
January 68
February 48
March 43
April 49
May 52
June 83
July 104
August 91
September 72
October 58
November 70
December 77
The image of the plot:
I've run the following code to come up with the above dataset, which was extracted from a larger, filtered dataset found here: https://www.dropbox.com/home?preview=rwjAll_internship_2012.csv
Filter the dataset by month:
rwjAll_internship_jan2012 <- filter(rwjAll_internship_2012, month == 1)
rwjAll_internship_feb2012 <- filter(rwjAll_internship_2012, month == 2)
rwjAll_internship_mar2012 <- filter(rwjAll_internship_2012, month == 3)
rwjAll_internship_apr2012 <- filter(rwjAll_internship_2012, month == 4)
rwjAll_internship_may2012 <- filter(rwjAll_internship_2012, month == 5)
rwjAll_internship_jun2012 <- filter(rwjAll_internship_2012, month == 6)
rwjAll_internship_jul2012 <- filter(rwjAll_internship_2012, month == 7)
rwjAll_internship_aug2012 <- filter(rwjAll_internship_2012, month == 8)
rwjAll_internship_sep2012 <- filter(rwjAll_internship_2012, month == 9)
rwjAll_internship_oct2012 <- filter(rwjAll_internship_2012, month == 10)
rwjAll_internship_nov2012 <- filter(rwjAll_internship_2012, month == 11)
rwjAll_internship_dec2012 <- filter(rwjAll_internship_2012, month == 12)
Get the number of internship ads by month
jan_2012_internship <- sum(rwjAll_internship_jan2012$internship)
feb_2012_internship <- sum(rwjAll_internship_feb2012$internship)
mar_2012_internship <- sum(rwjAll_internship_mar2012$internship)
apr_2012_internship <- sum(rwjAll_internship_apr2012$internship)
may_2012_internship <- sum(rwjAll_internship_may2012$internship)
jun_2012_internship <- sum(rwjAll_internship_jun2012$internship)
jul_2012_internship <- sum(rwjAll_internship_jul2012$internship)
aug_2012_internship <- sum(rwjAll_internship_aug2012$internship)
sep_2012_internship <- sum(rwjAll_internship_sep2012$internship)
oct_2012_internship <- sum(rwjAll_internship_oct2012$internship)
nov_2012_internship <- sum(rwjAll_internship_nov2012$internship)
dec_2012_internship <- sum(rwjAll_internship_dec2012$internship)
Create dataset to plot 2012 internship ads
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
internship_count_2012<- c(jan_2012_internship, feb_2012_internship, mar_2012_internship, apr_2012_internship, may_2012_internship, jun_2012_internship, jul_2012_internship, aug_2012_internship, sep_2012_internship, oct_2012_internship, nov_2012_internship, dec_2012_internship)
all_internship_2012 <- cbind(month, internship_count_2012)
all_internship_2012 <- as.data.frame(all_internship_2012)
plot2 <- ggplot(all_internship_2012, aes(x = month, y = internship_count_2012, group = 1)) +
geom_line(colour="purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How can I order the y values (internship_count_2012) so that the highest value is at the top? Thanks!
The following code simplifies the creation of your new dataset greatly using
the library dplyr and produces the plot you desire in just a few lines.
library(dplyr)
library(ggplot2)
all_internship_2012 <- rwjAll_internship_2012 %>%
group_by(month) %>%
summarise(internship_count_2012 = sum(internship))
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
ggplot(all_internship_2012, aes(x = month, y = internship_count_2012)) +
geom_line(colour = "purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_continuous(breaks=1:12, labels = month) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For info about dplyr check out this link on CRAN.

Resources