Dcast Function -> Reorder Columns - r

I used the dcast function to show the spendings per month of different companies. Of course I want January first, then February etc. and not the alphabetical order.
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings
Firma Produktmarke Apr Aug Dez Feb Jan Jul Jun Mai Mrz Nov Okt Sep
Company1 Product1 228582 1902138 725781 NA 709970 NA 265313 228177 NA NA 1463258 4031267
Is there a way to reorder the colums dynamically ? For 2018 for example the dataframe is shorter, so i can not use:
Spendings <- Spendings[,c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")]

Spendings_raw <- data.frame(matrix(ncol = 14, nrow = 0))
colnames(Spendings_raw) <- c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
Spendings_raw
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings <- rbind.fill(Spendings_raw, Spendings)
This works perfectly ;-).

Related

Quarter Conversion using casewhen

I am trying to create a function so I convert the month values to quarters using the case when function.
Then I want to leverage mutate() to create a new variable Qtr and determine how many observations I see in each quarter.
convert_to_qtr <- function(Month) {
case_when(
Month == "Jan" ~ "Q1",
Month == "Feb" ~ "Q1",
Month == "Mar" ~ "Q1",
Month == "Apr" ~ "Q2",
Month == "May" ~ "Q2",
Month == "Jun" ~ "Q2",
Month == "Jul" ~ "Q3",
Month == "Aug" ~ "Q3",
Month == "Sep" ~ "Q3",
Month == "Oct" ~ "Q4",
Month == "Nov" ~ "Q4",
Month == "Dec" ~ "Q4"
)
}
example_months <- c("Jan", "Mar", "May", "May", "Aug", "Nov", "Nov", "Dec")
convert_to_qtr(example_months)
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
group_by(Qtr) %>%
count(Qtr)
However I am not getting the same answer as my professor in his drop down so I am not sure if I am doing something wrong in my r coding.
He sees the numbers 161,071 85,588 100,227 142,651
I am not getting that, I see 152174 165778 205615 174592
You could write the function as below:
convert_to_qtr <- function(Month){
setNames(paste0("Q", rep(1:4,each=3)), month.abb)[Month]
}
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
count(Qtr)
Qtr n
1 Q1 29
2 Q2 24
3 Q3 26
4 Q4 21

Factor function in R returning NAs

I have a data frame with a column of 'months' and coordinating values. When I create a graph, the months are ordered alphabetically. I want to order the months using the factor function, but now my graph is only showing the month of May and 'NAs'.
xnames<-c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Data$Month<-factor(Data$Month, levels = xnames)
ggplot(DAtaTidy_MergeRWPeaks2, (aes(x=factor(Month, xnames), y=Volume)), na.rm=TRUE) +
geom_bar()
I tried embedding the factor in the ggplot function but it produced the same result. When I delete 'May' from 'xnames', the graph just shows NAs.
We can't see your data, but the behavior is indicative of Data$Month containing a value that is not included in your level term xnames. Is anything misspelled? I would suggest you compare levels(as.factor(Data$Month)) and xnames - it will certainly show you the issue.
Example dataset that shows the same problem you have:
yums <- c('soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df1 <- data.frame(yums, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df1, aes(x=factor(yums, yum.levels), y=nums)) + geom_col()
That gives you this:
...but if we mispell one of them (like capitalizing "Soup" in yums), you get this:
yums1 <- c('Soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df2 <- data.frame(yums1, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df2, aes(x=factor(yums1, yum.levels), y=nums)) + geom_col()

Multiple grouping variables using kableExtra

I've got the following dataset:
tab <- tibble(year = c(2017,2017,2017,2018,2018,2018)
mth = c("Apr", "Apr", "Jun", "Jul", "Jul", "Sep"),
var1 = 1:6,
var2 = 10:15)
Is it possible to use kableExtra to generate a table of this data where there are two grouping variables, year and month? This would give:
var1 var2
2017
Apr
1 10
2 11
Jun
3 12
2018
Jul
4 13
5 14
Sep
6 15
I've tried:
kable(tab[,3:4]) %>% pack_rows(index = table(year$Month, tab$mth))
It works fine with one grouping variable, but it doesn't work for two grouping variables.
This tutorial has great examples and explains how to do this.
library(dplyr)
library(kableExtra)
kable(tab, align = "c", col.names = c("","",names(tab)[3:4])) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "top")

Plotting issueHighest value shown at the bottom of y-axis

I'm trying to plot my dataset, i.e., the number of internship ads, using ggplot, but for some reason, the highest value shows at the bottom of y axis.
Here's the dataset:
month internship_count_2012
January 68
February 48
March 43
April 49
May 52
June 83
July 104
August 91
September 72
October 58
November 70
December 77
The image of the plot:
I've run the following code to come up with the above dataset, which was extracted from a larger, filtered dataset found here: https://www.dropbox.com/home?preview=rwjAll_internship_2012.csv
Filter the dataset by month:
rwjAll_internship_jan2012 <- filter(rwjAll_internship_2012, month == 1)
rwjAll_internship_feb2012 <- filter(rwjAll_internship_2012, month == 2)
rwjAll_internship_mar2012 <- filter(rwjAll_internship_2012, month == 3)
rwjAll_internship_apr2012 <- filter(rwjAll_internship_2012, month == 4)
rwjAll_internship_may2012 <- filter(rwjAll_internship_2012, month == 5)
rwjAll_internship_jun2012 <- filter(rwjAll_internship_2012, month == 6)
rwjAll_internship_jul2012 <- filter(rwjAll_internship_2012, month == 7)
rwjAll_internship_aug2012 <- filter(rwjAll_internship_2012, month == 8)
rwjAll_internship_sep2012 <- filter(rwjAll_internship_2012, month == 9)
rwjAll_internship_oct2012 <- filter(rwjAll_internship_2012, month == 10)
rwjAll_internship_nov2012 <- filter(rwjAll_internship_2012, month == 11)
rwjAll_internship_dec2012 <- filter(rwjAll_internship_2012, month == 12)
Get the number of internship ads by month
jan_2012_internship <- sum(rwjAll_internship_jan2012$internship)
feb_2012_internship <- sum(rwjAll_internship_feb2012$internship)
mar_2012_internship <- sum(rwjAll_internship_mar2012$internship)
apr_2012_internship <- sum(rwjAll_internship_apr2012$internship)
may_2012_internship <- sum(rwjAll_internship_may2012$internship)
jun_2012_internship <- sum(rwjAll_internship_jun2012$internship)
jul_2012_internship <- sum(rwjAll_internship_jul2012$internship)
aug_2012_internship <- sum(rwjAll_internship_aug2012$internship)
sep_2012_internship <- sum(rwjAll_internship_sep2012$internship)
oct_2012_internship <- sum(rwjAll_internship_oct2012$internship)
nov_2012_internship <- sum(rwjAll_internship_nov2012$internship)
dec_2012_internship <- sum(rwjAll_internship_dec2012$internship)
Create dataset to plot 2012 internship ads
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
internship_count_2012<- c(jan_2012_internship, feb_2012_internship, mar_2012_internship, apr_2012_internship, may_2012_internship, jun_2012_internship, jul_2012_internship, aug_2012_internship, sep_2012_internship, oct_2012_internship, nov_2012_internship, dec_2012_internship)
all_internship_2012 <- cbind(month, internship_count_2012)
all_internship_2012 <- as.data.frame(all_internship_2012)
plot2 <- ggplot(all_internship_2012, aes(x = month, y = internship_count_2012, group = 1)) +
geom_line(colour="purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
How can I order the y values (internship_count_2012) so that the highest value is at the top? Thanks!
The following code simplifies the creation of your new dataset greatly using
the library dplyr and produces the plot you desire in just a few lines.
library(dplyr)
library(ggplot2)
all_internship_2012 <- rwjAll_internship_2012 %>%
group_by(month) %>%
summarise(internship_count_2012 = sum(internship))
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
ggplot(all_internship_2012, aes(x = month, y = internship_count_2012)) +
geom_line(colour = "purple1") +
geom_point(size=0.5, colour="purple1") +
scale_x_continuous(breaks=1:12, labels = month) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For info about dplyr check out this link on CRAN.

Reshape data in R?

I have mean daily data for different sites organized as shown in figure 1 in this folder.
However, I want to organize this data to look like figure 2 in the same folder.
Using this code, the data was reshaped but the final values (reshpae_stage_R.csv) didn't match the original values.
By running the code for the second time, I got this error:
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’
Could you please help me why the final values don't match the original values?
Thanks in advance
Update:
Thanks to #aelwan for catching a bug, and the updated code is below:
library(ggplot2)
library(reshape2)
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"),
measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site,
value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
This leads to the picture below:
Original answer:
This example requires a fair amount of data munging skills. You basically have to note the repeating patters in the data -- the data are site-year measurements organized as day x month tables.
Recipe:
Here is a recipe for creating the desired dataset:
1. Remove the rows & columns in the data that are redundant.
2. Extract the rows that identify the year and the site of the table using pattern matching (grep).
3. From the longer string, extract the year and site name using regular expressions (regexpr and regmatches).
4. Find the starting row indices of the tables for each site-year combination and assign the site-year names just extracted to all rows that correspond to that site & year.
5. Now you can go ahead and reshape it into any shape you want. In the code below, the row identifiers are year, month and day, and the columns are the sites.
6. Some cleanup, and you are good to go.
Code:
Here is code for the recipe above:
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
Output:
Here is what the output looks like:
> head(dfStageFinalWide)
Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1 1 Jan 1990 454 NA 700 5133 NA NA NA
2 1 Jan 1991 1002 3643 1416 50 3597 1836 18160
3 1 Jan 1992 3490 34239 8922 3049 1221 417 NA
4 1 Jan 1993 404 NA 396 3408 NA 272 NA
5 1 Jan 1994 NA NA 3189 795 NA 2321 1889
6 1 Jan 1995 16548 1923 69862 4808 NA 6169 94
Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1 9406 2767 NA NA 6838 2831
2 4985 2479 823 1078 76 105
3 478 3665 1415 210 394 8247
4 6394 1298 NA 2668 3837 1878
5 14051 3561 NA 2645 807 NA
6 NA 1057 7029 4497 NA NA
Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1 5189 50444 17951
2 345 416 3025
3 1364 5713 1710
4 3457 28078 8670
5 199 NA 292
6 NA NA 22774
And a picture to bring it all together.

Resources