Displaying multiple boxplots in R within a specific range - r

If I have a data frame df with columns yearID and payroll
boxplot(df$payroll ~ df$yearID, ylab="Payroll", xlab="Year")
displays a boxplot for every year. Is there a way to specify the range of years that are displayed? Thanks

It's helpful to have the data that powers your code. You can read more about how to create an example here.
As rawr pointed out in comments, you can use the subset argument to boxplot to narrow the range of years that's rendered.
boxplot(df$payroll ~ df$yearID, ylab="Payroll", xlab="Year", subset = yearID > 2013)
Personally, I prefer using the data management tools from dplyr in order to keep my code consistent regardless of the function I'm using. In this case you can use filter to select only the years that you want. dplyr becomes even more useful when you use pipes, but I'll keep this example simple.
library(tidyverse) # Includes dplyr and other useful packages
# Generate dummy data
yearID <- sample(1995:2016, size = 1000, replace = TRUE)
payroll <- round(rnorm(1000, mean = 50000, sd = 20000))
df <- tibble(yearID, payroll)
# Filter the data to include only the years you want
df_plot <- filter(df, yearID > 2013)
# Generate your boxplot
boxplot(df_plot$payroll ~ df_plot$yearID, ylab="Payroll", xlab="Year")

Related

Aggregating two rows based on condition of different ID in R

I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Moving averages

I have daily data for over 100 years that looks like
01.01.1856 12
02.01.1956 9
03.01.1956 -12
04.01.1956 7
etc.
I wish to calculate the 30 year running average for this huge data. I tried converting the data into a time series but cant still figure out how to go about it. I will prefer a simple method that has to do with working with a data.frame.
I guess the preparation is the difficulty considering some leapyears.
So I try to show some way for preparing, before using the already mentioned function runmean of package require(caTools).
First we create example data (which is not necessary for you, but for the understanding).
Second I divide the data frame into a list of data frames, one for each year and taking the mean values for each year. These two steps could be done at once, but I think the separated way is easier to understand and to adapt.
#example data
Days <- seq(as.Date("1958-01-01"), as.Date("2015-12-31"), by="days")
Values <- runif(length(Days))
DF <- data.frame(Days = Days, Values = Values)
#start of script
Years <- format(DF$Days, "%Y")
UniqueYears <- unique(format(DF$Days, "%Y"))
#Create subset of years
#look for every unique year which element of days is in this year.
YearlySubset <- lapply(UniqueYears, function(x){
DF[which(Years == x), ]
})
YearlyMeanValues <- sapply(YearlySubset, function(x){
mean(x$Values)
})
Now the running mean is applied:
#install.packages("caTools")
require(caTools)
RM <- data.frame(Years = UniqueYears, RunningMean30y = runmean(YearlyMeanValues, 30))
Just if I didn't got you right at first and you want some running mean for every day within about 30 years, of course you could simply do:
RM <- cbind(DF, runmean(DF$Values, 365 * 30))
And considering your problems creating a timeseries:
DF[ , 1] <- as.Date(DF[ , 1], format = "%Y.%m.%d")
I would also suggest exploring RcppRoll in combination with dplyr which provides a fairly convenient solution to calculate rolling averages, sums, etc.
Code
# Libs
library(RcppRoll) # 'roll'-ing functions for R vectors and matrices.
library(dplyr) # data grammar (convenience)
library(zoo) # time series (convenience)
library(magrittr) # compound assignment pipe-operator (convenience)
# Data
data("UKgas")
## Convert to data frame to make example better
UKgas <- data.frame(Y = as.matrix(UKgas), date = time(UKgas))
# Calculations
UKgas %<>%
# To make example more illustrative I converted the data to a quarterly format
mutate(date = as.yearqtr(date)) %>%
arrange(date) %>%
# The window size can be changed to reflect any period
mutate(roll_mean = roll_mean(Y, n = 4, align = "right", fill = NA))
Notes
As the data provided in the example was fairly modest I used quarterly UK gas consumption data available via the data function in the utils package.

R fill in variable for a specific observation in a data frame

I have some data (download link: http://spreadsheets.google.com/pub?key=0AkBd6lyS3EmpdFp2OENYMUVKWnY1dkJLRXAtYnI3UVE&output=xls) that I'm trying to filter. I had reconfigured the data so that instead of one row per country, and one column per year, each row of the data frame is a country-year combination (i.e. Afghanistan, 1960, NA).
Now that I've done that, I want to create a subset of the initial data that excludes any country that has 10+ years of missing contraceptive use data.
I had thought to create a list of the unique country names in a second data frame, and then add a variable to that frame that holds the # of rows for each country that have an NA for contraceptive use (i.e. for Afghanistan it would have 46). My first thought (being most fluent in VB.net) was to use a for loop to iterate through the countries, get the NA count for that country, and then update the second data frame with that value.
In that vein I tried the following:
for(x in cl){
+ x$rc = nrow(subset(BCU, BCU$Country == x$Country))
+ }
After that failed, a little more Googling brought me to a question on here (forgot to grab the link) that suggested using by(). Based on that I tried:
by(cl, 1:nrow(cl), cl$rc <- nrow(subset(BCU, BCU$Country == cl$Country
& BCU$Contraceptive_Use == "NA")))
(cl is the second data frame listing the country names, and BCU is the initial contraceptive use data frame)
I'm fairly new to R (the problem I'm working is for an R course on Udacity), so I'll freely admit this may not be the best approach, but I'm still curious how to do this sort of aggregation.
They all seem to have >= 10 years of missing data (unless I miscalculated somewhere):
library(tidyr)
library(dplyr)
dat <- read.csv("contraceptive use.csv", stringsAsFactors=FALSE, check.names=FALSE)
dat <- rename(gather(dat, year, value, -1),
country=`Contraceptive prevalence (% of women ages 15-49)`)
dat %>%
group_by(country) %>%
summarise(missing_count=sum(is.na(value))) %>%
arrange(desc(missing_count)) -> missing
sum(missing$missing_count >= 10)
## [1] 213
length(unique(dat$country))
## [1] 213

Subset boxplots by date, order x-axis by month [duplicate]

This question already has answers here:
What is the most elegant way to split data and produce seasonal boxplots?
(3 answers)
Closed 5 years ago.
I have a year's worth of data spanning two calendar years. I want to plot boxplots for those data subset by month.
The plots will always be ordered alphabetically (if I use month names) or numerically (if I use month numbers). Neither suits my purpose.
In the example below, I want the months on the x-axis to start at June (2013) and end in May (2014).
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
boxplot(df$x ~ months(df$date), outline = FALSE)
I could probably generate a vector of the months in the order I need (e.g. months <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month")))
Is there a more elegant way to do this? What am I missing?
Are you looking for something like this :
boxplot(df$x ~ reorder(format(df$date,'%b %y'),df$date), outline = FALSE)
I am using reorder to reorder your data according to dates. I am also formatting dates to skip day part since it is you aggregate your boxplot by month.
Edit :
If you want to skip year part ( but why ? personally I find this a little bit confusing):
boxplot(df$x ~ reorder(format(df$date,'%B'),df$date), outline = FALSE)
EDIT2 a ggplot2 solution:
Since you are in marketing field and you are learning ggplot2 :)
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(y=x,
x=reorder(format(df$date,'%B'),df$date),
fill=format(df$date,'%Y'))) +
xlab('Month') + guides(fill=guide_legend(title="Year")) +
theme_bw()
I had a similar problem where I wanted to order the plot January to December. This seems to be a common cause of vexation for people, here is my solution:
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
months <- month.name
boxplot(x~as.POSIXlt(date)$mon,names=months, outline = FALSE)
Found an answer here - use a factor, not a date:
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
# create an ordered factor
m <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month"))
df$months <- factor(months(df$date), levels = m)
# plot x axis as ordered
boxplot(df$x ~ df$months, outline = FALSE)

Resources