grouping & plotting by textual column value - r

I've got a (very) basic level of competency with R when working with numbers, but when it comes to manipulating data based on text values in columns I'm stuck. For example, if I want to plot meal frequency vs. day of week (is Tuesday really for tacos?) using the following data frame, how would I do that? I've seen suggestions of tapply, aggregate, colSums, and others, but those have all been for slightly different scenarios and nothing gives me what I'm looking for. Should I be looking at something other than R for this problem? My end goal is a graph with day of week on the X-axis, count on the Y-axis, and a line plot for each meal.
df <- data.frame(meal= c("tacos","spaghetti","burgers","tacos","spaghetti",
"spaghetti"), day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
This is as close as I've gotten, and, to be honest, I don't fully understand what it's doing:
tapply(df$day, df$meal, FUN = function(x) length(x))
It will summarize the meal counts, but a) it doesn't have column names (my understanding is that's due to tapply returning a vector), and b) it doesn't keep an association with the day of the week.
Edit: The melt() suggestion below works for this dataset, but it won't scale to the size I need. I was, however, able to get a working graph from the dataframe produced by the melt. If anybody runs across this in the future, try:
ggplot(new, aes(day, value, group=meal, col=meal)) +
geom_line() + geom_point() + scale_y_continuous(breaks = function(x)
unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
(The part after geom_point() is to force the Y-axis to only be integers, which is what makes sense in this case.)

I tried to cut this into smaller pieces so you can understand whats going on
library(tidyverse)
# defining the dataframes
df <- data.frame(meal = c("tacos","spaghetti","burgers","tacos","spaghetti","spaghetti"),
day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
# define a vector of days of week ( will be useful to display x axis in the correct order)
ordered_days =c("sunday","monday","tuesday","wednesday",
"thursday","friday",'saturday')
# count the number of meals per day of week
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
# a lot of combinations are missing, for example no burgers on monday
# so i am creating all combinations with count 0
fill_0 <- expand.grid(
meal=factor(unique(df$meal)),
day=factor(ordered_days),
n=0)
# append this fill_0 to df_count
# as some combinations already exist, group by again and sum n
# so only one row per (meal,day) combination
df_count <- rbind(df_count,fill_0) %>%
group_by(meal,day) %>%
summarise(n=sum(n)) %>%
mutate(day=factor(day,ordered=TRUE,
ordered_days))
# plot this by grouping by meal
ggplot(df_count,aes(x=day,y=n,group=meal,col=meal)) + geom_line()

The magic is here, courtesy of #fmarm:
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
The fill_0 and rbind bits also in the sample provided by #fmarm are necessary to keep from bombing out on unspecified combinations, but it's the line above that handles summing meals by day.

Related

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

Finding Avg/Sum of a Column Value

I have a nice Jitter plot of my data, but I'm looking to look further into the data by finding Mean/Sum/Median etc...
I don't know the syntax to separate the data by column value.
My date frame consists of 2 variables: Year (2010-2017) and Followers (Numeric)
Code I used:
ggplot(MyData, aes(factor(Date), Followers)) +
geom_jitter(aes(color = factor(Date)))
This separated each Numeric data point into categorized groups of each year.
I was able to use sum(MyData$Followers) to get total Followers for all years.
As well as count(MyData, 'Date') To get frequency for each year.
But I'm not sure how to combine them to get total followers/avg followers for each individual year.
You can use dplyr:
df <- MyData %>%
group_by(Year) %>%
summarize(Mean = mean(Followers), Count = n(Followers))

Subset data frame based on top N most frequent values in variable

My objective is to create a simple density or barplot of a long dataframe which shows the relative frequency of nationalities in a course (MOOC). I just don't want all of the nationalities in there, just the top 10. I created this example df below + the ggplot2 code I use for plotting.
d=data.frame(course=sample(LETTERS[1:5], 500,replace=T),nationality=as.factor(sample(1:172,500,replace=T)))
mm <- ggplot(d, aes(x=nationality, colour=factor(course)))
mm + geom_bar() + theme_classic()
...but as said: I want a subset of the entire dataset based on frequency. The above shows all data.
PS. I added the ggplot2 code for context but also because maybe there is something within ggplot2 itself that would make this possible (I doubt it however).
EDIT 2014-12-11:
The current answers use ddplyr or table methods to arrive at the desired subset, but I wonder if there is not a more direct way to achieve the same.. I will let it stay for now, see if there are other ways.
Using dplyr functions count and top_n to get top-10 nationalities. Because top_n accounts for ties, the number of nationalities included in this example are more than 10 due to ties. arrange the counts, use factor and levels to set nationalities in descending order.
# top-10 nationalities
d2 <- d %>%
count(nationality) %>%
top_n(10) %>%
arrange(n, nationality) %>%
mutate(nationality = factor(nationality, levels = unique(nationality)))
d %>%
filter(nationality %in% d2$nationality) %>%
mutate(nationality = factor(nationality, levels = levels(d2$nationality))) %>%
ggplot(aes(x = nationality, fill = course)) +
geom_bar()
Here's an approach to select the top 10 nationalities. Note that multiple nationalities share the same frequency. Therefore, selecting the top 10 results in omitting some nationalities with the same frequency.
# calculate frequencies
tab <- table(d$nationality)
# sort
tab_s <- sort(tab)
# extract 10 most frequent nationalities
top10 <- tail(names(tab_s), 10)
# subset of data frame
d_s <- subset(d, nationality %in% top10)
# order factor levels
d_s$nationality <- factor(d_s$nationality, levels = rev(top10))
# plot
ggplot(d_s, aes(x = nationality, fill = as.factor(course))) +
geom_bar() +
theme_classic()
Note that I changed colour to fill since colour affects the colour of the border.
although the questions was raised some time ago, I propose two other solutions for the sake of completeness:
d_raw <- data.frame(
course = sample(LETTERS[1:5], 500, replace = T),
nationality = as.factor(sample(1:172, 500, replace=T))
)
One using fct_lump_n() from the forcats package and filter()
d1 <- d_raw %>%
mutate(nationality = fct_lump_n(
f = nationality,
n = 10,
ties.method = "first"
)) %>%
filter(nationality != "Other")
d1 %>% count(nationality, sort = TRUE)
ggplot(d1, aes(x = nationality, fill = course)) + # factor() is not needed here.
geom_bar() +
theme_classic()
fct_lump_n() summarises all nationalities except for the 10 most frequent ones to category "Other". Note that in fct_lump_n() argument ties.method = "first" is needed to really get only the first ten nationalities, not 11 or 12. All other nationalities are labeled "Other" even though they may appear just as often as the first ten nationalities.
Levels of nationality are only ordered alphabetically.
Another solution is using fct_infreq() from the forcats package, cur_group_id() and filter().
d2 <- d_raw %>%
group_by(nationality = fct_infreq(nationality)) %>%
filter(cur_group_id() <= 10) %>%
ungroup()
d2 %>% count(nationality, sort = TRUE)
ggplot(d2, aes(x = nationality, fill = course)) + # factor() is not needed here.
geom_bar() +
theme_classic()
cur_group_id() assigns a group ID to every nationality. To get started with the most frequent nationality we first need to order column nationality by its frequencies. Then we filter for the first ten group IDs aka the ten most frequent nationalities.
Levels of nationality are first ordered by n, then ordered alphabetically.
I used count() to verify the two data frames d1 and d2 look the same.
Both solutions have the advantage, that we don't need a second (temporary) data frame or temporary vectors.
I hope this helps someone in the future.

Subsetting based on observations in a month

I'm trying to subset some data and am stuck on the last part of cleaning.
What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.
I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.
Here's the data (Note this is a very small subset that only includes individuals from 1:5):
https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
And here's the for loop :
epa.d <- read.csv("/.../data.csv")
#Function for loops
days <- function (month){
if (month == 06) return(as.numeric(30))
if (month == 07) return(as.numeric(31))
if (month == 08) return(as.numeric(31))
}
#Subset data for 75% in June, July, and August
for (i in unique(epa.d$indivID)){
for (j in unique(epa.d$year)){
for (k in unique(epa.d$month)){
monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k )
monthperc = (monthsum/days(k))* 100
if (monthperc < 75){
epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]
}
}
}
}
If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:
library(dplyr)
# For each indivID, calculate percent of days in each month with
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
filter(pctCoverage >= 0.75)
We now have a data frame epa.d_75 that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.
# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
Update: To answer the questions in the commments:
Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.
The %>% is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:
group_by splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise counts the number of rows in the data frame for each unique combination of indivID, month, and year, then mutate adds a column with the fractional coverage for that indivID for that month and year. filter then gets rid of any combination of indivID, month, and year with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75 looks like before the filtering operation:
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31))
why the hell this is so much faster than running for loops? I don't know the answer in detail, but dplyr does most of its magic in C code under the hood, which is faster than native R. Hopefully someone else can give a more precise and detailed answer.
Another option would be to use data.table (similar to #eipi10's dplyr method), which would be very fast.
library(data.table)
epa.d_75 <- setDT(epa.d)[, list(pctCoverage=ifelse(month==6, .N/30,
.N/31)),by=list(indivID, year, month)][pctCoverage >=0.75]
epa.d_75New = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
data
epa.d <- read.csv('data.csv', row.names=1)

Resources