Problem with `summarise()` input `Illinois` in R - r

Doing an assignment for school in which we use a pre-loaded dataframe (Midwest) from dplyr to manipulate data and display visualizations through shiny.
I'm getting the error "Problem with 'summarise()' input 'Illinois' because "object 'IL' not found (even though that's a variable in a column that I thought I had grouped by.
Here's some of my code at the moment.
bar_chart <- function(midwest) {
data_summary <- midwest %>%
dplyr::group_by(state) %>%
summarize("Illinois" = mean(IL, na.rm = TRUE),
"Minnesota" = mean(MN, na.rm = TRUE),
"Indiana" = mean(IN, na.rm = TRUE),
"Ohio" = mean(OH, na.rm = TRUE),
"Wisconsin" = mean(WN, na.rm = TRUE))

A couple things to understand here. Groups specify a level of aggregation, in this case state. That means when we summarize, we summarize to that specified level of aggregation. We have a data set with multiple states, so when we group by state, that means we'll end up with one row for each state. The result is that you don't have to write a line of code for each state like you did in your provided example.
When we summarize, we need to specify a function which we'll use to summarize (i.e. roll-up) the data, as well as a column to apply it to. In this case you're using mean, so I'll use that as well, and we'll find the mean of poptotal for each state.
Finally, while you can use recode to replace factor levels, my little example below uses a left_join and R's built in table of state names and abbreviations to add it in - a nice little trick if you had all 50 states.
library(tidyverse)
data(midwest)
stateTable <- data.frame(state.abb, state.name)
midwest %>% group_by(state) %>%
summarize(poptotal = mean(poptotal)) %>%
left_join(. , stateTable, by = c( "state" = "state.abb"))
# A tibble: 5 x 3
state poptotal state.name
<chr> <dbl> <fct>
1 IL 112065. Illinois
2 IN 60263. Indiana
3 MI 111992. Michigan
4 OH 123263. Ohio
5 WI 67941. Wisconsin

Related

Conditional (row-wise) formating of currency, number, and percentage in R DT (datatable)

I have column in my DT output (in Shiny) that has a numeric value whose units depend on another column. Some values are percentages, some are currency, and some are plain numbers.
For example, I would like to turn this input...
DefaultFormat
Value
PCT
12345.67
DOLLAR
12345.67
NUMBER
12345.67
...into this DT output:
DefaultFormat
Value
PCT
123.45%
DOLLAR
$12,345
NUMBER
12,345.67
The formatCurrency(), formatPercentage() and formatRound() functions do what I need for each of these respective formats but they affect the entire column instead specific cells. On the other hand formatStyle() can target specific cells in a column based on another column but I can't figure out a way to have it change the contents rather than the styles.
Furthermore, I tried setting the class using formatStyle() in the hopes that in the .css file I could then target, e.g. .pctclass:after and .currencyclass:before but it ignores the class attribute.
What is a good way to get the conditional behavior of formatStyle() but for numbers, percentages, and currencies?
EDIT: here's a solution borrowing from the approach here: https://stackoverflow.com/a/35657820/6851825
You are seeking to sort a formatted column based on the underlying data instead of its varied formatted appearance. You can do this by using an unformatted helper column to handle the sorting:
library(dplyr)
data.frame(
stringsAsFactors = FALSE,
DefaultFormat = c("PCT", "DOLLAR", "NUMBER"),
Value = c(54.54, 12345.67, 12345.67)
) %>%
mutate(Value_fmt = case_when(DefaultFormat == "PCT" ~ scales::percent(Value),
DefaultFormat == "DOLLAR" ~ scales::dollar(Value),
DefaultFormat == "NUMBER" ~ scales::comma(Value),
TRUE ~ as.character(Value)) %>%
forcats::fct_reorder(Value), .after = 1) %>%
DT::datatable(rownames = FALSE, options = list(columnDefs = list(
list(orderData = 2, targets = 1),
list(visible=FALSE, targets = 2))))
For example, note how 5 454% appears before the other entries even though it is alphabetically later:
(This is not DT-specific, it wasn't clear if that was a requirement.)
You can group or split and assign:
library(dplyr)
set.seed(2)
dat <- data.frame(fmt = sample(c("PCT","DOLLAR","NUMBER"), 10, replace = TRUE), value = round(runif(10, 10, 9999), 2))
dat %>%
group_by(fmt) %>%
mutate(value2 = switch(fmt[1],
PCT=scales::percent(value),
DOLLAR=scales::dollar(value),
NUMBER=scales::percent(value),
as.character(value))
)
# # A tibble: 10 x 3
# # Groups: fmt [3]
# fmt value value2
# <chr> <dbl> <chr>
# 1 PCT 1816. 181 621%
# 2 NUMBER 4058. 405 836%
# 3 DOLLAR 8536. $8,536.10
# 4 DOLLAR 9763. $9,763.24
# 5 PCT 2266. 226 577%
# 6 PCT 4453. 445 320%
# 7 PCT 759. 75 897%
# 8 PCT 6622. 662 171%
# 9 PCT 3881. 388 123%
# 10 DOLLAR 8370. $8,369.69
An alternative would be to use case_when and it would come up with very similar results, but it will be working one string at a time; this method calls the format function once per group, perhaps a bit more efficient. (Over to you if that's necessary.)

Match two dataframes with different column names and create new column with mean from the other

I have two dataframes. The first one only lists each School/Team once, something like this:
classA <- data.frame(School=c("Omaha South", "Millard North", "Elkhorn"))
The other dataframe is a table of basketball scores throughout a season and you can a School/Team can be listed more than once in the same column:
scores <- data.frame('Away Score'=c(60,84,48,72),
'Away Team'=c("Omaha South", "Millard North", "Elkhorn","Elkhorn"),
'Home Score'=c(88,40,38,62),
'Home Team'=c("Elkhorn", "Omaha South", "Millard North","Omaha South"))
My goal is to create a new column called classA$'Away PPG' that averages all of the 'Away Scores' for each School in the first data frame. So as a result, for Elkhorn, the new classA column would be 60 (48+72)/2.
One of the places I'm getting stuck is that the two dfs have different column names to match and I haven't found out how to deal with that aspect.
I got help previously on a somewhat related problem where I was looking for a count instead of an average but couldn't figure out how to modify it to work for this one. The solution for count issue looked like this:
df2 %>%
right_join(df1, by = c('Winner' = 'School')) %>%
na.omit() %>%
count(Winner, name = "wins") %>%
right_join(df1, c('Winner' = 'School')) %>%
mutate(wins = replace(wins, is.na(wins), 0))
We can join classA with scores and then take mean of Away.Score for each School.
library(dplyr)
classA %>%
left_join(scores, by = c('School' = 'Away.Team')) %>%
group_by(School) %>%
summarise(AwayScore = mean(Away.Score, na.rm = TRUE))
# A tibble: 3 x 2
# School AwayScore
# <fct> <dbl>
#1 Elkhorn 60
#2 Millard North 84
#3 Omaha South 60
Similarly in base R
aggregate(Away.Score~School,
merge(classA, scores, by.x = 'School', by.y = 'Away.Team'),
mean, na.rm = TRUE)

Filtering a Data Frame with Very specific Requirements

Fifa2 datasetFirst, I am not a developer and have little experience with R, so please forgive me. I have tried to get this done on my own, but have run out of ideas for filtering a data frame using the 'filter' command.
the data frame has about a dozen or so columns, with one being Grp (meaning Group). This is a FIFA soccer dataset, so the Group in this context means the general position the player is in (Defense, Midfield, Goalkeeper, Forward).
I need to filter this data frame to provide me this exact information:
the Top 4 Defense Players
the Top 4 Midfield Players
the Top 2 Forwards
the Top 1 Goalkeeper
What do I mean by "Top"? It's arranged by the Grp column, which is just a numeric number. So, Top 4 would be like 22,21,21,20 (or something similar because that numeric number could in fact be repeated for different players). The Growth column is the difference between the Potential Column and Overall column, so again just a simple subtraction to find the difference between them.
#Create a subset of the data frame
library(dplyr)
fifa2 <- fifa %>% select(Club,Name,Position,Overall,Potential,Contract.Valid.Until2,Wage2,Value2,Release.Clause2,Grp) %>% arrange(Club)
#Add columns for determining potential
fifa2$Growth <- fifa2$Potential - fifa2$Overall
head(fifa2)
#Find Southampton Players
ClubName <- filter(fifa2, Club == "Southampton") %>%
group_by(Grp) %>% arrange(desc(Growth), .by_group=TRUE) %>%
top_n(4)
ClubName
ClubName2 <- ggplot(ClubName, aes(x=forcats::fct_reorder(Name, Grp),
y=Growth, fill = Grp)) +
geom_bar(stat = "identity", colour = "black") +
coord_flip() + xlab("Player Names") + ylab("Unfilled Growth Potential") +
ggtitle("Southampton Players, Grouped by Position")
ClubName2
That chart produces a list of players that ends up having the Top 4 players in each position (top_n(4)), but I need it further filtered per the logic I described above. How can I achieve this? I tried fooling around with dplyr and that is fairly easy to get rows by Grp name, but don't see how to filter it to the 4-4-2-1 that I need. Any help appreciated.
Sample Output from fifa2 & ClubName (which shows the data sorted by top_n(4):
fifa2_Dataset
This might not be the most elegant solution, but hopefully it works :)
# create dummy data
data_test = data.frame(grp = sample(c("def", "mid", "goal", "front"), 30, replace = T), growth = rnorm(30, 100,10), stringsAsFactors = F)
# create referencetable to give the number of players needed per grp
desired_n = data.frame(grp = c("def", "mid", "goal", "front"), top_n_desired = c(4,4,1,2), stringsAsFactors = F)
# > desired_n
# grp top_n_desired
# 1 def 4
# 2 mid 4
# 3 goal 1
# 4 front 2
# group and arrange, than look up the desired amount of players in the referencetable and select them.
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:desired_n$top_n_desired[which(first(grp) == desired_n$grp)]) %>%
arrange(grp)
# A bit more readable, but you have to create an additional column in your dataframe
# create additional column with desired amount for the position written in grp of each player
data_test = merge(data_test, desired_n, by = "grp", all.x = T
)
data_test %>% group_by(grp) %>% arrange(desc(growth)) %>%
slice(1:first(top_n_desired)) %>%
arrange(grp)

R dplyr summarise date gaps

I have data on a set of students and the semesters they were enrolled in courses.
ID = c(1,1,1,
2,2,
3,3,3,3,3,
4)
The semester variable "Date" is coded as the year followed by 20 for spring, 30 for summer, and 40 for fall. so the Date value 201430 is summer semester of 2014...
Date = c(201220,201240,201330,
201340,201420,
201120,201340,201420,201440,201540,
201640)
Enrolled<-data.frame(ID,Date)
I'm using dplyr to group the data by ID and to summarise various aspects about a given student's enrollment history
Enrollment.History<-dplyr::select(Enrolled,ID,Date)%>%group_by(ID)%>%summarise(Total.Semesters = n_distinct(Date),
First.Semester = min(Date))
I'm trying to get a measure for the number of enrollment gaps that each student has, as well as the size of the largest enrollment gap. The data frame shouls end up looking like this:
Enrollment.History$Gaps<-c(2,0,3,0)
Enrollment.History$Biggest.Gap<-c(1,0,7,0)
print(Enrollment.History)
I'm just trying to figure out what the best way to code those gap variables. Is it better to turn that Date variable into an ordered factor? I hope this is a simple solution
Since you are not dealing with real dates in a standard format, you can instead make use of factors to compute the gaps.
First you need to define a vector of all possible year/semester combinations ("Dates") in the correct order (this is important!).
all_semesters <- c(sapply(2011:2016, paste0, c(20,30,40)))
Then, you can create a new factor variable, arrange the data by ID and Date, and finally compute the maximum difference between two semesters:
Enrolled %>%
mutate(semester = factor(Enrolled$Date, levels = all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(as.integer(semester)) -1), na.rm = TRUE))
## A tibble: 4 × 2
# ID max_gap
# <dbl> <dbl>
#1 1 1
#2 2 0
#3 3 7
#4 4 0
I used max(c(0, ...)) in the summarise, because otherwise you would end up with -Inf for IDs with a single entry.
Similarly, you could also achieve this by using match instead of a factor:
Enrolled %>%
mutate(semester = match(Date, all_semesters)) %>%
group_by(ID) %>%
arrange(Date) %>%
summarise(max_gap = max(c(0, diff(semester) -1), na.rm = TRUE))

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Resources