Paired bar chart with conditional labeling based on multiple factors - r

I am trying to create a graphical output like the picture below for the following sample of data but the code I have included gives an error:
Error in mutate_impl(.data, dots) : Evaluation error: Column n must be length 43 (the number of rows) or one, not 42.
My goal is to plot all providers from the same location on the same chart and then only include one name on the axis so that each provider can see how they compare to others in their area without revealing the identity of the other providers. I have tried specifying that n= 43 (the length of the full dataset) but have not had any success. Additionally, I would like to do a paired bar chart to show how each provider compares the their previous months' rates.
Provider Month Payment Location
Andrew 2 32.62 OH
Dillard 2 40 OH
Henry 2 32.28 OH
Lewis 2 47.79 IL
Marcus 2 73.04 IL
Matthews 2 45.22 NY
Paul 2 65.73 NY
Reed 2 27.67 NY
Andrew 1 33.23 OH
Dillard 1 36.63 OH
Henry 1 42.68 OH
Lewis 1 71.45 IL
Marcus 1 39.51 IL
Matthews 1 59.11 NY
Paul 1 27.67 NY
Reed 1 28.78 NY
library(tidyverse)
library(purrr)
df <- 1:nrow(PaymentsFeb) %>%
purrr::map( ~PaymentsFeb) %>%
set_names(PaymentsFeb$Provider) %>%
bind_rows(.id = "ID") %>%
nest(-ID) %>%
mutate(Location=map2(data,ID, ~.x %>% filter(Provider == .y) %>% select(Location))) %>%
mutate(data=
map2(data, ID, ~.x %>%
mutate(n=paste0("#", sample(seq_len(n()), size = n())),
Provider=ifelse(Provider == .y, as.character(Provider), n),
Provider=factor(Provider, levels = c(.y, sample(n, n())))))) %>%
mutate(plots=map2(data,Location, ~ggplot(data=.x,aes(x = Provider, y = scores, fill = scores))+
geom_col() +geom_text(aes(label=Per.Visit.Bill.Rate), vjust=-.3)+
ggtitle("test scores by Location- February 2018", subtitle = .y$Location)
))

Related

Question about my computation (Using R with dplyr and nyflights13 to figure out number of seat miles by carrier)

I understand the problem and showed all my work. I'm working through the modern dive data science book (https://moderndive.com/3-wrangling.html#joins book), and got stuck on (LC3.20) at the end of chapter 3.Using the nycflights13 package on R and dplyr, I'm supposed to generate a tibble that has only two columns, airline name and seat miles. Seat miles is just seats * miles. I understand the problem and I thought my code was going to output the correct result, however my seat miles are different for each airline carrier than in the solution. Can someone please help me to figure out why my code went wrong. Additionally, I do understand the books solution, I just don't know why my solution is wrong. I posted all my work.
#seat miles = miles*seats
View(flights) #distance and identifiers year and tail num and carrier
View(airlines) # names and indentifiers carrier
View(planes) #seats and identifiers year and tail num
#join names to flights
named_flights <- flights %>%
inner_join(airlines, by = 'carrier')
named_flights #same number of rows, all good
flights
#join seats to named_flights
named_seat_flights <- named_flights %>%
inner_join(planes, by = c('tailnum'))
named_seat_flights #noticed 52,596 rows are missing
#when joining tailnum to named_flights
table(is.na(select(named_flights, 'tailnum')))
#2512 rows that has NA values for tailnum in named_flights
table(is.na(select(planes, 'tailnum')))
#no tailnum data is missing from planes dataset
#and since a given plane (with a given tailnum)
#can take off multiple times per year
#we can conclude that the 52,596 missing rows
#are a result of the missing tailnum data in flights (also named_flights)
named_seat_miles_by_airline_name <- named_seat_flights %>%
group_by(name) %>%
summarise(seat_miles = sum(seats, na.rm = T)*sum(distance,na.rm = T)) %>%
rename(airline_name = name) %>%
arrange(desc(seat_miles))
named_seat_miles_by_airline_name #not correct
View(named_seat_miles_by_airline_name)
flights %>% # book solution
inner_join(planes, by = "tailnum") %>%
select(carrier, seats, distance) %>%
mutate(ASM = seats * distance) %>%
group_by(carrier) %>%
summarize(ASM = sum(ASM, na.rm = TRUE)) %>%
arrange(desc(ASM))**strong text**
The output of my code is
# A tibble: 16 x 2
airline_name seat_miles
<chr> <dbl>
1 United Air Lines Inc. 8.73e14
2 Delta Air Lines Inc. 4.82e14
3 JetBlue Airways 4.13e14
4 ExpressJet Airlines I~ 9.82e13
5 US Airways Inc. 3.83e13
6 American Airlines Inc. 3.38e13
7 Southwest Airlines Co. 2.10e13
8 Endeavor Air Inc. 1.28e13
9 Virgin America 1.19e13
10 AirTran Airways Corpo~ 6.68e11
11 Alaska Airlines Inc. 2.24e11
12 Hawaiian Airlines Inc. 2.20e11
13 Frontier Airlines Inc. 1.17e11
14 Mesa Airlines Inc. 1.17e10
15 Envoy Air 7.10e 9
16 SkyWest Airlines Inc. 4.08e 7
The output of books code is
# A tibble: 16 x 2
carrier ASM
<chr> <dbl>
1 UA 15516377526
2 DL 10532885801
3 B6 9618222135
4 AA 3677292231
5 US 2533505829
6 VX 2296680778
7 EV 1817236275
8 WN 1718116857
9 9E 776970310
10 HA 642478122
11 AS 314104736
12 FL 219628520
13 F9 184832280
14 YV 20163632
15 MQ 7162420
16 OO 1299835
Also, I know I have airline names instead of carrier, but thats actually what was asked.
The code replaces the sum of the products with the product of the sums.
Compare these:
...
filter(!is.na(seats)) %>%
summarise(seat_miles_sums = sum(seats, na.rm = T)*sum(distance,na.rm = T),
seat_miles = sum(seats*distance))
...
Graphically, the question is asking for something like the areas below left, but your code calculates the area below right.
XXX YY XXXYY
XXX + YY < XXXYY
YY XXXYY

How to use get()to call on a variable in a function in R?

I am trying to use get() in a function. Sometimes it works and sometimes it doesn't. The data is:
ID<-c("001","002","003","004","005","006","007","008","009","010","NA","012","013")
Name<-c("Damon Bell","Royce Sellers",NA,"Cali Wall","Alan Marshall","Amari Santos","Evelyn Frye","Kierra Osborne","Mohammed Jenkins","Kara Beltran","Davon Harmon","Kaitlin Hammond","Jovany Newman")
Sex<-c("Male","Male","Male",NA,"Male","Male",NA,"Female","Male","Female","Male","Female","Male")
Age<-c(33,27,29,26,27,35,29,32,NA,25,34,29,26)
data<-data.frame(ID,Name,Sex,Age)
It works in some codes like this:
FctPieChart <- function(data,Var){
dVar <- data %>%
filter(!is.na(get(Var))) %>%
group_by(get(Var)) %>% #changes the name of the column to "`get(Var)`"
summarise(Count = n()) %>%
mutate(Total = sum(Count), Percentage = round((Count/Total),3))
}
FctPieChart(data,"Sex")
And it doesn't work in some other codes like this:
FctPieChart <- function(data,var){
x <- data %>%
select(get(var))
x
}
FctPieChart(data,"Sex")
It says:
<error/simpleError>
object 'Sex' not found
Does anyone know why?
Thank you very much in advance!
Best regards,
Stephanie
With dplyr::select() you don't need to use get(). Instead just useselect(var)
> FctPieChart <- function(data,var){
x <- data %>%
select(var)
x
}
> FctPieChart(data,"Sex")
Sex
1 Male
2 Male
3 Male
4 <NA>
5 Male
6 Male
7 <NA>
8 Female
9 Male
10 Female
11 Male
12 Female
13 Male

Is there a way to perform a summary function separate from the grouping?

I have a data frame in the following format:
Person Answer Value
John Yes 3
Pete No 6
Joan Yes 5
Joan Yes 4
Pete No 7
I want to conduct an analysis (and create a stacked bar plot), where I'm able to group by the Person (repeating) and Answer variables and then summarize by value.
I've tried using dplyr to perform this, but I'm running into issues. The values on which I'm trying to perform the function are hindered if I use a group_by clause in my dplyr piping.
e.g.,
df2 <- df %>%
select(Person, Answer, Value) %>%
group_by(Person, Answer) %>%
summarise(sum(value = 3)/length(original dataframe ungrouped) + sum(value = 6)/length(original dataframe ungrouped)
The problem I'm running into is performing this calculation properly. The calculation doesn't make sense AFTER the data has been grouped, as I end up return a very limited dataframe after grouping.
Expected output:
person answer value
Joan Yes. calculated value (summary stat)
Joan No. calculate value
John Yes. calculated value....
​ John No
Pete Yes
Pete No
Ultimately, I'd like to make a stacked bar chart, where the summarization is shown across the People and the bars are divided into percentages by "yes" and "no" answers. For example, there are 3 bars: one for John, one for Pete, and one for Joan, and each of these bars is divided into two parts (values based on yes/no response)
Thanks!
I don't understand what your desired outcome is; do either of these suit?
library(tidyverse)
df <- read.table(text = "Person Answer Value
John Yes 3
Pete No 6
Joan Yes 5
John No 6
Pete No 1", header = TRUE)
df2 <- df %>%
group_by(Person) %>%
mutate(proportion = Value / sum(Value))
df2
#> # A tibble: 5 x 4
#> # Groups: Person [3]
#> Person Answer Value proportion
#> <chr> <chr> <int> <dbl>
#> 1 John Yes 3 0.333
#> 2 Pete No 6 0.857
#> 3 Joan Yes 5 1
#> 4 John No 6 0.667
#> 5 Pete No 1 0.143
ggplot(df2, aes(x = Person, y = Value, fill = Answer)) +
geom_col(color = "black", position = "stack") +
geom_text(aes(label = Answer),
position = position_stack(vjust = 0.5))
ggplot(df2, aes(x = Person, y = proportion, fill = Answer)) +
geom_col(color = "black", position = "stack") +
geom_text(aes(label = round(proportion, 2)),
position = position_stack(vjust = 0.5))
Created on 2021-08-12 by the reprex package (v2.0.0)

How to correctly slice after arrange? (R)

I'm not able to slice according to the code specified. See a reproducible example below:
library(alr4)
library(tidyverse)
modelUN <- lm(fertility ~ ppgdp, data = UN11)
I want to label the two highest and lowest residuals.
library(broom)
UN11 <- UN11 %>% mutate(Residuals = augment(modelUN) %>% pull(.resid))
UN11 %>% arrange(Residuals) %>% slice_head(n = 2)
This does not give me the lowest residuals. I tried saving the dataset (thinking that its fetching from the original df) but the result is the same. How should I go ahead?
The slice_head or slice_tail returns the head and tail rows based on the n given. If it is to get both ends, we can use the slice with the index (1:2 - head, and (n()-1):n() for tail
library(dplyr)
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(1:2, (n()-1):n()))
Or make use of row_number with head/tail
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(head(row_number(), 2), tail(row_number(), 2)))
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
#3 Africa africa 6.300 1237.8 50.04 36 3.161712
#4 Africa africa 6.925 357.7 55.77 17 3.758539
and using head
UN11 %>%
arrange(Residuals) %>%
head(2)
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
Or another option is slice_min/slice_max and bind them together with bind_rows (but it is less efficient and less direct than the index option in slice)
UN11 %>%
slice_min(Residuals, n = 2) %>%
bind_rows(UN11 %>%
slice_max(Residuals, n = 2))

R dplyr, using mutate with na.omit causes error incompatible size (%d)

I'm doing data cleaning. I use mutate in Dplyr a lot since it generates new columns step by step and I can easily see how it goes.
Here are two examples where I have this error
Error: incompatible size (%d), expecting %d (the group size) or 1
Example 1: Get town name from zipcode. Data is simply like this:
Zip
1 02345
2 02201
And I notice when the data has NA in it, it doesn't work.
Without NA it works:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Source: local data frame [2 x 2]
Groups: <by row>
Zip Town1
1 02345 Manomet
2 02201 Boston
With NA it doesn't work:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Error: incompatible size (%d), expecting %d (the group size) or 1
Example2. I wanna get rid of the redundant state name that occurs in the Town column in the following data.
Town State
1 BOSTON MA MA
2 NORTH AMAMS MA
3 CHICAGO IL IL
This is how I do it:
(1) split the string in Town into words, e.g. 'BOSTON' and 'MA' for row 1.
(2) see if any of these words match the State of that line
(3) delete the matched words
library(dplyr)
test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE)
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-is.state])
This results in:
Town State Town.word is.state Town1
1 BOSTON MA MA <chr[2]> 2 BOSTON
2 NORTH AMAMS MA <chr[2]> NA NA
3 CHICAGO IL IL <chr[2]> 2 CHICAGO
Meaning: E.g., row 1 shows is.state==2, meaning the 2nd word in Town is the state name. After getting rid of that work, Town1 is the correct town name.
Now I wanna fix the NA in row 2, but add na.omit would cause error:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-na.omit(is.state)])
results in:
Error: incompatible size (%d), expecting %d (the group size) or 1
I checked the data type and size:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(length(is.state) ) %>%
mutate(class(na.omit(is.state)))
results in:
Town State Town.word is.state length(is.state) class(na.omit(is.state))
1 BOSTON MA MA <chr[2]> 2 1 integer
2 NORTH AMAMS MA <chr[2]> NA 1 integer
3 CHICAGO IL IL <chr[2]> 2 1 integer
So it is %d of length==1. Can somebody where's wrong? Thanks
Can you just sub it out?
test %>%
rowwise() %>%
mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town))
## Source: local data frame [3 x 2]
## Groups: <by row>
##
## Town State
## 1 BOSTON MA
## 2 NORTH AMAMS MA
## 3 CHICAGO IL
(This way also catches commas after the town, if that happens.)
NB: if you use ungroup() here with a rowwise_df (as this is), it will wipe the tbl_df class as well and output a straight data.frame, which is fine for your data but will clobber your screen if you aren't careful and are looking at large amounts of data (as I've done countless times). (Github references #936 and #553.)

Resources