Summarizing a specific column with dplyr - r

For my assignment I need to create an object which contains, for each
combination of Sex and Season, the number of different sports in the olympics data set. The columns of this object should be called Competitor_Sex, Olympic_Season, and Num_Sports,
respectively.
This is what I have at the moment:
object <- olympics %>%
group_by(Sex, Season) %>%
summarise(Num_Sports = ???)
I'm having trouble with defining the third column, which is the number of sports. My data looks like this:
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun"
), Sex = c("M", "M", "F", "M", "M"), Age = c(23L, 28L, 22L, 30L,
23L), Height = c(170L, 184L, 170L, 187L, 167L), Weight = c(60,
85, 125, 76, 64), Team = c("China", "Finland", "Romania", "France",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP"), Games = c("2012 Summer",
"2014 Winter", "2016 Summer", "2012 Summer", "2016 Summer"),
Year = c(2012L, 2014L, 2016L, 2012L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro"), Sport = c("Judo",
"Ice Hockey", "Weightlifting", "Athletics", "Gymnastics"),
Event = c("Judo Men's Extra-Lightweight", "Ice Hockey Men's Ice Hockey",
"Weightlifting Women's Super-Heavyweight", "Athletics Men's 1,500 metres",
"Gymnastics Men's Individual All-Around"), Medal = c(NA,
"Bronze", NA, NA, NA)), row.names = c("1", "2", "3", "4",
"5"), class = "data.frame")
This is probably solved in an easy way. Could someone help me? Would be appreciated a lot!
Best Regards,

Grouping twice should work:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n())

You can use the equivalent of length(unique( from dplyr: n_distinct:
olympics %>%
group_by(Sex, Season) %>%
summarise(Sports = n_distinct(Sport)) %>%
rename(Competitor_Sex = Sex, Olympic_Season = Season) # To rename the columns

Related

How to make stacked bar chart using ggplot2? [duplicate]

This question already has answers here:
Create stacked barplot where each stack is scaled to sum to 100%
(5 answers)
Closed 10 months ago.
So I am having trouble making a stacked bar chart showing proportion of cases vs deaths.
This is the data:
df <- structure(list(Date = structure(c(19108, 19108, 19108, 19108,
19108, 19108, 19108, 19108, 19108, 19108), class = "Date"), Country = c("US",
"India", "Brazil", "France", "Germany", "United Kingdom", "Russia",
"Korea, South", "Italy", "Turkey"), Confirmed = c(81100599L,
43065496L, 30378061L, 28605614L, 24337394L, 22168390L, 17887152L,
17086626L, 16191323L, 15023662L), Recovered = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), Deaths = c(991940L, 523654L, 663108L,
146464L, 134489L, 174778L, 367692L, 22466L, 162927L, 98720L),
Active = c(80108659L, 42541842L, 29714953L, 28459150L, 24202905L,
21993612L, 17519460L, 17064160L, 16028396L, 14924942L)), row.names = c(163539L,
163431L, 163375L, 163414L, 163418L, 163537L, 163496L, 163444L,
163437L, 163533L), class = "data.frame")
and I want to generate something that looks like this except with proportions of deaths vs cases.
This is a modification of #Allan Cameron's answer with adding the percent label and some other different approaches:
library(tidyverse)
library(scales)
df %>%
rename_with(., ~str_replace_all(., 'top10.', '')) %>%
pivot_longer(
cols = -Country,
names_to = "Status",
values_to = "value",
values_transform = list(value = as.integer)
) %>%
mutate(Status = fct_rev(fct_infreq(Status))) %>%
group_by(Country) %>%
mutate(pct= prop.table(value) * 100) %>%
ggplot(aes(x= Country, y = pct, fill=Status)) +
geom_col(position = position_fill())+
scale_fill_manual(values = c("#ff34b3", "#4976ff")) +
scale_y_continuous(labels = scales::percent)+
ylab("Percentage") +
geom_text(aes(label=paste0(sprintf("%1.1f", pct),"%")),
position=position_fill(vjust = 0.1)) +
ggtitle("Your Title")
I had to use OCR to convert the image of your data into actual data I could use. It's far better to include your data as text for this reason.
The plot is not particularly informative because the percentages are low, and difficult to read, but in any case, you can do it like this:
library(tidyverse)
p <- df %>%
mutate(top10.Confirmed = top10.Confirmed - top10.Deaths,
top10.Country = factor(top10.Country, top10.Country)) %>%
rename(Country = top10.Country,
Survived = top10.Confirmed,
Died = top10.Deaths) %>%
pivot_longer(-Country, names_to = "Outcome", values_to = "Count") %>%
mutate(Outcome = factor(Outcome, c("Survived", "Died"))) %>%
ggplot(aes(Country, Count, fill = Outcome)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("#4976ff", "#ff34b3")) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Covid outcomes by country", y = "Percent")
p
To make it easier to read, you could zoom into the bottom:
p + coord_cartesian(ylim = c(0, 0.05))
Data in reproducible format
df <- structure(list(top10.Country = c("US", "India", "Brazil", "France",
"Germany", "United Kingdom", "Russia", "Korea, South", "Italy",
"Turkey"), top10.Confirmed = c(81100599L, 43065496L, 30378061L,
28605614L, 24337394L, 22168390L, 17887152L, 17086626L, 16191323L,
15023662L), top10.Deaths = c(991940L, 523654L, 663108L, 146464L,
134489L, 174778L, 367692L, 22466L, 162927L, 98720L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
df
#> top10.Country top10.Confirmed top10.Deaths
#> 1 US 81100599 991940
#> 2 India 43065496 523654
#> 3 Brazil 30378061 663108
#> 4 France 28605614 146464
#> 5 Germany 24337394 134489
#> 6 United Kingdom 22168390 174778
#> 7 Russia 17887152 367692
#> 8 Korea, South 17086626 22466
#> 9 Italy 16191323 162927
#> 10 Turkey 15023662 98720
Created on 2022-05-01 by the reprex package (v2.0.1)

Add_annotations plotly first and last datapoints

My Main Goal:
Trying to add annotations to both the first datapoint of my
scatterplot and the last datapoint of my scatterplot (the entries for
years 2006 and 2021 respectively).
My Secondary Goals:
If possible, it would also be helpful to find out how to select out
specific datapoints to add annotations, as I only know the
which.max/which.min functions so far.
It would also be nice to know how to list the jobs on each point.
My Dput:
structure(list(Year = 2006:2021, Month_USD = c(1160L, 1240L,
1360L, 1480L, 1320L, 1320L, 375L, 1600L, 2000L, 2000L, 1600L,
2240L, 1900L, 2300L, 2900L, 2300L), Degree = c("High School",
"High School", "High School", "High School", "High School", "High School",
"High School", "High School", "High School", "BA", "BA", "BA",
"BA", "BA", "M.Ed", "M.Ed"), Country = c("USA", "USA", "USA",
"USA", "USA", "USA", "DE", "USA", "USA", "USA", "USA", "USA",
"PRC", "PRC", "PRC", "HK"), Job = c("Disher", "Prep", "Prep",
"Prep", "Prep", "Prep", "Au Pair", "CSA", "Valet", "Valet", "Intake",
"CM", "Teacher", "Teacher", "Teacher", "Student"), Median_Household_Income_US = c(4833L,
4961L, 4784L, 4750L, 4626L, 4556L, 4547L, 4706L, 4634L, 4873L,
5025L, 5218L, 5360L, 5725L, NA, NA), US_Home_Price_Index = c(183.24,
173.36, 152.56, 146.69, 140.64, 135.16, 143.88, 159.3, 166.5,
175.17, 184.51, 195.99, 204.9, 212.59, 236.31, NA)), class = "data.frame", row.names = c(NA,
-16L))
Current Scatterplot:
pal <- c("Red", "Blue", "Green")
plot_ly(data = Earnings_Year,
x=~Year,
y=~Month_USD,
type='scatter',
mode='markers',
symbol = ~as.factor(Degree),
symbols=c("star-open-dot","hexagon-open-dot","diamond-open-dot"),
color = ~as.factor(Degree),
colors = pal,
hoverinfo="text",
text= paste("Year: ",
Earnings_Year$Year,
"<br>", #this is a line break
"Monthly USD: ",
Earnings_Year$Month_USD),
size=10) %>%
add_annotations(
x=Earnings_Year$Year[which.min(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.min(Earnings_Year$Month_USD)],
text = "Au Pair Job in Germany") %>%
add_annotations(
x=Earnings_Year$Year[which.max(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.max(Earnings_Year$Month_USD)],
text = "Last Teaching Job in China") %>%
layout(legend= list(x=1,y=0.5),
title="Earnings by Degree",
xaxis=list(title="Year"),
yaxis=list(title="Monthly USD"))
Image of Current Scatter:
Scatter That I Want:
Figured it out. Just needed to pipe additional add_annotations as well as just select specific values for x and y:
pal <- c("Red", "Blue", "Green")
plot_ly(data = Earnings_Year,
x=~Year,
y=~Month_USD,
type='scatter',
mode='markers',
symbol = ~as.factor(Degree),
symbols=c("star-open-dot","hexagon-open-dot","diamond-open-dot"),
color = ~as.factor(Degree),
colors = pal,
hoverinfo="text",
text= paste("Year: ",
Earnings_Year$Year,
"<br>", #this is a line break
"Monthly USD: ",
Earnings_Year$Month_USD),
size=10) %>%
add_annotations(
x=Earnings_Year$Year[which.min(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.min(Earnings_Year$Month_USD)],
text = "Au Pair Job in Germany") %>%
add_annotations(
x=Earnings_Year$Year[which.max(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.max(Earnings_Year$Month_USD)],
text = "Last Teaching Job in China") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2006],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==1160],
text="First Job"
) %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2021],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==2300],
text="Began Ph.D.") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2008],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==1360],
text="Finished H.S.") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2015],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==2000],
text="Finished BA") %>%
layout(legend= list(x=1,y=0.5),
title="Earnings by Degree",
xaxis=list(title="Year"),
yaxis=list(title="Monthly USD"))
Finished Product:

Had a couple problems with One Way ANOVA in R

My dput is this:
structure(list(Year = 2006:2021, Month_USD = c(1160L, 1240L, 1360L, 1480L, 1320L, 1320L, 375L, 1600L, 2000L, 2000L, 1600L, 2240L, 1900L, 2300L, 2900L, 2300L), Degree = c("High School", "High School", "High School", "High School", "High School", "High School", "High School", "High School", "High School", "BA", "BA", "BA", "BA", "BA", "M.Ed", "M.Ed"), Country = c("USA", "USA", "USA", "USA", "USA", "USA", "DE", "USA", "USA", "USA", "USA", "USA", "PRC", "PRC", "PRC", "HK"), Job = c("Disher", "Prep", "Prep", "Prep", "Prep", "Prep", "Au Pair", "CSA", "Valet", "Valet", "Intake", "CM", "Teacher", "Teacher", "Teacher", "Student"), Median_Household_Income_US = c(4833L, 4961L, 4784L, 4750L, 4626L, 4556L, 4547L, 4706L, 4634L, 4873L, 5025L, 5218L, 5360L, 5725L, NA, NA), US_Home_Price_Index = c(183.24, 173.36, 152.56, 146.69, 140.64, 135.16, 143.88, 159.3, 166.5, 175.17, 184.51, 195.99, 204.9, 212.59, 236.31, NA)), class = "data.frame", row.names = c(NA, -16L))
So I ran a one-way ANOVA on this data and had a couple problems. First, when I ran the level function here:
data(Earnings_Year)
View(Earnings_Year)
set.seed(1234)
Earnings_Year %>%
sample_n_by(Degree,
size=1)
levels(Earnings_Year$Degree)
For whatever reason the code above wont show the levels and just spits out "NULL." As far as I know, the levels should be "BA", "High School", and "M.Ed."
Another issue I had later was when I ran this. When I ran a generic Shapiro test there didnt seem to be the same issue until I grouped it:
Earnings_Year %>%
group_by(Degree) %>%
shapiro_test(Month_USD)
When I run it, it comes up with the following problem:
Error: Problem with `mutate()` column `data`.
i `data = map(.data$data, .f, ...)`.
x Problem with `mutate()` column `data`.
i `data = map(.data$data, .f, ...)`.
x sample size must be between 3 and 5000
Run `rlang::last_error()` to see where the error occurred.
Any insight on what went wrong would be appreciated. Overall, I ended up with a nice ANOVA boxplot at the end that seemed to indicate what I was looking for:
As the error message suggests there are certain groups in your data which have less than 3 rows or more than 5000 rows.
We can check number of rows in each group using count.
library(dplyr)
library(rstatix)
df %>% count(Degree)
# Degree n
#1 BA 5
#2 High School 9
#3 M.Ed 2
You can remove such groups and the code should work fine.
df %>%
group_by(Degree) %>%
filter(n() > 2) %>%
shapiro_test(Month_USD)
# Degree variable statistic p
# <chr> <chr> <dbl> <dbl>
#1 BA Month_USD 0.944 0.695
#2 High School Month_USD 0.887 0.185

Apply a function to particular rows

I want to apply the following code to only the first 3 rows (if it's applied to the second 3, it fails to parse. netflix_and_disney$release_year <-year(dmy(netflix_and_disney$release_year))
Is there a way about doing this with this df?
structure(list(show_id = c("00147800", "07019028", "00115433", "70234439", "80058654", "80125979"), title = c("10 Things I Hate About You", "101 Dalmatian Street", "101 Dalmatians", "Transformers Prime", "Transformers: Robots in Disguise", "#realityhigh"), type = c("Movie", "Tv Show", "Movie", "Tv Show", "Tv Show", "Movie"), rating = c("PG-13", "N/A", "G", "TV-Y7-FV", "TV-Y7", "TV-14"), release_year = c("31 Mar 1999", "25 Mar 2019", "27 Nov 1996", "2013", "2016", "2017"), date_added = structure(c(18212, 18320, 18212, 17782, 17782, 17417), class = "Date"), duration = c("97 min", "N/A", "103 min", "1 Season", "1 Season", "99 min"), genre = c("Comedy, Drama, Romance", "Animation, Comedy, Family", "Adventure, Comedy, Crime, Family", "Kids' TV", "Kids' TV", "Comedies"), director = c("Gil Junger", "N/A", "Stephen Herek", NA, NA, "Fernando Lebrija"), country = c("USA", "UK, USA, Canada", "USA, UK", "United States", "United States", "United States"), imdb_rating = c("7.3", "6.2", "5.7", NA, NA, NA), platform = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Disney", "Netflix"), class = "factor")), row.names = c(1L, 2L, 3L, 995L, 996L, 997L), class = "data.frame")
I have tried applying to a subset of the df but has failed to work, as well as applying the which() function
It really all depends on your data and what function you want to apply. But, in principle, you can do this by subsetting your dataframe:
Data:
set.seed(123)
df <- data.frame(
v1 = rnorm(20),
v2 = runif(20),
v3 = sample(20)
)
Here we apply the function meanto the first ten rows of df:
apply(df[1:10,], 1, mean)
1 2 3 4 5 6 7 8 9 10
4.5274415 0.7281229 2.9908109 1.8131179 6.7605775 6.6179570 4.2313168 3.4003004 3.1930399 5.8040552
netflix_and_disney$release_year[1:3] < year(dmy(netflix_and_disney$release_year[1:3]))

Using mutate and a lookup/calc funtion

I wrote a function where I pass a company name to lookup in a 2nd table a set of records, calculate a complicated result, and return the result.
I want to process all companies and add a value to each record with that result.
I am using the following code:
`aa <- mutate(companies,newcol=sum_rounds(companies$company_name))`
But I get the following warning:
Warning message:
In c("Bwom", "Symple", "TravelTriangle", "Ark Biosciences", "Artizan Biosciences", :
longer object length is not a multiple of shorter object length
(each of these is a company name)
The company dataframe gets a new column, but all values are "false" where actually there should be both true and false.
Any advice would be welcome to a newbie.
Function follows:
sum_rounds<-function(co_name) {
#get records from rounds for the company name passed to the function
#remove NAs from column roundtype too
outval<- rounds %>%
filter(company_name.x==co_name & !is.na(roundtype)) %>%
#sort by date round is announced
arrange(announced_on) %>%
select(roundtype) %>%
#create a string of all round types in order
apply(2,paste,collapse="")
#the values from mixed to "M", venture to "V" and pureangel to "A"
# now see if it is of the form aaaaa (and #) followed by m or v
# in grep: ^ is start of a line and + is for ar least one copy
# [mv] is either m or v
# nice summary is here: http://www.endmemo.com/program/R/gsub.php
#is angel2vc?
angel2vc<-grepl("^a+[mv]+",outval)
#return(list("roundcodes"=outval,"angel2vc"=angel2vc))
return(angel2vc)
}
DPUT from Companies table Follows:
structure(list(company_name = c("Bwom", "Symple", "TravelTriangle",
"Ark Biosciences", "Artizan Biosciences", "Audiense"), domain = c("b-wom.com",
"getsymple.com", "traveltriangle.com", "arkbiosciences.com",
NA, "audiense.com"), country_code = c("ESP", "USA", "USA", "CHN",
"USA", "GBR"), state_code = c(NA, "CA", "VA", NA, "NC", NA),
region = c("Barcelona", "SF Bay Area", "Washington, D.C.",
"Shanghai", "Raleigh", "London"), city = c("Barcelona", "San Francisco",
"Charlottesville", "Shanghai", "Durham", "London"), status = c("operating",
"operating", "operating", "operating", "operating", "operating"
), short_description = c("Bwom is a tool that offers a test and personalized exercises for women's intimate health.",
"Symple is the cloud platform for all your business payments. Pay, get paid, connect.",
"TravelTriangle enables travel enthusiasts to reserve a personalized holiday plan with a local travel agent.",
"Ark Biosciences is a biopharmaceutical company that is dedicated to the discovery and development",
"Artizan Biosciences", "SaaS developer delivering unique consumer insight and engagement capabilities to many of the world’s biggest brands and agencies."
), category_list = c("health care", "cloud computing|machine learning|mobile apps|mobile payments|retail technology",
"e-commerce|personalization|tourism|travel", "health care",
"biopharma", "analytics|apps|marketing|market research|social crm|social media|social media marketing"
), category_group_list = c("health care", "apps|commerce and shopping|data and analytics|financial services|hardware|internet services|mobile|payments|software",
"commerce and shopping|travel and tourism", "health care",
"biotechnology|health care|science and engineering", "apps|data and analytics|design|information technology|internet services|media and entertainment|sales and marketing|software"
), employee_count = c("1 to 10", "11 to 50", "101 to 250",
NA, "1 to 10", "51 to 100"), funding_rounds = c(2L, 1L, 4L,
2L, 2L, 5L), funding_total_usd = c(1075791, 120000, 19900000,
NA, 3e+06, 8013391), founded_on = structure(c(16555, 16770,
15156, 16071, NA, 14975), class = "Date"), first_funding_on = structure(c(16526,
17204, 15492, 16532, 17091, 15294), class = "Date"), last_funding_on = structure(c(17204,
17204, 17204, 17203, 17203, 17203), class = "Date"), closed_on = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), email = c("hello#b-wom.com", "info#getsymple.com",
"admin#traveltriangle.com", "info#arkbiosciences.com", NA,
"moreinfo#audiense.com"), phone = c(NA, NA, "'+91 98 99 120408",
"###############################################################################################################################################################################################################################################################",
NA, "###############################################################################################################################################################################################################################################################"
), cb_url = c("https://www.crunchbase.com/organization/bwom",
"https://www.crunchbase.com/organization/symple-2", "https://www.crunchbase.com/organization/traveltriangle-com",
"https://www.crunchbase.com/organization/ark-biosciences",
"https://www.crunchbase.com/organization/artizan-biosciences",
"https://www.crunchbase.com/organization/socialbro"), twitter_url = c("https://www.twitter.com/hellobwom",
NA, "https://www.twitter.com/traveltriangle", NA, NA, "https://www.twitter.com/socialbro"
), facebook_url = c("https://www.facebook.com/hellobwom/?fref=ts",
NA, "http://www.facebook.com/traveltriangle", NA, NA, "http://www.facebook.com/socialbro"
), uuid = c("e6096d58-3454-d982-0dbe-7de9b06cd493", "fd0ab78f-0dc4-1f18-21d1-7ce9ff7a173b",
"742043c1-c17a-4526-4ed0-e911e6e9555b", "8e27eb22-ce03-a2af-58ba-53f0f458f49c",
"ed07ac9e-1071-fca0-46d9-42035c2da505", "fed333e5-2754-7413-1e3d-5939d70541d2"
), isbio = c("other", "other", "other", "other", "bio", "other"
), co_type = c("m", "m", "m", "v", "v", "m")), .Names = c("company_name",
"domain", "country_code", "state_code", "region", "city", "status",
"short_description", "category_list", "category_group_list",
"employee_count", "funding_rounds", "funding_total_usd", "founded_on",
"first_funding_on", "last_funding_on", "closed_on", "email",
"phone", "cb_url", "twitter_url", "facebook_url", "uuid", "isbio",
"co_type"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
>

Resources