Had a couple problems with One Way ANOVA in R - r

My dput is this:
structure(list(Year = 2006:2021, Month_USD = c(1160L, 1240L, 1360L, 1480L, 1320L, 1320L, 375L, 1600L, 2000L, 2000L, 1600L, 2240L, 1900L, 2300L, 2900L, 2300L), Degree = c("High School", "High School", "High School", "High School", "High School", "High School", "High School", "High School", "High School", "BA", "BA", "BA", "BA", "BA", "M.Ed", "M.Ed"), Country = c("USA", "USA", "USA", "USA", "USA", "USA", "DE", "USA", "USA", "USA", "USA", "USA", "PRC", "PRC", "PRC", "HK"), Job = c("Disher", "Prep", "Prep", "Prep", "Prep", "Prep", "Au Pair", "CSA", "Valet", "Valet", "Intake", "CM", "Teacher", "Teacher", "Teacher", "Student"), Median_Household_Income_US = c(4833L, 4961L, 4784L, 4750L, 4626L, 4556L, 4547L, 4706L, 4634L, 4873L, 5025L, 5218L, 5360L, 5725L, NA, NA), US_Home_Price_Index = c(183.24, 173.36, 152.56, 146.69, 140.64, 135.16, 143.88, 159.3, 166.5, 175.17, 184.51, 195.99, 204.9, 212.59, 236.31, NA)), class = "data.frame", row.names = c(NA, -16L))
So I ran a one-way ANOVA on this data and had a couple problems. First, when I ran the level function here:
data(Earnings_Year)
View(Earnings_Year)
set.seed(1234)
Earnings_Year %>%
sample_n_by(Degree,
size=1)
levels(Earnings_Year$Degree)
For whatever reason the code above wont show the levels and just spits out "NULL." As far as I know, the levels should be "BA", "High School", and "M.Ed."
Another issue I had later was when I ran this. When I ran a generic Shapiro test there didnt seem to be the same issue until I grouped it:
Earnings_Year %>%
group_by(Degree) %>%
shapiro_test(Month_USD)
When I run it, it comes up with the following problem:
Error: Problem with `mutate()` column `data`.
i `data = map(.data$data, .f, ...)`.
x Problem with `mutate()` column `data`.
i `data = map(.data$data, .f, ...)`.
x sample size must be between 3 and 5000
Run `rlang::last_error()` to see where the error occurred.
Any insight on what went wrong would be appreciated. Overall, I ended up with a nice ANOVA boxplot at the end that seemed to indicate what I was looking for:

As the error message suggests there are certain groups in your data which have less than 3 rows or more than 5000 rows.
We can check number of rows in each group using count.
library(dplyr)
library(rstatix)
df %>% count(Degree)
# Degree n
#1 BA 5
#2 High School 9
#3 M.Ed 2
You can remove such groups and the code should work fine.
df %>%
group_by(Degree) %>%
filter(n() > 2) %>%
shapiro_test(Month_USD)
# Degree variable statistic p
# <chr> <chr> <dbl> <dbl>
#1 BA Month_USD 0.944 0.695
#2 High School Month_USD 0.887 0.185

Related

Add_annotations plotly first and last datapoints

My Main Goal:
Trying to add annotations to both the first datapoint of my
scatterplot and the last datapoint of my scatterplot (the entries for
years 2006 and 2021 respectively).
My Secondary Goals:
If possible, it would also be helpful to find out how to select out
specific datapoints to add annotations, as I only know the
which.max/which.min functions so far.
It would also be nice to know how to list the jobs on each point.
My Dput:
structure(list(Year = 2006:2021, Month_USD = c(1160L, 1240L,
1360L, 1480L, 1320L, 1320L, 375L, 1600L, 2000L, 2000L, 1600L,
2240L, 1900L, 2300L, 2900L, 2300L), Degree = c("High School",
"High School", "High School", "High School", "High School", "High School",
"High School", "High School", "High School", "BA", "BA", "BA",
"BA", "BA", "M.Ed", "M.Ed"), Country = c("USA", "USA", "USA",
"USA", "USA", "USA", "DE", "USA", "USA", "USA", "USA", "USA",
"PRC", "PRC", "PRC", "HK"), Job = c("Disher", "Prep", "Prep",
"Prep", "Prep", "Prep", "Au Pair", "CSA", "Valet", "Valet", "Intake",
"CM", "Teacher", "Teacher", "Teacher", "Student"), Median_Household_Income_US = c(4833L,
4961L, 4784L, 4750L, 4626L, 4556L, 4547L, 4706L, 4634L, 4873L,
5025L, 5218L, 5360L, 5725L, NA, NA), US_Home_Price_Index = c(183.24,
173.36, 152.56, 146.69, 140.64, 135.16, 143.88, 159.3, 166.5,
175.17, 184.51, 195.99, 204.9, 212.59, 236.31, NA)), class = "data.frame", row.names = c(NA,
-16L))
Current Scatterplot:
pal <- c("Red", "Blue", "Green")
plot_ly(data = Earnings_Year,
x=~Year,
y=~Month_USD,
type='scatter',
mode='markers',
symbol = ~as.factor(Degree),
symbols=c("star-open-dot","hexagon-open-dot","diamond-open-dot"),
color = ~as.factor(Degree),
colors = pal,
hoverinfo="text",
text= paste("Year: ",
Earnings_Year$Year,
"<br>", #this is a line break
"Monthly USD: ",
Earnings_Year$Month_USD),
size=10) %>%
add_annotations(
x=Earnings_Year$Year[which.min(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.min(Earnings_Year$Month_USD)],
text = "Au Pair Job in Germany") %>%
add_annotations(
x=Earnings_Year$Year[which.max(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.max(Earnings_Year$Month_USD)],
text = "Last Teaching Job in China") %>%
layout(legend= list(x=1,y=0.5),
title="Earnings by Degree",
xaxis=list(title="Year"),
yaxis=list(title="Monthly USD"))
Image of Current Scatter:
Scatter That I Want:
Figured it out. Just needed to pipe additional add_annotations as well as just select specific values for x and y:
pal <- c("Red", "Blue", "Green")
plot_ly(data = Earnings_Year,
x=~Year,
y=~Month_USD,
type='scatter',
mode='markers',
symbol = ~as.factor(Degree),
symbols=c("star-open-dot","hexagon-open-dot","diamond-open-dot"),
color = ~as.factor(Degree),
colors = pal,
hoverinfo="text",
text= paste("Year: ",
Earnings_Year$Year,
"<br>", #this is a line break
"Monthly USD: ",
Earnings_Year$Month_USD),
size=10) %>%
add_annotations(
x=Earnings_Year$Year[which.min(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.min(Earnings_Year$Month_USD)],
text = "Au Pair Job in Germany") %>%
add_annotations(
x=Earnings_Year$Year[which.max(Earnings_Year$Month_USD)],
y=Earnings_Year$Month_USD[which.max(Earnings_Year$Month_USD)],
text = "Last Teaching Job in China") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2006],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==1160],
text="First Job"
) %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2021],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==2300],
text="Began Ph.D.") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2008],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==1360],
text="Finished H.S.") %>%
add_annotations(
x=Earnings_Year$Year[Earnings_Year$Year==2015],
y=Earnings_Year$Month_USD[Earnings_Year$Month_USD==2000],
text="Finished BA") %>%
layout(legend= list(x=1,y=0.5),
title="Earnings by Degree",
xaxis=list(title="Year"),
yaxis=list(title="Monthly USD"))
Finished Product:

How to add a column from a normal data frame to a spatial polygons data frame in R?

I am coding in R. I have a normal data frame named A, in this data frame (I mean A), there is a column named Province. I also have a spatial polygons data frame named B. I want to add the column Province to the spatial polygons data frame B.
A consists of the number of COVID-19 confirmed cases in a country.
Province is a column of the name of provinces of that country.
B includes the shapefile of administrative boundaries in that country.
I have tried to solve my problem but I could not! What I have coded are as follow:
B<-A$Province, but this is not what I want. I need the new B (after adding Province) as a new spatial polygons data frame.
My second try is this one:
B<-as.data.frame(B)
B<-cbind(B,Province=A$Provincea)
Like the first code, the above codes are not what I want, because I need the new B as a spatial polygons data frame (and not as a normal data frame).
To know more about A and B, I write the results of dput(head(A)) and dput(head(B)) in the following:
dput(head(A)):
structure(list(Province = c("EC", "FS", "GT", "KZ", "LM", "MP"
), Cases = c(2748L, 208L, 2993L, 1882L, 132L, 103L), Population = c(11.56,
2.75, 12.27, 10.27, 5.4, 4.04)), row.names = c(NA, 6L), class = "data.frame")
dput(head(B)):
structure(list(ID_0 = c("211", "211", "211", "211", "211", "211"
), ISO = c("ZAF", "ZAF", "ZAF", "ZAF", "ZAF", "ZAF"), NAME_0 = c("South Africa",
"South Africa", "South Africa", "South Africa", "South Africa",
"South Africa"), ID_1 = c("1", "2", "3", "4", "5", "6"), NAME_1 = c("Eastern Cape",
"Free State", "Gauteng", "KwaZulu-Natal", "Limpopo", "Mpumalanga"
), TYPE_1 = c("Provinsie", "Provinsie", "Provinsie", "Provinsie",
"Provinsie", "Provinsie"), ENGTYPE_1 = c("Province", "Province",
"Province", "Province", "Province", "Province"), NL_NAME_1 = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), VARNAME_1 = c("Oos-Kaap", "Orange Free State|Vrystaat", "Pretoria/Witwatersrand/Vaal",
"Natal and Zululand", "Noordelike Provinsie|Northern Transvaal|Northern Province",
"Eastern Transvaal")), row.names = 0:5, class = "data.frame")
Indeed, I want to add the column of Province of the normal data frame (A) to the spatial polygons data frame (B), then I want to merge A and B.
Could you please help me to solve the problem mentioned above?
Thank you in advance for your help.

Apply a function to particular rows

I want to apply the following code to only the first 3 rows (if it's applied to the second 3, it fails to parse. netflix_and_disney$release_year <-year(dmy(netflix_and_disney$release_year))
Is there a way about doing this with this df?
structure(list(show_id = c("00147800", "07019028", "00115433", "70234439", "80058654", "80125979"), title = c("10 Things I Hate About You", "101 Dalmatian Street", "101 Dalmatians", "Transformers Prime", "Transformers: Robots in Disguise", "#realityhigh"), type = c("Movie", "Tv Show", "Movie", "Tv Show", "Tv Show", "Movie"), rating = c("PG-13", "N/A", "G", "TV-Y7-FV", "TV-Y7", "TV-14"), release_year = c("31 Mar 1999", "25 Mar 2019", "27 Nov 1996", "2013", "2016", "2017"), date_added = structure(c(18212, 18320, 18212, 17782, 17782, 17417), class = "Date"), duration = c("97 min", "N/A", "103 min", "1 Season", "1 Season", "99 min"), genre = c("Comedy, Drama, Romance", "Animation, Comedy, Family", "Adventure, Comedy, Crime, Family", "Kids' TV", "Kids' TV", "Comedies"), director = c("Gil Junger", "N/A", "Stephen Herek", NA, NA, "Fernando Lebrija"), country = c("USA", "UK, USA, Canada", "USA, UK", "United States", "United States", "United States"), imdb_rating = c("7.3", "6.2", "5.7", NA, NA, NA), platform = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Disney", "Netflix"), class = "factor")), row.names = c(1L, 2L, 3L, 995L, 996L, 997L), class = "data.frame")
I have tried applying to a subset of the df but has failed to work, as well as applying the which() function
It really all depends on your data and what function you want to apply. But, in principle, you can do this by subsetting your dataframe:
Data:
set.seed(123)
df <- data.frame(
v1 = rnorm(20),
v2 = runif(20),
v3 = sample(20)
)
Here we apply the function meanto the first ten rows of df:
apply(df[1:10,], 1, mean)
1 2 3 4 5 6 7 8 9 10
4.5274415 0.7281229 2.9908109 1.8131179 6.7605775 6.6179570 4.2313168 3.4003004 3.1930399 5.8040552
netflix_and_disney$release_year[1:3] < year(dmy(netflix_and_disney$release_year[1:3]))

Summarizing a specific column with dplyr

For my assignment I need to create an object which contains, for each
combination of Sex and Season, the number of different sports in the olympics data set. The columns of this object should be called Competitor_Sex, Olympic_Season, and Num_Sports,
respectively.
This is what I have at the moment:
object <- olympics %>%
group_by(Sex, Season) %>%
summarise(Num_Sports = ???)
I'm having trouble with defining the third column, which is the number of sports. My data looks like this:
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun"
), Sex = c("M", "M", "F", "M", "M"), Age = c(23L, 28L, 22L, 30L,
23L), Height = c(170L, 184L, 170L, 187L, 167L), Weight = c(60,
85, 125, 76, 64), Team = c("China", "Finland", "Romania", "France",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP"), Games = c("2012 Summer",
"2014 Winter", "2016 Summer", "2012 Summer", "2016 Summer"),
Year = c(2012L, 2014L, 2016L, 2012L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro"), Sport = c("Judo",
"Ice Hockey", "Weightlifting", "Athletics", "Gymnastics"),
Event = c("Judo Men's Extra-Lightweight", "Ice Hockey Men's Ice Hockey",
"Weightlifting Women's Super-Heavyweight", "Athletics Men's 1,500 metres",
"Gymnastics Men's Individual All-Around"), Medal = c(NA,
"Bronze", NA, NA, NA)), row.names = c("1", "2", "3", "4",
"5"), class = "data.frame")
This is probably solved in an easy way. Could someone help me? Would be appreciated a lot!
Best Regards,
Grouping twice should work:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n())
You can use the equivalent of length(unique( from dplyr: n_distinct:
olympics %>%
group_by(Sex, Season) %>%
summarise(Sports = n_distinct(Sport)) %>%
rename(Competitor_Sex = Sex, Olympic_Season = Season) # To rename the columns

Using mutate and a lookup/calc funtion

I wrote a function where I pass a company name to lookup in a 2nd table a set of records, calculate a complicated result, and return the result.
I want to process all companies and add a value to each record with that result.
I am using the following code:
`aa <- mutate(companies,newcol=sum_rounds(companies$company_name))`
But I get the following warning:
Warning message:
In c("Bwom", "Symple", "TravelTriangle", "Ark Biosciences", "Artizan Biosciences", :
longer object length is not a multiple of shorter object length
(each of these is a company name)
The company dataframe gets a new column, but all values are "false" where actually there should be both true and false.
Any advice would be welcome to a newbie.
Function follows:
sum_rounds<-function(co_name) {
#get records from rounds for the company name passed to the function
#remove NAs from column roundtype too
outval<- rounds %>%
filter(company_name.x==co_name & !is.na(roundtype)) %>%
#sort by date round is announced
arrange(announced_on) %>%
select(roundtype) %>%
#create a string of all round types in order
apply(2,paste,collapse="")
#the values from mixed to "M", venture to "V" and pureangel to "A"
# now see if it is of the form aaaaa (and #) followed by m or v
# in grep: ^ is start of a line and + is for ar least one copy
# [mv] is either m or v
# nice summary is here: http://www.endmemo.com/program/R/gsub.php
#is angel2vc?
angel2vc<-grepl("^a+[mv]+",outval)
#return(list("roundcodes"=outval,"angel2vc"=angel2vc))
return(angel2vc)
}
DPUT from Companies table Follows:
structure(list(company_name = c("Bwom", "Symple", "TravelTriangle",
"Ark Biosciences", "Artizan Biosciences", "Audiense"), domain = c("b-wom.com",
"getsymple.com", "traveltriangle.com", "arkbiosciences.com",
NA, "audiense.com"), country_code = c("ESP", "USA", "USA", "CHN",
"USA", "GBR"), state_code = c(NA, "CA", "VA", NA, "NC", NA),
region = c("Barcelona", "SF Bay Area", "Washington, D.C.",
"Shanghai", "Raleigh", "London"), city = c("Barcelona", "San Francisco",
"Charlottesville", "Shanghai", "Durham", "London"), status = c("operating",
"operating", "operating", "operating", "operating", "operating"
), short_description = c("Bwom is a tool that offers a test and personalized exercises for women's intimate health.",
"Symple is the cloud platform for all your business payments. Pay, get paid, connect.",
"TravelTriangle enables travel enthusiasts to reserve a personalized holiday plan with a local travel agent.",
"Ark Biosciences is a biopharmaceutical company that is dedicated to the discovery and development",
"Artizan Biosciences", "SaaS developer delivering unique consumer insight and engagement capabilities to many of the world’s biggest brands and agencies."
), category_list = c("health care", "cloud computing|machine learning|mobile apps|mobile payments|retail technology",
"e-commerce|personalization|tourism|travel", "health care",
"biopharma", "analytics|apps|marketing|market research|social crm|social media|social media marketing"
), category_group_list = c("health care", "apps|commerce and shopping|data and analytics|financial services|hardware|internet services|mobile|payments|software",
"commerce and shopping|travel and tourism", "health care",
"biotechnology|health care|science and engineering", "apps|data and analytics|design|information technology|internet services|media and entertainment|sales and marketing|software"
), employee_count = c("1 to 10", "11 to 50", "101 to 250",
NA, "1 to 10", "51 to 100"), funding_rounds = c(2L, 1L, 4L,
2L, 2L, 5L), funding_total_usd = c(1075791, 120000, 19900000,
NA, 3e+06, 8013391), founded_on = structure(c(16555, 16770,
15156, 16071, NA, 14975), class = "Date"), first_funding_on = structure(c(16526,
17204, 15492, 16532, 17091, 15294), class = "Date"), last_funding_on = structure(c(17204,
17204, 17204, 17203, 17203, 17203), class = "Date"), closed_on = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), email = c("hello#b-wom.com", "info#getsymple.com",
"admin#traveltriangle.com", "info#arkbiosciences.com", NA,
"moreinfo#audiense.com"), phone = c(NA, NA, "'+91 98 99 120408",
"###############################################################################################################################################################################################################################################################",
NA, "###############################################################################################################################################################################################################################################################"
), cb_url = c("https://www.crunchbase.com/organization/bwom",
"https://www.crunchbase.com/organization/symple-2", "https://www.crunchbase.com/organization/traveltriangle-com",
"https://www.crunchbase.com/organization/ark-biosciences",
"https://www.crunchbase.com/organization/artizan-biosciences",
"https://www.crunchbase.com/organization/socialbro"), twitter_url = c("https://www.twitter.com/hellobwom",
NA, "https://www.twitter.com/traveltriangle", NA, NA, "https://www.twitter.com/socialbro"
), facebook_url = c("https://www.facebook.com/hellobwom/?fref=ts",
NA, "http://www.facebook.com/traveltriangle", NA, NA, "http://www.facebook.com/socialbro"
), uuid = c("e6096d58-3454-d982-0dbe-7de9b06cd493", "fd0ab78f-0dc4-1f18-21d1-7ce9ff7a173b",
"742043c1-c17a-4526-4ed0-e911e6e9555b", "8e27eb22-ce03-a2af-58ba-53f0f458f49c",
"ed07ac9e-1071-fca0-46d9-42035c2da505", "fed333e5-2754-7413-1e3d-5939d70541d2"
), isbio = c("other", "other", "other", "other", "bio", "other"
), co_type = c("m", "m", "m", "v", "v", "m")), .Names = c("company_name",
"domain", "country_code", "state_code", "region", "city", "status",
"short_description", "category_list", "category_group_list",
"employee_count", "funding_rounds", "funding_total_usd", "founded_on",
"first_funding_on", "last_funding_on", "closed_on", "email",
"phone", "cb_url", "twitter_url", "facebook_url", "uuid", "isbio",
"co_type"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
>

Resources