Apply a function to particular rows - r

I want to apply the following code to only the first 3 rows (if it's applied to the second 3, it fails to parse. netflix_and_disney$release_year <-year(dmy(netflix_and_disney$release_year))
Is there a way about doing this with this df?
structure(list(show_id = c("00147800", "07019028", "00115433", "70234439", "80058654", "80125979"), title = c("10 Things I Hate About You", "101 Dalmatian Street", "101 Dalmatians", "Transformers Prime", "Transformers: Robots in Disguise", "#realityhigh"), type = c("Movie", "Tv Show", "Movie", "Tv Show", "Tv Show", "Movie"), rating = c("PG-13", "N/A", "G", "TV-Y7-FV", "TV-Y7", "TV-14"), release_year = c("31 Mar 1999", "25 Mar 2019", "27 Nov 1996", "2013", "2016", "2017"), date_added = structure(c(18212, 18320, 18212, 17782, 17782, 17417), class = "Date"), duration = c("97 min", "N/A", "103 min", "1 Season", "1 Season", "99 min"), genre = c("Comedy, Drama, Romance", "Animation, Comedy, Family", "Adventure, Comedy, Crime, Family", "Kids' TV", "Kids' TV", "Comedies"), director = c("Gil Junger", "N/A", "Stephen Herek", NA, NA, "Fernando Lebrija"), country = c("USA", "UK, USA, Canada", "USA, UK", "United States", "United States", "United States"), imdb_rating = c("7.3", "6.2", "5.7", NA, NA, NA), platform = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Disney", "Netflix"), class = "factor")), row.names = c(1L, 2L, 3L, 995L, 996L, 997L), class = "data.frame")
I have tried applying to a subset of the df but has failed to work, as well as applying the which() function

It really all depends on your data and what function you want to apply. But, in principle, you can do this by subsetting your dataframe:
Data:
set.seed(123)
df <- data.frame(
v1 = rnorm(20),
v2 = runif(20),
v3 = sample(20)
)
Here we apply the function meanto the first ten rows of df:
apply(df[1:10,], 1, mean)
1 2 3 4 5 6 7 8 9 10
4.5274415 0.7281229 2.9908109 1.8131179 6.7605775 6.6179570 4.2313168 3.4003004 3.1930399 5.8040552

netflix_and_disney$release_year[1:3] < year(dmy(netflix_and_disney$release_year[1:3]))

Related

Summarizing a specific column with dplyr

For my assignment I need to create an object which contains, for each
combination of Sex and Season, the number of different sports in the olympics data set. The columns of this object should be called Competitor_Sex, Olympic_Season, and Num_Sports,
respectively.
This is what I have at the moment:
object <- olympics %>%
group_by(Sex, Season) %>%
summarise(Num_Sports = ???)
I'm having trouble with defining the third column, which is the number of sports. My data looks like this:
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun"
), Sex = c("M", "M", "F", "M", "M"), Age = c(23L, 28L, 22L, 30L,
23L), Height = c(170L, 184L, 170L, 187L, 167L), Weight = c(60,
85, 125, 76, 64), Team = c("China", "Finland", "Romania", "France",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP"), Games = c("2012 Summer",
"2014 Winter", "2016 Summer", "2012 Summer", "2016 Summer"),
Year = c(2012L, 2014L, 2016L, 2012L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro"), Sport = c("Judo",
"Ice Hockey", "Weightlifting", "Athletics", "Gymnastics"),
Event = c("Judo Men's Extra-Lightweight", "Ice Hockey Men's Ice Hockey",
"Weightlifting Women's Super-Heavyweight", "Athletics Men's 1,500 metres",
"Gymnastics Men's Individual All-Around"), Medal = c(NA,
"Bronze", NA, NA, NA)), row.names = c("1", "2", "3", "4",
"5"), class = "data.frame")
This is probably solved in an easy way. Could someone help me? Would be appreciated a lot!
Best Regards,
Grouping twice should work:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n())
You can use the equivalent of length(unique( from dplyr: n_distinct:
olympics %>%
group_by(Sex, Season) %>%
summarise(Sports = n_distinct(Sport)) %>%
rename(Competitor_Sex = Sex, Olympic_Season = Season) # To rename the columns

How to subset variables which have NA in the values?

I have an imdb dataset where I would like to replace the missing values for budget and box_office_gross, for which I think using multiple imputation would be a way to replace the missing values.
In order to separate the numeric columns from the entire dataset and perform imputation, I tried to subset the variables
> NBCU_Limited <- subset(NBCU_dataLaurel_Modified, select = c(NBCU_dataLaurel_Modified$imdb_votes, NBCU_dataLaurel_Modified$runtime_min, NBCU_dataLaurel_Modified$Budget, NBCU_dataLaurel_Modified$Box_Office_Gross))
Error: NA column indexes not supported
But I get an error because there are NA values in the variables, I cannot negate the rest of the character columns because even they have NA's and I get the same error.
How do I get only these four variables out into a new dataframe so that I can perform multiple imputation on them.
Sample Dataset
Update: The error is causing because I am specifying the data.frame individually in the subset, if I do not specify data.frame and just specify the name of the variable I do not get this error. I am not sure why but that is what causes the error, so maybe this is because of my improper code.
Below is the data,
> dput(Sample)
structure(list(imdbid = c("tt6256056", "tt0085450", "tt5050772",
"tt5069876", "tt0083791", "tt0083929"), title = c("Una Famiglia",
"Doctor Detroit", "Honeytrap", "Maniac 8.2.8", "The Dark Crystal",
"Fast Times at Ridgemont High"), plot = c("N/A", "A timid college professor, conned into posing as a flamboyant pimp, finds himself enjoying his new occupation on the streets.",
"Simeon's evening goes horribly wrong when a young woman tries to pick him up.",
"Maniac: a person afflicted with mania. Mania: A manifestation of bipolar disorder, characterized by profuse and rapidly changing ideas, exaggerated sexuality, gaiety, or irritability, decreased sleep and violent abnormal behavior.",
"On another planet in the distant past, a Gelfling embarks on a quest to find the missing shard of a magical crystal, and so restore order to his world.",
"A group of Southern California high school students are enjoying their most important subjects: sex, drugs and rock n' roll."
), rating = c("N/A", "R", "N/A", "N/A", "PG", "R"), imdb_rating = c(NA,
5.1, NA, NA, 7.2, 7.2), metacritic = c(NA, NA, NA, NA, NA, 67
), dvd_release = structure(c(NA, 1126569600, NA, NA, 939081600,
1099353600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
production = c("N/A", "Universal", "Array Releasing", "N/A",
"Sony Pictures Home Entertainment", "Universal Pictures"),
actors = c("Patrick Bruel, Fortunato Cerlino, Matilda De Angelis, Ennio Fantastichini",
"Dan Aykroyd, Howard Hesseman, Donna Dixon, Lydia Lei", "Jennifer Nelson, Daemian Greaves, Polina Vasileva, Becki Lloyd",
"Dimitra Aggelou, Giorgos Efthimiou, Stavroula Kontopoulou, Maria-Antouanetta Tatsi",
"Jim Henson, Kathryn Mullen, Frank Oz, Dave Goelz", "Sean Penn, Jennifer Jason Leigh, Judge Reinhold, Robert Romanus"
), imdb_votes = c(NA, 4492, NA, NA, 44862, 76980), poster = c("N/A",
"https://images-na.ssl-images-amazon.com/images/M/MV5BMjhjY2Q4NWEtYTUzZC00YjE2LTk0ZjktNzUyZjIwNmQ0YTkyXkEyXkFqcGdeQXVyMTQxNzMzNDI#._V1_SX300.jpg",
"N/A", "https://images-na.ssl-images-amazon.com/images/M/MV5BZjdmZTRhYzgtOGY4MS00OGM5LWJlNmItYzJiYjZiNmVmYjhkXkEyXkFqcGdeQXVyNDA2NjM2ODk#._V1_SX300.jpg",
"https://images-na.ssl-images-amazon.com/images/M/MV5BMWZlZjk1MGEtYWMzOC00N2EyLWFkOTUtZDM4NGNlY2M0YjVmXkEyXkFqcGdeQXVyNTAyODkwOQ##._V1_SX300.jpg",
"https://images-na.ssl-images-amazon.com/images/M/MV5BYzBlZjE1MDctYjZmZC00ZTJmLWFkOWEtYjdmZDZkODBkZmI2XkEyXkFqcGdeQXVyNjQ2MjQ5NzM#._V1_SX300.jpg"
), director = c("Sebastiano Riso", "Michael Pressman", "Nick Archer",
"Giorgos Efthimiou", "Jim Henson, Frank Oz", "Amy Heckerling"
), release_date = structure(c(1493596800, 421027200, 1448928000,
1431734400, 408931200, 398044800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Year = c(2017, 1983, 2015, 2015, 1982,
1982), Year_Groups = c("2010-2020", "1980-1989", "2010-2020",
"2010-2020", "1980-1989", "1980-1989"), Month = c("May",
"May", "December", "May", "December", "August"), runtime_min = c(97,
89, NA, 15, 93, 90), genre = c("Drama", "Comedy", "Short, Thriller",
"Short, Horror", "Adventure, Family, Fantasy", "Comedy, Drama"
), awards = c("N/A", "N/A", "N/A", "1 win.", "Nominated for 1 BAFTA Film Award. Another 2 wins & 4 nominations.",
"1 win & 1 nomination."), keywords = c(NA, "pimp|college-professor|voyeurism|voyeur|blue-panties|panties|red-dress|blonde|female-frontal-nudity|female-nudity|nude-girl|nude|bare-breasts|breasts|topless-female-nudity|scantily-clad-female|cleavage|two-word-title|reference-to-joe-frazier|reference-to-yul-brynner|mother-son-relationship|f-word|place-name-in-title|city-name-in-title|dual-identity|prostitution|independent-film|title-spoken-by-character|character-name-in-title",
NA, NA, "mystic|magical-crystal|crystal-shard|sword-and-sorcery|puppetry|crystal|shard|quest|evil|monster|feeding-on-energy|hidden-entrance|giant-crystal|actor-voicing-multiple-characters|planetary-alignment|reunification|three-word-title|dark-fantasy|slow-motion-scene|vampire|surrealism|christ-allegory|cult-film|sorceress|relic|race-against-time|muppet|mission|magic|kingdom|creature|good-versus-evil|directed-by-star|epic|multiple-monsters|invented-language|slavery|orrery|puppet|mutation|darkness|destiny",
"high-school|title-directed-by-female|females-talking-about-sex|unwanted-pregnancy|fired-from-the-job|teacher-student-relationship|irreverence|sexual-awakening|innocence-lost|ensemble-film|coming-of-age|teen-movie|high-school-teacher|advice|ticket-scalping|shopping-mall|loss-of-virginity|female-nudity|brother-sister-relationship|caught-masturbating|california|surfer|teacher|break-up|rock-'n'-roll|virgin|teenager|friendship|drugs|date|surfer-dude|blond-boy|redheaded-boy|generation-x|f-rated|vomiting|sex-scene|cult-film|breasts|jeans|hawaiian-shirt|payphone|teenage-girl|teen-sex-comedy|scantily-clad-female|reference-to-led-zeppelin|dream-girl|underage-girl|jailbait|trophy-wife|voyeur|sexual-promiscuity|sexual-desire|sexual-attraction|lust|sex-on-couch|female-rear-nudity|female-frontal-nudity|panties|cheerleader-uniform|female-removes-her-clothes|cleavage|marijuana|drug-use|teen-angst|surfing|school-life|pregnancy|masturbation|football-player|first-love|employment|bikini|stoner|rock-m... <truncated>
), Budget = c(NA, 10375893, NA, NA, 1.5e+07, 4500000), Box_Office_Gross = c(2.48,
70, 70, 124, 140, 140)), .Names = c("imdbid", "title", "plot",
"rating", "imdb_rating", "metacritic", "dvd_release", "production",
"actors", "imdb_votes", "poster", "director", "release_date",
"Year", "Year_Groups", "Month", "runtime_min", "genre", "awards",
"keywords", "Budget", "Box_Office_Gross"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
The error is causing because I am specifying the data.frame individually in the subset, if I do not specify data.frame and just specify the name of the variable I do not get this error. I am not sure why but that is what causes the error, so maybe this is because of my improper code. Thanks #Tung for pointing this out.

(still) Having trouble with ordering/sorting bar graph using plotly and shiny widget

I'm trying to numerically order my bar graph using plotly and I'm also using the shiny widget, select box, which displays the bar graph of each type of organization. (Ex. type of organizations are medical, web, gaming, military, etc) On the y-axis for the bar graph is the name of the organization and the x-axis has the number of records. Here is my coding for numerically ordering all of my bar graphs:
df <- original.df %>% choice(type_org == input$choice)
df$abbrv <- ifelse(nchar(df$name) > 20, abbreviate(df$name), df$name)
df$abbrv <- factor(df$abbrv, levels = unique(df$abbrv)[order(df$number, decreasing = FALSE)])
plot_ly(
x = df$number,
y = df$abbrv,
type = 'bar',
text = ifelse(nchar(df$name) > 20, df$name, "")) %>%
layout(margin = list(l = 150))
A bit explanation about my dataframe and coding. Basically, I abbreviated organization names that are longer than 20 character, which is why I have an abbrv column in my dataset. So in the y-axis, the full organization name doesn't show if it's longer than 20 characters, but instead it shows the abbreviation, where I used the abbreivation function. More details about it is in my previous post.
Anyways, the problem that I am having right now is that the factor() function is ordering 98% of the bar graphs. However, it doesn't sort two bar graphs for some reason. I have no idea why it's sorting everything else EXCEPT the bar graph for two type of organization. military and telecom. Here is the dataframe for the type of organizations, telecom:
structure(list(entity_name = c("KDDI", "T-Mobile, Deutsche Telecom",
"AT&T", "AT&T", "KT Corp.", "TerraCom & YourTel", "Vodafone",
"Three", "Bell"), year = c(2006, 2006, 2008, 2010, 2012, 2013, 2013, 2017, 2017
), type_org = c("telecoms", "telecoms", "telecoms", "telecoms",
"telecoms", "telecoms", "telecoms", "telecoms", "telecoms"),
records_lost = c(4000000L, 17000000L, 100000L, 100000L, 8700000L, 180000L,
2000000L, 200000L, 1900000L),
abbreviation = structure(c(6L, 8L, 1L, 1L, 2L, 7L,
3L, 5L, 4L), .Label = c("AT&T", "KT Corp.", "Vodafone", "Bell",
"Three", "KDDI", "TerraCom & YourTel", "T-DT"), class = "factor")), .Names = c("entity_name",
"alt_name", "description", "year", "type_org", "leak_method",
"interesting", "records_lost", "data_sensitivity", "source_1",
"source_2", "source_3", "source_name", "abbreviation"), row.names = c(NA,
9L), class = "data.frame")
Here is the dataframe for military:
structure(list(entity_name = c("US Dept of Defense", "US National Guard",
"US Military", "US Military", "US Army", "Stratfor", "Tricare"
), year = c(2009, 2009, 2009, 2010, 2011, 2011, 2011), type_org = c("military",
"military", "military", "military", "military", "military", "military"
), records_lost = c(72000L, 130000L, 76000000L, 300000L, 50000L, 900000L, 4900000L), abbreviation = structure(c(2L,
3L, 6L, 6L, 4L, 1L, 5L), .Label = c("Stratfor", "US Dept of Defense",
"US National Guard", "US Army", "Tricare", "US Military"), class = "factor")), .Names = c("entity_name",
"alt_name", "description", "year", "type_org", "leak_method",
"interesting", "records_lost", "data_sensitivity", "source_1",
"source_2", "source_3", "source_name", "abbreviation"), row.names = c(NA,
7L), class = "data.frame")
For some reason, these two are not sorted numerically like the other type of organizations and I'm not sure what to do. I tried using the arrange() dplyr function as well, but that doesn't do anything. I don't understand why it sorts all the other bar graphs though. Would anyone happen to know how to fix this?

Using mutate and a lookup/calc funtion

I wrote a function where I pass a company name to lookup in a 2nd table a set of records, calculate a complicated result, and return the result.
I want to process all companies and add a value to each record with that result.
I am using the following code:
`aa <- mutate(companies,newcol=sum_rounds(companies$company_name))`
But I get the following warning:
Warning message:
In c("Bwom", "Symple", "TravelTriangle", "Ark Biosciences", "Artizan Biosciences", :
longer object length is not a multiple of shorter object length
(each of these is a company name)
The company dataframe gets a new column, but all values are "false" where actually there should be both true and false.
Any advice would be welcome to a newbie.
Function follows:
sum_rounds<-function(co_name) {
#get records from rounds for the company name passed to the function
#remove NAs from column roundtype too
outval<- rounds %>%
filter(company_name.x==co_name & !is.na(roundtype)) %>%
#sort by date round is announced
arrange(announced_on) %>%
select(roundtype) %>%
#create a string of all round types in order
apply(2,paste,collapse="")
#the values from mixed to "M", venture to "V" and pureangel to "A"
# now see if it is of the form aaaaa (and #) followed by m or v
# in grep: ^ is start of a line and + is for ar least one copy
# [mv] is either m or v
# nice summary is here: http://www.endmemo.com/program/R/gsub.php
#is angel2vc?
angel2vc<-grepl("^a+[mv]+",outval)
#return(list("roundcodes"=outval,"angel2vc"=angel2vc))
return(angel2vc)
}
DPUT from Companies table Follows:
structure(list(company_name = c("Bwom", "Symple", "TravelTriangle",
"Ark Biosciences", "Artizan Biosciences", "Audiense"), domain = c("b-wom.com",
"getsymple.com", "traveltriangle.com", "arkbiosciences.com",
NA, "audiense.com"), country_code = c("ESP", "USA", "USA", "CHN",
"USA", "GBR"), state_code = c(NA, "CA", "VA", NA, "NC", NA),
region = c("Barcelona", "SF Bay Area", "Washington, D.C.",
"Shanghai", "Raleigh", "London"), city = c("Barcelona", "San Francisco",
"Charlottesville", "Shanghai", "Durham", "London"), status = c("operating",
"operating", "operating", "operating", "operating", "operating"
), short_description = c("Bwom is a tool that offers a test and personalized exercises for women's intimate health.",
"Symple is the cloud platform for all your business payments. Pay, get paid, connect.",
"TravelTriangle enables travel enthusiasts to reserve a personalized holiday plan with a local travel agent.",
"Ark Biosciences is a biopharmaceutical company that is dedicated to the discovery and development",
"Artizan Biosciences", "SaaS developer delivering unique consumer insight and engagement capabilities to many of the world’s biggest brands and agencies."
), category_list = c("health care", "cloud computing|machine learning|mobile apps|mobile payments|retail technology",
"e-commerce|personalization|tourism|travel", "health care",
"biopharma", "analytics|apps|marketing|market research|social crm|social media|social media marketing"
), category_group_list = c("health care", "apps|commerce and shopping|data and analytics|financial services|hardware|internet services|mobile|payments|software",
"commerce and shopping|travel and tourism", "health care",
"biotechnology|health care|science and engineering", "apps|data and analytics|design|information technology|internet services|media and entertainment|sales and marketing|software"
), employee_count = c("1 to 10", "11 to 50", "101 to 250",
NA, "1 to 10", "51 to 100"), funding_rounds = c(2L, 1L, 4L,
2L, 2L, 5L), funding_total_usd = c(1075791, 120000, 19900000,
NA, 3e+06, 8013391), founded_on = structure(c(16555, 16770,
15156, 16071, NA, 14975), class = "Date"), first_funding_on = structure(c(16526,
17204, 15492, 16532, 17091, 15294), class = "Date"), last_funding_on = structure(c(17204,
17204, 17204, 17203, 17203, 17203), class = "Date"), closed_on = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), email = c("hello#b-wom.com", "info#getsymple.com",
"admin#traveltriangle.com", "info#arkbiosciences.com", NA,
"moreinfo#audiense.com"), phone = c(NA, NA, "'+91 98 99 120408",
"###############################################################################################################################################################################################################################################################",
NA, "###############################################################################################################################################################################################################################################################"
), cb_url = c("https://www.crunchbase.com/organization/bwom",
"https://www.crunchbase.com/organization/symple-2", "https://www.crunchbase.com/organization/traveltriangle-com",
"https://www.crunchbase.com/organization/ark-biosciences",
"https://www.crunchbase.com/organization/artizan-biosciences",
"https://www.crunchbase.com/organization/socialbro"), twitter_url = c("https://www.twitter.com/hellobwom",
NA, "https://www.twitter.com/traveltriangle", NA, NA, "https://www.twitter.com/socialbro"
), facebook_url = c("https://www.facebook.com/hellobwom/?fref=ts",
NA, "http://www.facebook.com/traveltriangle", NA, NA, "http://www.facebook.com/socialbro"
), uuid = c("e6096d58-3454-d982-0dbe-7de9b06cd493", "fd0ab78f-0dc4-1f18-21d1-7ce9ff7a173b",
"742043c1-c17a-4526-4ed0-e911e6e9555b", "8e27eb22-ce03-a2af-58ba-53f0f458f49c",
"ed07ac9e-1071-fca0-46d9-42035c2da505", "fed333e5-2754-7413-1e3d-5939d70541d2"
), isbio = c("other", "other", "other", "other", "bio", "other"
), co_type = c("m", "m", "m", "v", "v", "m")), .Names = c("company_name",
"domain", "country_code", "state_code", "region", "city", "status",
"short_description", "category_list", "category_group_list",
"employee_count", "funding_rounds", "funding_total_usd", "founded_on",
"first_funding_on", "last_funding_on", "closed_on", "email",
"phone", "cb_url", "twitter_url", "facebook_url", "uuid", "isbio",
"co_type"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
>

Stacked Bar ordered by Sum of Fill with ggplot2

With the following data (an already melted data frame):
df1<-structure(list(Speciality = structure(27:32, .Label = c("Addiction Medicine",
"Anesthesiology", "Cardiac Electrophysiology", "Cardiology",
"Dermatology", "Emergency Medicine", "Family Medicine", "Gastroenterology",
"General Surgery", "Hematology & Oncology", "Hospitalist", "Internal Medicine",
"Nephrology", "Neurological Surgery", "Neurology", "Obstetrics & Gynecology",
"Otolaryngology", "Pain Medicine", "Pathology", "Pediatric Critical Care Medicine",
"Pediatric Hematology-Oncology", "Pediatric Pulmonology", "Pediatric Radiology",
"Pediatric Surgery", "Pediatrics", "Psychiatry", "Pulmonology",
"Radiation Oncology", "Radiology", "Surgical Oncology", "Urology",
"Vascular Surgery"), class = "factor"), PhysAge = structure(c(5L,
5L, 1L, 3L, 5L, 5L), .Label = c("25-34", "35-44", "45-54", "55-64",
"65+"), class = "factor"), value = c(0.0035, 0.0058, 0.0089, 0, 0.00512820512820513,
0.00512820512820513)), .Names = c("Speciality", "PhysAge", "value"
), row.names = 155:160, class = "data.frame")
How can I reorder in ggplot based on the sum of values for each Speciality in a stacked bar chart. I've found some options where the value is multiple columns, but in this case it's one value column.
Currently plotting by:
ggplot(df,aes(x=Speciality,y=value,fill=PhysAge))+
geom_bar(stat="identity")
You could try
set.seed(1)
df <- rbind(
AgevsPractice.melt,
transform(AgevsPractice.melt, PhysAge="1", value=runif(6, 0, 0.01)),
transform(AgevsPractice.melt, PhysAge="10", value=runif(6, 0, 0.01))
)
ggplot(df,aes(x=reorder(Speciality, value, sum), y=value,fill=PhysAge))+
geom_bar(stat="identity")

Resources