How to create a subset of data for most common [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I need some help creating a subset of data. I'm sure this is a simple problem but I can't figure it out.
For example, in the table, I need to create a subset of the data that includes the presidential winner from each state. So for Alabama for example, I would need the line for Donald J Trump since he got the highest proportion of votes (candidate votes/ total votes). I would need to isolate the winners from every state.
State Candidate candidatevotes totalvotes
Alabama D J Trump 1318255 2123372
Alabama Clinton 729547 2123372
Alabama Gary Johnson 44467 2123372
Alabama Other 21712 2123372
However, I don't know how to isolate the winner from each state. I have tried using using
data_sub <- filename[candidatevotes/totalvotes > .5]
but I know that since there are 3rd party candidates, not every winner from each state will win with majority votes. I have attached a picture for reference. Thank you in advance!

I just manipulated your data a little bit to demostrate how the problem could be solved:
# Just changed the last two states to Texas so that you get a two line result (not just one)
election <- data.frame(State = c("Alabama", "Alabama", "Texas", "Texas"),
Candidate = c("D J Trump", "Clinton", "Gary Johnson", "Other"),
candidatevotes = c(1318255, 729547, 44467, 21712),
totalvotes = c(2123372, 2123372, 2123372, 2123372))
# need library
library(dplyr)
election %>%
# group by the variable you want the max value for (State)
dplyr::group_by(State) %>%
# get the lines with maximum candidatevotes for each State
dplyr::filter(candidatevotes == max(candidatevotes))

We can do a group by 'State' and filter the max proportion row for each 'State'
library(dplyr)
df1 %>%
mutate(prop = candidatevotes/totalvotes) %>%
group_by(State) %>%
filter(prop > .5, prop == max(prop))

Related

Selective choice of tuples with partially matching characteristics in R

I have a dataset with data about political careers.
Every politician has a unique identifier nuber (ui) and can occur in multiple electoral terms (electoral_terms). Every electoral term equals a period of 4 years in which the politician is in office.
Now I would like to find out, which academic titles (academic_title) occure in the dataset and how often they occur.
The problem is that every politican is potentially mentioned multiple times and I'm only interested in the last state of their academic title.
E.g. the correct answer would be:
1x Prof. Dr.
1x Dr. Med
Thanks in advance!
I tried this Command:
Stammdaten_academic<- Stammdaten |> arrange(ui, academic_title) |> distinct(ui, .keep_all = TRUE)``
Stammdaten_academic is the dataframe where every politician is only mentioned once (similar as a Group-By command would do).
Stammdaten is the original dataframe with multiple occurences of each politician.
Result:
I got the academic title that was mentioned in the first occuring row of each politician.
Problem:
I would like to receive the last state of everyones' academic title!
library(dplyr)
Stammdaten_academic <- Stammdaten |>
group_by(ui) |>
arrange(electoral_term) |>
slice(n)
Should give you the n'th row from each group (ui) where n is the number of items in that group.
Academic titles are progressive and a person does not stop being a doctor or such.
I believe this solves your problem
# create your data frame
df <- data.frame(ui = c(1,1,1,2,2,3),
electoral_term = c(1,2,3,3,4,4),
academit_title = c(NA, "Dr.","Prof. Dr.","Dr. Med.","Dr. Med.", NA))
# get latest titles
titles <- df |>
dplyr::group_by(ui) |>
dplyr::summarise_at(vars(electoral_term), max) |>
dplyr::left_join(df, by = c("ui", "electoral_term")) |>
tidyr::drop_na() # in case you don't want the people without title
#counts occurences
table(titles$academic_title)

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Separating geographical data strings in R

I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!
So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.

filter based on numerous variables

I'm trying to filter a large data set by a few different variables. Here's a dummy dataset to show what I mean:
df <- data.frame(game_id = c(1,1,2,2,3,3,4,4,5,5,6,6),
team = c("a","a","a","a","a","a","b","b","b","b","b","b"),
play_id = c(1,2,1,2,1,2,1,2,1,2,1,2),
value = c(.2,.6,.9,.7,.5,.5,.4,.6,.5,.9,.2,.8),
play_type = c("run","pass","pass","pass","run","pass","run","run","pass","pass","run","run"),
qtr = c(1,1,1,1,1,1,1,1,1,1,1,
Where:
game_id = unique identifier of a matchup between two teams
team = designates which team is on offensive. two teams are assigned to each game_id and there are over 30 teams total in real dataset
play_id = sequential number of individual plays in a game (each game has at about 100 plays total split among teams)
value = at any point in the game, this value is the % chance the team on offense has of winning the game
play_type = strategy used by the offense of that play
qtr = 4 quarters in a complete game
My goal is find all games in which either team in a matchup had a value of at least .8 at any point in qtr 1, the trick being I want to mark all the plays leading up to that team's advantage and compare what percentage of them used the "run" strategy vs. "pass" strategy.
I was able to isolate the teams with such an advantage here:
types = c("run","pass")
df <- df %>%
filter(play_type %in% types, qtr == 1, wp > .79) %>%
distinct(game_id,team)
but I'm racking my brain to utilize that to serve my needs. a for loop doesn't work bc the datasets aren't the same size.
Ideally, I'd create a new data frame with only games in which this .8 value occurs at any point in qtr 1 for either team and then has a variable that assigns which team had that advantage for all play_ids leading up to this advantage.
Hopefully this made sense. thank you all!
Could you inner join from your 'summary' df?
df2 <- df %>%
filter(play_type %in% types, qtr == 1, wp > .79) %>%
distinct(game_id,team)
inner_join(df,df2)

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

Resources