filter based on numerous variables - r

I'm trying to filter a large data set by a few different variables. Here's a dummy dataset to show what I mean:
df <- data.frame(game_id = c(1,1,2,2,3,3,4,4,5,5,6,6),
team = c("a","a","a","a","a","a","b","b","b","b","b","b"),
play_id = c(1,2,1,2,1,2,1,2,1,2,1,2),
value = c(.2,.6,.9,.7,.5,.5,.4,.6,.5,.9,.2,.8),
play_type = c("run","pass","pass","pass","run","pass","run","run","pass","pass","run","run"),
qtr = c(1,1,1,1,1,1,1,1,1,1,1,
Where:
game_id = unique identifier of a matchup between two teams
team = designates which team is on offensive. two teams are assigned to each game_id and there are over 30 teams total in real dataset
play_id = sequential number of individual plays in a game (each game has at about 100 plays total split among teams)
value = at any point in the game, this value is the % chance the team on offense has of winning the game
play_type = strategy used by the offense of that play
qtr = 4 quarters in a complete game
My goal is find all games in which either team in a matchup had a value of at least .8 at any point in qtr 1, the trick being I want to mark all the plays leading up to that team's advantage and compare what percentage of them used the "run" strategy vs. "pass" strategy.
I was able to isolate the teams with such an advantage here:
types = c("run","pass")
df <- df %>%
filter(play_type %in% types, qtr == 1, wp > .79) %>%
distinct(game_id,team)
but I'm racking my brain to utilize that to serve my needs. a for loop doesn't work bc the datasets aren't the same size.
Ideally, I'd create a new data frame with only games in which this .8 value occurs at any point in qtr 1 for either team and then has a variable that assigns which team had that advantage for all play_ids leading up to this advantage.
Hopefully this made sense. thank you all!

Could you inner join from your 'summary' df?
df2 <- df %>%
filter(play_type %in% types, qtr == 1, wp > .79) %>%
distinct(game_id,team)
inner_join(df,df2)

Related

Selective choice of tuples with partially matching characteristics in R

I have a dataset with data about political careers.
Every politician has a unique identifier nuber (ui) and can occur in multiple electoral terms (electoral_terms). Every electoral term equals a period of 4 years in which the politician is in office.
Now I would like to find out, which academic titles (academic_title) occure in the dataset and how often they occur.
The problem is that every politican is potentially mentioned multiple times and I'm only interested in the last state of their academic title.
E.g. the correct answer would be:
1x Prof. Dr.
1x Dr. Med
Thanks in advance!
I tried this Command:
Stammdaten_academic<- Stammdaten |> arrange(ui, academic_title) |> distinct(ui, .keep_all = TRUE)``
Stammdaten_academic is the dataframe where every politician is only mentioned once (similar as a Group-By command would do).
Stammdaten is the original dataframe with multiple occurences of each politician.
Result:
I got the academic title that was mentioned in the first occuring row of each politician.
Problem:
I would like to receive the last state of everyones' academic title!
library(dplyr)
Stammdaten_academic <- Stammdaten |>
group_by(ui) |>
arrange(electoral_term) |>
slice(n)
Should give you the n'th row from each group (ui) where n is the number of items in that group.
Academic titles are progressive and a person does not stop being a doctor or such.
I believe this solves your problem
# create your data frame
df <- data.frame(ui = c(1,1,1,2,2,3),
electoral_term = c(1,2,3,3,4,4),
academit_title = c(NA, "Dr.","Prof. Dr.","Dr. Med.","Dr. Med.", NA))
# get latest titles
titles <- df |>
dplyr::group_by(ui) |>
dplyr::summarise_at(vars(electoral_term), max) |>
dplyr::left_join(df, by = c("ui", "electoral_term")) |>
tidyr::drop_na() # in case you don't want the people without title
#counts occurences
table(titles$academic_title)

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Removing duplicates where the relationship is only clear by comparing lines (relative reference in R)

The situation: I have some data about contracts, and how many acres are covered by a contract in a given year. The contracts I am dealing with have an obnoxious naming convention which is contract renewals have the same name with 'a', 'b', 'c', etc appended after the number.
Because contracts can be renewed at any time, calculating the acreage in a given year means that there is double-counting when the renewal begins. Some example data might help to explain:
example <- data.frame(contract = c('c300a', 'c300b'),
true_contract = c('c300', 'c300'),
acres_2007 = c(100, 0),
acres_2008 = c(100, 100),
acres_2009 = c(0, 100)
)
print(example)
contract true_contract acres_2007 acres_2008 acres_2009
1 c300a c300 100 100 0
2 c300b c300 0 100 100
As you can see, if the transition from 300a to 300b happened on (for example) May 20, 2008, then there is double-counting in 2008. Those 100 acres are the same piece of land. I would like a way to remove one of the 100s - it doesn't matter which, since both contracts are functionally "the same".
I can tell the problem by looking at it, but am completely puzzled about how I would address the issue using R. In fact, I have always been at a loss about how to deal with data issues where the relationship is only clear from looking at lines that are next to each other. This is a very Excel-style mindset (relative reference) but I am not good at Excel/VBA. In addition, I come up against problems like this often enough that figuring out how to map this problem to R solutions would help me a lot.
Here's a general solution that applies a rule to all contracts in all years. The rule I used was "For each contract-year with more than one contract, keep the largest one, and if more than one at that size, keep the later one."
library(dplyr); library(tidyr)
example %>%
# Split contract name into two, putting last letter/digit into new column
separate(contract, c("contract", "renewal_ltr"), sep = -1) %>%
# Gather into long form to make counting easier
gather(year_col, acres, -c(contract:true_contract)) %>%
# Optional: extract year from year_col; dropped below but might be of use.
mutate(year = readr::parse_number(year_col)) %>%
# For contracts with more than one value in a year, keep the larger one,
# or if tied, keep the later one
group_by(contract, year_col) %>%
arrange(year, desc(acres), desc(renewal_ltr)) %>%
slice(1) %>% # Keep top row per group
ungroup() %>%
# Optional: spread back
select(-year) %>%
spread(year_col, acres, fill = 0)
Output
# A tibble: 2 x 6
contract renewal_ltr true_contract acres_2007 acres_2008 acres_2009
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 c300 a c300 100 0 0
2 c300 b c300 0 100 100
If I undestood correctly you want to remove one of the duplicated 100 from the second column. This keeps the first value in the acres_2008 column and replace the other with 0
example$acres_2008 <- ave(
example$acres_2008,
example$true_contract,
FUN = function(a) replace(a, duplicated(a), 0)
)
The result with your example is:

Filter factor variable based on counts

I have a dataframe containing house price data, with price and lots of variables. One of these variables is a "sub-area" for the property, and I am trying to incorporate this into various regressions. However, it is a factor variable with almost 3000 levels.
For example:
table(df$sub_area)
La Jolla
2
Carlsbad
5
Esconsido
1
..etc
I want to filter out those places that have only 1 count, since they don't offer much predictive power but add lots of computation time. However, I want to replace the sub_area entry for that property with blank or NA, since I still want to use the rest of the information for that property, such as bedrooms, bathrooms, etc.
For reference, an individual property entry might look like:
ID Beds Baths City Sub_area sqm... etc
1 4 2 San Diego La Jolla 100....
Then I can do
lm(price ~ beds + baths + city + sub_area)
under the new, smaller sub_area variable with fewer levels.
I want to do this because most of the predictive price power is contained in sub_area for the locations I'm working on.
One way:
areas <- names(which(table(df$Sub_area) > 10))
df$Sub_area[! df$Sub_area %in% areas] <- NA
Create a new dataframe with the number of occurrences for each subarea and keep the subareas that occur at least twice.
Then add NAs to the original dataframe if the subarea does not appear in the filtered sub_area_count.
library(dplyr)
sub_area_count <- df %>%
count(sub_area) %>%
filter(n > 1)
boo <- !df$sub_area %in% sub_area_count$sub_area
df[boo, ]$sub_area <- NA
You didn't give a reproducible example, but I think this will work for identifying those places which count==1
count_1 <- as.data.frame(table(df$sub_area))
count_1 <- count_1$Var1[which(count_1$Freq==1)]

dplyr merge two datasets for Finished Goods Sales and Bill of Materials

I am working on an analysis of need for raw materials in my company and the approach i am taking is to use the sales records of finished goods to combine with bill of materials for each finished goods. The problem i am having now that each finished product consists of multiple components, and many finished products share common components. I am trying to keep all individual sales record for each finished good and use the UnitsSold to multiply with the unit qty for each component to get the demand of raw materials. Here is code for sample datasets:
fg_Sales <- data_frame(FG_PartNumber=rep(c("A","B","C"),2),
Order_Date=seq.Date(as.Date("2011-1-1"),as.Date("2012-1-10"),length.out = 6),
FG_UnitsSold=c(100,200,300,400,500,600))
bill_materials <- data_frame(FG_PartNumber=rep(c("A","B","C"),4),
Components=c("C1","C2","C3","C4","C5","C6","C7","C7","C7","C8","C8","C9"),
Qty=rnorm(3,1,n = 12))%>%
arrange(FG_PartNumber)
i am familiar with left_join in dplyr but it seems not work because it would always give me with the first component for each finished product.
Can anyone kindly help with this?
Thanks.
Perhaps I am not understanding the question, but if you group your two data frames by the FG_PartNumber and make a pivot tables on the quantities you are interested in, you can get the the totals you are looking for:
#Create data
set.seed(1)
fg_Sales <- data_frame(FG_PartNumber=rep(c("A","B","C"),2),
Order_Date=seq.Date(as.Date("2011-1-1"),as.Date("2012-1-10"),length.out = 6),
FG_UnitsSold=c(100,200,300,400,500,600))
bill_materials <- data_frame(FG_PartNumber=rep(c("A","B","C"),4),
Components=c("C1","C2","C3","C4","C5","C6","C7","C7","C7","C8","C8","C9"),
Qty=rnorm(3,1,n = 12))%>%
arrange(FG_PartNumber)
library(dplyr)
#make pivot tables for sales and quantity
tot_sales <- fg_Sales %>%
group_by(FG_PartNumber) %>%
summarise(tot_sales = sum(FG_UnitsSold))
tot_materials <- bill_materials %>%
group_by(FG_PartNumber) %>%
summarise(tot_qty = sum(Qty))
#join the pivot tables together
df <- left_join(tot_sales, tot_materials)
> df
# A tibble: 3 × 3
FG_PartNumber tot_sales tot_qty
<chr> <dbl> <dbl>
1 A 500 13.15087
2 B 700 14.76326
3 C 900 11.30953
I think inner_join from dplyr is the best choice here:
library(dplyr)
fg_Sales_ext <- inner_join(x = fg_Sales,
y = bill_materials,
by = "FG_PartNumber")
From the inner_join documentation: "If there are multiple matches between x and y, all combination of the matches are returned."
With fg_Sales_ext you can perform any kind of analysis now with group_by and summarise.

Resources