Calculate number of years worked with different end dates - r

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.

base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Related

Selective choice of tuples with partially matching characteristics in R

I have a dataset with data about political careers.
Every politician has a unique identifier nuber (ui) and can occur in multiple electoral terms (electoral_terms). Every electoral term equals a period of 4 years in which the politician is in office.
Now I would like to find out, which academic titles (academic_title) occure in the dataset and how often they occur.
The problem is that every politican is potentially mentioned multiple times and I'm only interested in the last state of their academic title.
E.g. the correct answer would be:
1x Prof. Dr.
1x Dr. Med
Thanks in advance!
I tried this Command:
Stammdaten_academic<- Stammdaten |> arrange(ui, academic_title) |> distinct(ui, .keep_all = TRUE)``
Stammdaten_academic is the dataframe where every politician is only mentioned once (similar as a Group-By command would do).
Stammdaten is the original dataframe with multiple occurences of each politician.
Result:
I got the academic title that was mentioned in the first occuring row of each politician.
Problem:
I would like to receive the last state of everyones' academic title!
library(dplyr)
Stammdaten_academic <- Stammdaten |>
group_by(ui) |>
arrange(electoral_term) |>
slice(n)
Should give you the n'th row from each group (ui) where n is the number of items in that group.
Academic titles are progressive and a person does not stop being a doctor or such.
I believe this solves your problem
# create your data frame
df <- data.frame(ui = c(1,1,1,2,2,3),
electoral_term = c(1,2,3,3,4,4),
academit_title = c(NA, "Dr.","Prof. Dr.","Dr. Med.","Dr. Med.", NA))
# get latest titles
titles <- df |>
dplyr::group_by(ui) |>
dplyr::summarise_at(vars(electoral_term), max) |>
dplyr::left_join(df, by = c("ui", "electoral_term")) |>
tidyr::drop_na() # in case you don't want the people without title
#counts occurences
table(titles$academic_title)

TABLEAU: How can I measure similarity of sets of dimensions across dates?

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)
You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”
I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck
As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

Removing duplicates where the relationship is only clear by comparing lines (relative reference in R)

The situation: I have some data about contracts, and how many acres are covered by a contract in a given year. The contracts I am dealing with have an obnoxious naming convention which is contract renewals have the same name with 'a', 'b', 'c', etc appended after the number.
Because contracts can be renewed at any time, calculating the acreage in a given year means that there is double-counting when the renewal begins. Some example data might help to explain:
example <- data.frame(contract = c('c300a', 'c300b'),
true_contract = c('c300', 'c300'),
acres_2007 = c(100, 0),
acres_2008 = c(100, 100),
acres_2009 = c(0, 100)
)
print(example)
contract true_contract acres_2007 acres_2008 acres_2009
1 c300a c300 100 100 0
2 c300b c300 0 100 100
As you can see, if the transition from 300a to 300b happened on (for example) May 20, 2008, then there is double-counting in 2008. Those 100 acres are the same piece of land. I would like a way to remove one of the 100s - it doesn't matter which, since both contracts are functionally "the same".
I can tell the problem by looking at it, but am completely puzzled about how I would address the issue using R. In fact, I have always been at a loss about how to deal with data issues where the relationship is only clear from looking at lines that are next to each other. This is a very Excel-style mindset (relative reference) but I am not good at Excel/VBA. In addition, I come up against problems like this often enough that figuring out how to map this problem to R solutions would help me a lot.
Here's a general solution that applies a rule to all contracts in all years. The rule I used was "For each contract-year with more than one contract, keep the largest one, and if more than one at that size, keep the later one."
library(dplyr); library(tidyr)
example %>%
# Split contract name into two, putting last letter/digit into new column
separate(contract, c("contract", "renewal_ltr"), sep = -1) %>%
# Gather into long form to make counting easier
gather(year_col, acres, -c(contract:true_contract)) %>%
# Optional: extract year from year_col; dropped below but might be of use.
mutate(year = readr::parse_number(year_col)) %>%
# For contracts with more than one value in a year, keep the larger one,
# or if tied, keep the later one
group_by(contract, year_col) %>%
arrange(year, desc(acres), desc(renewal_ltr)) %>%
slice(1) %>% # Keep top row per group
ungroup() %>%
# Optional: spread back
select(-year) %>%
spread(year_col, acres, fill = 0)
Output
# A tibble: 2 x 6
contract renewal_ltr true_contract acres_2007 acres_2008 acres_2009
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 c300 a c300 100 0 0
2 c300 b c300 0 100 100
If I undestood correctly you want to remove one of the duplicated 100 from the second column. This keeps the first value in the acres_2008 column and replace the other with 0
example$acres_2008 <- ave(
example$acres_2008,
example$true_contract,
FUN = function(a) replace(a, duplicated(a), 0)
)
The result with your example is:

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

getting the max() of a data frame under certain conditions

I have a rather large dataframe with 13 variables. Here is the first line just to give an idea:
prov_code nuts1 nuts1name nuts2 nuts2name prov_geoorder prov_name NUTS_ID EDAD year ORDER graphs value prov_geo
1. 15 1 NW 11 Galicia 1 La Corunna ES111 11 1975 1 1 0.000000000 La Corunna
I would like to obtain the maximum for a certain set of variables according to a combination of variables year ORDER and prov_code (ie, f_all being my data.frame: f_all[(f_all$year==1975)&(f_all$ORDER==1)&(f_all$prov_code=="1"),] ). The goal is to repeat the operation in order to obtain a new data frame containing all the maximum values for each year, ORDER, prov_code.
Is there a simple and quick way to do this?
Thanks for any suggestion on the matter,
There are several way of doing this, for example the one #James mentions. I want to suggest using plyr:
library(ply)
ddply(f_all, .(year, ORDER, prov_code), summarise, mx_value = max(value))
Alternatively, if you have a lot of data, data.table provides similar functionality, but is much much faster in that case.

Resources