Use R to count values in a number of different columns - r

I have a dataset of patents, where I have recorded 1) the month and year associated with a patent renewal and 2) whether the patent holder chose to pay the patent fee or let the patent lapse. So
patentid fee1date fee1paid fee2date fee2paid
1 May 2010 True May 2013 False
2 May 2010 True April 2014 True
What I want to do is develop a count of the number of renewals by month and by year, as well as the number of abandoned patents, as follows:
date renewed lapsed
May 2010 2 0
How might I count the data that I have now? Thank you!
EDIT: The key point is to aggregate these across different columns. The issue that I am running into now is that when I try using the count library, it treats 2 renewals in May 2010 as two separate values.

Using dplyr
require(tidyr)
require(dplyr)
data %>% gather(year,value, -Patent.ID) %>%
separate('year',c('Fee','N','Act')) %>%
spread(Act,value) %>%
unite(Fee, Fee,N, sep = '.') %>%
group_by(Date) %>%
summarise(R=sum(Paid=='True'), NotR=sum(Paid=='False'))
# A tibble: 3 x 3
Date R NotR
<chr> <int> <int>
1 April 2014 1 0
2 May 2010 2 0
3 May 2013 0 1
Data
data <- read.table(text="
'Patent ID' 'Fee 1 Date' 'Fee 1 Paid' 'Fee 2 Date' 'Fee 2 Paid'
1 'May 2010' True 'May 2013' False
2 'May 2010' True 'April 2014' True
",header=T, stringsAsFactors = F)

Related

Start and end of events assign to month of start based on condition

I have a data for 2000 events with start and end date of each event and the length.
What I am trying to do is finding the frequency of events by month and year. But several of events are split between two consecutive months (say May and June) and want for these events to be reported to the month over which they stay longer. But if an event split equally between tow month then it will be reported to the month of start.
Eg:
> date01[1:5,9:11]
# A tibble: 5 x 3
StrD EndD EvLength
<date> <date> <drtn>
1 1993-12-30 1994-01-01 3 days # this would be reported Dec frequency
2 2000-07-23 2000-08-02 11 days # this would be reported July frequency
3 2001-02-28 2001-03-01 2 days # this would be reported Feb frequency (as it started in Feb)
4 2006-05-29 2006-06-01 4 days # this would be reported May frequency (as it started in May)
5 2010-07-30 2010-08-04 6 days
I tried to use group_by (from dplyr), but still not able to figure it out.
dates to date format with ymd() from lubridate package.
mutate days in previous and next Month with days_in_month function and basic arichmetic. Note the start day is count therefore +1 to start date.
get the month depending on which month has more days with an ifelse
get the abbreviation of Months with month.abb[Month]
get the Year from start date.
group and summarise
library(dplyr)
library(lubridate)
df %>%
mutate(across(1:2, ymd)) %>%
mutate(prev_month_days = days_in_month(StrD)-day(StrD)+1,
next_month_days = day(EndD)) %>%
mutate(Month = ifelse(prev_month_days>= next_month_days, month(StrD), month(EndD))) %>%
mutate(Month = month.abb[Month]) %>%
mutate(Year = year(StrD)) %>%
group_by(Year, Month) %>%
summarise(n = n())
Output:
Year Month n
<int> <chr> <int>
1 1993 Dec 1
2 2000 Jul 1
3 2001 Feb 1
4 2006 May 1
5 2010 Aug 1

How to sample from smaller data frame with multiple conditions to a larger data frame?

I have a main dataset df.main with 3 sites and each site has 3 subsites, that were samlpled over three months. I have a separate dataset with some abiotic variables ONLY for a single month df.sample. But for each site, I have three values from the sub-sites. In my original dataset, I need to add the abitoic column. However, for every month, for each sub-site I only want to SAMPLE with replacement from one of the three samples from the sub.site.
set.seed(111)
##Main Data Set
month <- rep(c("Jan","Feb","Mar"), each =9 )
site <- rep(c("1","2","3","1","2","3","1","2","3"), each = 3)
sub.site <- rep(c(1,2,3,1,2,3,1,2,3), time = 3 )
df.main <- data.frame(month, site, sub.site)
month site sub.site
Jan 1 1
Jan 1 2
Jan 1 3
Jan 2 1
Jan 2 2
... .. ..
Mar 3 3
##Sampler Data Set
site <- rep(c(1,2,3), time = 9)
sub.site <- rep(c(1,1,1,2,2,2,3,3,3), each = 3)
abiotic <- rnorm(27,7,1)
df.sample <- data.frame(site, sub.site, abiotic)
site sub.site abiotic
1 1 7.175096
1 1 8.805868
1 1 6.783571
1 2 7.910917
1 2 7.202307
1 2 8.404883
...
2 1 7.122915
2 1 6.152732
...
3 1 7.978232
3 1 6.870228
##Desired Output in df.main
month site sub.site abiotic
Jan 1 1 8.805868
Jan 1 2 7.910917
You can do a full join of the two tables using site and sub.site, and then just sample one row from each month, site and sub.site combination.
If you are unsure about table joining (full join, left join, etc.), you may want to look that up online. It is very simple and standard in, say querying database.
In your case, after the full joining, because you have 9 unique combination of site and sub.site, you will have 81 rows:
joining.output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site"))
> joining.output
month site sub.site abiotic
1 Jan 1 1 7.235221
2 Jan 1 1 4.697654
3 Jan 1 1 5.502573
...
28 Feb 1 1 7.235221
29 Feb 1 1 4.697654
30 Feb 1 1 5.502573
...
55 Mar 1 1 7.235221
56 Mar 1 1 4.697654
57 Mar 1 1 5.502573
Then to sample 1 row for each site and sub.site combination for each month, just group by the 3 variables and sample.
Here is the code that puts everything together:
output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site")) %>%
group_by(month, site, sub.site) %>%
slice_sample(n=1)
p.s. in your example, df.main$sub.site is a character array but df.sample$sub.site is a numeric array. You may want to convert the character array to numeric using as.double() function before joining.

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

R Expand time series data based on start and end point

I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
...
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
...
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
library(data.table)
data <- data.table(dat)
data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.
Using a tidyverse solution could look like:
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
'2013-09-27'),
end_date = c('2010-10-12',
'2015-10-11'),
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
end_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.
You could do the following:
library(lubridate)
library(tidyverse)
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
unnest(year)%>%
select(place,year,value)
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Resources