Set Age Ranges Cohort Dataset - r

I'm in doubt about how to define age ranges to calculate Incidence Ratio in a cohort dataset. More specifically, my data comprises individuals who entered into a specific cohort between 2008-2018, and, furthermore, this dataset was used as a reference to merge with hospitalization information from another source of data by the id number.
My data looks like this below
id = c(1:5)
year_of_entry = c(2008, 2009, 2011, 2015, 2016)
age_of_entry = c(8,10,40,20,30)
year_birth = c(2000, 1999, 1971, 1995, 1986)
hospitalization_year = c(2009, NA, 2015, 2017, NA)
age_hospitalization = c(9, NA, 44, 22, NA)
data = data.frame(
id = id,
'Age of Entry' = age_of_entry,
'Year of Birth' = year_birth,
'Hospitalization Year' = hospitalization_year,
'Age of Hospitalization' = age_hospitalization
)
>data
id Age.of.Entry Year.of.Birth Hospitalization.Year Age.of.Hospitalization
1 8 2000 2009 9
2 10 1999 NA NA
3 40 1971 2015 44
4 20 1995 2017 22
5 30 1986 NA NA
This way, my next step is to run a linear regression to study determinants for admissions by age groups (i.e 0-10; 11-20; 21-30; 31-40; 41-50), but I'm not so sure about the criteria that I need to use in order to create these age groups regarding the fact that we have people who entered into the cohort in different periods, at different ages and was admitted in different periods of time. Additionally, as you can see in the example above, in my dataset I also have some individuals who have not been admitted.
Can anyone help me to solve that?

Related

Creating a new variable from the ranges of another column in which the ranges change - R

I am a beginner in R so sorry if it is a very simple question. I looked but I could not find the same problem.
I want to create a new variable from the ranges of another column in R but the ranges are not the same for each row.
To be more specific, my data has years 1960 - 2000 and i have ranges for employment. For 1960 to 1980 a teacher is 1 and a lawyer is 2 etc. For 1980 - 1990 a teacher is in the value range 1-29 and lawyer is 50-89 etc. Then finally for 1990-2000, the value range for the teacher is 40-65 and for the lawyer it is 1-39.
I dont even know how to begin with it (teacher and lawyer are not the only occupations there are 10 different occupations with overlapping value ranges for different years - which makes it very confusing for me).
I would appreciate your help. Thank you very much.
Here are a couple of approaches to get you started.
First, say you have a data frame with year and occupation_code:
df1 <- data.frame(
year = c(1965, 1985, 1995),
occupation_code = c(1, 2, 3)
)
year occupation_code
1 1965 1
2 1985 2
3 1995 3
Then, create a second data frame which will clearly indicate the year ranges and occupation code ranges with each occupation. You can include all of your occupations here.
df2 <- data.frame(
year_start = c(1960, 1960, 1980, 1980, 1990, 1990),
year_end = c(1980, 1980, 1990, 1990, 2000, 2000),
occupation_code_start = c(1, 2, 1, 50, 40, 1),
occupation_code_end = c(1, 2, 29, 89, 65, 39),
occupation = c("teacher", "lawyer", "teacher", "lawyer", "teacher", "lawyer")
)
year_start year_end occupation_code_start occupation_code_end occupation
1 1960 1980 1 1 teacher
2 1960 1980 2 2 lawyer
3 1980 1990 1 29 teacher
4 1980 1990 50 89 lawyer
5 1990 2000 40 65 teacher
6 1990 2000 1 39 lawyer
Then, you can merge the two together.
One approach is with data.table package.
library(data.table)
setDT(df1)
setDT(df2)
df2[df1,
on = .(year_start <= year,
year_end >= year,
occupation_code_start <= occupation_code,
occupation_code_end >= occupation_code),
.(year, occupation = occupation)]
This will give you:
year occupation
1: 1965 teacher
2: 1985 teacher
3: 1995 lawyer
Another approach is with fuzzyjoin and tidyverse:
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c("year" = "year_start",
"year" = "year_end",
"occupation_code" = "occupation_code_start",
"occupation_code" = "occupation_code_end"),
match_fun = list(`>=`, `<=`, `>=`, `<=`)) %>%
select(year, occupation)

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

ID variable changes over time, how to calculate pct change by ID

The problem starts from the difficulty of explaining it.
I have a data set that has a time dimension, my ID variables change name over time making it difficult to calculate e.g. percentage changes over time by ID variable.
ID YR Value
01 2004 100
02 2005 50
03 2005 50
04 2005 10
I need to calculate pct. Change in Value over time by ID. The problem is in Yr 2005 the ID variable 01 is split into three IDs (02,03,04), such that one has to aggregate the values for the three IDs in 2005 to get the corresponding value for ID 01 in 2005. The percent change of ID 01 is NOT 50/100, rather sum(50,50,10)/100.
I have data.frame of IDs only matching the changes over time, it looks like this:
x2004 x2005
01 01
01 02
01 03
I used group_by from dplyr to create matching between IDs in the two years
group_by(x2004) %>%
summarize(onetomany = paste(sort(unique(x2005)),collapse=", "))
Which gave me a data.frame of the form
cv2004 onetomany
1 1 1, 2, 3
Where I can see which IDs belong to the same group, and that is where I stopped the percentage calculation.
I totally understand that the problem it self is not easy to understand. This is a common problem in trade statistics, commodity codes change name over time but not content, and one has to keep track of the changes to get the picture of developments in trade over time by commodity. Any suggestion is appreciated.
df <- data.frame("ID" = c("01", "02", "03", "04"),
"YR" = c(2004, 2005, 2005, 2005),
"Value" = c(100, 50, 50, 10))
df %>% group_by(YR) %>% summarise(sum = sum(Value))
# A tibble: 2 x 2
YR sum
<dbl> <dbl>
1 2004 100
2 2005 110

Change column values based on factors of other columns

For example, if I have a data frame like this:
df <- data.frame(profit=c(10,10,10), year=c(2010,2011,2012))
profit year
10 2010
10 2011
10 2012
I want to change the value of profit according to the year. For year 2010, I multiple the profit by 3, for year 2011, multiple the profit by 4, for year 2012, multiple by 5, which should result like this:
profit year
30 2010
40 2011
50 2012
How should I approach this? I tried:
inflationtransform <- function(k,v) {
switch(k,
2010,v<-v*3,
2011,v<-v*4,
2012,v<-v*5,
)
}
df$profit <- sapply(df$year,df$profit,inflationtransform)
But it doesn't work. Can someone tell me what to do?
For this particular example, since your factors and years are both ordered and incremented by 1, you could just subtract 2007 from the year column and multiply it by profit.
transform(df, profit = profit * (year - 2007))
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
Otherwise, you could use a lookup vector. This will cover all cases.
lookup <- c("2010" = 3, "2011" = 4, "2012" = 5)
transform(df, profit = profit * lookup[as.character(year)])
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
I wouldn't use switch() unless you really need to. It's not vectorized, and that's where R is most efficient. However, since you ask for it in the comments, here's one way. I find it easier to use a for() loop with switch().
for(i in seq_len(nrow(df))) {
df$profit[i] <- with(df, switch(as.character(year[i]),
"2010" = 3 * profit[i],
"2011" = 4 * profit[i],
"2012" = 5 * profit[i]
))
}

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Resources