I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]
Related
I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)
I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.
I have two data frames:
d1:
Id group occu D Year
12 1 1 12 2007
13 4 2 67 2007
14 6 3 34 2007
15 7 1 88 2007
16 2 2 72 2007
17 1 1 43 2007
18 4 1 66 2007
and d2:
Id group occu D Year
12 1 1 34 2010
13 4 2 100 2010
14 6 3 76 2010
15 7 1 99 2010
16 2 2 102 2010
17 1 1 55 2010
18 4 1 32 2010
The variables "group" and "occu" are factors I want to make a panel data for the year 2007 and 2010 in the long form in R.
How can I do this?
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I'm trying to create a fiscal year variable called 'period', which will run from September through August for six years. My data frame 'dat' is structured as follows:
'data.frame': 52966 obs. of 4 variables:
$ userid : int 96 96 96 101 101 101 101 101 101 101 ...
$ comment.year : int 2008 2009 2009 2008 2008 2008 2008 2008 2008 2009 ...
$ comment.month: int 7 3 8 7 8 9 10 11 12 1 ...
$ num.comments : int 1 1 1 33 51 16 27 29 40 39 ...
I get this error message: Error: unexpected '=' in "dat$period[comment.year=2008 & comment.month="
when I run the following code. I've experimented with double equal signs and putting the month and year integers in quotes, but no success. I'm also wondering if there's a simpler way to do the recode. Since I'm dealing with 6 years, my approach takes 72 lines.
dat$period[comment.year=2008 & comment.month=9]<-"1"
dat$period[comment.year=2008 & comment.month=10]<-"1"
dat$period[comment.year=2008 & comment.month=11]<-"1"
dat$period[comment.year=2008 & comment.month=12]<-"1"
dat$period[comment.year=2009 & comment.month=1]<-"1"
dat$period[comment.year=2009 & comment.month=2]<-"1"
dat$period[comment.year=2009 & comment.month=3]<-"1"
dat$period[comment.year=2009 & comment.month=4]<-"1"
dat$period[comment.year=2009 & comment.month=5]<-"1"
dat$period[comment.year=2009 & comment.month=6]<-"1"
dat$period[comment.year=2009 & comment.month=7]<-"1"
dat$period[comment.year=2009 & comment.month=8]<-"1"
dat$period[comment.year=2009 & comment.month=9]<-"2"
dat$period[comment.year=2009 & comment.month=10]<-"2"
dat$period[comment.year=2009 & comment.month=11]<-"2"
dat$period[comment.year=2009 & comment.month=12]<-"2"
Rather than doing a bunch of partial assignments, why not just calculate the different in years with a bonus bump for months >=9?
#sample data
dat<-data.frame(
comment.year=rep(2009:2011, each=12),
comment.month=rep(1:12, 3)
)[-(1:8), ]
#assign new period
dat$period<- dat$comment.year-min(dat$comment.year) + ifelse(dat$comment.month>=9,1,0)
which gives you
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
If you want to make sure to start at a certain user, you can use 2009 rather than min(dat$comment.year).
Using MrFlick's sample data:
dat$period = rep(1:3, each=12)[1:28]
dat
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
>
Can easily be extended to your data.
I guess you could also try (Using #MrFlick's data)
set.seed(42)
dat1 <- dat[sample(1:nrow(dat)),]
dat<- within(dat, {period<- as.numeric(factor(comment.year))
period[comment.month <9] <- period[comment.month <9] -1})
dat
# comment.year comment.month period
#9 2009 9 1
#10 2009 10 1
#11 2009 11 1
#12 2009 12 1
#13 2010 1 1
#14 2010 2 1
#15 2010 3 1
#16 2010 4 1
#17 2010 5 1
#18 2010 6 1
#19 2010 7 1
#20 2010 8 1
#21 2010 9 2
#22 2010 10 2
#23 2010 11 2
#24 2010 12 2
#25 2011 1 2
#26 2011 2 2
#27 2011 3 2
#28 2011 4 2
#29 2011 5 2
#30 2011 6 2
#31 2011 7 2
#32 2011 8 2
#33 2011 9 3
#34 2011 10 3
#35 2011 11 3
#36 2011 12 3
Using the unordered dat1
within(dat1, {period<- as.numeric(factor(comment.year)); period[comment.month <9] <- period[comment.month <9] -1})[,3]
#[1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3
Crosschecking the results with #MrFlick's method
dat1$comment.year-min(dat1$comment.year) + ifelse(dat1$comment.month>=9,1,0)
# [1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3
The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]