I have two data frames:
d1:
Id group occu D Year
12 1 1 12 2007
13 4 2 67 2007
14 6 3 34 2007
15 7 1 88 2007
16 2 2 72 2007
17 1 1 43 2007
18 4 1 66 2007
and d2:
Id group occu D Year
12 1 1 34 2010
13 4 2 100 2010
14 6 3 76 2010
15 7 1 99 2010
16 2 2 102 2010
17 1 1 55 2010
18 4 1 32 2010
The variables "group" and "occu" are factors I want to make a panel data for the year 2007 and 2010 in the long form in R.
How can I do this?
Related
I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]
This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41
I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)
I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I'm trying to create a fiscal year variable called 'period', which will run from September through August for six years. My data frame 'dat' is structured as follows:
'data.frame': 52966 obs. of 4 variables:
$ userid : int 96 96 96 101 101 101 101 101 101 101 ...
$ comment.year : int 2008 2009 2009 2008 2008 2008 2008 2008 2008 2009 ...
$ comment.month: int 7 3 8 7 8 9 10 11 12 1 ...
$ num.comments : int 1 1 1 33 51 16 27 29 40 39 ...
I get this error message: Error: unexpected '=' in "dat$period[comment.year=2008 & comment.month="
when I run the following code. I've experimented with double equal signs and putting the month and year integers in quotes, but no success. I'm also wondering if there's a simpler way to do the recode. Since I'm dealing with 6 years, my approach takes 72 lines.
dat$period[comment.year=2008 & comment.month=9]<-"1"
dat$period[comment.year=2008 & comment.month=10]<-"1"
dat$period[comment.year=2008 & comment.month=11]<-"1"
dat$period[comment.year=2008 & comment.month=12]<-"1"
dat$period[comment.year=2009 & comment.month=1]<-"1"
dat$period[comment.year=2009 & comment.month=2]<-"1"
dat$period[comment.year=2009 & comment.month=3]<-"1"
dat$period[comment.year=2009 & comment.month=4]<-"1"
dat$period[comment.year=2009 & comment.month=5]<-"1"
dat$period[comment.year=2009 & comment.month=6]<-"1"
dat$period[comment.year=2009 & comment.month=7]<-"1"
dat$period[comment.year=2009 & comment.month=8]<-"1"
dat$period[comment.year=2009 & comment.month=9]<-"2"
dat$period[comment.year=2009 & comment.month=10]<-"2"
dat$period[comment.year=2009 & comment.month=11]<-"2"
dat$period[comment.year=2009 & comment.month=12]<-"2"
Rather than doing a bunch of partial assignments, why not just calculate the different in years with a bonus bump for months >=9?
#sample data
dat<-data.frame(
comment.year=rep(2009:2011, each=12),
comment.month=rep(1:12, 3)
)[-(1:8), ]
#assign new period
dat$period<- dat$comment.year-min(dat$comment.year) + ifelse(dat$comment.month>=9,1,0)
which gives you
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
If you want to make sure to start at a certain user, you can use 2009 rather than min(dat$comment.year).
Using MrFlick's sample data:
dat$period = rep(1:3, each=12)[1:28]
dat
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
>
Can easily be extended to your data.
I guess you could also try (Using #MrFlick's data)
set.seed(42)
dat1 <- dat[sample(1:nrow(dat)),]
dat<- within(dat, {period<- as.numeric(factor(comment.year))
period[comment.month <9] <- period[comment.month <9] -1})
dat
# comment.year comment.month period
#9 2009 9 1
#10 2009 10 1
#11 2009 11 1
#12 2009 12 1
#13 2010 1 1
#14 2010 2 1
#15 2010 3 1
#16 2010 4 1
#17 2010 5 1
#18 2010 6 1
#19 2010 7 1
#20 2010 8 1
#21 2010 9 2
#22 2010 10 2
#23 2010 11 2
#24 2010 12 2
#25 2011 1 2
#26 2011 2 2
#27 2011 3 2
#28 2011 4 2
#29 2011 5 2
#30 2011 6 2
#31 2011 7 2
#32 2011 8 2
#33 2011 9 3
#34 2011 10 3
#35 2011 11 3
#36 2011 12 3
Using the unordered dat1
within(dat1, {period<- as.numeric(factor(comment.year)); period[comment.month <9] <- period[comment.month <9] -1})[,3]
#[1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3
Crosschecking the results with #MrFlick's method
dat1$comment.year-min(dat1$comment.year) + ifelse(dat1$comment.month>=9,1,0)
# [1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3