Inserting rows into a table - r

I have this table (visit_ts) -
Year Month Number_of_visits
2011 4 1
2011 6 3
2011 7 23
2011 12 32
2012 1 123
2012 11 3200
The aim is to insert rows with Number_of_visits as 0, for months which are missing in the table.
Do not insert rows for 2011 where month is 1,2,3 or 2012 where month is 12.
The following code works correctly -
vec_month=c(1,2,3,4,5,6,7,8,9,10,11,12)
vec_year=c(2011,2012,2013,2014,2015,2016)
i=1
startyear=head(visit_ts$Year,n=1)
endyear=tail(visit_ts$Year,n=1)
x=head(visit_ts$Month,n=1)
y=tail(visit_ts$Month,n=1)
for (year in vec_year)
{
if(year %in% visit_ts$Year)
{
a=subset(visit_ts,visit_ts$Year==year)
index= which(!vec_month %in% a$Month)
for (j in index)
{
if((year==startyear & j>x )|(year==endyear & j<y))
visit_ts=rbind(visit_ts,c(year,j,0))
else
{
if(year!=startyear & year!=endyear)
visit_ts=rbind(visit_ts,c(year,j,0))
}
}}
else
{
i=i+1
}}
As I am new to R I am looking for an alternative/better solution to the problem which would not involve hard-coding the year and month vectors. Also please feel free to point out best programming practices.

We can use expand.grid with merge or left_join
library(dplyr)
expand.grid(Year = min(df1$Year):max(df1$Year), Month = 1:12) %>%
filter(!(Year == min(df1$Year) & Month %in% 1:3|
Year == max(df1$Year) & Month == 12)) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
#1 2012 1 123
#2 2012 2 0
#3 2012 3 0
#4 2011 4 1
#5 2012 4 0
#6 2011 5 0
#7 2012 5 0
#8 2011 6 3
#9 2012 6 0
#10 2011 7 23
#11 2012 7 0
#12 2011 8 0
#13 2012 8 0
#14 2011 9 0
#15 2012 9 0
#16 2011 10 0
#17 2012 10 0
#18 2011 11 0
#19 2012 11 3200
#20 2011 12 32
We can make it more dynamic by grouping by 'Year', get the sequence of 'Month' from minimum to maximum in a list, unnest the column, join with the original dataset (left_join) and replace the NA values with 0.
library(tidyr)
df1 %>%
group_by(Year) %>%
summarise(Month = list(min(Month):max(Month))) %>%
unnest(Month) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
# <int> <int> <dbl>
#1 2011 4 1
#2 2011 5 0
#3 2011 6 3
#4 2011 7 23
#5 2011 8 0
#6 2011 9 0
#7 2011 10 0
#8 2011 11 0
#9 2011 12 32
#10 2012 1 123
#11 2012 2 0
#12 2012 3 0
#13 2012 4 0
#14 2012 5 0
#15 2012 6 0
#16 2012 7 0
#17 2012 8 0
#18 2012 9 0
#19 2012 10 0
#20 2012 11 3200
Or another option is data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Year', we get the sequence of min to max 'Month', join with the original dataset on 'Year' and 'Month', replace the NA values to 0.
library(data.table)
setDT(df1)
df1[df1[, .(Month=min(Month):max(Month)), Year],
on = c("Year", "Month")][is.na(Number_of_visits), Number_of_visits := 0][]
# Year Month Number_of_visits
# 1: 2011 4 1
# 2: 2011 5 0
# 3: 2011 6 3
# 4: 2011 7 23
# 5: 2011 8 0
# 6: 2011 9 0
# 7: 2011 10 0
# 8: 2011 11 0
# 9: 2011 12 32
#10: 2012 1 123
#11: 2012 2 0
#12: 2012 3 0
#13: 2012 4 0
#14: 2012 5 0
#15: 2012 6 0
#16: 2012 7 0
#17: 2012 8 0
#18: 2012 9 0
#19: 2012 10 0
#20: 2012 11 3200

Related

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

Counting duplicates in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I would like to know how it would be possible to make a new variable counting how many ID duplicates I have for certain years. For example, below I want to count for the year 2014 how many times before that year that ID was repeated. That way in the year 2015 it is counting the ID's in both 2013 and 2014.
ID Term Year Repeats
122 L 2013 N/A
112 L 2013 N/A
002 L 2013 N/A
152 L 2013 N/A
124 L 2013 N/A
122 L 2014 1
102 L 2014 N/A
142 L 2014 N/A
152 L 2014 N/A
120 L 2014 N/A
198 L 2014 N/A
122 L 2015 2
012 L 2015 N/A
101 L 2015 N/A
092 L 2015 N/A
031 L 2015 N/A
If Year is in ascending order:
df$Repeats <- 0L
i <- which(duplicated(df$ID))
df$Repeats[i] <- with(df[i, ], unsplit(lapply(split(ID, ID), seq_along), ID))
df
# ID Term Year Repeats
#1 122 L 2013 0
#2 112 L 2013 0
#3 2 L 2013 0
#4 152 L 2013 0
#5 124 L 2013 0
#6 122 L 2014 1
#7 102 L 2014 0
#8 142 L 2014 0
#9 152 L 2014 1
#10 120 L 2014 0
#11 198 L 2014 0
#12 122 L 2015 2
#13 12 L 2015 0
#14 101 L 2015 0
#15 92 L 2015 0
#16 31 L 2015 0
Another base R solution:
d$Repeats <- ave(d$ID, d$ID, FUN = function(x) seq_along(x)-1)
# or a bit cleaner (thx to #DavidArenburg):
d$Repeats <- with(d, ave(ID, ID, FUN = seq_along)) - 1
which gives:
> d
ID Term Year Repeats
1 122 L 2013 0
2 112 L 2013 0
3 2 L 2013 0
4 152 L 2013 0
5 124 L 2013 0
6 122 L 2014 1
7 102 L 2014 0
8 142 L 2014 0
9 152 L 2014 1
10 120 L 2014 0
11 198 L 2014 0
12 122 L 2015 2
13 12 L 2015 0
14 101 L 2015 0
15 92 L 2015 0
16 31 L 2015 0
A solution using data.table:
library(data.table)
setDT(d, key = c('ID','Year'))
d[, Repeats := 0:(.N-1), by = ID]
which gives:
> d
ID Term Year Repeats
1: 2 L 2013 0
2: 12 L 2015 0
3: 31 L 2015 0
4: 92 L 2015 0
5: 101 L 2015 0
6: 102 L 2014 0
7: 112 L 2013 0
8: 120 L 2014 0
9: 122 L 2013 0
10: 122 L 2014 1
11: 122 L 2015 2
12: 124 L 2013 0
13: 142 L 2014 0
14: 152 L 2013 0
15: 152 L 2014 1
16: 198 L 2014 0
Alternatively, you can use the rowid function from the development version of data.table:
d[, Repeats := rowid(ID)-1]
With dplyr:
library(dplyr)
d %>% group_by(ID) %>% mutate(Repeats = row_number()-1)
If you want NA's instead of zero's, you could use:
d[, Repeats := c(NA, 1:(.N-1)), by = ID]
which will give:
ID Term Year Repeats
1: 2 L 2013 NA
2: 12 L 2015 NA
3: 31 L 2015 NA
4: 92 L 2015 NA
5: 101 L 2015 NA
6: 102 L 2014 NA
7: 112 L 2013 NA
8: 120 L 2014 NA
9: 122 L 2013 NA
10: 122 L 2014 1
11: 122 L 2015 2
12: 124 L 2013 NA
13: 142 L 2014 NA
14: 152 L 2013 NA
15: 152 L 2014 1
16: 198 L 2014 NA

Calculating mean date by row

I wish to obtain the mean date by row, where each row contains two dates. Eventually I found a way, posted below. However, the approach I used seems rather cumbersome. Is there a better way?
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
my.data
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1, my.data$DAY1, my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2, my.data$DAY2, my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
Here is the approach that worked for me:
my.data$mean.date <- (my.data$MY.DATE1 + ((my.data$MY.DATE2 - my.data$MY.DATE1) / 2))
my.data
These approaches did not work:
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 1)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0.5)
my.data$mean.data <- apply(my.data, 1, function(x) {(x[9] + x[10]) / 2})
I think I am supposed to use the Ops.Date command, but have not found an example.
Thank you for any suggestions.
Keep things simple and use mean.Date in base R.
mean.Date(as.Date(c("01-01-2014", "01-07-2014"), format=c("%m-%d-%Y")))
[1] "2014-01-04"
Using the good advice of # jaysunice3401, I came up with this. If you want to keep the original data, you can add remove = FALSE in the two lines with unite
library(dplyr)
library(tidyr)
my.data %>%
unite(whatever1, matches("1"), sep = "-") %>%
unite(whatever2, matches("2"), sep = "-") %>%
mutate_each(funs(as.Date(., "%m-%d-%Y")), contains("whatever")) %>%
rowwise %>%
mutate(mean.date = mean.Date(c(whatever1, whatever2)))
# OBS whatever1 whatever2 STATE mean.date
#1 1 2012-03-06 2012-03-10 1 2012-03-08
#2 2 2012-03-10 2012-03-20 1 2012-03-15
#3 3 2012-03-16 2012-03-30 1 2012-03-23
#4 4 2012-03-20 2012-04-08 1 2012-03-29
#5 5 2012-03-20 2012-04-09 1 2012-03-30
#6 6 2012-03-20 2012-04-10 1 2012-03-30
#7 7 2012-03-20 2012-04-11 1 2012-03-31
#8 8 2012-04-04 2012-04-05 1 2012-04-04
#9 9 2012-04-06 2012-04-06 1 2012-04-06
#10 10 2012-04-06 2012-04-07 1 2012-04-06
Maybe something like that?
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := MY.DATE2 - ceiling((MY.DATE2 - MY.DATE1)/2)]
my.data
# OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
# 1: 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
# 2: 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
# 3: 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
# 4: 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
# 5: 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
# 6: 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
# 7: 7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
# 8: 8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
# 9: 9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
# 10: 10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
Or if you insist on using mean.date, here's alternative solution:
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := mean.Date(c(MY.DATE1, MY.DATE2)), by = OBS]
One-liner (split for readability), uses lubridate and dplyr and (of course) pipes:
> require(lubridate)
> require(dplyr)
> my.data = my.data %>%
mutate(
MY.DATE1=as.Date(mdy(paste(MONTH1,DAY1,YEAR1))),
MY.DATE2=as.Date(mdy(paste(MONTH2,DAY2,YEAR2)))) %>%
rowwise %>%
mutate(mean.data=mean.Date(c(MY.DATE1,MY.DATE2))) %>% data.frame()
> head(my.data)
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2
1 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10
2 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20
3 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30
4 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08
5 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09
6 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10
mean.data
1 2012-03-08
2 2012-03-15
3 2012-03-23
4 2012-03-29
5 2012-03-30
6 2012-03-30
As an afterthought, if you like pipes, you can put a pipe in your pipe so you can pipe while you pipe - rewriting the first mutate step thus:
my.data %>% mutate(
MY.DATE1 = paste(MONTH1,DAY1,YEAR1) %>% mdy %>% as.Date,
MY.DATE2 = paste(MONTH2,DAY2,YEAR2) %>% mdy %>% as.Date)
1) Create Date class columns and then its easy. No external packages are used:
asDate <- function(x) as.Date(x, "1970-01-01")
my.data2 <- transform(my.data,
date1 = as.Date(ISOdate(YEAR1, MONTH1, DAY1)),
date2 = as.Date(ISOdate(YEAR2, MONTH2, DAY2))
)
transform(my.data2, mean.date = asDate(rowMeans(cbind(date1, date2))))
If we did add a library(zoo) call then we could omit the asDate definition using as.Date in the last line instead of asDate since zoo adds a default origin to as.Date.
1a) A dplyr version would look like this (using asDate from above):
library(dplyr)
my.data %>%
mutate(
date1 = ISOdate(YEAR1, MONTH1, DAY1) %>% as.Date,
date2 = ISOdate(YEAR2, MONTH2, DAY2) %>% as.Date,
mean.date = cbind(date1, date2) %>% rowMeans %>% asDate)
2) Another way uses julian in the chron package. julian converts a month/day/year to the number of days since the Epoch. We can average the two julians and convert back to Date class:
library(zoo)
library(chron)
transform(my.data,
mean.date = as.Date( ( julian(MONTH1,DAY1,YEAR1) + julian(MONTH2,DAY2,YEAR2) )/2 )
)
We could omit library(zoo) if we used asDate from (1) in place of as.Date.
Update Discussed use of zoo to shorten the solutions and made further reductions in solution (1).
what about :
apply(my.data[,c("MY.DATE1","MY.DATE2")],1,function(date){substr(strptime(mean(c(strptime(date[1],"%y%y-%m-%d"),strptime(date[2],"%y%y-%m-%d"))),format="%y%y-%m-%d"),1,10)})
?
(I just had to use substr because of CET and CEST that put my output as a list...)
This is a vectorized version of the answer posted by jaysunice3401. It seems fairly straight-forward, except that I had to use trial-and-error to identify the correct origin. I do not know how general origin = "1970-01-01" is or whether a different origin would have to be specified with each data set.
According to this website: http://www.ats.ucla.edu/stat/r/faq/dates.htm
When R looks at dates as integers, its origin is January 1, 1970.
Which seems to suggest that origin = "1970-01-01" is fairly general. Although, if I had dates prior to "1970-01-01" in my data set I would definitely test the code before using it.
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1,my.data$DAY1,my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2,my.data$DAY2,my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data$mean.date2 <- as.Date( apply(my.data, 1, function(x) {
mean.Date(c(as.Date(x['MY.DATE1']), as.Date(x['MY.DATE2'])))
}) , origin = "1970-01-01")
my.data
desired.result

Recoding two variables into a new variable [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I'm trying to create a fiscal year variable called 'period', which will run from September through August for six years. My data frame 'dat' is structured as follows:
'data.frame': 52966 obs. of 4 variables:
$ userid : int 96 96 96 101 101 101 101 101 101 101 ...
$ comment.year : int 2008 2009 2009 2008 2008 2008 2008 2008 2008 2009 ...
$ comment.month: int 7 3 8 7 8 9 10 11 12 1 ...
$ num.comments : int 1 1 1 33 51 16 27 29 40 39 ...
I get this error message: Error: unexpected '=' in "dat$period[comment.year=2008 & comment.month="
when I run the following code. I've experimented with double equal signs and putting the month and year integers in quotes, but no success. I'm also wondering if there's a simpler way to do the recode. Since I'm dealing with 6 years, my approach takes 72 lines.
dat$period[comment.year=2008 & comment.month=9]<-"1"
dat$period[comment.year=2008 & comment.month=10]<-"1"
dat$period[comment.year=2008 & comment.month=11]<-"1"
dat$period[comment.year=2008 & comment.month=12]<-"1"
dat$period[comment.year=2009 & comment.month=1]<-"1"
dat$period[comment.year=2009 & comment.month=2]<-"1"
dat$period[comment.year=2009 & comment.month=3]<-"1"
dat$period[comment.year=2009 & comment.month=4]<-"1"
dat$period[comment.year=2009 & comment.month=5]<-"1"
dat$period[comment.year=2009 & comment.month=6]<-"1"
dat$period[comment.year=2009 & comment.month=7]<-"1"
dat$period[comment.year=2009 & comment.month=8]<-"1"
dat$period[comment.year=2009 & comment.month=9]<-"2"
dat$period[comment.year=2009 & comment.month=10]<-"2"
dat$period[comment.year=2009 & comment.month=11]<-"2"
dat$period[comment.year=2009 & comment.month=12]<-"2"
Rather than doing a bunch of partial assignments, why not just calculate the different in years with a bonus bump for months >=9?
#sample data
dat<-data.frame(
comment.year=rep(2009:2011, each=12),
comment.month=rep(1:12, 3)
)[-(1:8), ]
#assign new period
dat$period<- dat$comment.year-min(dat$comment.year) + ifelse(dat$comment.month>=9,1,0)
which gives you
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
If you want to make sure to start at a certain user, you can use 2009 rather than min(dat$comment.year).
Using MrFlick's sample data:
dat$period = rep(1:3, each=12)[1:28]
dat
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
>
Can easily be extended to your data.
I guess you could also try (Using #MrFlick's data)
set.seed(42)
dat1 <- dat[sample(1:nrow(dat)),]
dat<- within(dat, {period<- as.numeric(factor(comment.year))
period[comment.month <9] <- period[comment.month <9] -1})
dat
# comment.year comment.month period
#9 2009 9 1
#10 2009 10 1
#11 2009 11 1
#12 2009 12 1
#13 2010 1 1
#14 2010 2 1
#15 2010 3 1
#16 2010 4 1
#17 2010 5 1
#18 2010 6 1
#19 2010 7 1
#20 2010 8 1
#21 2010 9 2
#22 2010 10 2
#23 2010 11 2
#24 2010 12 2
#25 2011 1 2
#26 2011 2 2
#27 2011 3 2
#28 2011 4 2
#29 2011 5 2
#30 2011 6 2
#31 2011 7 2
#32 2011 8 2
#33 2011 9 3
#34 2011 10 3
#35 2011 11 3
#36 2011 12 3
Using the unordered dat1
within(dat1, {period<- as.numeric(factor(comment.year)); period[comment.month <9] <- period[comment.month <9] -1})[,3]
#[1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3
Crosschecking the results with #MrFlick's method
dat1$comment.year-min(dat1$comment.year) + ifelse(dat1$comment.month>=9,1,0)
# [1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3

Resources