Repeating annual values multiple times to form a monthly dataframe - r

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.

Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Related

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

Resources