replace NA with previous 2 years values - r

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Search in a column based on the value of a different column

I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA

Resources