How to calculate difference between data in different rows? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have got monthly data in this format
PrecipMM Date
122.7 2004-01-01
54.2 2005-01-01
31.9 2006-01-01
100.5 2007-01-01
144.9 2008-01-01
96.4 2009-01-01
75.3 2010-01-01
94.8 2011-01-01
67.6 2012-01-01
93.0 2013-01-01
184.6 2014-01-01
101.0 2015-01-01
149.3 2016-01-01
50.2 2004-02-01
46.2 2005-02-01
57.7 2006-02-01
I want to calculate all of the difference of precipMM in same month of different years.
My dream output is like this:
PrecipMM Date PrecipMM_diff
122.7 2004-01-01 NA
54.2 2005-01-01 -68.5
31.9 2006-01-01 -22.3
100.5 2007-01-01 68.6
144.9 2008-01-01 44.4
96.4 2009-01-01 -48.5
75.3 2010-01-01 -21.2
94.8 2011-01-01 19.5
67.6 2012-01-01 -27.2
93.0 2013-01-01 25.4
184.6 2014-01-01 91.6
101.0 2015-01-01 -83.6
149.3 2016-01-01 48.3
50.2 2004-02-01 NA
46.2 2005-02-01 -4.0
57.7 2006-02-01 11.5
I think diff() can do this but I have no idea how.

I think you can do this with lag combined with group_by from dplyr. Here's how:
library(dplyr)
library(lubridate) # makes dealing with dates easier
# Load your example data
df <- structure(list(PrecipMM = c(4.4, 66.7, 48.2, 60.9, 108.1, 109.2,
101.7, 38.1, 53.8, 71.9, 75.4, 67.1, 92.7, 115.3, 68.9, 38.9),
Date = structure(5:20, .Label = c("101.7", "108.1", "109.2",
"115.3", "1766-01-01", "1766-02-01", "1766-03-01", "1766-04-01",
"1766-05-01", "1766-06-01", "1766-07-01", "1766-08-01", "1766-09-01",
"1766-10-01", "1766-11-01", "1766-12-01", "1767-01-01", "1767-02-01",
"1767-03-01", "1767-04-01", "38.1", "38.9", "4.4", "48.2",
"53.8", "60.9", "66.7", "67.1", "68.9", "71.9", "75.4", "92.7"
), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("PrecipMM", "Date"))
results <- df %>%
mutate(years = year(Date), months = month(Date)) %>%
group_by(months) %>%
arrange(years) %>%
mutate(lagged.rain = lag(PrecipMM), rain.diff = PrecipMM - lagged.rain)
results
# Source: local data frame [16 x 6]
# Groups: months [12]
#
# PrecipMM Date years months lagged.rain rain.diff
# (dbl) (fctr) (dbl) (dbl) (dbl) (dbl)
# 1 4.4 1766-01-01 1766 1 NA NA
# 2 92.7 1767-01-01 1767 1 4.4 88.3
# 3 66.7 1766-02-01 1766 2 NA NA
# 4 115.3 1767-02-01 1767 2 66.7 48.6
# 5 48.2 1766-03-01 1766 3 NA NA
# 6 68.9 1767-03-01 1767 3 48.2 20.7
# 7 60.9 1766-04-01 1766 4 NA NA
# 8 38.9 1767-04-01 1767 4 60.9 -22.0
# 9 108.1 1766-05-01 1766 5 NA NA
# 10 109.2 1766-06-01 1766 6 NA NA
# 11 101.7 1766-07-01 1766 7 NA NA
# 12 38.1 1766-08-01 1766 8 NA NA
# 13 53.8 1766-09-01 1766 9 NA NA
# 14 71.9 1766-10-01 1766 10 NA NA
# 15 75.4 1766-11-01 1766 11 NA NA
# 16 67.1 1766-12-01 1766 12 NA NA

Related

How to count the days after rain events in r

I have a data frame 'test' like below,
day Rain SWC_11 SWC_12 SWC_13 SWC_14 SWC_21
01/01/2019 00:00:00 0.2 51 60 63 60 64
02/01/2019 00:00:00 0.2 51.5 60.3 63.4 60.8 64.4
03/01/2019 00:00:00 0.0 51.3 60.3 63.3 60.6 64.1
04/01/2019 00:00:00 0.4 51.5 60.3 63.4 60.8 64.4
15/01/2019 00:00:00 0.0 NA NA NA NA NA
16/01/2019 00:00:00 0.0 NA NA NA NA NA
17/01/2019 00:00:00 0.0 51.5 60.3 63.4 60.8 64.4
Now I want to count the days after each rain event, once it comes to the next rain events, it restarts again. My ideal outputs gonna be like below.
day Rain SWC_11 SWC_12 SWC_13 SWC_14 SWC_21 events
01/01/2019 00:00:00 0.2 51 60 63 60 64 1
02/01/2019 00:00:00 0.2 51.5 60.3 63.4 60.8 64.4 1
03/01/2019 00:00:00 0.0 51.3 60.3 63.3 60.6 64.1 2
04/01/2019 00:00:00 0.4 51.5 60.3 63.4 60.8 64.4 1
15/01/2019 00:00:00 0.0 NA NA NA NA NA 12
16/01/2019 00:00:00 0.0 NA NA NA NA NA 13
17/01/2019 00:00:00 0.0 51.5 60.3 63.4 60.8 64.4 14
my code is
test$day<- as.numeric(as.Date(test$day))
for(i in 1:(nrow(test)-1))
if (test$Rain[[i]] != 0){
test$event[i] <- 1
test$event[i+nrow(test)] <-test$day[i+nrow(test)]- test$day[i] +1
}else{
test$event <-0
}
but the results looks wired and the warning message is as below,
Error in `$<-.data.frame`(`*tmp*`, "event", value = c(0, 1, 0, 0, 0, 0, :
replacement has 12 rows, data has 10
Hope someone gonna help.
Instead of the rle-derivative I suggested earlier, I think cumulative-sum logic can be used here.
dat %>%
mutate(
day = as.Date(day, format="%d/%m/%Y"),
daylag = dplyr::lag(day, default = first(day) - 1)
) %>%
group_by(grp = cumsum(Rain > 0)) %>%
mutate(event = day - daylag[1]) %>%
ungroup()
# # A tibble: 7 x 11
# day Rain SWC_11 SWC_12 SWC_13 SWC_14 SWC_21 events daylag grp event
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <date> <int> <drtn>
# 1 2019-01-01 0.2 51 60 63 60 64 1 2018-12-31 1 1 days
# 2 2019-01-02 0.2 51.5 60.3 63.4 60.8 64.4 1 2019-01-01 2 1 days
# 3 2019-01-03 0 51.3 60.3 63.3 60.6 64.1 2 2019-01-02 2 2 days
# 4 2019-01-04 0.4 51.5 60.3 63.4 60.8 64.4 1 2019-01-03 3 1 days
# 5 2019-01-15 0 NA NA NA NA NA 12 2019-01-04 3 12 days
# 6 2019-01-16 0 NA NA NA NA NA 13 2019-01-15 3 13 days
# 7 2019-01-17 0 51.5 60.3 63.4 60.8 64.4 14 2019-01-16 3 14 days
Data:
dat <- read.table(header = TRUE, text = "
day Rain SWC_11 SWC_12 SWC_13 SWC_14 SWC_21 events
01/01/2019 0.2 51 60 63 60 64 1
02/01/2019 0.2 51.5 60.3 63.4 60.8 64.4 1
03/01/2019 0.0 51.3 60.3 63.3 60.6 64.1 2
04/01/2019 0.4 51.5 60.3 63.4 60.8 64.4 1
15/01/2019 0.0 NA NA NA NA NA 12
16/01/2019 0.0 NA NA NA NA NA 13
17/01/2019 0.0 51.5 60.3 63.4 60.8 64.4 14")
I have a solution which works just with base R, but it is not as short as the one above.
# imagine if you have to do this manually, how would you achieve it step-by-step?
# then just use base R to realize every single step.
# create the "events" column and "day2" column
test$events <- NA
test$day2 <- as.Date(test$day,format="%d/%m/%Y")
# test if the first day, rains or not, and assign the value for event
for (i in 1:1){
if(test$Rain[[i]] !=0){
test$events[[i]] <- 1
}
else {
test$events[[i]] <- 0
}
}
# then starting from the 2nd row, go down one by one
# assign the value for "events" column based on your criteria
for (i in 2:(nrow(test))){
if(test$Rain[[i-1]] !=0 &
test$Rain[[i]] != 0){
test$events[[i]] <- 1
}
if (test$Rain[[i-1]] != 0 &
test$Rain[[i]] == 0){
test$events[[i]] <- test$events[[i-1]] + 1*(as.Date(as.character(test$day2[[i]]), format="%d/%m/%Y") -
as.Date(as.character(test$day2[[i-1]]), format="%d/%m/%Y"))
}
if (test$Rain[[i-1]] == 0 &
test$Rain[[i]] !=0){
test$events[[i]] <-1
}
if (test$Rain[[i-1]] == 0 &
test$Rain[[i]] ==0){
test$events[[i]] <- test$events[[i-1]] + + 1*(as.Date(as.character(test$day2[[i]]), format="%d/%m/%Y") -
as.Date(as.character(test$day2[[i-1]]), format="%d/%m/%Y"))
}
}
Now you will have the desired results. My solution is not very smart, but this is my thinking when the codes seem to be difficult.

reshape untidy data frame, spreading rows to columns names [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 2 years ago.
Have searched the threads but can't understand a solution that will solve the problem with the data frame that I have.
My current data frame (df):
# A tibble: 8 x 29
`Athlete` Monday...2 Tuesday...3 Wednesday...4 Thursday...5 Friday...6 Saturday...7 Sunday...8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Date 29/06/2020 30/06/2020 43837.0 43868.0 43897.0 43928.0 43958.0
2 HR 47.0 54.0 51.0 56.0 59.0 NA NA
3 HRV 171.0 91.0 127.0 99.0 77.0 NA NA
4 Sleep Duration 9.11 7.12 8.59 7.15 8.32 NA NA
5 Sleep Efficien~ 92.0 94.0 89.0 90.0 90.0 NA NA
6 Recovery Score 98.0 66.0 96.0 72.0 46.0 NA NA
7 Life Stress NO NO NO NO NO NA NA
8 Sick NO NO NO NO NO NA NA
Have tried to use spread and pivot wider but I know there would require additional functions in order to get the desired output which beyond my level on understanding in R.
Do I need to u
Desired output:
Date HR HRV Sleep Duration Sleep Efficiency Recovery Score Life Stress Sick
29/06/2020 47.0 171.0 9.11
30/06/2020 54.0 91.0 7.12
43837.0 51.0 127.0 8.59
43868.0 56.0 99.0 7.15
43897.0 59.0 77.0 8.32
43928.0 NA NA NA
43958.0 NA NA NA
etc.
Thank you
In Base R you will do:
type.convert(setNames(data.frame(t(df[-1]), row.names = NULL), df[,1]))
Date HR HRV Sleep Duration Sleep Efficien~ Recovery Score Life Stress Sick
1 29/06/2020 47 171 9.11 92 98 NO NO
2 30/06/2020 54 91 7.12 94 66 NO NO
3 43837.0 51 127 8.59 89 96 NO NO
4 43868.0 56 99 7.15 90 72 NO NO
5 43897.0 59 77 8.32 90 46 NO NO
6 43928 NA NA NA NA NA <NA> <NA>
7 43958 NA NA NA NA NA <NA> <NA>

Time series forecasting by lm() using lapply

I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))

R: Update observation ID by comparing its value against a range of values given by two other columns

I am comparing road data between two years. The information gives a unique code for each road (id.agg), a unique code for each segment of the road (id), the intital and final kilometers for each segment (ini, fin) and the year.
When I merge the two tables using their unique segment id, I realized that there are mismatches due to road construction. However based on the kilometer covered by the segments, the roads' extent are the same. Therefore I would like to find a way to correct older ids with update ids.
A subset of my combined data looks like this:
>trial
id.agg id year.x ini.x fin.x year.y ini.y fin.y
010BTO 010BTO0318 1 606.1 611.7 2 606.1 611.7
010BTO 010BTO0320 1 611.7 631.4 2 611.7 631.4
010BTO 010BTO0325 1 631.4 670.2 2 631.4 670.2
010BTO 010BTO0330 1 670.2 718.4 2 670.2 718.4
010BTO 010BTO0335 1 718.4 734.0 2 718.4 786.8
010BTO 010BTO0340 1 734.0 772.9 NA NA NA
010BTO 010BTO0345 1 772.9 786.8 NA NA NA
010BTO 010BTO0350 1 786.8 794.9 2 786.8 794.9
010BTO 010BTO0355 1 794.9 828.2 2 794.9 827.2
010BTO 010BTO0357 NA NA NA 2 827.2 828.2
020BPI 020BPI0370 1 0.0 40.3 2 0.0 54.3
020BPI 020BPI0375 1 40.3 54.3 NA NA NA
020BPI 020BPI0380 1 54.3 85.3 2 54.3 85.3
020BPI 020BPI0390 1 85.3 148.3 2 85.3 148.3
>
For thos NAs in year 2, I would like to update the id so that I can compare which segments of road in year 1 are reflected in year 2. I thought to compare the initial kilometers in year 1 (ini.x) with the range of kilometers in year 2. I tried the following:
> trial[is.na(trial$year.y) & trial$ini.x %between% list(trial$ini.y,trial$fin.y) %in% trial$ini.y,]
id.agg id year.x ini.x fin.x year.y ini.y fin.y
6 010BTO 010BTO0340 1 734.0 772.9 NA NA NA
7 010BTO 010BTO0345 1 772.9 786.8 NA NA NA
12 020BPI 020BPI0375 1 40.3 54.3 NA NA NA
>
This shows me those observations for which I would like to correct their ids but I still do not find solutions to two major issues to solve my problem:
1) How do I know to which specific id is the query finding a match?
2) How can this be applied comparing per group (i.e. id.agg) and not through the whole vector?
Question 1) is important to be able to assign the update id to the old id. Question 2) is important as my real dataframe has over 6000 observations and around 500 groups, so I am pretty sure without accounting for groups I would find false matches (e.g. the ini.x would be found in the range [ini.y,fin.y] for more than one group)
I would like to have a dataframe like this:
>trial
id.agg id year.x ini.x fin.x year.y ini.y fin.y corrected.id
010BTO 010BTO0318 1 606.1 611.7 2 606.1 611.7 010BTO0318
010BTO 010BTO0320 1 611.7 631.4 2 611.7 631.4 010BTO0320
010BTO 010BTO0325 1 631.4 670.2 2 631.4 670.2 010BTO0325
010BTO 010BTO0330 1 670.2 718.4 2 670.2 718.4 010BTO0330
010BTO 010BTO0335 1 718.4 734.0 2 718.4 786.8 010BTO0335
010BTO 010BTO0340 1 734.0 772.9 NA NA NA 010BTO0335
010BTO 010BTO0345 1 772.9 786.8 NA NA NA 010BTO0335
010BTO 010BTO0350 1 786.8 794.9 2 786.8 794.9 010BTO0350
010BTO 010BTO0355 1 794.9 828.2 2 794.9 827.2 010BTO0355
010BTO 010BTO0357 NA NA NA 2 827.2 828.2 010BTO0357
020BPI 020BPI0370 1 0.0 40.3 2 0.0 54.3 020BPI0370
020BPI 020BPI0375 1 40.3 54.3 NA NA NA 020BPI0370
020BPI 020BPI0380 1 54.3 85.3 2 54.3 85.3 020BPI0380
020BPI 020BPI0390 1 85.3 148.3 2 85.3 148.3 020BPI0390
>
I have been looking to solutions but I cannot find a function or code that helps me to address point 1). In point 2) I found that group_by combined with %>% might help but I do not manage to apply it even to my query show aboved. For instance:
> trial %>%
+ group_by(id.agg) %>%
+ which( is.na(trial$year.y) & trial$ini.x %between% list(trial$ini.y,trial$fin.y) %in% trial$ini.y,
+ arr.ind=TRUE)
Error in which(., is.na(trial$year.y) & trial$ini.x %between% list(trial$ini.y, :
argument to 'which' is not logical
>
Any thoughts or leads on how to solve this would be truly appreciated.
Many thanks in advance!
-----------
As follow up, I used the proposed solution. It does well in identifying which observations fall in the previous non_NA observation's range of (ini.y,fin.y). Here is a subset of the output I got:
id.agg id year.x ini.x fin.x year.y ini.y fin.y rownum corrected.id
1 424BAL 424BAL0175 2016 39.5 42.7 2017 39.5 42.7 8 424BAL0175
2 424BAL 424BAL0180 2016 42.7 44.3 2017 42.7 50.8 9 424BAL0180
3 424BAL 424BAL0185 2016 44.3 47.1 NA NA NA 9 424BAL0180
4 424BAL 424BAL0190 2016 47.1 52.3 NA NA NA 9 424BAL0180
5 424BAL 424BAL0195 2016 52.3 55.0 NA NA NA 12 424BAL0195
6 424BAL 424BAL0200 2016 55.0 64.4 NA NA NA 13 424BAL0200
7 424BAL 424BAL0205 2016 64.4 68.1 NA NA NA 14 424BAL0205
8 424BAL 424BAL0210 2016 68.1 70.4 2017 50.8 73.8 15 424BAL0210
9 424BAL 424BAL0230 2016 70.4 77.2 2017 73.8 80.6 16 424BAL0230
Notice that observations in rows 4-6 are not in the range of (ini.y,fin.y) of observation in row 2 but on that of observation in row 8. The expected output should look like this:
id.agg id year.x ini.x fin.x year.y ini.y fin.y rownum corrected.id
1 424BAL 424BAL0175 2016 39.5 42.7 2017 39.5 42.7 8 424BAL0175
2 424BAL 424BAL0180 2016 42.7 44.3 2017 42.7 50.8 9 424BAL0180
3 424BAL 424BAL0185 2016 44.3 47.1 NA NA NA 9 424BAL0180
4 424BAL 424BAL0190 2016 47.1 52.3 NA NA NA 9 424BAL0180
5 424BAL 424BAL0195 2016 52.3 55.0 NA NA NA 12 424BAL0210
6 424BAL 424BAL0200 2016 55.0 64.4 NA NA NA 13 424BAL0210
7 424BAL 424BAL0205 2016 64.4 68.1 NA NA NA 14 424BAL0210
8 424BAL 424BAL0210 2016 68.1 70.4 2017 50.8 73.8 15 424BAL0210
9 424BAL 424BAL0230 2016 70.4 77.2 2017 73.8 80.6 16 424BAL0230
Sincerily greatful in advance!
If I understood the problem correctly then this should help
library(dplyr)
library(zoo)
df %>%
group_by(id.agg) %>%
mutate(rownum=ifelse(is.na(year.y) & is.na(ini.y) & is.na(fin.y), NA, row_number())) %>%
mutate(rownum=ifelse(is.na(rownum) & ini.x >=na.locf(ini.y) & ini.x <= na.locf(fin.y),
na.locf(rownum),
na.locf(rownum, fromLast=T))) %>%
mutate(corrected.id = id[rownum]) %>%
select(-rownum)
Output is:
id.agg id year.x ini.x fin.x year.y ini.y fin.y corrected.id
<chr> <chr> <int> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 424BAL 424BAL0175 2016 39.5 42.7 2017 39.5 42.7 424BAL0175
2 424BAL 424BAL0180 2016 42.7 44.3 2017 42.7 50.8 424BAL0180
3 424BAL 424BAL0185 2016 44.3 47.1 NA NA NA 424BAL0180
4 424BAL 424BAL0190 2016 47.1 52.3 NA NA NA 424BAL0180
5 424BAL 424BAL0195 2016 52.3 55.0 NA NA NA 424BAL0210
6 424BAL 424BAL0200 2016 55.0 64.4 NA NA NA 424BAL0210
7 424BAL 424BAL0205 2016 64.4 68.1 NA NA NA 424BAL0210
8 424BAL 424BAL0210 2016 68.1 70.4 2017 50.8 73.8 424BAL0210
9 424BAL 424BAL0230 2016 70.4 77.2 2017 73.8 80.6 424BAL0230
Sample data:
df <-structure(list(id.agg = c("424BAL", "424BAL", "424BAL", "424BAL",
"424BAL", "424BAL", "424BAL", "424BAL", "424BAL"), id = c("424BAL0175",
"424BAL0180", "424BAL0185", "424BAL0190", "424BAL0195", "424BAL0200",
"424BAL0205", "424BAL0210", "424BAL0230"), year.x = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L), ini.x = c(39.5,
42.7, 44.3, 47.1, 52.3, 55, 64.4, 68.1, 70.4), fin.x = c(42.7,
44.3, 47.1, 52.3, 55, 64.4, 68.1, 70.4, 77.2), year.y = c(2017L,
2017L, NA, NA, NA, NA, NA, 2017L, 2017L), ini.y = c(39.5, 42.7,
NA, NA, NA, NA, NA, 50.8, 73.8), fin.y = c(42.7, 50.8, NA, NA,
NA, NA, NA, 73.8, 80.6)), .Names = c("id.agg", "id", "year.x",
"ini.x", "fin.x", "year.y", "ini.y", "fin.y"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
Edit: Updated code after getting a better clarity of the requirement.

Combine some columns of two matrices but with common information transposed

I have the following two matrices:
matrix1 (first 10 rows and only some relevant columns):
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place
1 6101 5733 5655 5803 5155 3
2 4614 4513 4322 5211 4397 1
3 5370 5295 4951 5145 4491 3
4 5689 5855 5600 5787 4848 1
5 3598 3491 3462 3765 3094 2
6 6367 6244 5838 6404 5466 7
7 2720 2635 2465 2917 2623 2
8 5077 5113 4456 5503 4749 8
9 5260 5055 4512 5691 4876 2
10 4771 4583 4202 5266 4422 2
where each column is grassland productivity from years 2010 to 2014, and the last column is the place where productivity was measured.
and matrix2:
Year Rain_Place1 Rain_Place2 Rain_Place3 Rain_Place7 Rain_Place8
11 2010 123.0 361.0 60.5 469.7 492.3
12 2011 45.5 404.4 224.8 395.4 417.3
13 2012 318.7 369.4 115.7 322.6 385.8
14 2013 93.2 378.4 155.5 398.2 413.1
15 2014 216.8 330.0 31.0 344.0 387.5
where for each of the same 5 years of matrix1 (which are the rows in matrix 2) I have data on the rainfall for each place.
I do not see how to proceed in R to join the information of the two matrices in such a way that my matrix1 has a series of additional columns intercalated (or interspersed) with the corresponding rain values matching the corresponding years and places. That is, what I need is a new matrix1 such as:
Prod_Y2010 Rain_Y2010 Prod_Y2011 Rain_Y2011 Prod_Y2012 Rain_Y2012 ... Place
1 6101 60.5 5733 224.8 5655 115.7 3
2 4614 123.0 4513 45.5 4322 318.7 1
3 5370 60.5 5295 224.8 4951 115.7 3
4 5689 123.0 5855 45.5 5600 318.7 1
5 3598 361.0 3491 404.4 3462 369.4 2
... ... ... ... ... ... ... ...
Of course the order is not important to me: if all the Rainfall columns are added as new columns at the right end of matrix1, that would be fine anyway.
Needless to say, my real matrices are several thousands rows long, and the number of years is 15.
I would second #jazzurro's comment- reformatting your data to long format would likely make it easier to work with for analysis etc. However, if you want to keep it using the wide format here is a way that might work- it uses the reshape2 and plyr libraries.
Given these data frames (dput() output of your data frames above, only included for reproducibility):
m1<-structure(list(Prod_Y2010 = c(6101L, 4614L, 5370L, 5689L, 3598L,
6367L, 2720L, 5077L, 5260L, 4771L), Prod_Y2011 = c(5733L, 4513L,
5295L, 5855L, 3491L, 6244L, 2635L, 5113L, 5055L, 4583L), Prod_Y2012 = c(5655L,
4322L, 4951L, 5600L, 3462L, 5838L, 2465L, 4456L, 4512L, 4202L
), Prod_Y2013 = c(5803L, 5211L, 5145L, 5787L, 3765L, 6404L, 2917L,
5503L, 5691L, 5266L), Prod_Y2014 = c(5155L, 4397L, 4491L, 4848L,
3094L, 5466L, 2623L, 4749L, 4876L, 4422L), Place = c(3L, 1L,
3L, 1L, 2L, 7L, 2L, 8L, 2L, 2L)), .Names = c("Prod_Y2010", "Prod_Y2011",
"Prod_Y2012", "Prod_Y2013", "Prod_Y2014", "Place"), class = "data.frame", row.names = c(NA,
-10L))
m2<-structure(list(Year = 2010:2014, Rain_Place1 = c(123, 45.5, 318.7,
93.2, 216.8), Rain_Place2 = c(361, 404.4, 369.4, 378.4, 330),
Rain_Place3 = c(60.5, 224.8, 115.7, 155.5, 31), Rain_Place7 = c(469.7,
395.4, 322.6, 398.2, 344), Rain_Place8 = c(492.3, 417.3,
385.8, 413.1, 387.5)), .Names = c("Year", "Rain_Place1",
"Rain_Place2", "Rain_Place3", "Rain_Place7", "Rain_Place8"), class = "data.frame", row.names = c("11",
"12", "13", "14", "15"))
To get the place number from the column names in your rain data frame to use in a later join:
rename <- function(x) {
y <- substr(x, nchar(x), nchar(x))
return(y)
}
Edit: Here is a better rename function, that should work with more than 9 places (modified from an answer here):
rename <- function(x) {
y <- unlist(regmatches(x, gregexpr('\\(?[0-9,.]+', x)))
return(y)
}
sapply(names(m2[2:ncol(m2)]), FUN = rename)
names(m2) <- c(names(m2)[1], sapply(names(m2[2:ncol(m2)]), FUN = rename))
> m2
Year 1 2 3 7 8
1 2010 123.0 361.0 60.5 469.7 492.3
2 2011 45.5 404.4 224.8 395.4 417.3
3 2012 318.7 369.4 115.7 322.6 385.8
4 2013 93.2 378.4 155.5 398.2 413.1
5 2014 216.8 330.0 31.0 344.0 387.5
Melt the rain data frame:
m3<-melt(m2, id.vars = "Year", variable.name = "Place", value.name = "Rain")
> head(m3)
Year Place Rain
1 2010 1 123.0
2 2011 1 45.5
3 2012 1 318.7
4 2013 1 93.2
5 2014 1 216.8
6 2010 2 361.0
Reshape the melted data frame to allow for a join by "Place", and treat "Place" as a character rather than a factor:
m4<-reshape(m3, idvar = "Place", timevar = "Year", direction = "wide")
m4$Place <- as.character(m4$Place)
> m4
Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 1 123.0 45.5 318.7 93.2 216.8
6 2 361.0 404.4 369.4 378.4 330.0
11 3 60.5 224.8 115.7 155.5 31.0
16 7 469.7 395.4 322.6 398.2 344.0
21 8 492.3 417.3 385.8 413.1 387.5
Finally, join this melted/reshaped data frame to your "Prod" data frame.
m5<-join(m1, m4, by = "Place")
> m5
Prod_Y2010 Prod_Y2011 Prod_Y2012 Prod_Y2013 Prod_Y2014 Place Rain.2010 Rain.2011 Rain.2012 Rain.2013 Rain.2014
1 6101 5733 5655 5803 5155 3 60.5 224.8 115.7 155.5 31.0
2 4614 4513 4322 5211 4397 1 123.0 45.5 318.7 93.2 216.8
3 5370 5295 4951 5145 4491 3 60.5 224.8 115.7 155.5 31.0
4 5689 5855 5600 5787 4848 1 123.0 45.5 318.7 93.2 216.8
5 3598 3491 3462 3765 3094 2 361.0 404.4 369.4 378.4 330.0
6 6367 6244 5838 6404 5466 7 469.7 395.4 322.6 398.2 344.0
7 2720 2635 2465 2917 2623 2 361.0 404.4 369.4 378.4 330.0
8 5077 5113 4456 5503 4749 8 492.3 417.3 385.8 413.1 387.5
9 5260 5055 4512 5691 4876 2 361.0 404.4 369.4 378.4 330.0
10 4771 4583 4202 5266 4422 2 361.0 404.4 369.4 378.4 330.0

Resources