How to add a moving sum and a function to a dataframe - r

I need to add a new column containing an specifica function to a data frame.
Basically i need to calculate an indicator which is the sum of the past 5 observations (in column "value1") multuplied by 100 and divided by column "value2" {this one not as a sum, just the simple observatio} of my sample data below.
somewhat like this (its not a formal notation):
indicator = [sum (i-5) value1 / value2] * 100
the indicator must be calculate by country.
in case of countries or dates "mixed" in the data frame the formula need to be able to recognize and sum the correct values only, in the correct order.
If there is a NA value in the value 1, the formula should also be able to ignore this line as a computation. ex: 31/12, 1/01, 2/01, 3/01, 4/01 = NA, 05/01 --> the indicator of 06/01 will then take into account the past 5 valid observation, 31/12, 1/01, 2/01, 3/01, 05/01.
Important -> only use base R
Example of the data frame (my actual data frame is more complex)
set.seed(1)
Country <- c(rep("USA", 10),rep("UK", 10), rep("China", 10))
Value1 <- sample(x = c(120, 340, 423), size = 30, replace = TRUE)
Value2 <- sample(x = c(1,3,5,6,9), size = 30, replace = TRUE)
date <- seq(as.POSIXct('2020/01/01'),
as.POSIXct('2020/01/30'),
by = "1 day")
df = data.frame(Country, Value1, Value2, date)
I thank you all very much in advance. this one has bein very hard to crack :D

Since it has to be done group-wise but in base R, you could use the split-apply-bind method
df2 <- do.call(rbind, lapply(split(df, df$Country), function(d) {
d <- d[order(d$date),]
d$computed <- 100 * d$Value1 / d$Value2
d$Result <- NA
for(i in 5:nrow(d)) d$Result[i] <- sum(tail(na.omit(d$computed[seq(i)]), 5))
d[!names(d) %in% "computed"]
}))
rn <- sapply(strsplit(rownames(df2), "\\."), function(x) as.numeric(x[2]))
`rownames<-`(df2[rn,], NULL)
#> Country Value1 Value2 date Result
#> 1 USA 423 9 2020-01-01 NA
#> 2 USA 120 3 2020-01-02 NA
#> 3 USA 120 3 2020-01-03 NA
#> 4 USA 423 5 2020-01-04 NA
#> 5 USA 120 1 2020-01-05 33160.00
#> 6 USA 120 1 2020-01-06 40460.00
#> 7 USA 120 3 2020-01-07 40460.00
#> 8 USA 340 1 2020-01-08 70460.00
#> 9 USA 423 6 2020-01-09 69050.00
#> 10 USA 340 9 2020-01-10 60827.78
#> 11 UK 340 5 2020-01-11 NA
#> 12 UK 423 6 2020-01-12 NA
#> 13 UK 423 3 2020-01-13 NA
#> 14 UK 340 1 2020-01-14 NA
#> 15 UK 120 3 2020-01-15 65950.00
#> 16 UK 120 9 2020-01-16 60483.33
#> 17 UK 423 1 2020-01-17 95733.33
#> 18 UK 423 9 2020-01-18 86333.33
#> 19 UK 340 1 2020-01-19 86333.33
#> 20 UK 340 3 2020-01-20 93666.67
#> 21 China 340 1 2020-01-21 NA
#> 22 China 340 9 2020-01-22 NA
#> 23 China 423 3 2020-01-23 NA
#> 24 China 120 1 2020-01-24 NA
#> 25 China 340 9 2020-01-25 67655.56
#> 26 China 340 5 2020-01-26 40455.56
#> 27 China 120 5 2020-01-27 39077.78
#> 28 China 340 9 2020-01-28 28755.56
#> 29 China 340 9 2020-01-29 20533.33
#> 30 China 423 5 2020-01-30 25215.56
Created on 2022-06-08 by the reprex package (v2.0.1)

Here's an option - not sure if the calculation is as you intend:
split_df <- split(df, Country)
split_df <- lapply(split_df, function(x) {
x <- x[order(x$date),]
x$index <- nrow(x):1
x$indicator <- ifelse(x$index <= 5, sum(x$Value2[x$index <= 5]) * 100 / x$Value2, NA)
x$index <- NULL
return(x)
})
final_df <- do.call(rbind, split_df)
Country Value1 Value2 date indicator
China.21 China 120 3 2020-01-21 NA
China.22 China 423 5 2020-01-22 NA
China.23 China 340 6 2020-01-23 NA
China.24 China 120 3 2020-01-24 NA
China.25 China 340 9 2020-01-25 NA
China.26 China 423 6 2020-01-26 366.6667
China.27 China 120 3 2020-01-27 733.3333
China.28 China 340 3 2020-01-28 733.3333
China.29 China 120 5 2020-01-29 440.0000
China.30 China 340 5 2020-01-30 440.0000
UK.11 UK 423 1 2020-01-11 NA
UK.12 UK 340 6 2020-01-12 NA
UK.13 UK 423 1 2020-01-13 NA
UK.14 UK 423 5 2020-01-14 NA
UK.15 UK 340 6 2020-01-15 NA
UK.16 UK 340 1 2020-01-16 2400.0000
UK.17 UK 120 5 2020-01-17 480.0000
UK.18 UK 423 9 2020-01-18 266.6667
UK.19 UK 120 6 2020-01-19 400.0000
UK.20 UK 423 3 2020-01-20 800.0000
USA.1 USA 423 1 2020-01-01 NA
USA.2 USA 423 5 2020-01-02 NA
USA.3 USA 423 5 2020-01-03 NA
USA.4 USA 423 6 2020-01-04 NA
USA.5 USA 423 1 2020-01-05 NA
USA.6 USA 340 5 2020-01-06 600.0000
USA.7 USA 340 5 2020-01-07 600.0000
USA.8 USA 423 6 2020-01-08 500.0000
USA.9 USA 423 5 2020-01-09 600.0000
USA.10 USA 423 9 2020-01-10 333.3333

In base R you could do:
transform(df,Results=ave(Value1,Country,FUN=function(x)replace(x,!is.na(x),
filter(na.omit(x),rep(1,5),sides=1)))/Value2)
Country Value1 Value2 date Results
1 USA 120 1 2020-01-01 NA
2 USA 423 6 2020-01-02 NA
3 USA 120 1 2020-01-03 NA
4 USA 340 6 2020-01-04 NA
5 USA 120 5 2020-01-05 224.6000
6 USA 423 3 2020-01-06 475.3333
7 USA 423 3 2020-01-07 475.3333
8 USA 340 6 2020-01-08 274.3333
9 USA 340 6 2020-01-09 274.3333
10 USA 423 6 2020-01-10 324.8333
11 UK 423 3 2020-01-11 NA
12 UK 120 6 2020-01-12 NA
13 UK 120 1 2020-01-13 NA
14 UK 120 1 2020-01-14 NA
15 UK 340 6 2020-01-15 187.1667
16 UK 340 1 2020-01-16 1040.0000
17 UK 340 3 2020-01-17 420.0000
18 UK 340 5 2020-01-18 296.0000
19 UK 423 3 2020-01-19 594.3333
20 UK 120 3 2020-01-20 521.0000
21 China 423 9 2020-01-21 NA
22 China 120 3 2020-01-22 NA
23 China 120 1 2020-01-23 NA
24 China 120 5 2020-01-24 NA
25 China 120 5 2020-01-25 180.6000
26 China 340 6 2020-01-26 136.6667
27 China 120 5 2020-01-27 164.0000
28 China 120 1 2020-01-28 820.0000
29 China 340 6 2020-01-29 173.3333
30 China 340 9 2020-01-30 140.0000

Related

Performing a rolling average with criteria in R

Been trying to learn the most basic of items at first and then expanding the complexity. So for this one, how would I modify the last line to where it would be create a rolling 12 month average for each seriescode. In this case, it would produce an average of 8 for seriescode 100 and 27 for seriescode 101.
First, is the sample data
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
Manipulations
library(dplyr)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA))
You would just need to add a group_by(seriescode) which would then perform the mutate functions per seriescode:
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% group_by(seriescode) %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA)) # add the group_by(seriescode)
This produces the output:
# A tibble: 26 x 7
# Groups: seriescode [2]
Monthx empx seriescode year month date ravg
<dbl> <dbl> <dbl> <chr> <chr> <date> <dbl>
1 201911 1 100 2019 11 2019-11-01 NA
2 201912 2 100 2019 12 2019-12-01 NA
3 20201 3 100 2020 1 2020-01-01 NA
4 20202 4 100 2020 2 2020-02-01 NA
5 20203 5 100 2020 3 2020-03-01 NA
6 20204 6 100 2020 4 2020-04-01 NA
7 20205 7 100 2020 5 2020-05-01 NA
8 20206 8 100 2020 6 2020-06-01 NA
9 20207 9 100 2020 7 2020-07-01 NA
10 20208 10 100 2020 8 2020-08-01 NA
11 20209 11 100 2020 9 2020-09-01 NA
12 202010 12 100 2020 10 2020-10-01 6.5
13 202011 13 100 2020 11 2020-11-01 7.5
14 201911 21 110 2019 11 2019-11-01 NA
15 201912 22 110 2019 12 2019-12-01 NA
16 20201 23 110 2020 1 2020-01-01 NA
17 20202 24 110 2020 2 2020-02-01 NA
18 20203 25 110 2020 3 2020-03-01 NA
19 20204 26 110 2020 4 2020-04-01 NA
20 20205 27 110 2020 5 2020-05-01 NA
21 20206 28 110 2020 6 2020-06-01 NA
22 20207 29 110 2020 7 2020-07-01 NA
23 20208 20 110 2020 8 2020-08-01 NA
24 20209 31 110 2020 9 2020-09-01 NA
25 202010 32 110 2020 10 2020-10-01 25.7
26 202011 33 110 2020 11 2020-11-01 26.7
If you want to continue using the tidyverse for this, the following should do the trick:
library(dplyr)
ces12x %>%
group_by(seriescode) %>%
arrange(date) %>%
slice(tail(row_number(), 12)) %>%
summarize(ravg = mean(empx))

How to filter timestamps of one data frame based on timestamps from another?

I am attempting to filter one dataframe 'Blond_GSE' e.g. (bird tracking data which contains lots of variables including a timestamp) by the timestamps from a separate dataframe 'Blond_Prey' (variables including a timestamp of when a bird bought food to a nest) .
I would like to filter, so I have a new data frame with all tracking data (Blond_GSE) 30 minutes prior to the timestamps from the 'Blond_Prey.
Here is a look at each separate data frame.
head(Blond_GSE)
tag_id sensor_type_id acceleration_raw_x acceleration_raw_y
1 977476871 653 30 -942
2 977476871 653 32 -949
3 977476871 653 34 -949
4 977476871 653 40 -944
5 977476871 653 36 -943
6 977476871 653 36 -944
acceleration_raw_z barometric_height battery_charge_percent
1 454 0 100
2 445 0 100
3 450 0 100
4 446 0 100
5 451 0 100
6 455 0 100
battery_charging_current external_temperature flt_switch gps_hdop
1 0 33 NA 0.9
2 0 33 NA 1.0
3 0 33 NA 1.0
4 0 34 NA 0.9
5 0 33 NA 1.0
6 0 33 NA 0.8
gps_maximum_signal_strength gps_satellite_count gps_time_to_fix
1 NA 7 21.46
2 NA 6 12.48
3 NA 7 14.48
4 NA 8 26.41
5 NA 7 7.95
6 NA 9 8.98
ground_speed gsm_mcc_mnc heading height_above_ellipsoid
1 0 NA 86 NA
2 0 NA 296 NA
3 0 NA 331 NA
4 0 NA 44 NA
5 0 NA 213 NA
6 0 NA 225 NA
height_above_msl import_marked_outlier light_level
1 152 false 0
2 152 false 0
3 152 false 0
4 152 false 0
5 152 false 0
6 152 false 0
location_error_numerical location_lat location_long
1 NA 51.86663 27.59045
2 NA 51.86654 27.59053
3 NA 51.86645 27.59056
4 NA 51.86644 27.59071
5 NA 51.86636 27.59047
6 NA 51.86646 27.59067
magnetic_field_raw_x magnetic_field_raw_y magnetic_field_raw_z
1 0.067 -0.354 -0.024
2 0.065 -0.360 -0.013
3 0.067 -0.352 -0.019
4 0.061 -0.360 -0.012
5 0.061 -0.356 -0.014
6 0.073 -0.350 -0.019
ornitela_transmission_protocol tag_voltage timestamp
1 GPRS 4155 2019-04-26 01:42:00
2 GPRS 4150 2019-04-26 01:46:51
3 GPRS 4150 2019-04-26 01:51:53
4 GPRS 4150 2019-04-26 01:57:05
5 GPRS 4147 2019-04-26 02:01:46
6 GPRS 4147 2019-04-26 02:06:47
transmission_timestamp update_ts
1 2019-10-07 09:46:52.104
2 2019-10-07 09:46:52.104
3 2019-10-07 09:46:52.104
4 2019-10-07 09:46:52.104
5 2019-10-07 09:46:52.104
6 2019-10-07 09:46:52.104
vertical_error_numerical visible deployment_id event_id
1 NA true 1003456347 12506913411
2 NA true 1003456347 12506913412
3 NA true 1003456347 12506913413
4 NA true 1003456347 12506913414
5 NA true 1003456347 12506913415
6 NA true 1003456347 12506913416
sensor_type tag_local_identifier location_long.1 location_lat.1
1 GPS 171035 27.59045 51.86663
2 GPS 171035 27.59053 51.86654
3 GPS 171035 27.59056 51.86645
4 GPS 171035 27.59071 51.86644
5 GPS 171035 27.59047 51.86636
6 GPS 171035 27.59067 51.86646
optional sensor timestamps trackId comments
1 TRUE GPS 2019-04-26 01:42:00 Blond NA
2 TRUE GPS 2019-04-26 01:46:51 Blond NA
3 TRUE GPS 2019-04-26 01:51:53 Blond NA
4 TRUE GPS 2019-04-26 01:57:05 Blond NA
5 TRUE GPS 2019-04-26 02:01:46 Blond NA
6 TRUE GPS 2019-04-26 02:06:47 Blond NA
death_comments earliest_date_born exact_date_of_birth
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
individual_id latest_date_born local_identifier nick_name ring_id
1 1003455374 NA Blond Blond
2 1003455374 NA Blond Blond
3 1003455374 NA Blond Blond
4 1003455374 NA Blond Blond
5 1003455374 NA Blond Blond
6 1003455374 NA Blond Blond
sex taxon_canonical_name timestamp_start
1 Aquila clanga 2018-08-31 00:01:23.000
2 Aquila clanga 2018-08-31 00:01:23.000
3 Aquila clanga 2018-08-31 00:01:23.000
4 Aquila clanga 2018-08-31 00:01:23.000
5 Aquila clanga 2018-08-31 00:01:23.000
6 Aquila clanga 2018-08-31 00:01:23.000
timestamp_end number_of_events number_of_deployments
1 2020-07-16 09:54:12.000 85156 1
2 2020-07-16 09:54:12.000 85156 1
3 2020-07-16 09:54:12.000 85156 1
4 2020-07-16 09:54:12.000 85156 1
5 2020-07-16 09:54:12.000 85156 1
6 2020-07-16 09:54:12.000 85156 1
sensor_type_ids taxon_detail
1 GPS Clanga clanga
2 GPS Clanga clanga
3 GPS Clanga clanga
4 GPS Clanga clanga
5 GPS Clanga clanga
6 GPS Clanga clanga
head(Blond_prey)
Location ID Species Habitat Year Date Activity Gender
1 ?????? Blond BP Fen Mire 2019 2019-04-25 Arrival M
2 ?????? Blond BP Fen Mire 2019 2019-04-27 Arrival M
3 ?????? Blond BP Fen Mire 2019 2019-04-27 Arrival M
4 ?????? Blond BP Fen Mire 2019 2019-05-03 Arrival M
5 ?????? Blond BP Fen Mire 2019 2019-05-12 Arrival M
6 ?????? Blond BP Fen Mire 2019 2019-05-13 Arrival M
Activity_1 Category Prey
1 Prey Delivery ? medium-sized bird or large vole
2 Prey Delivery ? Something Small
3 Prey Delivery Crane-like Spotted Crake
4 Prey Delivery Geese Large Duck
5 Prey Delivery ? medium-sized bird or large vole
6 Prey Delivery Snake Grass Snake
Class Age Condition Weight..g. Notes
1 ? <NA> <NA> 100 Imperfectly Seen
2 ? <NA> <NA> NA <NA>
3 Aves ad <NA> NA <NA>
4 Aves ad duck spine with head NA <NA>
5 ? <NA> <NA> 100 Imperfectly Seen
6 Reptilia <NA> <NA> NA <NA>
New_Time
1 2019-04-25 17:03:00 UTC
2 2019-04-27 04:39:00 UTC
3 2019-04-27 07:33:00 UTC
4 2019-05-03 07:26:00 UTC
5 2019-05-12 06:40:00 UTC
6 2019-05-13 13:19:00 UTC
The columns with the timestamps are called "timestamp" in Blond_GSE and "New_Time in Blond_Prey.
Here are a look at the two timestamps.
head(Blond_GSE$timestamp)
[1] "2019-04-26 01:42:00 UTC" "2019-04-26 01:46:51 UTC"
[3] "2019-04-26 01:51:53 UTC" "2019-04-26 01:57:05 UTC"
[5] "2019-04-26 02:01:46 UTC" "2019-04-26 02:06:47 UTC"
head(Blond_prey$New_Time)
[1] "2019-04-25 17:03:00 UTC" "2019-04-27 04:39:00 UTC"
[3] "2019-04-27 07:33:00 UTC" "2019-05-03 07:26:00 UTC"
[5] "2019-05-12 06:40:00 UTC" "2019-05-13 13:19:00 UTC"
I would like to filter the Blond_GSE data by the timestamp of Blond_prey, so I get all data 30 mins prior to the Blond_Prey timestamps.
Is this possible?
I have tried the code.
Blond.GSE <- Blond_GSE %>% filter_time(timestamp => Blond_prey$New_Time <=(Blond_prey&New_Time - 30))
However that returns an error message:
Error: unexpected '>' in "Blond.GSE <- Blond_GSE %>% filter_time(timestamp =>"
Please can someone help?

store each variable into a list element [duplicate]

This question already has answers here:
How to split a data frame?
(8 answers)
Closed 2 years ago.
I have this df.
Date <- c("2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04",
"2020-10-01", "2020-10-02", "2020-10-03", "2020-10-04")
Country <- c("USA", "USA", "USA", "USA",
"Mexico", "Mexico", "Mexico", "Mexico",
"Japan", "Japan", "Japan","Japan")
Value_A <- 1:12
Value_B <- 10:21
Date Country Value_A Value_B
<date> <chr> <int> <int>
1 2020-10-01 USA 1 10
2 2020-10-02 USA 2 11
3 2020-10-03 USA 3 12
4 2020-10-04 USA 4 13
5 2020-10-01 Mexico 5 14
6 2020-10-02 Mexico 6 15
7 2020-10-03 Mexico 7 16
8 2020-10-04 Mexico 8 17
9 2020-10-01 Japan 9 18
10 2020-10-02 Japan 10 19
11 2020-10-03 Japan 11 20
12 2020-10-04 Japan 12 21
What I want is break this df into n parts, each part being a country. Something like:
list <- list(df[1:4,], df[5:8,], df[9:12,])
ps: for my purposes, the dataframes should stay "long".
ps2: I simplified the df, I have like 50 countries.
I appreciate it if someone can help :)
You can use split.data.frame from base R.
df <- data.frame(Date = Date, Country = Country, Value_a = Value_A, Value_b= Value_B)
>split.data.frame(df,df$Country)
$Japan
Date Country Value_a Value_b
9 2020-10-01 Japan 9 18
10 2020-10-02 Japan 10 19
11 2020-10-03 Japan 11 20
12 2020-10-04 Japan 12 21
$Mexico
Date Country Value_a Value_b
5 2020-10-01 Mexico 5 14
6 2020-10-02 Mexico 6 15
7 2020-10-03 Mexico 7 16
8 2020-10-04 Mexico 8 17
$USA
Date Country Value_a Value_b
1 2020-10-01 USA 1 10
2 2020-10-02 USA 2 11
3 2020-10-03 USA 3 12
4 2020-10-04 USA 4 13
Does this work:
df
Date Country Value_A Value_B
1 2020-10-01 USA 1 10
2 2020-10-02 USA 2 11
3 2020-10-03 USA 3 12
4 2020-10-04 USA 4 13
5 2020-10-01 Mexico 5 14
6 2020-10-02 Mexico 6 15
7 2020-10-03 Mexico 7 16
8 2020-10-04 Mexico 8 17
9 2020-10-01 Japan 9 18
10 2020-10-02 Japan 10 19
11 2020-10-03 Japan 11 20
12 2020-10-04 Japan 12 21
list <- df %>% split(df$Country)
list
$Japan
Date Country Value_A Value_B
9 2020-10-01 Japan 9 18
10 2020-10-02 Japan 10 19
11 2020-10-03 Japan 11 20
12 2020-10-04 Japan 12 21
$Mexico
Date Country Value_A Value_B
5 2020-10-01 Mexico 5 14
6 2020-10-02 Mexico 6 15
7 2020-10-03 Mexico 7 16
8 2020-10-04 Mexico 8 17
$USA
Date Country Value_A Value_B
1 2020-10-01 USA 1 10
2 2020-10-02 USA 2 11
3 2020-10-03 USA 3 12
4 2020-10-04 USA 4 13

Equivalent of index - match in Excel to return greater than the lookup value

In R I need to perform a similar function to index-match in Excel which returns the value just greater than the look up value.
Data Set A
Country GNI2009
Ukraine 6604
Egypt 5937
Morocco 5307
Philippines 4707
Indonesia 4148
India 3677
Viet Nam 3180
Pakistan 2760
Nigeria 2699
Data Set B
GNI2004 s1 s2 s3 s4
6649 295 33 59 3
6021 260 30 50 3
5418 226 27 42 2
4846 193 23 35 2
4311 162 20 29 2
3813 134 16 23 1
3356 109 13 19 1
2976 89 10 15 1
2578 68 7 11 0
2248 51 5 8 0
2199 48 5 8 0
At the 2009 level GNI for each country (data set A) I would like to find out which GNI2004 is just greater than or equal to GNI2009 and then return the corresponding sales values (s1,s2...) at that row (data set B). I would like to repeat this for each and every Country-gni row for 2009 in table A.
For example: Nigeria with a GNI2009 of 2698 in data set A would return:
GNI2004 s1 s2 s3 s4
2976 89 10 15 1
In Excel I guess this would be something like Index and Match where the match condition would be match(look up value, look uparray,-1)
You could try data.tables rolling join which designed to achieve just that
library(data.table) # V1.9.6+
indx <- setDT(DataB)[setDT(DataA), roll = -Inf, on = c(GNI2004 = "GNI2009"), which = TRUE]
DataA[, names(DataB) := DataB[indx]]
DataA
# Country GNI2009 GNI2004 s1 s2 s3 s4
# 1: Ukraine 6604 6649 295 33 59 3
# 2: Egypt 5937 6021 260 30 50 3
# 3: Morocco 5307 5418 226 27 42 2
# 4: Philippines 4707 4846 193 23 35 2
# 5: Indonesia 4148 4311 162 20 29 2
# 6: India 3677 3813 134 16 23 1
# 7: Viet Nam 3180 3356 109 13 19 1
# 8: Pakistan 2760 2976 89 10 15 1
# 9: Nigeria 2699 2976 89 10 15 1
The idea here is per each row in GNI2009 find the closest equal/bigger value in GNI2004, get the row index and subset. Then we update DataA with the result.
See here for more information.

creating index conditioned on value in other column; differences over time

I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727

Resources