need to 'reshape' dataframe - r

dataset:
zip acs.pop napps pperct cgrp zgrp perc
1: 12007 97 2 2.0618557 2 1 25.000000
2: 12007 97 2 2.0618557 NA 2 50.000000
3: 12007 97 2 2.0618557 1 1 25.000000
4: 12008 485 2 0.4123711 2 1 33.333333
5: 12008 485 2 0.4123711 4 1 33.333333
6: 12008 485 2 0.4123711 NA 1 33.333333
7: 12009 7327 187 2.5522042 4 76 26.206897
8: 12009 7327 187 2.5522042 1 41 14.137931
9: 12009 7327 187 2.5522042 2 23 7.931034
10: 12009 7327 187 2.5522042 NA 103 35.517241
11: 12009 7327 187 2.5522042 3 47 16.206897
12: 12010 28802 580 2.0137490 NA 275 32.163743
13: 12010 28802 580 2.0137490 4 122 14.269006
14: 12010 28802 580 2.0137490 1 269 31.461988
15: 12010 28802 580 2.0137490 2 96 11.228070
16: 12010 28802 580 2.0137490 3 93 10.877193
17: 12018 7608 126 1.6561514 3 30 16.129032
18: 12018 7608 126 1.6561514 NA 60 32.258065
19: 12018 7608 126 1.6561514 2 14 7.526882
20: 12018 7608 126 1.6561514 4 57 30.645161
21: 12018 7608 126 1.6561514 1 25 13.440860
22: 12019 14841 144 0.9702850 NA 62 30.097087
23: 12019 14841 144 0.9702850 4 73 35.436893
24: 12019 14841 144 0.9702850 3 30 14.563107
25: 12019 14841 144 0.9702850 1 23 11.165049
26: 12019 14841 144 0.9702850 2 18 8.737864
27: 12020 31403 343 1.0922523 3 76 14.960630
28: 12020 31403 343 1.0922523 1 88 17.322835
29: 12020 31403 343 1.0922523 2 38 7.480315
30: 12020 31403 343 1.0922523 4 141 27.755906
31: 12020 31403 343 1.0922523 NA 165 32.480315
32: 12022 1002 5 0.4990020 NA 4 44.444444
33: 12022 1002 5 0.4990020 4 2 22.222222
34: 12022 1002 5 0.4990020 3 1 11.111111
35: 12022 1002 5 0.4990020 1 1 11.111111
I know the reshape2 or reshape package can handle this, but I'm not sure how. I need the final output to look like this:
zip acs.pop napps pperct zgrp4 zgrp3 zgrp2 zgrp1 perc4 perc3 perc2 perc1
12009 7327 187 2.5522042 76 47 23 41 26.206897 16.206897 7.931034 14.137931
zip is the id
acs.pop, napps, pperct will be the same for each zip group
zgrp4…zgrp1 are the values of zgrp for each value of cgrp
perc4…perc1 are the values of perc for each value of cgrp

We can try dcast from the devel version of data.table which can take multiple value.var columns. In this case, we have 'zgrp' and 'perc' are the value columns. Using the grouping variables, we create an sequence variable ('ind') and then use dcast to convert from 'long' to 'wide' format.
Instructions to install the devel version are here
library(data.table)#v1.9.5
setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
# zip acs.pop napps pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp 1_perc
#1: 12007 97 2 2.0618557 1 2 1 NA NA 25.00000
#2: 12008 485 2 0.4123711 1 1 1 NA NA 33.33333
#3: 12009 7327 187 2.5522042 76 41 23 103 47 26.20690
#4: 12010 28802 580 2.0137490 275 122 269 96 93 32.16374
#5: 12018 7608 126 1.6561514 30 60 14 57 25 16.12903
#6: 12019 14841 144 0.9702850 62 73 30 23 18 30.09709
#7: 12020 31403 343 1.0922523 76 88 38 141 165 14.96063
#8: 12022 1002 5 0.4990020 4 2 1 1 NA 44.44444
# 2_perc 3_perc 4_perc 5_perc
#1: 50.00000 25.000000 NA NA
#2: 33.33333 33.333333 NA NA
#3: 14.13793 7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807 7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505 8.737864
#7: 17.32284 7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111 NA
Or we can use ave/reshape from base R
df2 <- transform(df1, ind=ave(seq_along(zip), zip,
acs.pop, napps, pperct, FUN=seq_along))
reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
timevar='ind', direction='wide')

This is a good use for spread() in tidyr.
df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc
Example output:
> df2[df2$zip==12009,]
Source: local data frame [1 x 12]
zip acs.pop napps pperct perc_1 perc_2 perc_3 perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009 7327 187 2.552204 14.13793 7.931034 16.2069 26.2069 41 23 47 76
Thanks to #akrun for the assist

Related

Removing and adding observations specific to an id variable within a dataframe of multiple ids in R

I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

Adding a column according to norm data in R

I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)

integer64 not assigned NA values during data.table::dcast(), but defaults to 9218868437227407266

I have a data.table named test where the 'freq' column is of class integer64, as follows
>test
event_type eventdt_qtr freq
1: IF 200601 23
2: BDSF 200601 47
3: EDPM 200601 258
4: CPBP 200601 17
5: EF 200601 5132
6: EPWS 200601 96
7: EF 200602 15929
8: IF 200602 24
9: BDSF 200602 41
10: CPBP 200602 16
11: EDPM 200602 231
12: EPWS 200602 109
13: CPBP 200603 15
14: IF 200603 42
15: EDPM 200603 358
16: BDSF 200603 72
17: EPWS 200603 93
18: DPA 200603 2
19: EF 200603 48185
20: BDSF 200604 47
>str(chk_1_1_means)
Classes ‘data.table’ and 'data.frame': 297 obs. of 3 variables:
$ event_type : chr "IF" "BDSF" "EDPM" "CPBP" ...
$ eventdt_qtr: num 200601 200601 200601 200601 200601 ...
$ freq :integer64 23 47 258 17 5132 96 15929 24 ...
When I do data.table::dcast() operation on test, instead of getting NA or 0 for non existing event_type and eventdt_qtr combinations, I get the value 9218868437227407266. I believe this issue might be related to Merge (outer join) fails to set NAs for integer64, but I dont know how to fix it.
> test_tx <- data.table::dcast(test,eventdt_qtr ~ event_type,value.var = "freq")
> test_tx
eventdt_qtr BDSF CPBP DPA EDPM EF
1: 200601 47 17 9218868437227407266 258 5132
2: 200602 41 16 9218868437227407266 231 15929
3: 200603 72 15 2 358 48185
4: 200604 47 9218868437227407266 9218868437227407266 9218868437227407266 9218868437227407266
EPWS IF
1: 96 23
2: 109 24
3: 93 42
4: 9218868437227407266 9218868437227407266
Note: I know there is a fix (see solution below) if I do explicit conversion of column 'freq' to integer instead of integer64, but I don't want to do any type conversions as I am reading the data from Redshift and have to preserve the structure of the data.table 'test' downstream in the code.
> test$freq <- as.integer(test$freq)
> test_tx <- data.table::dcast(test,eventdt_qtr ~ event_type,value.var = "freq")
> test_tx
eventdt_qtr BDSF CPBP DPA EDPM EF EPWS IF
1: 200601 47 17 NA 258 5132 96 23
2: 200602 41 16 NA 231 15929 109 24
3: 200603 72 15 2 358 48185 93 42
4: 200604 47 NA NA NA NA NA NA

Equivalent of index - match in Excel to return greater than the lookup value

In R I need to perform a similar function to index-match in Excel which returns the value just greater than the look up value.
Data Set A
Country GNI2009
Ukraine 6604
Egypt 5937
Morocco 5307
Philippines 4707
Indonesia 4148
India 3677
Viet Nam 3180
Pakistan 2760
Nigeria 2699
Data Set B
GNI2004 s1 s2 s3 s4
6649 295 33 59 3
6021 260 30 50 3
5418 226 27 42 2
4846 193 23 35 2
4311 162 20 29 2
3813 134 16 23 1
3356 109 13 19 1
2976 89 10 15 1
2578 68 7 11 0
2248 51 5 8 0
2199 48 5 8 0
At the 2009 level GNI for each country (data set A) I would like to find out which GNI2004 is just greater than or equal to GNI2009 and then return the corresponding sales values (s1,s2...) at that row (data set B). I would like to repeat this for each and every Country-gni row for 2009 in table A.
For example: Nigeria with a GNI2009 of 2698 in data set A would return:
GNI2004 s1 s2 s3 s4
2976 89 10 15 1
In Excel I guess this would be something like Index and Match where the match condition would be match(look up value, look uparray,-1)
You could try data.tables rolling join which designed to achieve just that
library(data.table) # V1.9.6+
indx <- setDT(DataB)[setDT(DataA), roll = -Inf, on = c(GNI2004 = "GNI2009"), which = TRUE]
DataA[, names(DataB) := DataB[indx]]
DataA
# Country GNI2009 GNI2004 s1 s2 s3 s4
# 1: Ukraine 6604 6649 295 33 59 3
# 2: Egypt 5937 6021 260 30 50 3
# 3: Morocco 5307 5418 226 27 42 2
# 4: Philippines 4707 4846 193 23 35 2
# 5: Indonesia 4148 4311 162 20 29 2
# 6: India 3677 3813 134 16 23 1
# 7: Viet Nam 3180 3356 109 13 19 1
# 8: Pakistan 2760 2976 89 10 15 1
# 9: Nigeria 2699 2976 89 10 15 1
The idea here is per each row in GNI2009 find the closest equal/bigger value in GNI2004, get the row index and subset. Then we update DataA with the result.
See here for more information.

Resources