I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)
I have a data.table named test where the 'freq' column is of class integer64, as follows
>test
event_type eventdt_qtr freq
1: IF 200601 23
2: BDSF 200601 47
3: EDPM 200601 258
4: CPBP 200601 17
5: EF 200601 5132
6: EPWS 200601 96
7: EF 200602 15929
8: IF 200602 24
9: BDSF 200602 41
10: CPBP 200602 16
11: EDPM 200602 231
12: EPWS 200602 109
13: CPBP 200603 15
14: IF 200603 42
15: EDPM 200603 358
16: BDSF 200603 72
17: EPWS 200603 93
18: DPA 200603 2
19: EF 200603 48185
20: BDSF 200604 47
>str(chk_1_1_means)
Classes ‘data.table’ and 'data.frame': 297 obs. of 3 variables:
$ event_type : chr "IF" "BDSF" "EDPM" "CPBP" ...
$ eventdt_qtr: num 200601 200601 200601 200601 200601 ...
$ freq :integer64 23 47 258 17 5132 96 15929 24 ...
When I do data.table::dcast() operation on test, instead of getting NA or 0 for non existing event_type and eventdt_qtr combinations, I get the value 9218868437227407266. I believe this issue might be related to Merge (outer join) fails to set NAs for integer64, but I dont know how to fix it.
> test_tx <- data.table::dcast(test,eventdt_qtr ~ event_type,value.var = "freq")
> test_tx
eventdt_qtr BDSF CPBP DPA EDPM EF
1: 200601 47 17 9218868437227407266 258 5132
2: 200602 41 16 9218868437227407266 231 15929
3: 200603 72 15 2 358 48185
4: 200604 47 9218868437227407266 9218868437227407266 9218868437227407266 9218868437227407266
EPWS IF
1: 96 23
2: 109 24
3: 93 42
4: 9218868437227407266 9218868437227407266
Note: I know there is a fix (see solution below) if I do explicit conversion of column 'freq' to integer instead of integer64, but I don't want to do any type conversions as I am reading the data from Redshift and have to preserve the structure of the data.table 'test' downstream in the code.
> test$freq <- as.integer(test$freq)
> test_tx <- data.table::dcast(test,eventdt_qtr ~ event_type,value.var = "freq")
> test_tx
eventdt_qtr BDSF CPBP DPA EDPM EF EPWS IF
1: 200601 47 17 NA 258 5132 96 23
2: 200602 41 16 NA 231 15929 109 24
3: 200603 72 15 2 358 48185 93 42
4: 200604 47 NA NA NA NA NA NA
In R I need to perform a similar function to index-match in Excel which returns the value just greater than the look up value.
Data Set A
Country GNI2009
Ukraine 6604
Egypt 5937
Morocco 5307
Philippines 4707
Indonesia 4148
India 3677
Viet Nam 3180
Pakistan 2760
Nigeria 2699
Data Set B
GNI2004 s1 s2 s3 s4
6649 295 33 59 3
6021 260 30 50 3
5418 226 27 42 2
4846 193 23 35 2
4311 162 20 29 2
3813 134 16 23 1
3356 109 13 19 1
2976 89 10 15 1
2578 68 7 11 0
2248 51 5 8 0
2199 48 5 8 0
At the 2009 level GNI for each country (data set A) I would like to find out which GNI2004 is just greater than or equal to GNI2009 and then return the corresponding sales values (s1,s2...) at that row (data set B). I would like to repeat this for each and every Country-gni row for 2009 in table A.
For example: Nigeria with a GNI2009 of 2698 in data set A would return:
GNI2004 s1 s2 s3 s4
2976 89 10 15 1
In Excel I guess this would be something like Index and Match where the match condition would be match(look up value, look uparray,-1)
You could try data.tables rolling join which designed to achieve just that
library(data.table) # V1.9.6+
indx <- setDT(DataB)[setDT(DataA), roll = -Inf, on = c(GNI2004 = "GNI2009"), which = TRUE]
DataA[, names(DataB) := DataB[indx]]
DataA
# Country GNI2009 GNI2004 s1 s2 s3 s4
# 1: Ukraine 6604 6649 295 33 59 3
# 2: Egypt 5937 6021 260 30 50 3
# 3: Morocco 5307 5418 226 27 42 2
# 4: Philippines 4707 4846 193 23 35 2
# 5: Indonesia 4148 4311 162 20 29 2
# 6: India 3677 3813 134 16 23 1
# 7: Viet Nam 3180 3356 109 13 19 1
# 8: Pakistan 2760 2976 89 10 15 1
# 9: Nigeria 2699 2976 89 10 15 1
The idea here is per each row in GNI2009 find the closest equal/bigger value in GNI2004, get the row index and subset. Then we update DataA with the result.
See here for more information.