add rows to data frame for non-observations - r

I have a dataframe that summarizes the number of times birds were observed at their breeding site one each day and each hour during daytime (i.e., when the sun was above the horizon). example:
head(df)
ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7
However, this dataframe does not include hours when the bird was not observed. Eg. no line for bird 19 on day 202 at 14 with an nObs value of 0.
I'd like to find a way, preferably with dplyr (tidy verse), to add in those rows for when individuals were not observed.

You can use complete from tidyr, i.e.
library(tidyverse)
df %>%
group_by(ID, site) %>%
complete(hr = seq(min(hr), max(hr)))
which gives,
# A tibble: 9 x 5
# Groups: ID, site [2]
ID site hr day nObs
<int> <fct> <int> <int> <int>
1 8 B 8 188 6
2 8 B 9 188 6
3 8 B 10 NA NA
4 8 B 11 188 7
5 19 A 11 202 60
6 19 A 12 NA NA
7 19 A 13 202 18
8 19 A 14 NA NA
9 19 A 15 202 27

One way to do this would be to first build a "template" of all possible combinations where birds can be observed and then merge ("left join") the actual observations onto that template:
a = read.table(text = " ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7")
tpl <- expand.grid(c(unique(a[, 1:3]), list(hr = 1:24)))
merge(tpl, a, all.x = TRUE)
Edit based on comment by #user3220999: in case we want to do the process per ID, we can just use split to get a list of data.frames per ID, get a list of templates and mapply merge on the two lists:
a <- split(a, a$ID)
tpl <- lapply(a, function(ai) {
expand.grid(c(unique(ai[, 1:3]), list(hr = 1:24)))
})
res <- mapply(merge, tpl, a, SIMPLIFY = FALSE, MoreArgs = list(all.x = TRUE))

Related

How to keep grouped variables together in training and test data

I'm making and testing the accuracy of age extrapolations from growth measurements and to do this I have to split my data into my training and test data.
The issue is that individuals in my data set were measured multiple times and sometimes they were measured twice, sometimes 3 times. In the dataset Birds is the individual chick, age is the age at measurement, and wing is that measurement value.
I've tried using the group_by function to keep their measurements together, but this doesn't seem to work. I also tried nesting the data but that puts the data in a new table and my code doesn't like that. Is there another way I could keep the groups together while still randomly assigning them to training and test data?
library('tidyverse')
library("ggplot2")
library("readxl")
library("writexl")
library('dplyr')
library('Rmisc')
library('cowplot')
library('purrr')
library('caTools')
library('MLmetrics')
Bird<-c(1,1,1,2,2,3,3,3,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,)
Age<-c(10,17,27,17,28,10,17,27,10,17,10,17,28,10,17,28,10,17,28,10,17,28,10,17,28,11,18,)
Wing<-c(39,63,98,61,99,34,48,80,30,37,35,51,71,40,55,79,34,47,77,36,55,84,35,55,88,36,59,)
Set14<-data.frame(Bird, Age, Wing) %>%
group_by(Bird)
Set14$Bird<-as.factor((Set14$Bird))
Set14
sample_size = floor(0.7*nrow(Set14))
picked = sample(seq_len(nrow(Set14)),size = sample_size)
Training =Set14[picked,]
Training
Test =Set14[-picked,]
Test
trm<-lm(Age~Wing, data=Training)
predval<-predict(object=trm,
newdata=Test)
predval
error<-data.frame(actual=Test$Age, calculated=predval)
error
MAPE(error$actual, error$calculated)
In Base R you could do:
a <- as.integer(Set14$Bird)
train_index <- a %in% sample(n<-length(unique(a)), 0.7*n)
train <- set14[train, ]
test <- set14[!train, ]
in Tidyverse:
ungroup(Set14) %>%
nest_by(Bird) %>%
ungroup() %>%
mutate(tt = floor(.7*n()),
tt = sample(rep(c('train', 'test'), c(tt[1], n()-tt[1])))) %>%
unnest(data) %>%
group_split(tt, .keep = FALSE)
[[1]]
# A tibble: 9 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 1 10 39
2 1 17 63
3 1 27 98
4 3 10 34
5 3 17 48
6 3 27 80
7 7 10 34
8 7 17 47
9 7 28 77
[[2]]
# A tibble: 18 x 3
Bird Age Wing
<fct> <dbl> <dbl>
1 2 17 61
2 2 28 99
3 4 10 30
4 4 17 37
5 5 10 35
6 5 17 51
7 5 28 71
8 6 10 40
9 6 17 55
10 6 28 79
11 8 10 36
12 8 17 55
13 8 28 84
14 9 10 35
15 9 17 55
16 9 28 88
17 10 11 36
18 10 18 59

Taking rolling differences of columns in R tibble for arbitrary number of columns

I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA

How to use the summarise function to create a summary in R using dplyr package?

I have the following table which represents a child, his siblings and the case they are assigned under. The resource ids represent the house where they were placed together.
child_id|sibling_id|case_id|resource_id|placed_together
1 8 123 12856 Y
1 9 123 12856 Y
3 11 321 12555 N
4 12 323 10987 N
4 13 323 10956 N
6 14 156 10554 N
6 15 156 10554 N
10 16 156 10553 N
10 17 145 18986 Y
10 18 145 18986 Y
I want to create a summary which shows the total count for those children that were placed together based on their case_ids and those who were not. So my result should look like this
Total Groups|sibling placed together|siblings not placed together
5 2 3
Any help would be appreciated. I have tried to use the summarise function but that gives me a total of each case id seperately.
I'm inferring that your logic is "any "Y" in `placed_together", since id 10 has one "N" and two "Y" for sibling placement.
library(dplyr)
dat %>%
group_by(child_id) %>%
summarize(tog = "Y" %in% unique(placed_together)) %>%
ungroup() %>%
summarize(TotalGroups = n(), Together = sum(tog), NotTogether = sum(!tog))
# # A tibble: 1 x 3
# TotalGroups Together NotTogether
# <int> <int> <int>
# 1 5 2 3
Data
dat <- read.table(header=T, text="
child_id sibling_id case_id resource_id placed_together
1 8 123 12856 Y
1 9 123 12856 Y
3 11 321 12555 N
4 12 323 10987 N
4 13 323 10956 N
6 14 156 10554 N
6 15 156 10554 N
10 16 156 10553 N
10 17 145 18986 Y
10 18 145 18986 Y")

5 nearest neighbors based on given distance in r

I have the following dataset:
id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22
I want to find 5 closest neighbors for each of the id based on the distance calculated from the given (x,y). The final output should look like the following:
id n_id dist age age_n_id
1 2 2 30 23
1 5 1.5 30 2
1 3 5 30 10
1 7 3 30 7
1 8 3 30 55
2 1 6 23 30
2 10 1 23 22
2 6 2 23 66
2 7 6 23 7
2 8 9 23 55
3 2 1 10 23
3 1 2 10 30
3 4 1.2 10 5
3 6 1.6 10 66
3 9 2.3 10 26
................................
................................
10 2 1.9 22 23
10 6 2.3 22 66
10 9 2.1 22 26
10 1 2.5 22 30
10 5 1.6 22 2
where n_id is the id if the neighbors, dist is the straight line distance between id and n_id, age is the age of the id, and age_n_id is the age of the n_id. Also, the maximum distance would be 10km. If there are fewer than 5 neighbors within 10km, say 3 neighbors, the corresponding id would be repeated only three times.
I am relatively newer in r programming and any help would be much appreciated.
data.table solution:
library(data.table)
data<-fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
data[,all_x:=list(list(x))]
data[,all_y:=list(list(y))]
data[,all_age:=list(list(age))]
data[,seq_nr:=seq_len(.N)]
#Distance formula:
formula_distance<-function(x_1,x_2,y_1,y_2,z){
x_2<-x_2[[1]][-z]
y_2<-y_2[[1]][-z]
sqrt((x_1-x_2)^2+(y_1-y_2)^2)
}
data<-data[,{list(dist = formula_distance(x,all_x,y,all_y,seq_nr),
id =seq(1:nrow(data))[-id],
age_id=all_age[[1]][-id],
age=rep(age,nrow(data)-1))},by=1:nrow(data)]
data<-data[order(nrow,dist)]
#Filter data within threshold:
threshold<-1000
#How many nearest neighbors to take:
k<-5
filtered<-data[dist<=threshold]
filtered<-filtered[,{list(dist=dist[1:k],n_id=id[1:k],n_age=age_id[1:k])},by=c("nrow","age")]
filtered<-filtered[!is.na(dist)]
setnames(filtered,"nrow","id")
filtered
id age dist n_id n_age
1: 1 30 25.37893 4 5
2: 1 30 52.27055 5 2
3: 1 30 69.37211 2 23
4: 1 30 213.41050 3 10
5: 2 23 26.31045 5 2
6: 2 23 48.08326 4 5
7: 2 23 69.37211 1 30
8: 2 23 147.12665 3 10
9: 3 10 147.12665 2 23
10: 3 10 172.11243 5 2
11: 3 10 194.93653 4 5
12: 3 10 213.41050 1 30
13: 4 5 25.37893 1 30
14: 4 5 27.32471 5 2
15: 4 5 48.08326 2 23
16: 4 5 194.93653 3 10
17: 5 2 26.31045 2 23
18: 5 2 27.32471 4 5
19: 5 2 52.27055 1 30
20: 5 2 172.11243 3 10
21: 6 66 67.84106 9 26
22: 6 66 209.88273 7 7
23: 7 7 180.54432 9 26
24: 7 7 209.88273 6 66
25: 8 55 805.91482 10 22
26: 9 26 67.84106 6 66
27: 9 26 180.54432 7 7
28: 10 22 805.91482 8 55
Assuming that the unit of coordinates is in meter.
# Load packages
library(FNN)
library(tidyverse)
library(data.table)
# Create example data frame
dataset <- fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
# Calculate the nearest ID and distance
near_data <- get.knn(dataset[, 2:3], k = 5)
# Extract the nearest ID
nn_index <- as.data.frame(near_data$nn.index)
# Extract the nearest Distance
nn_dist <- as.data.frame(near_data$nn.dist)
# Re organize the data
nn_index2 <- nn_index %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, n_id, -ID)
nn_dist2 <- nn_dist %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, dist, -ID)
# Remove coordinates in dataset
dataset2 <- dataset %>% select(-x, -y)
# Create the final output
nn_final <- nn_index2 %>%
# Merge nn_index2 and nn_dist2
left_join(nn_dist2, by = c("ID", "Rank")) %>%
# Merge with dataset2 by ID and id
left_join(dataset2, by = c("ID" = "id")) %>%
# Merge with dataset2 by n_id and id
left_join(dataset2, by = c("n_id" = "id")) %>%
# Remove Rank
select(-Rank) %>%
# Rename column names
rename(id = ID, age = age.x, age_n_id = age.y) %>%
# Sort the data frame
arrange(id, dist) %>%
# Filter the dist < 10000 meters
filter(dist < 10000)

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Resources