Conditional selection of repeated measures from data frame - r

I have data with repeat measurements on each subject (id) at a variable number of timepoints. I would like to retain two row for each subject, timepoint == 0 and the timepoint closest to 4. In the case rows with two candidate timepoints equally distant from 4, e.g. (3, 5), I want to chose the lowest (3).
As shown in the 'choice' column of the image below, rows with "x" would not be retained.
dat <- structure(list(id = c(172507L, 172507L, 172507L, 172525L, 172525L,
172525L, 172526L, 172526L, 172526L, 172527L, 172527L, 172527L,
172527L, 172527L), timepoint = c(0L, 2L, 6L, 0L, 4L, 5L, 0L,
5L, 2L, 2L, 3L, 5L, 6L, 0L)), class = "data.frame", row.names = c(NA,
-14L))

We could arrange by id and timepoint and for every group select the first occurrence when timepoint == 0 and minimum absolute value between 4 - timepoint. Since we have arranged it by timepoint which.min will select first timepoint with lower value (in case of tie).
library(dplyr)
dat %>%
arrange(id, timepoint) %>%
group_by(id) %>%
slice(c(which.max(timepoint == 0), which.min(abs(4- timepoint))))
# id timepoint
# <int> <int>
#1 172507 0
#2 172507 2
#3 172525 0
#4 172525 4
#5 172526 0
#6 172526 5
#7 172527 0
#8 172527 3

Can you do something like this. Arranging by the distance and then the timepoint will put the smallest closest value first. Then you can use the first() function to grab the first value or filter for when the timepoint is zero.
library(tidyverse)
dat %>%
mutate(dist = abs(4-timepoint)) %>%
arrange(id, dist, timepoint) %>%
group_by(id) %>%
filter(timepoint %in% c(0, first(timepoint))) %>%
ungroup() %>%
arrange(id, timepoint)

Here's the data.table solution. It relies on the assumption that each ID will have a timepoint of 0. Otherwise, you should use which.max(timepoint == 0). Credit to Ronak Shah for the which.min approach.
Edit: Changed to match(TRUE, timepoint == 0) and fixed an issue in base R approach.
library(data.table)
dt <- as.data.table(dat)
dt[order(timepoint),
.SD[c(match(TRUE, timepoint == 0), which.min(abs(4- timepoint)))],
by = id]
For kicks, here's base R:
do.call(rbind, by(dat[order(dat$timepoint), ], dat[order(dat$timepoint), ], function(x) x[c(match(TRUE, x$timepoint == 0), which.min(abs(4-x$timepoint))),]) )

Something like this should work:
zeros <-
dat %>%
filter(timepoint == 0) %>%
transmute(id, timepoint)
nonzeros <-
dat %>%
filter(timepoint != 0) %>%
mutate(diff = abs(timepoint - 4)) %>%
group_by(id) %>%
filter(diff == min(diff)) %>%
arrange(timepoint) %>%
slice(1) %>%
ungroup() %>%
transmute(id, timepoint)
df <-
bind_rows(zeros, nonzeros) %>%
arrange(id, timepoint)
There is probably a way to do this in one pipe but I had an easier time visualizing what's going on this way.

Related

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))

Is correct summarizing 2 times in a row in R?

I have a data with the states, month and year when people die. I need to calculate the median of the number of people died in each month (across years).
So, the first step is to calculate the number of people died by month and year:
data %>% group_by(state, month, year) %>% summarise(n = n())
data.frame(
stringsAsFactors = FALSE,
State = c("X", "X", "Y", "Y"),
Month = c(1L, 1L, 1L, 1L),
Year = c(2019L, 2020L, 2019L, 2020L),
n = c(20L, 15L, 45L, 54L)
)
at this point, I have a dataframe like this (these numbers are just an example):
State
Month
Year
n
X
1
2019
20
X
1
2020
15
Y
1
2019
45
Y
1
2020
54
But I want to calculate the median, so I write
data %>% group_by(state, month, year) %>% summarise(n = n()) %>% summarise(median = median(n))
State
Month
median
X
1
17.5
Y
1
49.5
I obtain my desired result, but i don't know if R is making some things from behind that I don't see.
My question is: It is something bad to 'summarise()' twice in a row?
After the first summarise, by default the last grouping is dropped i.e. year. So, the second summarise is based on the 'State' and 'Month' (if that it is the OP's desired outcome). In this case, two summarise makes sense. It may be better to specify the .groups option to make sure that what we need i.e. drop_last will drop the last group and in the second summarise, remove the grouping with drop
library(dplyr)
data %>%
group_by(state, month, year) %>%
summarise(n = n(), .groups = 'drop_last') %>%
summarise(median = median(n), .groups = 'drop')

R: dplyr arrange by row number

I am trying to order a dataset according to values in columns in ascending order.
I have a dataset with 1 row and 3000+ columns. I guess I can just change it to a list and use .[[n]] but I was thinking if there was another way.
data looks something like this only with more columns and values.
structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
I expect something like this:
b c a
1 -4.1135727372109401e-05 -0.00018142429393043499 -0.00106163456888295
I understand you can arrange by column number by doing the following:
.[[column number]]
for example:
mtcars %>% arrange(.[[2]])
what is the row number equivalent?
If I understand you correctly, you want to order the columns based on the values in the single row.
z <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
Base R:
z[,order(z[1,])]
# a c b
# 1 -0.00106163457 -0.000181424294 -0.0000411357274
Tidyverse:
library(dplyr)
z %>%
select_at(order(.))
Note: I think your expected output might not be correct, as the values are not ordered. Your intended output:
c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05)
# [1] -0.0001814242939 -0.0010616345689 -0.0000411357274
diff(c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05))
# [1] -0.000880210275 0.001020498842
shows the first value is greater than the second, but the second is less than the third. If they were ordered, I would expect the diff to be always-nonnegative; if reverse-ordered, diff should be always-nonpositive.
We can unlist the first row, order and use that in select
library(dplyr)
df1 %>%
select(order(-unlist(.[1,])))
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
It can be also used a general solution i.e if we want to do this based on a particular row
n <- 3
mtcars %>%
select(order(-unlist(.[n,])))
Or reshape to 'long' and then use arrange, get the column names and then select
library(tidyr)
df1 %>%
pivot_longer(everything()) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
Or enframe, then do a arrange, pull the 'name' column and use that in select
library(tibble)
as.list(df1) %>%
enframe %>%
unnest(c(value)) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
Or if we want to select the column 'c'
df1 %>%
select(c, everything())
# c a b
#1 -0.0001814243 -0.001061635 -4.113573e-05
In base R, we can do
df1[order(-unlist(df1[1,]))]
data
df1 <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Resources