I have the following data frame that contains points that originate from different samples. Each point has a type.
I need to calculate, for each point belonging to a given sample of a given type (for instance for "Sample_1" type "A") how many points of another type are around it in a given cutoff.
My current implementation uses "future.apply" and I was wondering if there is a more efficient way to solve this problem. The example here is limited and should run quickly, the real problem is composed of several thousands of lines and it's much slower.
In the end I store the results in a list.
This list has, for each element with "type" in "starting_point", the number of elements of type "target_point" in a threshold of 40.
library(future)
library(future.apply)
a_test=data.frame(ID=sample(c("Sample_1", "Sample_2", "Sample_3"), 100, replace=TRUE), type=sample(c("A", "B", "C", "D"), 100, replace = TRUE), xpos=sample(1:200, 100, replace=TRUE), ypos=sample(1:200, 100, replace=TRUE))
starting_point=c("A", "B")
target_point=c("C", "D")
threshold=40
result_per_pair=list()
for(sp in starting_point){
## Here I select a data frame of "Starting points" without looking
## from which ID they came from
sp_tdf=a_test[a_test$type==sp, ]
for(tp in target_point){
## Here I select a data frame of "Target points" without looking
## from which ID they came from
tp_tdf=a_test[a_test$type==tp, ]
## I use future_sapply here, parallelizing on each line of "sp_tdf"
plan(multisession)
elements_around=future_sapply(1:nrow(sp_tdf), function(x, sp_tdf, tp_tdf, treshold2){
xc=sp_tdf$xpos[x]
yc=sp_tdf$ypos[x]
### NOTE HERE: At this point I select the points that are in the same
### ID as the current line of sp_tdf
tp_tdf2=tp_tdf[tp_tdf$ID == sp_tdf$ID[x],]
ares=tp_tdf2[ (tp_tdf2$xpos-xc)^2 + (tp_tdf2$ypos-yc)^2 <threshold2, ]
return(nrow(ares))
},sp_tdf=sp_tdf, tp_tdf=tp_tdf, threshold2=threshold*threshold)
a_newcol=paste0(tp, "_around_", sp)
## we need to create a copy of sp_tdf otherwise we add columns to the
## initial sp_tdf and we memorize them in the wrong place in the list
sp_tdf_temp=sp_tdf
sp_tdf_temp[, a_newcol]=elements_around
result_per_pair[[ paste0(tp, "_around_", sp ) ]]=rbind(result_per_pair[[ paste0(tp, "_around_", sp ) ]], sp_tdf_temp)
}
}
You can see the type of table I get here:
head(result_per_pair[[1]])
$C_around_A
ID type xpos ypos C_around_A
1 Sample_2 A 26 74 1
2 Sample_3 A 64 8 1
3 Sample_3 A 121 2 1
5 Sample_2 A 62 94 0
You can try using RANN::nn2 function:
id_list <- split(a_test, a_test$ID)
res <- id_list %>%
map(~select(.x, xpos, ypos)) %>%
map(~RANN::nn2(.x, .x, k = nrow(.), searchtype = "radius", radius = threshold)) %>%
map(1) %>%
map2(
id_list,
function(x, y){
seq_len(nrow(x)) %>%
map(~x[.x,] %>% .[. > 0]) %>%
map(~y[.x,]) %>%
map("type") %>%
map_dfr(table) %>%
mutate(across(everything(), as.integer))
}
) %>%
map2_dfr(id_list, ~bind_cols(.y, .x))
Some time improvements might be done replacing tidyverse functions (hard to say how fast it is on your example). Result:
res %>% head()
ID type xpos ypos A B C D
Sample_1 C 48 157 0 0 3 1
Sample_1 D 177 97 1 1 1 3
Sample_1 C 10 71 0 0 3 0
Sample_1 C 71 168 1 1 2 0
Sample_1 D 82 48 1 0 1 2
Sample_1 C 165 71 3 3 1 1
where columns A-D represent count of types within same ID. I was using seed 123 for generating a_test. You can adapt algorithm to work with starting_point and target_point with spliting each id_list into two parts - those defined by starting_point and target_point and adapting data & query arguments in RANN::nn2.
edit
Function based on the ideas of upper comment:
f <- function(df, threshold, start = levels(df$type), target = levels(df$type)){
my_lists <- df %>%
filter(type %in% c(start, target)) %>%
split(.$ID) %>%
map(
function(x){
map(
list(start, target),
~filter(x, type %in% .x) %>% mutate(type = droplevels(type))
)
}
) %>%
discard(~any(map_int(.x, nrow) == 0))
indices <- my_lists %>%
map(
~RANN::nn2(
data = select(.x[[2]], xpos, ypos),
query = select(.x[[1]], xpos, ypos),
k = nrow(.x[[2]]),
searchtype = "radius",
radius = threshold
)
) %>%
map(1) %>%
map(function(x) seq_len(nrow(x)) %>% map(~x[.x,] %>% .[. > 0]))
my_lists %>%
map(2) %>%
map2(indices, function(x, y) map_dfr(y, ~summary(x[.x,]$type))) %>%
{map2_dfr(map(my_lists, 1), ., bind_cols)}
}
To count C around A with radius 40:
f(a_test, 40, "A", "C")
Related
I need to extract a sample that has equal distribution in each experience-level group. For your info, there are total 4 groups (1, 2, 3, 4 years of exp), and total 8 people (A, B, C, D, E, F, G, H) in this example scenario. I was trying to come up with a function with loops, but don't know how to. Please help me out! Thank you! :)
library(tidyverse)
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4), pre_year_exp = year_exp - 1)
data_0 <- data %>% filter(year_exp == max(year_exp) - 0) %>% sample_n(2)
data_1 <- data %>% filter(year_exp == max(year_exp) - 1) %>% anti_join(data_0, by = 'id') %>% sample_n(2)
data_2 <- data %>% filter(year_exp == max(year_exp) - 2) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% sample_n(2)
data_3 <- data %>% filter(year_exp == max(year_exp) - 3) %>% anti_join(data_0, by = 'id') %>% anti_join(data_1, by = 'id') %>% anti_join(data_2, by = 'id')
#Result Table
result <- data_0 %>% bind_rows(data_1, data_2, data_3)
result
The below produces the same output as your code and extends the idea to allow for an arbitrary number of values of year_exp using a for loop.
Please note that because this simply extends your code, it must share the following (possibly-undesirable) features with your code:
The code moves sequentially through groups, sampling from the members of later groups who were not sampled for early groups. Accordingly, there is a risk that the code throws an error because it tries to sample from groups whose members were already sampled from previous, other groups.
The probabilities of selection are not uniformly distributed across members of a group. Accordingly, the samples drawn from each group are not representative of that group.
In the event that there data were instead a balanced panel, there are much more efficient and simpler ways to accomplish this.
library(tibble)
library(dplyr)
set.seed(123)
# Create original data
data <- tibble(id = c("A","A","A","B","B","C","C","D","D","D","D","E","E","E","E","F","F","G","G","G","H","H","H","H"),
year_exp = c(1,2,3,1,2,1,2,1,2,3,4,1,2,3,4,1,2,1,2,3,1,2,3,4),
pre_year_exp = year_exp - 1)
# Assign values to parameters used by/in the loop.
J <- data$id %>% unique %>% length # unique units/persons (8)
K <- data$year_exp %>% unique %>% length # unique groups/years (4)
N <- 2 # sample size per group (2)
# Initialize objects loop will modify
samples_list <- vector(mode = "list", length = K) # stores each sample
used_ids <- rep(NA_character_, J) # stores used ids
index <- 1:N # initial indices for used ids
# For-loop solution
for (k in 1:K) {
# Identifier for current group
cur_group <- 1 + K - k
# Sample from persons in current group who were not previously sampled
one_sample <- data %>%
filter(year_exp == cur_group, !(id %in% used_ids)) %>%
slice_sample(n = N)
# Save sample and the id values for those sampled
samples_list[[k]] <- one_sample
used_ids[index] <- one_sample$id
index <- index + N
}
# Bind into a single data.frame
bind_rows(samples_list)
#> # A tibble: 8 x 3
#> id year_exp pre_year_exp
#> <chr> <dbl> <dbl>
#> 1 H 4 3
#> 2 D 4 3
#> 3 G 3 2
#> 4 E 3 2
#> 5 C 2 1
#> 6 B 2 1
#> 7 F 1 0
#> 8 A 1 0
I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?
I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))
Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?
Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)
You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))
You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902
I would like to create a dataframe with confidence intervals for proportions as a final result. I have introduced a variable (tp in my example) as a cut off value to calculate the proportions for. I would like to use the dplyr package to produce the final dataframe.
Below is a simplified example:
library(dplyr)
my_names <- c("A","B")
dt <- data.frame(
Z = sample(my_names,100,replace = TRUE),
X = sample(1:10, replace = TRUE),
Y = sample(c(0,1), 100, replace = TRUE)
)
my.df <- dt%>%
mutate(tp = (X >8)* 1) %>% #multiply by one to convert into numeric
group_by(Z, tp) %>%
summarise(n = n()) %>%
mutate(prop.tp= n/sum(n)) %>%
mutate(SE.tp = sqrt((prop.tp*(1-prop.tp))/n))%>%
mutate(Lower_limit = prop.tp-1.96 * SE.tp)%>%
mutate(Upper_limit = prop.tp+1.96 * SE.tp)
output:
Source: local data frame [4 x 7]
Groups: Z
Z tp n prop.tp SE.tp Lower_limit Upper_limit
1 A 0 33 0.6346154 0.08382498 0.4703184 0.7989123
2 A 1 19 0.3653846 0.11047236 0.1488588 0.5819104
3 B 0 27 0.5625000 0.09547033 0.3753782 0.7496218
4 B 1 21 0.4375000 0.10825318 0.2253238 0.6496762
However, I would like to calculate the Standard error and the CI:s using the total sample for the groups in column Z, not the splitted sample by the categorical variable tp.
So the total sample for A in my example should be n = 33 +19.
Any ideas?
Not quite sure I get which group you want to compare with which here, but at any rate you have two grouping variables tp = X > 8 and Z.
If you want to compare the rows with X > 8 and Z == "A" to all rows with X > 8 you can do it like this
merge(
dt %>%
group_by(X > 8) %>%
summarize(n.X = n()),
dt %>%
group_by(X > 8, Z) %>%
summarise(n.XZ = n()),
by = "X > 8"
) %>%
mutate(prop.XZ = n.XZ/n.X) %>%
mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.X))%>%
mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
mutate(Upper_limit = prop.XZ+1.96 * SE)
X > 8 n.X Z n.XZ prop.XZ SE Lower_limit Upper_limit
1 FALSE 70 A 37 0.5285714 0.05966378 0.4116304 0.6455124
2 FALSE 70 B 33 0.4714286 0.05966378 0.3544876 0.5883696
3 TRUE 30 A 16 0.5333333 0.09108401 0.3548087 0.7118580
4 TRUE 30 B 14 0.4666667 0.09108401 0.2881420 0.6451913
If you want to turn the problem around and compare X > 8 and Z == "A" to all rows with Z == "A" you can do it like this
merge(
dt %>%
group_by(Z) %>%
summarize(n.Z = n()),
dt %>%
group_by(X > 8, Z) %>%
summarise(n.XZ = n()),
by = "Z"
) %>%
mutate(prop.XZ = n.XZ/n.Z) %>%
mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.Z))%>%
mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
mutate(Upper_limit = prop.XZ+1.96 * SE)
Z n.Z X > 8 n.XZ prop.XZ SE Lower_limit Upper_limit
1 A 53 FALSE 37 0.6981132 0.06305900 0.5745176 0.8217088
2 A 53 TRUE 16 0.3018868 0.06305900 0.1782912 0.4254824
3 B 47 FALSE 33 0.7021277 0.06670743 0.5713811 0.8328742
4 B 47 TRUE 14 0.2978723 0.06670743 0.1671258 0.4286189
It is a bit messy having to merge two separate groupings, but I don't know if it is possible to ungroup and re-group in the same statement. I am suprised though how difficult it seems to be to use groupings on two different levels (if you can call it that) and hope someone else can come up with a better solution.