How to conditionally summarize on other entries in the group - R - r

In my dataset I have Cartesian coordinates of different items overtime identified by an EventID, event_type, ID number, x position, y position, identity type, broad category, and frame id number. What I need to do is go for each EventID, event_type pair, and frame id number go through each ID number and calculate which other ID number with a different broad category has the minimum distance from the current row. I would like to avoid using for loops for this because the dataset is several million lines long.
I tried formulating this as a group_by and summarize call using dplyr but couldn't quite wrap my head around how I could call a function on the current row x, an y against all other x, and ys and then choose the conditional minimum.
two_dim_euclid = function(x1, x2, y1, y2){
a <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
return(a)
}
# Example Data
df <- data.frame(stringsAsFactors = FALSE,
EventID = c(1003, 1003, 1003, 1003),
event_type = c(893, 893, 893, 893),
ID_number = c(80427, 2346, 24954, 27765),
x = c(86.07, 72.4, 43.08, 80.13),
y = c(35.58, 26.43, 34.8, 34.79),
identity_type = c("A", "C", "B", "B"),
broad_category = c("set1", "set1", "set2", "set2"),
frame_id = c(1, 1, 1, 1))
df
# EventID event_type ID_number x y identity_type broad_category frame_id
#1 1003 893 80427 86.07 35.58 A set1 1
#2 1003 893 2346 72.40 26.43 C set1 1
#3 1003 893 24954 43.08 34.80 B set2 1
#4 1003 893 27765 80.13 34.79 B set2 1
The expected result would return 5.992303 for row 1 it looks for all the entries not belonging to set1 with the same EventID, event_type, and frame_id and then returns the minimum euclidian distance given those parameters.
Also, I want to do this for every entry with identity type A. But, the identity_type and broad_category are not always tied together. A can belong to either set1 or set2.

Here's a base way that relies on dist().
res <- as.matrix(dist(cbind(df$x, df$y)))
res[res == 0] <- Inf
apply(res, 1, min)
1 2 3 4
5.992303 11.386066 30.491299 5.992303
# or potentially more performant
res[cbind(seq_len(nrow(res)), max.col(-res))]
[1] 5.992303 11.386066 30.491299 5.992303
A potential way with data.table would be to do a cartesian join but it would need a lot of memory and would likely be slower:
library(data.table)
dt <- as.data.table(df)
dt[, ID := .I]
CJ.dt = function(X,Y) {
stopifnot(is.data.table(X),is.data.table(Y))
k = NULL
X = X[, c(k=1, .SD)]
setkey(X, k)
Y = Y[, c(k=1, .SD)]
setkey(Y, NULL)
X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dt, dt)[ID != i.ID, min(sqrt((x - i.x)^2 + (y-i.y)^2)), by = i.ID]
i.ID V1
1: 1 5.992303
2: 2 11.386066
3: 3 30.491299
4: 4 5.992303
For data.table cartesian join, see here:
R: data.table cross-join not working

While I'm not sure about your criteria, it seems that you MUST use for loops in some way if you want to iterate. I'm sure others can provide you with Rcpp solutions that are very quick. In the meantime, here is one possible way with base R.
# In the future, please provide the code to create your example data
dat <- structure(list(EventID = c(1003L, 1003L, 1003L, 1003L),
event_type = c(893L, 893L, 893L, 893L),
ID_number = c(80427L, 2346L, 24954L, 27765L),
x = c(86.07, 72.4, 43.08, 80.13),
y = c(35.58, 26.43, 34.8, 34.79),
identity_type = structure(c(1L, 3L, 2L, 2L),
.Label = c("A", "B", "C"),
class = "factor"),
broad_category = structure(c(1L, 1L, 2L, 2L),
.Label = c("set1", "set2"),
class = "factor"),
frame_id = c(1L, 1L, 1L, 1L)),
.Names = c("EventID", "event_type", "ID_number","x", "y",
"identity_type", "broad_category", "frame_id"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
# Define your criteria here
dat$uniqueID <- paste0(dat$EventID, dat$event_type, dat$frame_id, dat$broad_category)
# made your function have two 2 dim vectors instead since that's simpler for passing in
two_dim_euclid = function(a, b) return(sqrt((a[1] - b[1])^2 + (a[2] - b[2])^2))
n <- nrow(dat)
vec <- numeric(n)
for(i in 1:n){
vec[i] = sum(apply(dat[dat$uniqueID != dat$uniqueID[i], c("x","y")], 1,
function(r) two_dim_euclid(dat[i,c("x","y")], r)), na.rm = T)
if(i%%10000 == 0) cat(i,"completed...\n") # Progress check since >1mil rows
}
dat$result <- vec

Related

Add a column to dataframe in R based on grater/less than condition based on the values in existing columns

I have a dataframe and want to create a new column Z populated with a value of "tw" or "ok" for each record. If x > y, z = "ok" , IF x < y, z = "tw".
x y
a 1 2
b 2 3
c 5 1
result
x y z
a 1 2 tw
b 2 3 tw
c 5 1 ok
Maybe you can try ifelse() like below
df <- within(df,z <- ifelse(x>y,"ok","tw"))
If you do not define the output for the case "x==y", maybe you should add the following line
df$z[df$x==df$y] <- NA
alternatively you can do it on the dataframe directly:
# creation of dataframe
df = data.frame("x" = c(1, 2, 5), "y" = c(2, 3, 1))
# column creation of z
df$z[(df$x > df$y)] <- "ok"
df$z[(df$x < df$y)] <- "tw"
We can do this directly :
df$z <- c("tw", "ok")[(df$x > df$y) + 1]
df
# x y z
#a 1 2 tw
#b 2 3 tw
#c 5 1 ok
Not exactly clear what you want to do when x == y (above assigns "ok").
We can also use case_when from dplyr to assign values based on various condition.
library(dplyr)
df %>%
mutate(z = case_when(x > y ~"ok",
x < y ~"tw",
TRUE ~ NA_character_))
data
df <- structure(list(x = c(1L, 2L, 5L), y = c(2L, 3L, 1L), z = c("tw",
"tw", "ok")), row.names = c("a", "b", "c"), class = "data.frame")

How to filter my data.table by condition and by group?

Problem
I work on a data.table where each row is a medical observation. The problem is there are some errors in my data, and I need to correct them before pursuit my analysis. For example, a male patient can have an observation where he is coded as a female.
Solution
My solution is to select the mode (most frequent value) of a variable by the patient. If a patient has 10 observations as a male, and one as female, it is safe to assume that he is a male.
I have found that clever way to do it with data.table.
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
The problem is that when a patient as multiple modes, it just keeps one. So a patient which is 50% male and 50% female will be counted as a male, which will lead to a bias in the end. I would like to code them as NA's.
The only way to correct this I founded is by using dplyr
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))
and then replace SEX value by NA if duplicated. But it takes way longer than data.table, it is not very optimized, and I have a big data set with a lot of variables that would need to be corrected as well.
Resume
How do I took the mode of a variable by a patient and replace it by NA's if not unique?
Example
ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")
require(data.table)
DATA <- data.table(ID, SEX)
# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]
# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())
# Applied to my data it took 84.288 seconds
Update
Solution proposed by #Cole based on an idea of #Sindri_baldur :
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
our_mode_fac <- function(x) {
freq <- tabulate(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
else {levels(x)[which.max(freq)]}
}
vars <- c("SEX", "V1")
DATA[j = paste0(vars) := lapply(.SD, as.factor),
.SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
.SDcols = vars,
by = ID]
It works perfectly fine. It took the mode, even when there is more NAs than factors, and replace values by NAs when there is more than 1 mode.
Now it is also very fast : 11 seconds for 3M+ observations and 1M+ patients (117 seconds with #Sindri_baldur answer). Thanks a lot both of you I'm very grateful !
our_mode <- function(x) {
freq <- table(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
NA
} else {
names(freq)[which.max(freq)]
}
}
vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]
ID SEX V1 SEX_corrected V1_corrected
1: 1 M a M a
2: 1 M <NA> M a
3: 1 M a M a
4: 1 M a M a
5: 1 F b M a
6: 1 M a M a
7: 2 M b F b
8: 2 F b F b
9: 2 M b F b
10: 2 F c F b
11: 2 F b F b
12: 2 <NA> c F b
Reproducible data
DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)
Note that our_mode() is not optimised for speed. See suggestions by Cole for speed improvements in comments.

How to find the given value from the range of values?

I have have the below data :
Y z
100-800 a
150-600 b
200-300 c
400-600 d
4000-12000 e
Any help would be really appreciated.
Based on given value of x (i.e x=100) it should find the values in the given ranges of Y and give the corresponding values of Y and z.If the given of x is not in the given ranges of Y then it should find the nearer range and give corresponding values of Y and Z.
DT[, list(OK = 1 %in% seq(Y, Y)), by = Z]
For given value of X=110
output should be
Y Z
100-800 a
For x=200
Y z
100-800 a
150-600 b
200-300 c
For x=12500
Y z
4000-12000 e
We can write a helper function using tidyr::separate to separate columns. In case if there are no indices which fall within the range we compare the value with lowest value and highest value in the dataframe and return the row accordingly.
subset_fun <- function(df, val) {
df1 <- tidyr::separate(df, Y, c("low", "high"), sep = "-",convert = TRUE)
inds <- with(df1, val >= low & val <= high)
if (any(inds))
df[inds, ]
else if (min(df1$low) > val) df[which.min(df1$low), ]
else df[which.max(df1$high), ]
}
subset_fun(df, 100)
# Y z
#1 100-800 a
subset_fun(df, 200)
# Y z
#1 100-800 a
#2 150-600 b
#3 200-300 c
subset_fun(df, 12500)
# Y z
#5 4000-12000 e
subset_fun(df, 0)
# Y z
#1 100-800 a
data
df <- structure(list(Y = structure(1:5, .Label = c("100-800", "150-600",
"200-300", "400-600", "4000-12000"), class = "factor"),
z = structure(1:5, .Label = c("a", "b", "c", "d", "e"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))
We can create a helper and use this to subset:
library(dplyr)
library(tidyr)
find_number <- function(x,high,low){
x >= low & x < high # might be able to use between
}
df %>%
separate(Y,c("Low","High")) -> new_df
new_df[new_df %>%
mutate(Logi=find_number(200,high = High,low=Low)) %>%
pull("Logi"),]
Low High z
1 100 800 a
2 150 600 b
3 200 300 c
EDIT: An attempt to automate this process. Using NSE might be a much better option since that would eliminate the need to have the exact same names as in this answer. In other words, redefine the function with a data and column name arguments. For now:
find_number <- function(x){
new_df[new_df %>%
mutate(Logi=x >= Low & x< High,
isMax=ifelse(High==max(High)
& x>High,
TRUE,Logi)) %>%
pull("isMax"),]
}
find_number(12500)
Low High z
5 4000 12000 e
Data:
new_df<-structure(list(Low = c(100, 150, 200, 400, 4000), High = c(800,
600, 300, 600, 12000), z = c("a", "b", "c", "d", "e")), class = "data.frame", row.names = c(NA,
-5L))

How to create a dynamic number and name of mutate calls in dplyr?

I am making a dynamic permutation function to create order independent parameters. Outside of a function, I have been able to hard code this approach with dplyr. However, I want to generalize it so that I could use the same function to permute 3 factors or 6 factors without typing all of the repeating calls. However, I have not figured out how to make it work.
Here's a simple data frame df of all the permutations of 3 variables:
#> dput(df)
structure(list(var1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), var2 = structure(c(2L, 3L, 1L,
3L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), var3 = structure(c(3L,
2L, 3L, 1L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"),
X1 = c(0.5, 0.5, 0.8, 0.8, 0.3, 0.3), X2 = c(0.8, 0.3, 0.5,
0.3, 0.5, 0.8), X3 = c(0.3, 0.8, 0.3, 0.5, 0.8, 0.5)), .Names = c("var1",
"var2", "var3", "X1", "X2", "X3"), row.names = c(NA, -6L), class = "data.frame")
My goal is to get to the average order independent value of each variable. To get there, I need to create two intermediate variables: one a multiplication m1, m2, m3, m4 and one a subtraction s1, s2, s3, s4. The variables m1 and s1 are special, m1 = X1, and s1 = X1-1. However, the others need to refer to the one before: m2 = X2*X1 and s2 = m2-m1.
I tried to combine the ideas from this SO question: R - dplyr - mutate - use dynamic variable names with a lazyeval interp, so that I could dynamically refer to the other variables and also dynamically name mutated columns. However, it only kept the last one sent, and the rename did not work, so I got a single additional column, named, for example, X2*X3, which is fine on this example with 3. When I had 5, it gave a single additional column X4*X5.
for(n in 2:n_params) {
varname <- paste("m", n, sep=".")
df <- mutate_(df, .dots = setNames(interp(~one*two, one=as.name(paste0("X",n-1)),
two=as.name(paste0("X",n))),varname))
df
}
Since I can not figure out why this does not work, I have set up a series of if statements that calculate the ms and ss .
xx <- data.frame(df) %>%
mutate(m1 = X1,
s1 = X1 - 1)
if(n_params >= 2) {
xx <- data.frame(xx) %>%
mutate(m2 = m1 * X2,
s2 = m2 - m1)
}
if(n_params >= 3) {
xx <- data.frame(xx) %>%
mutate(m3 = m2 * X3,
s3 = m3 - m2)
}
if(n_params >= 4) {
xx <- data.frame(xx) %>%
mutate(m4 = m3 * X4,
s4 = m4 - m3)
}
if(n_params >= 5) {
xx <- data.frame(xx) %>%
mutate(m5 = m4 * X5,
s5 = m5 - m4)
}
if(n_params >= 6) {
xx <- data.frame(xx) %>%
mutate(m6 = m5 * X6,
s6 = m6 - m5)
}
It seems like I should be able to write a function that creates this,
In pseudocode:
function(n_params) {
function(x) {
new_df <- df %>%
mutate(m1 = X1,
s1 = X1 - 1)
for(i in 2:n_params){
new_df <- append(call to new_df,
mutate(mi = Xi*Xi-1,
si = mi-mi-1)
}
}
}
However, I cannot figure out how to combine the lazyeval interp and the setNames to allow for referring to the previous mutated value.
I could just leave it in if functions, but I'd love to make this more compact if possible.
The final final output of interest is the average s value over all permutations for each initial variable. I do that in a separate function.
Not the prettiest thing, but it works:
n_params = 3
xx1 = df %>%
mutate(m1 = X1,
s1 = X1 - 1)
for (i in 2:n_params) {
xx1 = xx1 %>%
mutate_(.dots = setNames(list(varval = paste0("m", i - 1, " * X", i)),
paste0("m", i))) %>%
mutate_(.dots = setNames(list(varval = paste0("m", i, " - m", i - 1)),
paste0("s", i)))
}
There's probably much better ways to use lazyeval. Hopefully someone else will show a nice answer, but this does match the xx produced in your question (for n_params = 3):
identical(xx, xx1)
# [1] TRUE

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1

Resources