Retrieve matched observation based on distance algorithm

Retrieve matched observation based on distance algorithm - r

What I am trying to do is close to propensity score matching (or causal matching, MatchIt) but not quite the same.
I am simply interested in finding and gathering together the closest (pairwise) observations from a dataset with mixed variables (categorical and numerical).
The dataset looks like this:
id child age edu y
1 11011209 0 69 some college 495
2 11011212 0 44 secondary/primary 260
3 11011213 1 40 some college 175
4 11020208 1 47 secondary/primary 0
5 11020212 1 50 secondary/primary 25
6 11020310 0 65 secondary/primary 525
7 11020315 1 43 college 0
8 11020316 1 41 secondary/primary 5
9 11031111 0 49 secondary/primary 275
10 11031116 1 42 secondary/primary 0
11 11031119 0 32 college 425
12 11040801 1 38 secondary/primary 0
13 11040814 0 52 some college 260
14 11050109 0 59 some college 405
15 11050111 1 35 secondary/primary 20
16 11050113 0 51 secondary/primary 40
17 11051001 1 38 college 165
18 11051004 1 36 college 10
19 11051011 0 63 secondary/primary 455
20 11051018 0 44 college 40
What I want is to match the variables {child, age, edu} but not y (nor id).
Because I use a dataset with mixed variables I can use the gower distance
library(cluster)
# test on first ten observations
dt = dt[1:10, ]
# gower distance
ddmen = daisy(dt[,-c(1,5)], metric = 'gower')
Now, I want to retrieve the closest observations
mg = as.matrix(ddmen)
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
close = mgg %>% dplyr::select(Var2, closest, dis = m) %>% distinct()
close gives me
Var2 closest dis
1 1 6 0.37931034
2 2 9 0.05747126
3 3 8 0.34482759
4 4 5 0.03448276
5 5 4 0.03448276
6 6 9 0.18390805
7 7 10 0.34482759
8 8 10 0.01149425
9 9 2 0.05747126
10 10 8 0.01149425
I can merge close to my original data
dt$id = 1:10
dt2 = merge(dt, close, by.x = 'id', by.y = 'Var2', all = T)
Then, bind it
vlist = vector('list', 10)
for(i in 1:10){
vlist[[i]] = dt2[ c( which(dt2$id == i), dt2$closest[dt2$id == i] ), ] %>%
mutate(p = i)
}
bind_rows(vlist)
and get
id child age edu y closest dis p
1 1 0 69 some college 495 6 0.37931034 1
2 6 0 65 secondary/primary 525 9 0.18390805 1
3 2 0 44 secondary/primary 260 9 0.05747126 2
4 9 0 49 secondary/primary 275 2 0.05747126 2
...
p then is the identifier of the matched pairs, based on id. So, you can notice that individuals can be in different pairs (because the closest matching of 1 on 2 is not necessarily symmetrical, 2 might have another closest match than 1).
Questions
First, there is a little bug in the code here:
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
I get this error message Column closest must be length 19 (the group size) or one, not 2
The code works for 10 observations but not for 20 (complete dataset provided here).
Why?
Second, is there a package available to do this automatically?
dt = structure(list(id = c(11011209L, 11011212L, 11011213L, 11020208L,
11020212L, 11020310L, 11020315L, 11020316L, 11031111L, 11031116L,
11031119L, 11040801L, 11040814L, 11050109L, 11050111L, 11050113L,
11051001L, 11051004L, 11051011L, 11051018L), child = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), age = c(69L,
44L, 40L, 47L, 50L, 65L, 43L, 41L, 49L, 42L, 32L, 38L, 52L, 59L,
35L, 51L, 38L, 36L, 63L, 44L), edu = structure(c(3L, 2L, 3L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 1L, 2L,
1L), .Label = c("college", "secondary/primary", "some college"
), class = "factor"), y = c(495, 260, 175, 0, 25, 525, 0, 5,
275, 0, 425, 0, 260, 405, 20, 40, 165, 10, 455, 40)), class = "data.frame",
.Names = c("id",
"child", "age", "edu", "y"), row.names = c(NA, -20L))

Related

Find the "top N" in a group and find the average of the "top N" in R

Rank Laps Average Time
1 1 1 30
2 2 1 34
3 3 1 35
4 1 2 32
5 2 2 33
6 3 2 56
7 4 1 43
8 5 1 23
9 6 1 31
10 4 2 23
11 5 2 88
12 6 2 54
I would like to know how I can group ranks 1-3 and ranks 4-6 and get an average of the "average time" for each lap. Also, I would like this to extend if I have groups 7-9, 10-13, etc.

One option is to use cut to put the different ranks into groups, and add Laps as a grouping variable. Then, you can summarize the data to get the mean.
library(tidyverse)
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3)), Laps) %>%
summarize(avg = mean(Average_Time))
Output
gr Laps avg
<fct> <int> <dbl>
1 (0,3] 1 33
2 (0,3] 2 40.3
3 (3,6] 1 32.3
4 (3,6] 2 55
Or another option if you want the range of ranks displayed for the group:
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3))) %>%
mutate(Rank_gr = paste0(min(Rank), "-", max(Rank))) %>%
group_by(Rank_gr, Laps) %>%
summarize(avg = mean(Average_Time))
Output
Rank_gr Laps avg
<chr> <int> <dbl>
1 1-3 1 33
2 1-3 2 40.3
3 4-6 1 32.3
4 4-6 2 55
Since you will have uneven groups, then you might want to use case_when to make the groups:
df %>%
group_by(gr=case_when(Rank %in% 1:3 ~ "1-3",
Rank %in% 4:6 ~ "4-6",
Rank %in% 7:9 ~ "7-9",
Rank %in% 10:13 ~ "10-13"),
Laps) %>%
summarize(avg = mean(Average_Time))
Data
df <- structure(list(Rank = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 4L,
5L, 6L), Laps = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L), Average_Time = c(30L, 34L, 35L, 32L, 33L, 56L, 43L, 23L,
31L, 23L, 88L, 54L)), class = "data.frame", row.names = c(NA,
-12L))

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.

Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))

Valid observations based on conditions [duplicate]

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.

Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))

Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

Shapiro.test & plyr: all 'x' values are identical

I'm trying to run a Shapiro Wilks test on the variable 'Size', using a dataset that I'm subsetting with ddply (by the variables 'Site' and 'Category'), but I keep getting an error message.
Here's a sample of my dataset (d). I have 4237 observations with 9 categories and 13 sites:
Site Genus Size Category
Arn01 ACR 4 ACR
Arn01 ACR 7 ACR
Arn02 ACR 3 ACR
I created a function for Shapiro Wilks:
shap.w <- function(input){ #shapiro wilk test function
if(sum(!is.na(input$Size)) > 3 & sum(!is.na(input$Size)) < 5000){
p <- shapiro.test(input$Size)$p.value
return(p)}else{return(NA)} }
Then, I try to apply the function to subsets of my data using ddply:
sw_test <- ddply(d, .(Site, Category), .fun = shap.w)
But when I do, I get an error message that says:
Error in shapiro.test(input$Size) : all 'x' values are identical
Even though they're clearly not. Any help/advice would be much appreciated.
ETA output of
dput(d[1:20,]):
> dput(d[1:20,])
structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Arn01n",
"Arn02n", "Arn03n", "Arn04n", "Arn05n", "Arn06n", "Arn07n", "Arn08n",
"Arn09n", "Arn10n", "Arn11n", "Arn12n", "Arn13n"), class = "factor"),
Genus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 30L, 30L, 30L, 30L), .Label = c("ACA",
"ACR", "AST", "COS", "CYP", "ECH", "FUN", "FVA", "FVT", "GAR",
"GON", "HEL", "HYD", "ISO", "LEA", "LEO", "LEP", "LOB", "MER",
"MNT", "MST", "MYC", "PAV", "PBR", "PLA", "PLAT", "POC",
"POD", "PRE", "PRM", "PRS", "PSA", "SAR", "STY"), class = "factor"),
Size = c(4, 2, 4, 4, 3, 5, 5, 4, 4, 4, 4, 3, 6, 3, 4, 5,
2, 3, 3, 6), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L, 8L), .Label = c("ACR",
"FAV", "FUN", "HEL", "ISO", "MNT", "POC", "PRM", "PRS"), class = "factor")),
.Names = c("Site",
"Genus", "Size", "Category"), row.names = c(NA, 20L), class = "data.frame")`
ETA output of table(d$Size)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 27 28 29 30 31 33 35 36 37 38 39
14 271 525 548 521 424 201 206 50 357 23 95 36 7 171 11 14 30 4 145 11 21 5 46 4 1 5 1 95 1 2 31 3 1 2 1
40 41 42 43 44 45 46 48 50 51 53 55 56 57 60 62 63 65 66 70 72 75 76 80 82 83 85 88 90 94 95 100 105 110 120 125
80 1 9 3 4 22 1 4 42 1 1 4 1 3 64 3 5 9 4 13 1 2 1 20 2 2 2 1 5 1 2 17 1 2 6 2
128 130 143 150 155 160 180 200 230 300 890 920
1 1 1 1 1 1 1 2 1 1 1 1

Note that if you return NA, then is.numeric will give FALSE: Try is.numeric(NA) to see this.
You could return NA_real_ instead
is.numeric(NA)
[1] FALSE
is.numeric(NA_real_)
[1] TRUE
It's still an NA though:
is.na(NA_real_)
[1] TRUE
However, as.numeric should also fix that problem (perhaps double check what's being returned to ddply by your function given the inputs)

Okay, thanks to the help I received in the comments, I was able to solve this problem by updating the code for the function to read:
shap.w <- function(input){ #shapiro-wilks test function
if(length(unique((input$Size[!is.na(input)]))) > 3
& length(unique((input$Size[!is.na(input)])))< 5000 ){
p <- shapiro.test(input$Size)$p.value
return(p)}else{return(NA)} }
This removes the combinations that are less than 3 / greater than 5000 (although I won't have any greater than 5,000 in this dataset). Once I updated this, the next line ran without any problems. Thank you all for your help!

Subset of data with criteria of two columns

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time

If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89

An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89

A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Retrieve matched observation based on distance algorithm - r

Related

Find the "top N" in a group and find the average of the "top N" in R

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Valid observations based on conditions [duplicate]

Shapiro.test & plyr: all 'x' values are identical

Subset of data with criteria of two columns

Categories

Resources