Identifying Duplicate/Unique Teams (and Restructuring Data) in R - r

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.

This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA

Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)

Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.

Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2

Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Related

Calculate a new dataframe via applying calculations to rows in R

I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.
In code, I have created a dataframe that is structured a bit like this:
library(tibble)
df_in <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Three star", 1L, 1, 2L, 1,
"Two star", 5L, 5, 12L, 6,
"One star", 6L, 6, 100L, 50,
"No star", 88L, 88, 86L, 43,
"Empty", 0L, 0, 0L, 0,
"Also empty", 0L, 0, 0L, 0)
In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).
I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.
My desired output would be a bit like this:
df_out <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Any stars", 12L, 12, 114L, 57,
"... of which at least 2 stars", 6L, 6, 14L, 7,
"... of which 3 stars", 1L, 1, 2L, 1,
"No star", 88L, 88, 86L, 43)
For example, that 12L in the top left (meaning column a, "Any stars") is the sum of the 1L, 5L and 6L entries in the a column of the input.
I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.
Edit to add: Note that the labels I am using in the $names column of the test dataset df_in are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:
set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"
new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"
We may use imap to loop over patterns (as some cases are overlapping) and do the group by sum across those columns (after filtering the rows)
library(purrr)
library(stringr)
imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'),
c("Any stars", "... of which at least 2 stars",
"... of which 3 stars", "No star" )), ~ df_in %>%
filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
-ouptut
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
OP's expected
> df_out
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
Update
If the OP is passing a custom set of names
map2_dfr(mget(ls(pattern = '^set_\\d+')),
mget(ls(pattern = '^new_name_\\d+')),
~ df_in %>%
filter(names %in% .x) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43

How to join and bind similar dataframes in R, iterative solutions welcome

In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153

Percent change for grouped subjects at multiple timepoints R

id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)

Finding closest points between multiple datasets using Ellipsoidal/Vincenty

Note: This question is a follow up to a previous question: r - Finding closest coordinates between two large data sets.
I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).
The previously referenced post contains a solution the problem, however it uses RANN::nn2 which uses Euclidean distance as opposed to the aim of using Ellipsoidal/Vincenty.
Current code:
df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
df1[,4] <- df2[df1[, 4], 1]
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 44 0.7990743
# 2 2 57.80945 -2.234544 5688 2.1676868
# 3 4 34.02335 -3.098445 61114 1.4758202
# 4 5 63.80879 -2.439163 23 4.2415854
# 5 6 53.68881 -7.396112 54 3.6445416
# 6 7 63.44628 -5.162345 23 2.3577811
# 7 8 21.60755 -8.633113 440 8.2123762
# 8 9 78.32444 3.813290 76 11.4936496
# 9 10 66.85533 -3.994326 55 1.9296370
# 10 3 51.62354 -8.906553 54 3.2180026
I suspect that the solution would involve geosphere::distVincentyEllipsoid but I am unsure as to how to integrate it into the existing code.
Data:
r details
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
data set 1 input (not narrowed down to unique coordinates)
df1 <- structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
data set 2 input
df2 <- structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
Using distVincentyEllipsoid function:
library(geosphere)
t(
apply(
apply(df1[,c(3,2)], 1, function(mrow){distVincentyEllipsoid(mrow, df2[,c(3,2)])}),
2, function(x){ c(SRC_ID=df2[which.min(x),1],distance=min(x))}
)
)
SRC_ID distance
1 44 74680.48
2 5688 238553.51
3 61114 137385.18
4 23 340642.70
5 44 308458.73
6 23 256176.88
7 440 908292.28
8 76 1064419.47
9 55 185119.29
10 54 251580.45
Just use df1[,c(4,5)] <- t(apply(... to assign the values to the column of df1
Using rgeos::gDistance. This is Cartesian distance but starting from the solution below, I managed to post the updated answer above;
library(sp);library(rgeos)
#convert to spatial datasets
df1rgsp <- SpatialPointsDataFrame(df1[,c(3,2)], df1[,-c(3,2)])
df2rgsp <- SpatialPointsDataFrame(df2[,c(3,2)], data.frame(SRC_ID=df2[,1]))
#apply it on each rows
#find the minimum value and the corresponding row number
#transform it to become to columns and assign it to the columns of `df1`
df1[,c(4,5)] <- t( apply(gDistance(df1rgsp, df2rgsp, byid=TRUE), 1, function(x){
c(SRC_ID=which.min(x),distance=min(x))}))
#replace row numbers with `SRC_ID
df1[,4] <- df2[as.integer(df1[, 4]), 1] #same as what you have in the Q
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 440 1.9296370
# 2 2 57.80945 -2.234544 61114 3.2180026
# 3 4 34.02335 -3.098445 21 2.3577811
# 4 5 63.80879 -2.439163 23 8.8794997
# 5 6 53.68881 -7.396112 55 0.7990743
# 6 7 63.44628 -5.162345 440 3.4316239
# 7 8 21.60755 -8.633113 5688 11.4936496
# 8 9 78.32444 3.813290 54 2.1676868
# 9 10 66.85533 -3.994326 23 6.1545391
# 10 3 51.62354 -8.906553 23 1.4758202

Matching dataframes with data.table

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).
An idea of my three matrices:
MA.ID creates an identifier to look in the big DF the needed variables:
a b c
a ID.aa ID.ab ID.ac
b ID.ba ID.bb ID.bc
c ID.ca ID.cb ID.cc
The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:
ID 1990 1991 1992
ID.aa 10 11 12
ID.ab 13 14 15
ID.ac 16 17 18
ID.ba 19 20 21
ID.bb 22 23 24
ID.bc 25 26 27
ID.ca 28 29 30
ID.cb 31 32 33
ID.cc 34 35 36
ID.xx 40 40 55
ID.xy 50 51 45
....
MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):
a b c
a 10 13 16
b 19 22 25
c 28 31 34
I've tried to use match but honestly it didn't work out:
MA$a = DF[match(MA.ID$a, DF$ID),2]
I was recommended to use the data.table package but I couldn't see how that would help me.
Anyone has any good way to approach this problem?
Supposing that your input are dataframes, then you could do the following:
library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
, .SDcols = colnames(ma)]
which returns:
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Explanation:
With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.
An alternative approach with data.table is first transforming ma to a long data.table:
ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value) # set key by which 'ma' has to be indexed
setDT(df, key="ID") # transform to a datatable & set key by which 'df' has to be indexed
# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value := `1990`]
which gives:
> ma2
variable value
1: a 10
2: b 13
3: c 16
4: a 19
5: b 22
6: c 25
7: a 28
8: b 31
9: c 34
The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:
ma2[df, value := `1990`][, value := as.numeric(value)]
If you want to change it back to wide format you could use the rowid function within dcast:
ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]
which gives:
> ma3
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Used data:
ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"),
b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"),
c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")),
.Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))
df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"),
`1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L),
`1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L),
`1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)),
.Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))
In base R, it can be seen as a job for outer:
> outer(1:nrow(MA.ID), 1:ncol(MA.ID), Vectorize(function(x,y) {DF[which(DF$ID==MA.ID[x,y]),'1990']}))
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
Explanations:
outer creates a matrix as the outer product of the first argument X (here a b c) and the second argument Y (here the same, a b c)
for every combination of X and Y, it applies a function that looks up the value in DF, at the row where the ID is MA.ID[X,Y], and at the column 1990
the important trick here is to wrap the function with Vectorize, because outer expects a vectorized function
the result is finally returned as matrix
Alternatively, another way to do it (still in base R) is:
to convert your data frame MA.ID into a vector
sapply a quick function that looks up correspondance with DF$ID
and convert back to a matrix.
This works:
> structure(
sapply(unlist(MA.ID),
function(id){DF[which(DF$ID==id),'1990']}),
dim=dim(MA.ID), names=NULL)
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
(here the call to structure(..., dim=dim(MA.ID), names=NULL) converts back the vector to a matrix)

Resources