Join tables via multiple partial matching - r

I have a table describing a train track with each line being a segment of the track with a from and to station as well as a trackID and segment-ID. The stations names are completely random, not as structured as they appear here.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",1:8),
to = paste0("station_2",1:8)
)
tracks
trackID segment from to
1 A a station_11 station_21
2 A b station_12 station_22
3 A c station_13 station_23
4 A d station_14 station_24
5 B e station_15 station_25
6 B f station_16 station_26
7 B g station_17 station_27
8 B h station_18 station_28
I have another table with sightings made on this train, and I would like to know what the correspoding trackID is per sighting. The table looks like this:
sightings <- data.frame(from = c("station_24","station_28","station_14"),
to = c("station_14","station_16","station_25"))
sightings
from to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25
I could gather the information on the trackID from the to and from information provided in the sightings table. BUT, from and to in the sightings-table do not correspond with the from and to in the track-table: from and to can be in different segments and can be interchanged (to-from). In some problematic cases, from and to are in different trackID, which would then return no match. The desired output from this example would be:
from to trackID
1 station_24 station_14 A
2 station_28 station_16 B
3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs
In my mind, the solution involves collapsing the tracks table by trackID and then doing a double partial matching of strings (using grepl()?). This next lines would take care of collapsing, but I have no clue where to go from here. Can someone point me in the right direction?
Solutions with R / dplyr very much preferred, but I would take anything!
library(dplyr)
tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
)
tracks
trackID from_to
<fct> <chr>
1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24
2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28
EDIT: It seems that I've oversimplified my problem in my minimal example. The main issue is that stations (from and to) are not unique in the table, and not even unique to a trackID. Only a combination of to and from is unique to a trackID. I've accepted the answer as it solves the problem as stated, but I will also provide my own solution that I've come up with in the meantime.

A double-join can work.
Notes: You don't appear to be using segment, so I'm discarding it here, but this might be adapted if needed. Also, I added stringsAsFactors=FALSE to your data, since otherwise combining vectors of factors can be problematic.)
library(dplyr)
tracksmod <- bind_rows(
select(tracks, trackID, sta=from),
select(tracks, trackID, sta=to)
)
head(tracksmod)
# trackID sta
# 1 A station_11
# 2 A station_12
# 3 A station_13
# 4 A station_14
# 5 B station_15
# 6 B station_16
sightings %>%
left_join(select(tracksmod, trackID, from=sta), by="from") %>%
left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
select(-trackID2)
# from to trackID
# 1 station_24 station_14 A
# 2 station_28 station_16 B
# 3 station_14 station_25 <NA>
I did not assume that directionality was important. That is, I'm not assuming that a station listed in from must always be in the from column. This is why I converted tracks to tracksmod, in order to identify a station with an id regardless of direction.

As I've stated in the EDIT of my question, I've oversimplified my problem in the minimal Example.
Here is an updated version of the data that resembles my data more accurately. I've also added stringsAsFactor = F as commented by #r2evans.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",c(1:4,1,2,5,6)),
to = paste0("station_2",1:8),
stringsAsFactors = F
)
sightings <- data.frame(
from = c("station_24","station_28","station_14"),
to = c("station_14","station_11","station_25"),
trackID = c("A","B",NA),
stringsAsFactors = F
)
I've solved the problem by collapsing the tracks table on the basis of trackID and then using the purrr package to use the loop functions in a nested manner.
library(dplyr)
# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
# from = list(from),
# to = list(to),
# stas = list(c(from,to))
)
# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}
library(purrr)
pmap_dfr(sightings, function(from,to,trackID){ # pmap_dfr runs over a data.frame and returns a data.frame
data.frame(
from = from, # recreates the sightings data.frame
to = to, # dito
trackID = paste( # collapses the resulting vector
remove_na( # removes the NA values
pmap_chr( # matches every row from the sightings-data.frame with the tracks-data.frame
tracks_collapse,
function(trackID,from_to){
ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
}
)
),collapse = ","
)
)
})
Output:
from to trackID
1 station_24 station_14 A
2 station_28 station_11 B
3 station_14 station_25 <NA>

Related

Rounded percentages that add up to 100% in group_by statement

I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.
The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
summarize(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.

Creating new columns with mutate

i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Calling a data.frame from global.env and adding a column with the data.frame name

I have a dataset consisting of pairs of data.frames (which are almost exact pairs, but not enough to merge directly) which I need to munge together. Luckily, each df has an identifier for the date it was created which can be used to reference the pair. E.g.
df_0101 <- data.frame(a = rnorm(1:10),
b = runif(1:10))
df_0102 <- data.frame(a = rnorm(5:20),
b = runif(5:20))
df2_0101 <- data.frame(a2 = rnorm(1:10),
b2 = runif(1:10))
df2_0102 <- data.frame(a2 = rnorm(5:20),
b2 = runif(5:20))
Therefore, the first thing I need to do is mutate a new column on each data.frame consisting of this date (01_01/ 01_02 / etc.) i.e.
df_0101 <- df_0101 %>%
mutate(df_name = "df_0101")
but obviously in a programmatic manner.
I can call every data.frame in the global environment using
l_df <- Filter(function(x) is(x, "data.frame"), mget(ls()))
head(l_df)
$df_0101
a b
1 0.7588803 0.17837296
2 -0.2592187 0.45445752
3 1.2221744 0.01553190
4 1.1534353 0.72097071
5 0.7279514 0.96770448
$df_0102
a b
1 -0.33415584 0.53597308
2 0.31730849 0.32995013
3 -0.18936533 0.41024220
4 0.49441962 0.22123885
5 -0.28985964 0.62388478
$df2_0101
a2 b2
1 -0.5600229 0.6283224
2 0.5944657 0.7384586
3 1.1284180 0.4656239
4 -0.4737340 0.1555984
5 -0.3838161 0.3373913
$df2_0102
a2 b2
1 -0.67987149 0.65352466
2 1.46878953 0.47135011
3 0.10902751 0.04460594
4 -1.82677732 0.38636357
5 1.06021443 0.92935144
but no idea how to then pull the names of each df down into a new column on each. Any ideas?
Thanks for reading,
We can use Map in base R
Map(cbind, names = names(l_df), l_df)
If we are going by the tidyverse way, then
library(tidyverse)
map2(names(l_df), l_df, ~(cbind(names = .x, .y)))
Also, this can be created a single dataset with bind_rows
bind_rows(l_df, .id = "names")

Finding repeated sentences/words/phrases by group over time

I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data):
I'd like to know if a person or group is saying the same thing(s) over time. I'm familiar with n-grams, but it's not quite what I need. Any help would be appreciated.
This is the output I'd like:
Sorry for all the edits poor comments; still getting used to the website.
If you want to see the frequence of each comments related to each Person and a new column Ready you can do this with the following code :
set.seed(123456)
### I use the same data as the previous example, thank you for providing this !
data <-data.frame(date = Sys.Date() - sample(100),
Group = c("Cars","Trucks") %>% sample(100,replace=T),
Reporting_person = c("A","B","C") %>% sample(100,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
)
library(dplyr)
data %>%
group_by(Reporting_person,Ready) %>%
count(Comments) %>%
mutate(prop = prop.table(n))
If what you are asking is to see if a change occurs in the comments over time and to see if that change is correlated with an event (like Ready) you can use the following code:
library(dplyr)
### Creating a column comments at time + plus
new = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(comments_plusone=lag(Comments))
new = na.omit(new)
### Creating the change column 1 is a change , 0 no change
new$Change = as.numeric(new$Comments != new$comments_plusone)
### Get the correlation between Change and the events...
### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here :
tbl <- table(new$Ready,new$Change)
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))
You should get no significative correlation with this example. As you can clearly see when you illustrate the table.
plot(tbl)
Not that using cor function is not appropriate working with two binary variable.
Here a post in this topic.... Correlation between two binary
Frequence of change by change of State
Following your comments, I am adding this code:
newR = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(Ready_plusone=lag(Ready))
newR = na.omit(newR)
###------------------------Add the column to the new data frame
### Creating the REady change column 1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")
### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>%
group_by(Reporting_person,State_change) %>%
count(Change) %>%
mutate(Frequence = prop.table(n))%>%
filter(Change==1)
### Tidyr is a great library for reshape data, you want the wide format of the previous long
### dataframe... However doing this will generate a lot of NA so If I were you I would get
### the result format instead of the following but this could be helpful for future need so here you go.
library(tidyr)
final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]
Hope this help :)
Something like this ?
df <-data.frame(date = Sys.Date() - sample(10),
Group = c("Cars","Trucks") %>% sample(10,replace=T),
Reporting_person = c("A","B","C") %>% sample(10,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))
# date Group Reporting_person Comments
# 1 2017-06-08 Trucks B Awesome
# 2 2017-06-05 Trucks A Awesome
# 3 2017-06-14 Cars B Meh
# 4 2017-06-06 Cars B Awesome
# 5 2017-06-11 Cars A Meh
# 6 2017-06-07 Cars B NC
# 7 2017-06-09 Cars A NC
# 8 2017-06-10 Cars A NC
# 9 2017-06-13 Trucks C Awesome
# 10 2017-06-12 Trucks B NC
aggregate(date ~ .,df,length)
# Group Reporting_person Comments date
# 1 Trucks A Awesome 1
# 2 Cars B Awesome 1
# 3 Trucks B Awesome 1
# 4 Trucks C Awesome 1
# 5 Cars A Meh 1
# 6 Cars B Meh 1
# 7 Cars A NC 2
# 8 Cars B NC 1
# 9 Trucks B NC 1

Resources