Mapping 2 unrelated data frames in R - r

I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?

You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Finding group with distinct (non-overlapping) elements

I have a simple dataframe with group IDs and elements of each group, like this:
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
Each ID may have a different number of elements. Now I want to find all IDs that have distinct elements with other IDs. In this example, ID1 and ID3 will be selected because they have distinct elements (3,5,7 vs 2,4,6). I also want to copy these unique IDs and their elements into a new dataframe, similar to the original.
How would I do that in R? My skills with R is quite limited.
Thank you very much!
Bests,
Seems like a good question for igraph cliques with one edge to another clique but I cant seem to wrap my head on how to use it.
Anyway, here is an option applying join to identify IDs with same Values and then anti-join to remove those IDs using data.table:
library(data.table)
DT <- as.data.table(x)
for (i in DT[, unique(ID)]) {
dupeID <- DT[DT[ID==i], on=.(Values), .(ID=unique(x.ID[x.ID!=i.ID]))]
DT <- DT[!dupeID , on=.(ID)]
}
output:
ID Values
1: 1 3
2: 1 5
3: 1 7
4: 3 2
5: 3 4
6: 3 6
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
gps = split(x, x$ID)
nGroups = length(gps)
k = 1
results = data.frame(ID = NULL, Values = NULL)
for(i in 1:(nGroups - 1)){
j = i + 1
while(j <= nGroups){
if(length(intersect(gps[[i]]$Values, gps[[j]]$Values)) == 0){
print(c(i,j))
results = rbind(results, gps[[i]], gps[[j]])
}
j = j + 1
}
}
results
> results
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
You can try the following code, where the y is the list of data frames (including all data frames that have exclusive Value)
xs <- split(x,x$ID)
id <- names(xs)
y <- list()
ids <- seq_along(xs)
repeat {
if (length(ids)==0) break;
y[[length(y)+1]] <- xs[[ids[1]]]
p <- ids[[1]]
qs <- p
for (q in ids[-1]) {
if (length(intersect(xs[[p]]$Value,xs[[q]]$Value))==0) {
y[[length(y)]] <- rbind(y[[length(y)]],xs[[q]])
qs <- c(qs,q)
}
}
ids <- setdiff(ids,qs)
}
Example
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3,4,4),
"Values" = c(3,5,7,2,4,5,2,4,6,1,3))
> x
ID Values
1 1 3
2 1 5
3 1 7
4 2 2
5 2 4
6 2 5
7 3 2
8 3 4
9 3 6
10 4 1
11 4 3
then you will get
> y
[[1]]
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
[[2]]
ID Values
4 2 2
5 2 4
6 2 5
10 4 1
11 4 3

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

R - compare rows consecutively in two data frames and return a value

I have the following two data frames:
df1 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"))
df2 <- data.frame(period=c("1","1","1","1","1","1","1","1","2","2","2","2","2","2","3","3","3","3","3","3","3","3","3","3","3","3"),
max_temp=c("9","13","16","18","30","37","38","39","10","15","16","25","30","32","8","10","12","14","16","18","19","25","28","30","35","40"),
group=c("1","1","1","2","2","2","3","3","3","3","4","4","5","5","5","5","5","6","6","6","7","7","7","7","8","8"))
I would like to:
Consecutively for each row, check if the value in the month column in df1 matches that in the period column of df2, i.e. df1$month == df2$period.
If step 1 is not TRUE, i.e. df1$month != df2$period, then repeat step 1 and compare the value in df1 with the value in the next row of df2, and so forth until df1$month == df2$period.
If df1$month == df2$period, check if the value in the temp column of df1 is less than or equal to that in the max_temp column of df2, i.e. df1$temp <= df$max_temp.
If df1$temp <= df$max_temp, return value in that row for the group column in df2 and add this value to df1, in a new column called "new_group".
If step 3 is not TRUE, i.e. df1$temp > df$max_temp, then go back to step 1 and compare the same row in df1 with the next row in df2.
An example of the output data frame I'd like is:
df3 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"),
new_group=c("1","1","1","2","3","4","4","5","6","7","7","8"))
I've been playing around with the ifelse function and need some help or re-direction. Thanks!
I found the procedure for computing new_group hard to follow as stated. As I understand it, you're trying to create a variable called new_group in df1. For row i of df1, the new_group value is the group value of the first row in df2 that:
Is indexed i or higher
Has a period value matching df1$month[i]
Has a max_temp value no less than df1$temp[i]
I approached this by using sapply called on the row indices of df1:
fxn = function(idx) {
# Potentially matching indices in df2
pm = idx:nrow(df2)
# Matching indices in df2
m = pm[df2$period[pm] == df1$month[idx] &
as.numeric(as.character(df1$temp[idx])) <=
as.numeric(as.character(df2$max_temp[pm]))]
# Return the group associated with the first matching index
return(df2$group[m[1]])
}
df1$new_group = sapply(seq(nrow(df1)), fxn)
df1
# month temp new_group
# 1 1 10 1
# 2 1 15 1
# 3 1 16 1
# 4 1 25 2
# 5 2 13 3
# 6 2 17 4
# 7 2 20 4
# 8 3 5 5
# 9 3 16 6
# 10 3 25 7
# 11 3 30 7
# 12 3 37 8
library(data.table)
dt1 <- data.table(df1, key="month")
dt2 <- data.table(df2, key="period")
## add a row index
dt1[, rn1 := seq(nrow(dt1))]
dt3 <-
unique(dt1[dt2, allow.cartesian=TRUE][, new_group := group[min(which(temp <= max_temp))], by="rn1"], by="rn1")
## Keep only the columns you want
dt3[, c("month", "temp", "max_temp", "new_group"), with=FALSE]
month temp max_temp new_group
1: 1 1 19 1
2: 1 3 19 1
3: 1 4 19 1
4: 1 7 19 1
5: 2 2 1 3
6: 2 5 1 3
7: 2 6 1 4
8: 3 10 18 5
9: 3 4 18 5
10: 3 7 18 5
11: 3 8 18 5
12: 3 9 18 5

Resources