Path along linked values taking the lowest value each time - r

I have a data.table with two columns "From" and "To" as follows:
data.table(From = c(1,1,1,1,2,2,2,2,3,3,3,4,4,5),
To = c(3,4,5,6,3,4,5,6,4,5,6,5,6,6))
The data.table will always be sorted as shown in the example above, with "From" and "To" values increasing from smallest to largest.
I need to find a 'path' starting from the first 'From' (which will always be '1'), through to the last 'To' value, subject to always choosing the lowest 'To' value.
In the above example, I would have 1 --> 3, then 3 --> 4, then 4 --> 5, then finally 5 --> 6.
I then want to return in a vector 1, 3, 4, 5, and 6, representing the linked values.
The only way that I can think of doing it is using a while or for loop and looping through each group of 'From' values and iteratively choosing the smallest. That seems inefficient though and will probably be very slow on my actual data set which is over 100,000 rows long.
Are there any data.table-like solutions?
I also thought that maybe igraph would have a method for this, but I must admit that I currently have pretty much zero knowledge of this function.
Any help would be greatly appreciated.
Thanks,
Phil
EDIT:
Thanks for all the responses so far.
My example/explanation wasn't a great one sorry, as I didn't explain that the 'From' / 'To' pairs don't need to go all the way through to the end value of the 'To' column.
Using the example from the comments below:
dt <- data.table(From = c(1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 5),
To = c(3, 4, 5, 6, 3, 4, 5, 6, 5, 6, 6))
The output would simply be a vector of c(1, 3), as it will start at 1, choose the lowest value which is 3, and then because there are no 'From' values of '3', it wouldn't continue any further.
Another example:
dt <- data.table(From = c(1,1,1,2,2,3,3,4,4),
To = c(2,3,4,5,6,4,7,8,9))
The intended output here is a vector c(1,2,5); following the path 1 --> 2, then 2 --> 5, at which point it stops as there is no '5' value in the "From" column.
Hopefully, that makes sense, and apologies for the lack of clarity in the original question.
Thanks,
Phil

You can try the code below
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
ego(
order = sum((m <- membership(components(.))) == m[names(m) == "1"]),
nodes = "1",
mode = "out"
) %>%
pluck(1) %>%
names() %>%
as.numeric()
or simpler with subcomponent (as #clp did)
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
subcomponent(v = "1", mode = "out") %>%
names() %>%
as.integer()
which gives
For the first new updated data
[1] 1 3
For the second updaed data
[1] 1 2 5

Assuming an ordered From and To list this may work.
It first groups by From, compresses by To, then excludes non-matching From-To values using shift.
If jumps are missing (e.g. To 3 but From 3 missing) it prints NULL
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
[1] 1 3 4 5 6

Using from Igraph and subcomponents().
After ThomasisCoding's comment, I realized that graph_from_data_frame creates a graph by name.
This is a waste of memory (and time) if the graph is large (1E6).
Note also that graph_from_edgelist(as.matrix(...)) is much faster.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_edgelist(as.matrix(dt2), directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
First attempt.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_data_frame(dt2, directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))

I can't seem to get the other answers to work with certain tables. E.g.,
library(data.table)
library(igraph)
library(purrr)
dt <- data.table(
From = c(1, 1, 1, 1, 2, 2, 4, 5),
To = c(3, 4, 5, 6, 4, 6, 6, 6)
)
fPath1 <- function(dt) {
setorder(dt, From, To)[, wt := fifelse(rleid(To)==1,1,Inf), From] %>%
graph_from_data_frame() %>%
set_edge_attr(name = "weight", value = dt[, wt]) %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath2 <- function(dt) {
dt[, .SD[which.min(To)], From] %>%
graph_from_data_frame() %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath3 <- function(dt) {
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
}
fPath1(dt)
#> [1] 1 6
fPath2(dt)
#> Warning in shortest_paths(., min(dt[, From]), max(dt[, To])): At core/paths/
#> unweighted.c:368 : Couldn't reach some vertices.
#> integer(0)
fPath3(dt)
#> NULL
This igraph solution seems to work based on a little more extensive testing:
fPath4 <- function(dt) {
g <- graph_from_data_frame(dt)
E(g)$weight <- (dt$To - dt$From)^2
as.integer(V(g)[shortest_paths(g, V(g)[1], V(g)[name == dt$To[nrow(dt)]])$vpath[[1]]]$name)
}
fPath4(dt)
#> [1] 1 4 6

A sequential solution is feasable.
Copying one million dataframe lines took 8 seconds on my system.
n <- 1E6
df1 <- data.frame(from=sample(n), to=sample(n))
path <- c()
system.time(
for (i in seq(nrow(df1)) ){
path[length(path) + 1] <- df1[i, "to"] # avoid copying.
}
)
mean(path)
length(path)
Output.
[1] 500000.5
[1] 1000000
Updated after last edit of Phil.
The first step is to simplify the input (df).
## Select min(To) by From.
if (nrow(df) > 0) { df2 <- setNames(aggregate(df$To, list(df$From), "min"), c("From", "To") )
} else df2 <- df
Set path to first start node and
subsequently append end nodes.
## Let tt is maximal outgoing node upto now.
path <- df2[1,1]
tt <- df2[1,1]
for (i in seq_len(nrow(df2))){
if (df2[i, 1] < tt) next
else if (df2[i,1] == tt) { tt <- df2[i, 2]
path[length(path) + 1] <- df2[i, 2]
}
else break
}
head(path)
Output:
[1] 1 3 4 5 6 , df as in first example.
[1] 1 2 5 , df as in another example.

Related

Get column index from data frame that matches numeric vector?

Very similar questions have been asked here, here, and here. However, they all seem to rely on knowing the column names of the data.
I am trying to get the column index of a data frame that matches a numeric vector. For example, if I have some data and a vector like so:
dat <- data.frame(
x = c(1,2,3,4,5),
y = c(10,9,8,7,6),
z = c(2,4,6,8,10)
)
testVec <- c(2,4,6,8,10)
I would just like to return the column index of dat that matches testVec . We can see that dat$z matches testVec... so in this situation I would just like to return 3.
Any suggestions as to how I could do this?
Here's a base R approach, which compares every column in dat with testVec to see if they are identical. Use which to output the column index if they're identical.
which(sapply(1:ncol(dat), function(x) identical(dat[,x], testVec)))
[1] 3
UPDATE
#nicola has provided a better syntax to my original code (you can see it in the comment under this answer):
which(sapply(dat, identical, y = testVec))
z
3
You perhaps can try this
> which(colSums(dat == testVec) == nrow(dat))
z
3
An option with select from dplyr
library(dplyr)
dat %>%
select(where(~ all(testVec == .x))) %>%
names %>%
match(names(dat))
[1] 3
Subtract the testVec.
which(colSums(dat - testVec) == 0)
# z
# 3
Without name:
unname(which(colSums(dat - testVec) == 0))
# [1] 3
Data:
dat <- structure(list(x = c(1, 2, 3, 4, 5), y = c(10, 9, 8, 7, 6), z = c(2,
4, 6, 8, 10)), class = "data.frame", row.names = c(NA, -5L))
testVec <- c(2, 4, 6, 8, 10)

Checking sums with condition in R

I need to check the sums in some dataframe. But its elements do not sum up because some of them contain other. In this example 051040 is a part of 051043. Although most elements ends with "0", those which last digit is 3 always bigger or equal to counterparts with last digit 0. In other words I need to find all elements ends with "3" and skip those counterparts with "0". In this example 051040 and 051050 must be skipped because of 051043 and 051053 presence. Note that element "05000" is a control sum, obviously it also must be skipped. So I need to find the difference:
05000 - SUM ("051010", "051043", "051053", "052020", "052100", "052220", "052310") = 0
Something like that.
Here is an artificial example (actual dataframe is really big).
Area <- c("050000", "051010", "051040", "051043", "051050", "051053", "052020", "052100", "052220", "052310")
Total <- c(100, 28, 16, 22, 10, 10, 10, 10, 10, 10)
sodf <- data.frame(Area, Total)
Thanks very much in advance!
You could use the substrings.
sodf[with(sodf, ave(substring(Area, 6) == 3, substr(Area, 1, 5), FUN=\(x) {
if (any(x)) x else TRUE
})), ] |>
(\(x) x[1, 2] - sum(x[-1, 2]))()
# [1] 0
In an lapply for multiple areas, I expanded slightly your sodf to demonstrate (not exactly sure however, how your area codes exactly look like, but first two digits seem relevant).
lapply(split(sodf, substr(sodf$Area, 1, 2)), \(x) {
x <- x[ave(substring(Area, 6) == 3, substr(Area, 1, 5), FUN=\(x) {
if (any(x)) x else TRUE
}), ]
x[1, 2] - sum(x[-1, 2])
})
# $`05`
# [1] 0
#
# $`06`
# [1] 111
Data:
sodf <- structure(list(Area = c("050000", "051010", "051040", "051043",
"051050", "051053", "052020", "052100", "052220", "052310", "060000",
"061010", "061040", "061043", "061050", "061053", "062020", "062100",
"062220", "062310"), Total = c(100, 28, 16, 22, 10, 10, 10, 10,
10, 10, 211, 28, 16, 22, 10, 10, 10, 10, 10, 10)), row.names = c(NA,
-20L), class = "data.frame")
One row solution can be this:
sum(sodf[!sodf$Area %in% stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') & !sodf$Area %in% sodf[grepl('^[0]+[1-9]{1}[0]+$',sodf$Area),"Area"] ,"Total"])
How does it work ?
You find rows with Area that does not end with 0.
sodf[!grepl('0$',sodf$Area),"Area"] # "051043" "051053"
You find how this row should end, if they had a 0. I used a function from stringr package. stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') # "051040" "051050"
You find rows that are not included in group 2 and does not have '050000' structure (^[0]+[1-9]{1}[0]+$), then sum their total. sum(sodf[!sodf$Area %in% stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') & !sodf$Area %in% sodf[grepl('^[0]+[1-9]{1}[0]+$',sodf$Area),"Area"] ,"Total"] ) # 100
Same result can be achived splitting the code in multiple lines with dplyr package.
My colleague suggested the following variant.
library(tidyverse)
check_diff <- function(df) {
# Find value Total, for instance ends with 0000
sum_control <- df |>
filter(str_detect(Area, "0{4}$")) |>
select(Total)
# Find second number from the end where the last character is 3 and connect them into one string
before_last_3 <- df |>
filter(str_detect(Area, "3$")) |>
pull(Area) |>
str_sub(-2, -2) |>
str_flatten()
# Create string for filter pattern
filter3 <- str_c("[", before_last_3, "]0$")
# filter data on new conditions and find sum
sum <- df |>
filter(!str_detect(Area, "0{4}$"),
!str_detect(Area, filter3)) |>
summarise(sum(Total))
sum_control - sum
}

Piping over a list, subsetting and calculate a function of my own

I have a dataset with these three columns and other additional columns
structure(list(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4,
1), to = c(8, 3, 8, 54, 3, 4, 1, 6, 7, 1, 4, 3, 8, 8), time = c(1521823032,
1521827196, 1521827196, 1522678358, 1522701516, 1522701993, 1522702123,
1522769399, 1522780956, 1522794468, 1522794468, 1522794468, 1522794468,
1522859524)), class = "data.frame", row.names = c(NA, -14L))
I need the code to take all indices less than a number (e.g. 5) and for each of them do the following: Subset the data set if the index is either in column "from" or in column "to" and calculate a function (e.g the difference between the min and max in time). As a result I expect a dataframe with the indexes and the results of the calculation.
This is what I have, but it does not work.
dur<-function(x)max(x)-min(x) #The function to calculate the difference. In other cases I need to use other functions of my own
filternumber <- function(number,x){ #A function to filter data x by the number in the two two columns
x <- x%>% subset(from == number | to == number)
return(x)
}
lista <- unique(c(data$from, data$to)) # Creates a list with all the indexes in the data. I do this to avoid having non-existing indexes
lista <-lista[lista <= 5] #Limit the list to 5. In my code this number would be an argument to a function
result<-lista%>%filteremployee(.,data) %>% select(time) %>% dur() #I use select because I have many other columns in the data
The result in this case should be a dataframe with 1036492 for 1, 967272 for 3 and 92475 for 4
I´ve also try putting filteremployee(.,data) %>% select(time) %>% dur() in side mutate but that does not work either
Perhaps you are looking for something like this:
library(purrr)
library(dplyr)
index <- c(1, 3, 4)
names(index) <- index
index %>%
map_dfr(~ df %>%
filter(from == .x | to == .x) %>%
summarize(result = dur(time)),
.id = "index")
This returns
index result
1 1 1036492
2 3 967272
3 4 92475
The function was created with ==, which is elementwise. Here, we may need to loop
library(dplyr)
library(purrr)
map_dbl(lista, ~ filternumber(.x, data) %>%
select(time) %>%
dur)
[1] 1036492 967272 92475 0

How to check each element value of a vector testing for two specific values?

library(tidyverse)
I have a vector, called Drinks_Choice
Drink_Choice <- c(NA, 3, 4, 2, 1, 2, 5, NA)
Where 1-5 represets a certain drink type
Once NAs have been ommited I want to be able to test each element of the vector if is has a value of 1 or 5. Initially I tried using a for loop:
Drink_Choice %>%
na.omit()%>%
if (Drink_Choice == 1|5) {
cat("Even\n")
} else {
cat("Odd\n")
}
As an output I need TRUE (where element is 1 or 5) or FALSE (where element is 2 to 4). Many thanks for any help!!
If you do not mind skiping the pipes (%>%), try:
na.omit(Drink_Choice) %in% c(1,5)
But here with pipes also:
Drink_Choice %>% na.omit() %>% (function(x) x %in% c(1,5))

Adding a column based on a list dplyr

I am trying to summarise a list of dataframes. Here is some test data
noms <- list('A', 'B')
A_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(3, 2, 6, 7))
B_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(2, 6, 3, 6))
frames <- list(A_data, B_data)
I want to produce a list of data frams where'adj' is summed for each 'Dis' group, and then add a column for the relevant name from 'noms' so I can then combine the data frames together to form a single dataframe in the future.
So far I have this:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)))
,paste0(unlist(noms)))
But I can figure out how to add a column with the relevant name. I know I need to use the mutate function something like so:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)) %>%
dplyr::mutate(nom = )
,paste0(unlist(noms)))
but I cant figure out how to add the correct name.
The expected output would be a list of two dataframes one for 'A' and one for 'B'. Here is the expected output for 'A':
Dis total Nom
1 1 5 A
2 2 13 A
How do I do this?
A base R option where we use Map instead of lapply
out <- Map(function(x, y) {
transform(aggregate(adj ~ Dis, data = x, sum), Nom = y)
}, x = frames, y = noms)
out
#[[1]]
# Dis adj Nom
#1 1 5 A
#2 2 13 A
#[[2]]
# Dis adj Nom
#1 1 8 B
#2 2 9 B
The same idea with tidyverse functions
library(purrr); library(dplyr)
map2(.x = frames, .y = noms, ~ .x %>%
group_by(Dis) %>%
summarise(adj = sum(adj)) %>%
mutate(Nom = .y))

Resources