I am trying to come up with a way to loop through a data frame and recognise points shared between 2 columns and work iteratively to assign a unique factor to these. Specifically, I have a data frame indicating points along a river and which points are immediately upstream of this.
Here is some example data:
df <- data.frame(RiverID = rep(c(1,2), each = 15), SiteID = rep((c(2,3,4,5,6,7,8,9,10,11,12,13,13,13,13)),2),
Upstream_SiteID = rep((c(1,1,1,2,2,3,4,5,6,7,8,9,10,11,12)),2),
Dist2Mouth = rep((c(2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)),2))
And the basic paint image below might help explain the kind of data I have.
What I would like to do is identify all possible 'routes' through the system (orange letters in the image). So in the example, I would start from point 13, go 'upstream' in the dataframe (i.e. lower Distance2Mouth values) and recognise 4 different routes (A-D). I then need to iteratively keep working up the data frame and assign routes to each point.
There are some instances where the stretch of river between the two points could belong to two routes. For example in the image below, the section between 2-1 could be part of routes C or D. In these instances, I would like to create multiple rows with the same SiteID and Upstream_SiteID that list the different potential routes.
Lastly, I have these instances across various rivers, so I would like to loop through the dataframe and apply the above code for each RiverID.
Desired output (the correct routes apply to RiverID '1'):
output <- data.frame(RiverID = rep(c(1,2), each = 16),
SiteID = rep((c(2,2,3,4,5,6,7,8,9,10,11,12,13,13,13,13)),2),
Upstream_SiteID = rep((c(1,1,1,1,2,2,3,4,5,6,7,8,9,10,11,12)),2),
Dist2Mouth = rep((c(2000,2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)),2),
Route = as.factor(c(rep(c("D","C","B","A"),times = 4),rep(c("H","G","F","E"),times = 4))))
igraph is definitely the right tool for problems like this. However, it is also possible to do what you need with a few lines of code without the package. The routes are identified by the go_up() function which uses a depth-first search algorithm.
## Builds a nested list with all possible paths on its deepest level
#
# path path so far
# ups list of upstream points for each point
#
go_up <- function(path, ups) {
# last point of the path
last <- tail(path, 1)
if (last %in% names(ups)) {
# continue with all possible upstream points
lapply(ups[[as.character(last)]],
function(up) go_up(c(path, up), ups))
# finish if no upstream point exists
} else paste(path, collapse='---')
# path is collapsed into a string so that the resulting list can be
# easily flattened
}
This can be then applied to each river separately like this:
river.routes <- lapply(split(df, df$RiverID), function(river) {
# list with upstream points for each point
ups <- tapply(river$Upstream_SiteID, river$SiteID, c)
# we will start from the highest ID
last <- max(river$SiteID)
# find the routes
routes <- go_up(last, ups)
# flatten the list and split the routes into points
routes <- strsplit(unlist(routes), '---')
# add a logical column for each route
for (i in seq_along(routes)) {
river[[paste0('route', i)]] <- river$SiteID %in% routes[[i]]
}
river
})
Output:
river.routes
# $`1`
# RiverID SiteID Upstream_SiteID Dist2Mouth route1 route2 route3 route4
# 1 1 2 1 2000 TRUE TRUE FALSE FALSE
# 2 1 3 1 2000 FALSE FALSE TRUE FALSE
# 3 1 4 1 2000 FALSE FALSE FALSE TRUE
# 4 1 5 2 1500 TRUE FALSE FALSE FALSE
# 5 1 6 2 1500 FALSE TRUE FALSE FALSE
# 6 1 7 3 1500 FALSE FALSE TRUE FALSE
# 7 1 8 4 1500 FALSE FALSE FALSE TRUE
# 8 1 9 5 1000 TRUE FALSE FALSE FALSE
# 9 1 10 6 1000 FALSE TRUE FALSE FALSE
# 10 1 11 7 1000 FALSE FALSE TRUE FALSE
# 11 1 12 8 1000 FALSE FALSE FALSE TRUE
# 12 1 13 9 500 TRUE TRUE TRUE TRUE
# 13 1 13 10 500 TRUE TRUE TRUE TRUE
# 14 1 13 11 500 TRUE TRUE TRUE TRUE
# 15 1 13 12 500 TRUE TRUE TRUE TRUE
#
# $`2`
# RiverID SiteID Upstream_SiteID Dist2Mouth route1 route2 route3 route4
# 16 2 2 1 2000 TRUE TRUE FALSE FALSE
# 17 2 3 1 2000 FALSE FALSE TRUE FALSE
# 18 2 4 1 2000 FALSE FALSE FALSE TRUE
# 19 2 5 2 1500 TRUE FALSE FALSE FALSE
# 20 2 6 2 1500 FALSE TRUE FALSE FALSE
# 21 2 7 3 1500 FALSE FALSE TRUE FALSE
# 22 2 8 4 1500 FALSE FALSE FALSE TRUE
# 23 2 9 5 1000 TRUE FALSE FALSE FALSE
# 24 2 10 6 1000 FALSE TRUE FALSE FALSE
# 25 2 11 7 1000 FALSE FALSE TRUE FALSE
# 26 2 12 8 1000 FALSE FALSE FALSE TRUE
# 27 2 13 9 500 TRUE TRUE TRUE TRUE
# 28 2 13 10 500 TRUE TRUE TRUE TRUE
# 29 2 13 11 500 TRUE TRUE TRUE TRUE
# 30 2 13 12 500 TRUE TRUE TRUE TRUE
I didn't much like the idea of duplicating the rows for points located on more than one route so I rather added a logical column for each route instead. You can change that if you like, the routes are available inside the lapply in the routes variable which looks like this:
# [[1]]
# [1] "13" "9" "5" "2" "1"
#
# [[2]]
# [1] "13" "10" "6" "2" "1"
#
# [[3]]
# [1] "13" "11" "7" "3" "1"
#
# [[4]]
# [1] "13" "12" "8" "4" "1"
The core function you want with igraph is all_simple_paths which will give you a list of paths/routes. To create a graph, you need unique ids, so I'm just ignoring the second river piece.
library(igraph)
library(purrr)
library(dplyr)
df_w1 <- data.frame(
SiteID = c(2,3,4,5,6,7,8,9,10,11,12,13,13,13,13),
Upstream_SiteID = c(1,1,1,2,2,3,4,5,6,7,8,9,10,11,12),
Dist2Mouth = c(2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)
)
paths <- graph_from_data_frame(df_w1) |> # create graph object
all_simple_paths("13", "1") |> # enumerate paths
map(~attr(.x, "names") |> as.integer()) |> # get the labels and not internal ids
print()
#> [[1]]
#> [1] 13 9 5 2 1
#>
#> [[2]]
#> [1] 13 10 6 2 1
#>
#> [[3]]
#> [1] 13 11 7 3 1
#>
#> [[4]]
#> [1] 13 12 8 4 1
For your purposes then you can enumerate the routes each segment belongs to and then join back to the original data (as not to lose the metadata). Iterate through the segments and check to see if both sites are in each of the paths.
# assume less than 26 routes
codingList <- map2_dfr(
df_w1$SiteID, df_w1$Upstream_SiteID,
\(from, to) tibble(
SiteID = from,
Upstream_SiteID = to,
Route = LETTERS[map_lgl(paths, \(p) all(c(from, to) %in% p)) |> which()]
)
) |>
print()
#> # A tibble: 16 × 3
#> SiteID Upstream_SiteID Route
#> <dbl> <dbl> <chr>
#> 1 2 1 A
#> 2 2 1 B
#> 3 3 1 C
#> 4 4 1 D
#> 5 5 2 A
#> 6 6 2 B
#> 7 7 3 C
#> 8 8 4 D
#> 9 9 5 A
#> 10 10 6 B
#> 11 11 7 C
#> 12 12 8 D
#> 13 13 9 A
#> 14 13 10 B
#> 15 13 11 C
#> 16 13 12 D
df_w1 |>
left_join(codingList, by = c("SiteID", "Upstream_SiteID"))
#> SiteID Upstream_SiteID Dist2Mouth Route
#> 1 2 1 2000 A
#> 2 2 1 2000 B
#> 3 3 1 2000 C
#> 4 4 1 2000 D
#> 5 5 2 1500 A
#> 6 6 2 1500 B
#> 7 7 3 1500 C
#> 8 8 4 1500 D
#> 9 9 5 1000 A
#> 10 10 6 1000 B
#> 11 11 7 1000 C
#> 12 12 8 1000 D
#> 13 13 9 500 A
#> 14 13 10 500 B
#> 15 13 11 500 C
#> 16 13 12 500 D
This is a question that came up in my mind when seeing that SO QUESTION
x <- list(c(1:6,32,24), c(1:4,8,10,12,13,17,24), c(1:5,9:15,17,18,19,20,32))
IND <- !duplicated(unlist(x))
INPUT
> x
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 1 2 3 4 8 10 12 13 17 24
[[3]]
[1] 1 2 3 4 5 9 10 11 12 13 14 15 17 18 19 20 32
> IND
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
Is it possible to access the list elements with IND so I get:
DESIRED OUTPUT
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 8 10 12 13 17
[[3]]
[1] 9 11 14 15 18 19 20
Normally I would access the elements like x[[1]][1] etc ...
You can try:
IND2 <-split(IND, rep(1:3, sapply(x, length)))
Map(function(x, y) x[y], x, IND2)
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 8 10 12 13 17
[[3]]
[1] 9 11 14 15 18 19 20
The idea is to transform the IND vector back to a list using the corresponding lengths. Then using Map to loop through each list element for subsetting.
Or try a reshape2 & tidyverse solution:
library(tidyverse)
library(reshape2)
melt(x) %>%
filter(!duplicated(value)) %>%
with(., split(value, L1))
I have a dataframe p1. I would like to transpose by column a. Find minimum of each row and return the column name that has the minimum value.
a=c(0,1,2,3,4,0,1,2,3,4)
b=c(10,20,30,40,50,9,8,7,6,5)
p1=data.frame(a,b)
p1
> p1
a b
1 0 10
2 1 20
3 2 30
4 3 40
5 4 50
6 0 9
7 1 8
8 2 7
9 3 6
10 4 5
The final required answer
0 1 2 3 4 row_minimum column_index_of_minimum
10 20 30 40 50 10 0
9 8 7 6 5 5 4
I used many things but the main was ave(p1$a, p1$a, FUN = seq_along) which allowed me to separate the b into groups based on the number of times they were associated with a
myans = setNames(data.frame(do.call(rbind, lapply(split(p1, ave(p1$a, p1$a, FUN = seq_along)),
function(x) x[,2]))), nm = rbind(p1$a[ave(p1$a, p1$a, FUN = seq_along) == 1]))
minimum = apply(myans, 1, min)
index = colnames(myans)[apply(myans, 1, which.min)]
myans$min = minimum
myans$index = index
myans
# 0 1 2 3 4 min index
#1 10 20 30 40 50 10 0
#2 9 8 7 6 5 5 4
Consider using a running group count followed by an aggregate and reshape:
# RUNNING GROUP COUNT
p1$grpcnt <- sapply(seq(nrow(p1)), function(i) sum(p1[1:i, c("a")]==p1$a[[i]]))
# MINIMUM OF B BY GROUP COUNT MERGING TO RETRIEVE A VALUE
aggdf <- setNames(merge(aggregate(b~grpcnt, p1, FUN=min),p1,by="b")[c("grpcnt.x","b","a")],
c("grpcnt", "row_minimum", "column_index_of_minimum"))
# RESHAPE/TRANSPOSE LONG TO WIDE
reshapedf <- setNames(reshape(p1, timevar=c("a"), idvar=c("grpcnt"), direction="wide"),
c("grpcnt", paste(unique(p1$a))))
# FINAL MERGE
finaldf <- merge(reshapedf, aggdf, by="grpcnt")[-1]
finaldf
# 0 1 2 3 4 row_minimum column_index_of_minimum
# 1 10 20 30 40 50 10 0
# 2 9 8 7 6 5 5 4
I have a time series matrix called trendtable , which has data's from 1960 to 2010(57 columns) for 175 countries .The column names are years and their corresponding values are given.I need to find the difference between the columns to find the trend for each country, to find if the trend is going up or down . The resulting trend value should be in new table.The code I have written is mentioned below,but i guess its wrong.
for (i in 1:175) {
Trend=0
for (j in 5:56) {
Dif=TimeSeriesCO2[i,j]-TimeSeriesCO2[i,j+1]
if(Dif<0){
Trend--}
else{
Trend++}
}
TrendTable<-rbind(TrendTable,Trend)
}
Here's a stab, with fake data:
set.seed(42)
nc <- 9 ; nr <- 5
mtx <- t(replicate(nr, cumsum(sample(-3:3, size = nc, replace = TRUE))))
dimnames(mtx) <- list(LETTERS[seq_len(nr)], 2000 + seq_len(nc))
mtx
# 2001 2002 2003 2004 2005 2006 2007 2008 2009
# A 3 6 5 7 8 8 10 7 8
# B 1 1 3 6 4 4 7 10 7
# C 0 0 3 0 3 6 3 3 2
# D 3 3 5 7 9 8 9 6 8
# E -3 -5 -2 -1 -2 -2 -5 -2 -2
So I'm simulating your 175 countries with five letters, and 57 years with 9.
A first cut could be to determine if a year is greater than or equal to the previous year.
t(apply(mtx, 1, diff) >= 0)
# 2002 2003 2004 2005 2006 2007 2008 2009
# A TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
# B TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
# C TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
# D TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
# E FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
(The need to transpose the result from apply is perhaps unintuitive but a fact of dealing with it. There may be ways to do it without the transpose, but none that I know of in base R that are as direct.)
Note that there are one fewer columns than in the original dataset. I you want to keep track in each year if there is a cumulative increase or not, one might do:
t(apply(mtx, 1, function(a) cumsum(diff(a) >= 0)))
# 2002 2003 2004 2005 2006 2007 2008 2009
# A 1 1 2 3 4 5 5 6
# B 1 2 3 3 4 5 6 6
# C 1 2 2 3 4 4 5 5
# D 1 2 3 4 4 5 5 6
# E 0 1 2 2 3 3 4 5
But I don't necessarily find that more informative.
From your code, it looks like you might be expecting a single number for each country (letter), in which case you can simplify this further to:
apply(mtx, 1, function(a) sum(diff(a) >= 0))
# A B C D E
# 6 6 5 6 5
As far as adding this somehow to your current data, I am neither certain what you want (from your example) nor do I recommend appending summary data to the original matrix. The sole exception is for immediate visualization, not storage.
I work with data frames for flight movements (~ 1 million rows * 108 variables) and want to group phases during which a certain criterion is met (i.e. the value of a certain variable). In order to identify these groups, I want to number them.
Being a R newbie, I made it work for my case. Now I am looking for a more elegant way. In particular, I would like to overcome with the "useless" gaps in the numbering of the groups.
I provide a simplified example of my dplyr data frame with the value THR for the threshold criterion. The rows are sorted by the timestamp (and thus, i can truncate this here).
THR <- c(13,17,19,22,21,19,17,12,12,17,20,20,20,17,17,13, 20,20,17,13)
df <- as.data.frame(THR)
df <- tbl_df(df)
To flag all rows where the criterion is (not) met
df <- mutate(df, CRIT = THR < 19)
With the following, I managed to conditionally "cumsum" to get a unique group identification:
df <- mutate(df, GRP = ifelse(CRIT == 1, 0, cumsum(CRIT))
df
x CRIT GRP
1 13 TRUE 0
2 17 TRUE 0
3 19 FALSE 2
4 22 FALSE 2
5 21 FALSE 2
6 19 FALSE 2
7 17 TRUE 0
8 12 TRUE 0
9 12 TRUE 0
10 17 TRUE 0
11 20 FALSE 6
12 20 FALSE 6
While this does the trick and I can operate on the groups with group_by (e.g. summarise, filter), the numbering is not ideal as can be seen in the example output. In this example the 1st is numbered 2, and the 2nd group is numbered 6 which is in line with the cumsum() result.
I would appreciate, if anybody could shed some light on me. I was not able to find an appropriate solution in other posts.
I don't you can really avoid that preliminary step of creating CRIT, though I'd suggest to add cumsum when creating it and then just run a simple cumsum/diff wrap up on it. Also, If you don't need the groups that aren't meeting the criteria, it is better to assign NA instead of just some random number such as zero. Here's a possible data.table wrap up (also, you don't need the df <- tbl_df(df) step at all)
library(data.table)
setDT(df)[, CRIT := cumsum(THR < 19)]
df[THR >= 19, GRP := cumsum(c(0L, diff(CRIT)) != 0L) + 1L]
# THR CRIT GRP
# 1: 13 1 NA
# 2: 17 2 NA
# 3: 19 2 1
# 4: 22 2 1
# 5: 21 2 1
# 6: 19 2 1
# 7: 17 3 NA
# 8: 12 4 NA
# 9: 12 5 NA
# 10: 17 6 NA
# 11: 20 6 2
# 12: 20 6 2
# 13: 20 6 2
# 14: 17 7 NA
# 15: 17 8 NA
# 16: 13 9 NA
# 17: 20 9 3
# 18: 20 9 3
# 19: 17 10 NA
# 20: 13 11 NA
You can do:
x = rle(df$CRIT)
mask = x$values
x$values[mask] = 0
x$values[!mask] = cumsum(!x$values[!mask])
mutate(df, GRP=inverse.rle(x))
# THR CRIT GRP
#1 13 TRUE 0
#2 17 TRUE 0
#3 19 FALSE 1
#4 22 FALSE 1
#5 21 FALSE 1
#6 19 FALSE 1
#7 17 TRUE 0
#8 12 TRUE 0
#9 12 TRUE 0
#10 17 TRUE 0
#11 20 FALSE 2
#12 20 FALSE 2
#13 20 FALSE 2
#14 17 TRUE 0
#15 17 TRUE 0
#16 13 TRUE 0
#17 20 FALSE 3
#18 20 FALSE 3
#19 17 TRUE 0
#20 13 TRUE 0