Loop through dataframe to create new variable - r

I am trying to come up with a way to loop through a data frame and recognise points shared between 2 columns and work iteratively to assign a unique factor to these. Specifically, I have a data frame indicating points along a river and which points are immediately upstream of this.
Here is some example data:
df <- data.frame(RiverID = rep(c(1,2), each = 15), SiteID = rep((c(2,3,4,5,6,7,8,9,10,11,12,13,13,13,13)),2),
Upstream_SiteID = rep((c(1,1,1,2,2,3,4,5,6,7,8,9,10,11,12)),2),
Dist2Mouth = rep((c(2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)),2))
And the basic paint image below might help explain the kind of data I have.
What I would like to do is identify all possible 'routes' through the system (orange letters in the image). So in the example, I would start from point 13, go 'upstream' in the dataframe (i.e. lower Distance2Mouth values) and recognise 4 different routes (A-D). I then need to iteratively keep working up the data frame and assign routes to each point.
There are some instances where the stretch of river between the two points could belong to two routes. For example in the image below, the section between 2-1 could be part of routes C or D. In these instances, I would like to create multiple rows with the same SiteID and Upstream_SiteID that list the different potential routes.
Lastly, I have these instances across various rivers, so I would like to loop through the dataframe and apply the above code for each RiverID.
Desired output (the correct routes apply to RiverID '1'):
output <- data.frame(RiverID = rep(c(1,2), each = 16),
SiteID = rep((c(2,2,3,4,5,6,7,8,9,10,11,12,13,13,13,13)),2),
Upstream_SiteID = rep((c(1,1,1,1,2,2,3,4,5,6,7,8,9,10,11,12)),2),
Dist2Mouth = rep((c(2000,2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)),2),
Route = as.factor(c(rep(c("D","C","B","A"),times = 4),rep(c("H","G","F","E"),times = 4))))

igraph is definitely the right tool for problems like this. However, it is also possible to do what you need with a few lines of code without the package. The routes are identified by the go_up() function which uses a depth-first search algorithm.
## Builds a nested list with all possible paths on its deepest level
#
# path path so far
# ups list of upstream points for each point
#
go_up <- function(path, ups) {
# last point of the path
last <- tail(path, 1)
if (last %in% names(ups)) {
# continue with all possible upstream points
lapply(ups[[as.character(last)]],
function(up) go_up(c(path, up), ups))
# finish if no upstream point exists
} else paste(path, collapse='---')
# path is collapsed into a string so that the resulting list can be
# easily flattened
}
This can be then applied to each river separately like this:
river.routes <- lapply(split(df, df$RiverID), function(river) {
# list with upstream points for each point
ups <- tapply(river$Upstream_SiteID, river$SiteID, c)
# we will start from the highest ID
last <- max(river$SiteID)
# find the routes
routes <- go_up(last, ups)
# flatten the list and split the routes into points
routes <- strsplit(unlist(routes), '---')
# add a logical column for each route
for (i in seq_along(routes)) {
river[[paste0('route', i)]] <- river$SiteID %in% routes[[i]]
}
river
})
Output:
river.routes
# $`1`
# RiverID SiteID Upstream_SiteID Dist2Mouth route1 route2 route3 route4
# 1 1 2 1 2000 TRUE TRUE FALSE FALSE
# 2 1 3 1 2000 FALSE FALSE TRUE FALSE
# 3 1 4 1 2000 FALSE FALSE FALSE TRUE
# 4 1 5 2 1500 TRUE FALSE FALSE FALSE
# 5 1 6 2 1500 FALSE TRUE FALSE FALSE
# 6 1 7 3 1500 FALSE FALSE TRUE FALSE
# 7 1 8 4 1500 FALSE FALSE FALSE TRUE
# 8 1 9 5 1000 TRUE FALSE FALSE FALSE
# 9 1 10 6 1000 FALSE TRUE FALSE FALSE
# 10 1 11 7 1000 FALSE FALSE TRUE FALSE
# 11 1 12 8 1000 FALSE FALSE FALSE TRUE
# 12 1 13 9 500 TRUE TRUE TRUE TRUE
# 13 1 13 10 500 TRUE TRUE TRUE TRUE
# 14 1 13 11 500 TRUE TRUE TRUE TRUE
# 15 1 13 12 500 TRUE TRUE TRUE TRUE
#
# $`2`
# RiverID SiteID Upstream_SiteID Dist2Mouth route1 route2 route3 route4
# 16 2 2 1 2000 TRUE TRUE FALSE FALSE
# 17 2 3 1 2000 FALSE FALSE TRUE FALSE
# 18 2 4 1 2000 FALSE FALSE FALSE TRUE
# 19 2 5 2 1500 TRUE FALSE FALSE FALSE
# 20 2 6 2 1500 FALSE TRUE FALSE FALSE
# 21 2 7 3 1500 FALSE FALSE TRUE FALSE
# 22 2 8 4 1500 FALSE FALSE FALSE TRUE
# 23 2 9 5 1000 TRUE FALSE FALSE FALSE
# 24 2 10 6 1000 FALSE TRUE FALSE FALSE
# 25 2 11 7 1000 FALSE FALSE TRUE FALSE
# 26 2 12 8 1000 FALSE FALSE FALSE TRUE
# 27 2 13 9 500 TRUE TRUE TRUE TRUE
# 28 2 13 10 500 TRUE TRUE TRUE TRUE
# 29 2 13 11 500 TRUE TRUE TRUE TRUE
# 30 2 13 12 500 TRUE TRUE TRUE TRUE
I didn't much like the idea of duplicating the rows for points located on more than one route so I rather added a logical column for each route instead. You can change that if you like, the routes are available inside the lapply in the routes variable which looks like this:
# [[1]]
# [1] "13" "9" "5" "2" "1"
#
# [[2]]
# [1] "13" "10" "6" "2" "1"
#
# [[3]]
# [1] "13" "11" "7" "3" "1"
#
# [[4]]
# [1] "13" "12" "8" "4" "1"

The core function you want with igraph is all_simple_paths which will give you a list of paths/routes. To create a graph, you need unique ids, so I'm just ignoring the second river piece.
library(igraph)
library(purrr)
library(dplyr)
df_w1 <- data.frame(
SiteID = c(2,3,4,5,6,7,8,9,10,11,12,13,13,13,13),
Upstream_SiteID = c(1,1,1,2,2,3,4,5,6,7,8,9,10,11,12),
Dist2Mouth = c(2000,2000,2000,1500,1500,1500,1500,1000,1000,1000,1000,500,500,500,500)
)
paths <- graph_from_data_frame(df_w1) |> # create graph object
all_simple_paths("13", "1") |> # enumerate paths
map(~attr(.x, "names") |> as.integer()) |> # get the labels and not internal ids
print()
#> [[1]]
#> [1] 13 9 5 2 1
#>
#> [[2]]
#> [1] 13 10 6 2 1
#>
#> [[3]]
#> [1] 13 11 7 3 1
#>
#> [[4]]
#> [1] 13 12 8 4 1
For your purposes then you can enumerate the routes each segment belongs to and then join back to the original data (as not to lose the metadata). Iterate through the segments and check to see if both sites are in each of the paths.
# assume less than 26 routes
codingList <- map2_dfr(
df_w1$SiteID, df_w1$Upstream_SiteID,
\(from, to) tibble(
SiteID = from,
Upstream_SiteID = to,
Route = LETTERS[map_lgl(paths, \(p) all(c(from, to) %in% p)) |> which()]
)
) |>
print()
#> # A tibble: 16 × 3
#> SiteID Upstream_SiteID Route
#> <dbl> <dbl> <chr>
#> 1 2 1 A
#> 2 2 1 B
#> 3 3 1 C
#> 4 4 1 D
#> 5 5 2 A
#> 6 6 2 B
#> 7 7 3 C
#> 8 8 4 D
#> 9 9 5 A
#> 10 10 6 B
#> 11 11 7 C
#> 12 12 8 D
#> 13 13 9 A
#> 14 13 10 B
#> 15 13 11 C
#> 16 13 12 D
df_w1 |>
left_join(codingList, by = c("SiteID", "Upstream_SiteID"))
#> SiteID Upstream_SiteID Dist2Mouth Route
#> 1 2 1 2000 A
#> 2 2 1 2000 B
#> 3 3 1 2000 C
#> 4 4 1 2000 D
#> 5 5 2 1500 A
#> 6 6 2 1500 B
#> 7 7 3 1500 C
#> 8 8 4 1500 D
#> 9 9 5 1000 A
#> 10 10 6 1000 B
#> 11 11 7 1000 C
#> 12 12 8 1000 D
#> 13 13 9 500 A
#> 14 13 10 500 B
#> 15 13 11 500 C
#> 16 13 12 500 D

Related

Use index vector to select / remove values from a list

This is a question that came up in my mind when seeing that SO QUESTION
x <- list(c(1:6,32,24), c(1:4,8,10,12,13,17,24), c(1:5,9:15,17,18,19,20,32))
IND <- !duplicated(unlist(x))
INPUT
> x
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 1 2 3 4 8 10 12 13 17 24
[[3]]
[1] 1 2 3 4 5 9 10 11 12 13 14 15 17 18 19 20 32
> IND
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
Is it possible to access the list elements with IND so I get:
DESIRED OUTPUT
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 8 10 12 13 17
[[3]]
[1] 9 11 14 15 18 19 20
Normally I would access the elements like x[[1]][1] etc ...
You can try:
IND2 <-split(IND, rep(1:3, sapply(x, length)))
Map(function(x, y) x[y], x, IND2)
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 8 10 12 13 17
[[3]]
[1] 9 11 14 15 18 19 20
The idea is to transform the IND vector back to a list using the corresponding lengths. Then using Map to loop through each list element for subsetting.
Or try a reshape2 & tidyverse solution:
library(tidyverse)
library(reshape2)
melt(x) %>%
filter(!duplicated(value)) %>%
with(., split(value, L1))

Sum of column by condition

Trying to summarize column 3 if column 1 is >.25
if(df$V1>.25){sum(df$V3)} ##This returns an error In
if (df$V1 > 0.25) { :
the condition has length > 1 and only the first element will be used
Any code to summarize column 3 when Column one is >.25
0.1287953 3 12 1
1.094262 13 14 3
0.5962845 8 17 4
0.6511204 7 19 5
0.2533915 4 6 2
0.8222555 6 18 6
0.08695875 3 7 1
0.6096232 6 6 2
1.583204 24 7 1
0.08337463 4 7 1
0.06398186 1 11 2
0.2713974 4 11 2
0.6205648 13 4 1
1.276595 15 14 3
If you only want to sum over the entries in column 3, where column 1 entries are > 0.25:
inds <- (df$V1 > 0.25)
inds
# [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
Just use that to subset the third column:
sum( df$V3[ inds ] )
# 116
Or short: sum( df$V3[ df$V1 > 0.25 ] )

How to find increasing or decreasing trend in a matrix in r?

I have a time series matrix called trendtable , which has data's from 1960 to 2010(57 columns) for 175 countries .The column names are years and their corresponding values are given.I need to find the difference between the columns to find the trend for each country, to find if the trend is going up or down . The resulting trend value should be in new table.The code I have written is mentioned below,but i guess its wrong.
for (i in 1:175) {
Trend=0
for (j in 5:56) {
Dif=TimeSeriesCO2[i,j]-TimeSeriesCO2[i,j+1]
if(Dif<0){
Trend--}
else{
Trend++}
}
TrendTable<-rbind(TrendTable,Trend)
}
Here's a stab, with fake data:
set.seed(42)
nc <- 9 ; nr <- 5
mtx <- t(replicate(nr, cumsum(sample(-3:3, size = nc, replace = TRUE))))
dimnames(mtx) <- list(LETTERS[seq_len(nr)], 2000 + seq_len(nc))
mtx
# 2001 2002 2003 2004 2005 2006 2007 2008 2009
# A 3 6 5 7 8 8 10 7 8
# B 1 1 3 6 4 4 7 10 7
# C 0 0 3 0 3 6 3 3 2
# D 3 3 5 7 9 8 9 6 8
# E -3 -5 -2 -1 -2 -2 -5 -2 -2
So I'm simulating your 175 countries with five letters, and 57 years with 9.
A first cut could be to determine if a year is greater than or equal to the previous year.
t(apply(mtx, 1, diff) >= 0)
# 2002 2003 2004 2005 2006 2007 2008 2009
# A TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
# B TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
# C TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
# D TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
# E FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
(The need to transpose the result from apply is perhaps unintuitive but a fact of dealing with it. There may be ways to do it without the transpose, but none that I know of in base R that are as direct.)
Note that there are one fewer columns than in the original dataset. I you want to keep track in each year if there is a cumulative increase or not, one might do:
t(apply(mtx, 1, function(a) cumsum(diff(a) >= 0)))
# 2002 2003 2004 2005 2006 2007 2008 2009
# A 1 1 2 3 4 5 5 6
# B 1 2 3 3 4 5 6 6
# C 1 2 2 3 4 4 5 5
# D 1 2 3 4 4 5 5 6
# E 0 1 2 2 3 3 4 5
But I don't necessarily find that more informative.
From your code, it looks like you might be expecting a single number for each country (letter), in which case you can simplify this further to:
apply(mtx, 1, function(a) sum(diff(a) >= 0))
# A B C D E
# 6 6 5 6 5
As far as adding this somehow to your current data, I am neither certain what you want (from your example) nor do I recommend appending summary data to the original matrix. The sole exception is for immediate visualization, not storage.

R - add column checking occurrence of something in last n rows of column

I want to create a new column where at each row TRUE is returned if a certain value is found within the last n rows of another column, and FALSE is returned otherwise.
Here is an example dataframe (suppose this is a sample from a much larger dataframe):
A
2
23
1
5
6
15
14
3
7
9
55
3
77
2
And here is what I want (where conditional value=1 and n=10)
A B
2 FALSE
23 FALSE
1 FALSE
5 TRUE
6 TRUE
15 TRUE
14 TRUE
3 TRUE
7 TRUE
9 TRUE
55 TRUE
3 TRUE
77 TRUE
2 FALSE
I can do this with many "OR" conditions in an ifelse statement in dplyr:
df<-df %>% mutate(B=ifelse(lag(A)==1|lag(A,2)==1 ... |lag(A,10)==1,T,F))
But this is far too tedious, especially when n is large. Also, lag in dplyr only takes integers so lag(A,1:10) doesn't work.
Is there an easy way to do this (preferably without a for loop)?
As you've noticed, lag from dplyr does not allow you to pass a vector as shift amount, but the shift function from data.table allows you to do so, which has the same functionality as lag and lead in dplyr, so you can use shift from data.table with a Reduce function to do that:
library(data.table)
setDT(df)
df[, B := Reduce("|", shift(A == 1, n = 1:10, fill = F))]
df
A B
# 1: 2 FALSE
# 2: 23 FALSE
# 3: 1 FALSE
# 4: 5 TRUE
# 5: 6 TRUE
# 6: 15 TRUE
# 7: 14 TRUE
# 8: 3 TRUE
# 9: 7 TRUE
#10: 9 TRUE
#11: 55 TRUE
#12: 3 TRUE
#13: 77 TRUE
#14: 2 FALSE
We can also do this in dplyr with do and shift from data.table
library(dplyr)
df %>%
do(data.frame(., B= Reduce(`|`, shift(.$A==1, n = 1:10, fill = 0))))
# A B
#1 2 FALSE
#2 23 FALSE
#3 1 FALSE
#4 5 TRUE
#5 6 TRUE
#6 15 TRUE
#7 14 TRUE
#8 3 TRUE
#9 7 TRUE
#10 9 TRUE
#11 55 TRUE
#12 3 TRUE
#13 77 TRUE
#14 2 FALSE

R - identify row if next x rows have equal or smaller values compared to each previous row

I'm trying to work out whether the next x (6 is the current plan but this could be subject to change) balances remain the same or decrease each month.
I did this in Excel such that it would start with the current month's value and compare next month's against it to see if it stayed the same or decreased and so on.
=IF(AND(H3<=H2,H4<=H3,H5<=H4,H6<=H5,H7<=H6,H8<=H7),1,0)
This isn't the most flexible or elegant formula as it was part of an initial exploration. To make everything cleaner and more reproducible, I'd like to put my calculations into R instead.
Here is a basic dataset that is like my data for multiple IDs over many months.
rbind(data.frame(ID=1,Month=1:11,Bal=seq(from=500, to=300, by=-20)),
data.frame(ID=2,Month=1:10,Bal=rep(200,10)),
data.frame(ID=3,Month=1:11,Bal=seq(from=300, to=500, by=20)))
Having something that calculates against the raw data on a row level or will work inside a ddply are ideal solutions variants.
I'm still pretty new to R and I'm sure there's an elegant solution for this, but I really can't see it. Anyone have a neat solution or could point me in the direction of the sorts of keyterms I should be researching to try and reach a solution?
I am not sure if I understood correctly:
checkfun <- function(x,n) {
rev(filter(rev(c(diff(x) <= 0,NA)),rep(1,n),sides=1)) == n
}
This function calculates the differences between consecutive values and checks if they are <= 0. The filter sums the number of following n differences that fulfill the condition. This number is finally compared with n, to see if all of them fulfill the condition. (rev is only used, so that a one-sided filter can be used.)
DF$Bal[6] <- 505 #to not only have equal differences
library(plyr)
#example with 3 next values
ddply(DF,.(ID),transform,check=checkfun(Bal,3))
# ID Month Bal check
# 1 1 1 500 TRUE
# 2 1 2 480 TRUE
# 3 1 3 460 FALSE
# 4 1 4 440 FALSE
# 5 1 5 420 FALSE
# 6 1 6 505 TRUE
# 7 1 7 380 TRUE
# 8 1 8 360 TRUE
# 9 1 9 340 NA
# 10 1 10 320 NA
# 11 1 11 300 NA
# 12 2 1 200 TRUE
# 13 2 2 200 TRUE
# 14 2 3 200 TRUE
# 15 2 4 200 TRUE
# 16 2 5 200 TRUE
# 17 2 6 200 TRUE
# 18 2 7 200 TRUE
# 19 2 8 200 NA
# 20 2 9 200 NA
# 21 2 10 200 NA
# 22 3 1 300 FALSE
# 23 3 2 320 FALSE
# 24 3 3 340 FALSE
# 25 3 4 360 FALSE
# 26 3 5 380 FALSE
# 27 3 6 400 FALSE
# 28 3 7 420 FALSE
# 29 3 8 440 FALSE
# 30 3 9 460 NA
# 31 3 10 480 NA
# 32 3 11 500 NA
If df is your data.frame:
you can find consecutive differences using:
df$diff <- do.call("c",lapply(unique(df$ID), function(x) c(0,diff(df$Bal[df$ID==x]))))
This assumes that you want to keep those calculations separate for different ID's.
> head(df)
ID Month Bal diff
1 1 1 500 0
2 1 2 480 -20
3 1 3 460 -20
4 1 4 440 -20
5 1 5 420 -20
6 1 6 400 -20
Now, for a give k=6 (say), check:
sapply(unique(df$ID), function(x) ifelse(sum(df$diff[df$ID==x][1:k] < 0)!=0,1,0))
[1] 1 0 0
It returns a value of 1 (all differences are negative) or 0 (all differences are positive) for each ID.

Resources