Checking sums with condition in R - r

I need to check the sums in some dataframe. But its elements do not sum up because some of them contain other. In this example 051040 is a part of 051043. Although most elements ends with "0", those which last digit is 3 always bigger or equal to counterparts with last digit 0. In other words I need to find all elements ends with "3" and skip those counterparts with "0". In this example 051040 and 051050 must be skipped because of 051043 and 051053 presence. Note that element "05000" is a control sum, obviously it also must be skipped. So I need to find the difference:
05000 - SUM ("051010", "051043", "051053", "052020", "052100", "052220", "052310") = 0
Something like that.
Here is an artificial example (actual dataframe is really big).
Area <- c("050000", "051010", "051040", "051043", "051050", "051053", "052020", "052100", "052220", "052310")
Total <- c(100, 28, 16, 22, 10, 10, 10, 10, 10, 10)
sodf <- data.frame(Area, Total)
Thanks very much in advance!

You could use the substrings.
sodf[with(sodf, ave(substring(Area, 6) == 3, substr(Area, 1, 5), FUN=\(x) {
if (any(x)) x else TRUE
})), ] |>
(\(x) x[1, 2] - sum(x[-1, 2]))()
# [1] 0
In an lapply for multiple areas, I expanded slightly your sodf to demonstrate (not exactly sure however, how your area codes exactly look like, but first two digits seem relevant).
lapply(split(sodf, substr(sodf$Area, 1, 2)), \(x) {
x <- x[ave(substring(Area, 6) == 3, substr(Area, 1, 5), FUN=\(x) {
if (any(x)) x else TRUE
}), ]
x[1, 2] - sum(x[-1, 2])
})
# $`05`
# [1] 0
#
# $`06`
# [1] 111
Data:
sodf <- structure(list(Area = c("050000", "051010", "051040", "051043",
"051050", "051053", "052020", "052100", "052220", "052310", "060000",
"061010", "061040", "061043", "061050", "061053", "062020", "062100",
"062220", "062310"), Total = c(100, 28, 16, 22, 10, 10, 10, 10,
10, 10, 211, 28, 16, 22, 10, 10, 10, 10, 10, 10)), row.names = c(NA,
-20L), class = "data.frame")

One row solution can be this:
sum(sodf[!sodf$Area %in% stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') & !sodf$Area %in% sodf[grepl('^[0]+[1-9]{1}[0]+$',sodf$Area),"Area"] ,"Total"])
How does it work ?
You find rows with Area that does not end with 0.
sodf[!grepl('0$',sodf$Area),"Area"] # "051043" "051053"
You find how this row should end, if they had a 0. I used a function from stringr package. stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') # "051040" "051050"
You find rows that are not included in group 2 and does not have '050000' structure (^[0]+[1-9]{1}[0]+$), then sum their total. sum(sodf[!sodf$Area %in% stringr::str_replace_all(sodf[!grepl('0$',sodf$Area),"Area"],'[1-9]$','0') & !sodf$Area %in% sodf[grepl('^[0]+[1-9]{1}[0]+$',sodf$Area),"Area"] ,"Total"] ) # 100
Same result can be achived splitting the code in multiple lines with dplyr package.

My colleague suggested the following variant.
library(tidyverse)
check_diff <- function(df) {
# Find value Total, for instance ends with 0000
sum_control <- df |>
filter(str_detect(Area, "0{4}$")) |>
select(Total)
# Find second number from the end where the last character is 3 and connect them into one string
before_last_3 <- df |>
filter(str_detect(Area, "3$")) |>
pull(Area) |>
str_sub(-2, -2) |>
str_flatten()
# Create string for filter pattern
filter3 <- str_c("[", before_last_3, "]0$")
# filter data on new conditions and find sum
sum <- df |>
filter(!str_detect(Area, "0{4}$"),
!str_detect(Area, filter3)) |>
summarise(sum(Total))
sum_control - sum
}

Related

Path along linked values taking the lowest value each time

I have a data.table with two columns "From" and "To" as follows:
data.table(From = c(1,1,1,1,2,2,2,2,3,3,3,4,4,5),
To = c(3,4,5,6,3,4,5,6,4,5,6,5,6,6))
The data.table will always be sorted as shown in the example above, with "From" and "To" values increasing from smallest to largest.
I need to find a 'path' starting from the first 'From' (which will always be '1'), through to the last 'To' value, subject to always choosing the lowest 'To' value.
In the above example, I would have 1 --> 3, then 3 --> 4, then 4 --> 5, then finally 5 --> 6.
I then want to return in a vector 1, 3, 4, 5, and 6, representing the linked values.
The only way that I can think of doing it is using a while or for loop and looping through each group of 'From' values and iteratively choosing the smallest. That seems inefficient though and will probably be very slow on my actual data set which is over 100,000 rows long.
Are there any data.table-like solutions?
I also thought that maybe igraph would have a method for this, but I must admit that I currently have pretty much zero knowledge of this function.
Any help would be greatly appreciated.
Thanks,
Phil
EDIT:
Thanks for all the responses so far.
My example/explanation wasn't a great one sorry, as I didn't explain that the 'From' / 'To' pairs don't need to go all the way through to the end value of the 'To' column.
Using the example from the comments below:
dt <- data.table(From = c(1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 5),
To = c(3, 4, 5, 6, 3, 4, 5, 6, 5, 6, 6))
The output would simply be a vector of c(1, 3), as it will start at 1, choose the lowest value which is 3, and then because there are no 'From' values of '3', it wouldn't continue any further.
Another example:
dt <- data.table(From = c(1,1,1,2,2,3,3,4,4),
To = c(2,3,4,5,6,4,7,8,9))
The intended output here is a vector c(1,2,5); following the path 1 --> 2, then 2 --> 5, at which point it stops as there is no '5' value in the "From" column.
Hopefully, that makes sense, and apologies for the lack of clarity in the original question.
Thanks,
Phil
You can try the code below
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
ego(
order = sum((m <- membership(components(.))) == m[names(m) == "1"]),
nodes = "1",
mode = "out"
) %>%
pluck(1) %>%
names() %>%
as.numeric()
or simpler with subcomponent (as #clp did)
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
subcomponent(v = "1", mode = "out") %>%
names() %>%
as.integer()
which gives
For the first new updated data
[1] 1 3
For the second updaed data
[1] 1 2 5
Assuming an ordered From and To list this may work.
It first groups by From, compresses by To, then excludes non-matching From-To values using shift.
If jumps are missing (e.g. To 3 but From 3 missing) it prints NULL
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
[1] 1 3 4 5 6
Using from Igraph and subcomponents().
After ThomasisCoding's comment, I realized that graph_from_data_frame creates a graph by name.
This is a waste of memory (and time) if the graph is large (1E6).
Note also that graph_from_edgelist(as.matrix(...)) is much faster.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_edgelist(as.matrix(dt2), directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
First attempt.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_data_frame(dt2, directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
I can't seem to get the other answers to work with certain tables. E.g.,
library(data.table)
library(igraph)
library(purrr)
dt <- data.table(
From = c(1, 1, 1, 1, 2, 2, 4, 5),
To = c(3, 4, 5, 6, 4, 6, 6, 6)
)
fPath1 <- function(dt) {
setorder(dt, From, To)[, wt := fifelse(rleid(To)==1,1,Inf), From] %>%
graph_from_data_frame() %>%
set_edge_attr(name = "weight", value = dt[, wt]) %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath2 <- function(dt) {
dt[, .SD[which.min(To)], From] %>%
graph_from_data_frame() %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath3 <- function(dt) {
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
}
fPath1(dt)
#> [1] 1 6
fPath2(dt)
#> Warning in shortest_paths(., min(dt[, From]), max(dt[, To])): At core/paths/
#> unweighted.c:368 : Couldn't reach some vertices.
#> integer(0)
fPath3(dt)
#> NULL
This igraph solution seems to work based on a little more extensive testing:
fPath4 <- function(dt) {
g <- graph_from_data_frame(dt)
E(g)$weight <- (dt$To - dt$From)^2
as.integer(V(g)[shortest_paths(g, V(g)[1], V(g)[name == dt$To[nrow(dt)]])$vpath[[1]]]$name)
}
fPath4(dt)
#> [1] 1 4 6
A sequential solution is feasable.
Copying one million dataframe lines took 8 seconds on my system.
n <- 1E6
df1 <- data.frame(from=sample(n), to=sample(n))
path <- c()
system.time(
for (i in seq(nrow(df1)) ){
path[length(path) + 1] <- df1[i, "to"] # avoid copying.
}
)
mean(path)
length(path)
Output.
[1] 500000.5
[1] 1000000
Updated after last edit of Phil.
The first step is to simplify the input (df).
## Select min(To) by From.
if (nrow(df) > 0) { df2 <- setNames(aggregate(df$To, list(df$From), "min"), c("From", "To") )
} else df2 <- df
Set path to first start node and
subsequently append end nodes.
## Let tt is maximal outgoing node upto now.
path <- df2[1,1]
tt <- df2[1,1]
for (i in seq_len(nrow(df2))){
if (df2[i, 1] < tt) next
else if (df2[i,1] == tt) { tt <- df2[i, 2]
path[length(path) + 1] <- df2[i, 2]
}
else break
}
head(path)
Output:
[1] 1 3 4 5 6 , df as in first example.
[1] 1 2 5 , df as in another example.

Get column index from data frame that matches numeric vector?

Very similar questions have been asked here, here, and here. However, they all seem to rely on knowing the column names of the data.
I am trying to get the column index of a data frame that matches a numeric vector. For example, if I have some data and a vector like so:
dat <- data.frame(
x = c(1,2,3,4,5),
y = c(10,9,8,7,6),
z = c(2,4,6,8,10)
)
testVec <- c(2,4,6,8,10)
I would just like to return the column index of dat that matches testVec . We can see that dat$z matches testVec... so in this situation I would just like to return 3.
Any suggestions as to how I could do this?
Here's a base R approach, which compares every column in dat with testVec to see if they are identical. Use which to output the column index if they're identical.
which(sapply(1:ncol(dat), function(x) identical(dat[,x], testVec)))
[1] 3
UPDATE
#nicola has provided a better syntax to my original code (you can see it in the comment under this answer):
which(sapply(dat, identical, y = testVec))
z
3
You perhaps can try this
> which(colSums(dat == testVec) == nrow(dat))
z
3
An option with select from dplyr
library(dplyr)
dat %>%
select(where(~ all(testVec == .x))) %>%
names %>%
match(names(dat))
[1] 3
Subtract the testVec.
which(colSums(dat - testVec) == 0)
# z
# 3
Without name:
unname(which(colSums(dat - testVec) == 0))
# [1] 3
Data:
dat <- structure(list(x = c(1, 2, 3, 4, 5), y = c(10, 9, 8, 7, 6), z = c(2,
4, 6, 8, 10)), class = "data.frame", row.names = c(NA, -5L))
testVec <- c(2, 4, 6, 8, 10)

Piping over a list, subsetting and calculate a function of my own

I have a dataset with these three columns and other additional columns
structure(list(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4,
1), to = c(8, 3, 8, 54, 3, 4, 1, 6, 7, 1, 4, 3, 8, 8), time = c(1521823032,
1521827196, 1521827196, 1522678358, 1522701516, 1522701993, 1522702123,
1522769399, 1522780956, 1522794468, 1522794468, 1522794468, 1522794468,
1522859524)), class = "data.frame", row.names = c(NA, -14L))
I need the code to take all indices less than a number (e.g. 5) and for each of them do the following: Subset the data set if the index is either in column "from" or in column "to" and calculate a function (e.g the difference between the min and max in time). As a result I expect a dataframe with the indexes and the results of the calculation.
This is what I have, but it does not work.
dur<-function(x)max(x)-min(x) #The function to calculate the difference. In other cases I need to use other functions of my own
filternumber <- function(number,x){ #A function to filter data x by the number in the two two columns
x <- x%>% subset(from == number | to == number)
return(x)
}
lista <- unique(c(data$from, data$to)) # Creates a list with all the indexes in the data. I do this to avoid having non-existing indexes
lista <-lista[lista <= 5] #Limit the list to 5. In my code this number would be an argument to a function
result<-lista%>%filteremployee(.,data) %>% select(time) %>% dur() #I use select because I have many other columns in the data
The result in this case should be a dataframe with 1036492 for 1, 967272 for 3 and 92475 for 4
I´ve also try putting filteremployee(.,data) %>% select(time) %>% dur() in side mutate but that does not work either
Perhaps you are looking for something like this:
library(purrr)
library(dplyr)
index <- c(1, 3, 4)
names(index) <- index
index %>%
map_dfr(~ df %>%
filter(from == .x | to == .x) %>%
summarize(result = dur(time)),
.id = "index")
This returns
index result
1 1 1036492
2 3 967272
3 4 92475
The function was created with ==, which is elementwise. Here, we may need to loop
library(dplyr)
library(purrr)
map_dbl(lista, ~ filternumber(.x, data) %>%
select(time) %>%
dur)
[1] 1036492 967272 92475 0

Subtracting values from two lists and adding the differences in R

Let's say I have one num[1:25] with values 3, 4, 5, 6 ...
and another list with values num[1:25] 5, 7, 4 ...
I want to subtract values and add these differences together such that:
(3-5) + (4-7) + (5-4)...
so the first number in the first list with the first number in the other list and so on.
If your vectors are saved in the same list lst, you can try the code below
> sum(do.call("-", lst))
[1] -24
or
> diff(sapply(rev(lst), sum))
[1] -24
Dummy Data
set.seed(1)
lst <- list(
sample(10, 25, replace = TRUE),
sample(10, 25, replace = TRUE)
)

fuzzy comparing names in R - how to find highest possible sum in a matrix (with boundary conditions)

I have a matrix with similarity scores that looks like this:
I need to find the highest possible sum of scores in this matrix. The sums have to fulfill a condition though:
If a number has been used for the sum, no numbers of its row or column or any previous rows or columns can be used for the sum anymore. This is because the order of the names matters.
I can start at any number, but all the values to the left and above that value will then be disqualified for the rest of that sum, as well as the values in the same row and column.
The highest possible sum with this system is 130 (10 + 100 + 10 + 10). That's the number i want in the end.
My strategy at the moment is to calculate all possible sums, and then simply select the highest. But how do i code the condition that i described above? Does it make sense?
Here are more examples of allowed (green) and not allowed (red) sums:
Another example of a valid sum:
I start at the upper left corner. I choose 10.
I can not add the 12 or the 11 that are in the same column anymore.
I choose one of the remaining numbers. 12.
I can no longer choose 100, 11, 10, or 25 from that column and row (and the previous columns and rows). I can now only choose 22 or 10 for the last number to add.
If i pick 22, my total sum is 44. If i pick 10, my total sum 32.
The reason i'm using this system is because i'm trying to create an algorithm that compares full names of people and assigns it a probability that they're the same person - purely based on name information.
My current code looks like this:
library(tidyverse)
library(stringdist)
string.compare <- function(Var1, Var2){
string1 <- Var1 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% strsplit(" ") %>% unlist()
string2 <- Var2 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "") %>% strsplit(" ") %>% unlist()
compare <- array(NA, dim = c(length(string1),
length(string2)), dimnames = list(string1,
string2))
compare[] <- do.call(mapply,
c(list(FUN = string.score),
expand.grid(dimnames(compare), stringsAsFactors = FALSE)))
sums <- func_calc_sums(compare) # This is where is need help. How to write this function?
output(max(sums))
}
string.score <- function(Var1, Var2){
phonetic.weight <- 50 # this is an important variable. it determines the weight of the phonetic comparison. 100 = no weight, 0 = phonetic is all that matters.
if(is.null(Var1) | is.null(Var2) | is.na(Var1) | is.na(Var2) | Var1 == "" | Var2 == ""){ # if one of the entries is empty, score 0
return(0)
} else if(Var1 == substr(Var2, 1, 1)){ # if Var1 is an abbreviation of Var2, score 10
return(10)
} else if(nchar(Var1) == 1){ # if Var1 is an abbreviation but not of Var2, score 0
return(0)
} else if(Var2 == substr(Var1, 1, 1)){ # if Var2 is an abbreviation of Var1, score 10
return(10)
} else if(phonetic(Var1) == phonetic(Var2)){ # If Var1 and Var2 are phonetically similar, give score based on stringdist
return(round(100 - (phonetic.weight * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
} else { # If Var1 and Var2 are not phonetically similar, give a score based on stringdist but lower
return(round(100 - (100 * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
}
}
If you enter for example Var1 <- " a. michelle hernandes s. " and Var2 <- " Alexa michelle h. sanchez" and then run it through the function string.compare (the function is not finished, you'll have to execute the code line by line) it will first clean up the strings, and then split them into separate words.
Those words get assigned as rownames and colnames of a matrix, over which a scoring system is run string.score. Then you end up with the matrix at the beginning of this post.
I have two suggestions, that might be helpful:
there is an implementation of the Needleman-Wunsch algorithm, which was mentioned by Roman Cheplyaka, in R on github. You can find it here: https://gist.github.com/juliuskittler/ed53696ac1e590b413aac2dddf0457f6
You could try to solve the problem recursively using for example the maximum path sum function described here: https://lucidmanager.org/data-science/project-euler-18/
You'd have to explicitly insert the constraints you mentioned to block these paths from being eligible, I think. Here's a run of the default function on your dataset:
testmat <- matrix(data = c(10, 0, 0 , 0, 12, 100, 12, 25, 11, 11, 10, 22, 0, 0,
0, 10),
ncol = 4,
nrow = 4,
byrow = T)
path.sum <- function(triangle) {
for (rij in nrow(triangle):2) {
for (kol in 1:(ncol(triangle)-1)) {
triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol]
}
triangle[rij,] <- NA
}
return(max(triangle, na.rm = TRUE))
}
> path.sum(testmat)
[1] 130
It sounds like you are looking for the Needleman–Wunsch dynamic programming algorithm. Just set the match score to the value of your similarity function, and the mismatch/insertion/deletion scores to 0.
The algorithm is not too hard to implement, and you can find a lot of code samples online.

Resources