I have the following data set:
PATH = c("5-8-10-8-17-20",
"56-85-89-89-0-15-88-10",
"58-85-89-65-49-51")
INDX = c(18, 89, 50)
data.frame(PATH, INDX)
PATH
INDX
5-8-10-8-17-20
18
56-85-89-89-0-15-88-10
89
58-85-89-65-49-51
50
The column PATH has strings that represent a numerical series and I want to be able to pick the largest number from the string that satisfies PATH <= INDX, that is selecting a number from PATH that is equal to INDX or the largest number from PATH that is yet less than INDX
my desired output would look like this:
PATH
INDX
PICK
5-8-10-8-17-20
18
17
56-85-89-89-0-15-88-10
89
88
58-85-89-65-49-51
50
49
Some of my thought-process behind the answer:
I know that If I have a function such strsplit I could separate each string by "-", arrange by number and then subtract with INDX and thus select the smallest negative number or zero. However, the original dataset is quite large and I wonder if there is a faster or more efficient way to perform this task.
Another option:
mapply(
\(x, y) max(x[x <= y]),
strsplit(PATH, "-") |> lapply(as.integer),
INDX
)
# [1] 17 88 49
Using purrr::map2_dbl():
library(purrr)
PICK <- map2_dbl(
strsplit(PATH, "-"),
INDX,
~ max(
as.numeric(.x)[as.numeric(.x) <= .y]
)
)
# 17 89 49
The below should be reasonably efficient, there is nothing wrong with your approach.
numpath <- sapply(strsplit(PATH, "-"), as.numeric)
maxindexes <- lapply(1:length(numpath), function(x) which(numpath[[x]] <= INDX[x]))
result <- sapply(1:length(numpath), function(x) max(numpath[[x]][maxindexes[[x]]]))
> result
[1] 17 89 49
Using dplyr
library(dplyr)
df |>
rowwise() |>
mutate(across(PATH, ~ {
a = unlist(strsplit(.x, split = "-"))
max(as.numeric(a)[which(as.numeric(a) <= INDX)])
}, .names = "PICK"))
PATH INDX PICK
<chr> <dbl> <dbl>
1 5-8-10-8-17-20 18 17
2 56-85-89-89-0-15-88-10 89 89
3 58-85-89-65-49-51 50 49
You can create a custom function like below:
my_func <- function(vec1, vec2) {
sort(as.numeric(unlist(strsplit(vec1, split = "-")))) -> x
return(x[max(cumsum(x <= vec2))])
}
df$PICK <- sapply(seq_len(nrow(df)), function(i) my_func(df$PATH[i], df$INDX[i]))
which will yield the following output:
# PATH INDX PICK
# 1 5-8-10-8-17-20 18 17
# 2 56-85-89-89-0-15-88-10 89 89
# 3 58-85-89-65-49-51 50 49
Related
I am currently trying to program a function that shows me possible ways to finish a certain number of points in 1 to 3 dart throws. Only the range from 170 to 2 points is necessary.
Example: I have 120 points left. I want a function that gives out possible ways to finish 120 points. For example by throwing 40 - 40 - 40; 60 - 60 or 57 - 57 - 6 and so on...
I am new to programming in R and to programming in general and have absolutely no idea how and where to start. Does anyone have a recommendation for me? Is there a specific function that can be applied in this case?
Thank you very much in advance!
here is another approach. One thing you would have to fix (if needed), this solution makes separate 'solutions` for, let's say, a single 2, and a double 1.
so, 4 can be ended with
single 1, single 1, double 2
single 2, double 2
double 2, double 2
and this gets worse on the thee dart finishes...
singles <- c(1:20, 25)
doubles <- 2 * singles
triples <- 3*1:20
#all possible eidings
one.dart <- tidyr::crossing( first = doubles )
two.dart <- tidyr::crossing( first = unique( c(singles, doubles, triples ) ),
second = doubles )
three.dart <- tidyr::crossing( first = unique( c( singles, doubles, triples ) ),
second = unique( c( singles, doubles, triples ) ),
third = doubles )
#bind together
DT <- rbindlist( list( one.dart, two.dart, three.dart), use.names = TRUE, fill = TRUE )
#calculate finish total and number of darts used
DT[, finish := rowSums(.SD, na.rm = TRUE) ]
DT[, darts := rowSums( !is.na(.SD)), .SDcols = 1:3 ]
calculate_finish <- function( x ) {
DT[ finish == x, 1:3]
}
calculate_finish( 120 )
# first second third
# 1: 10 60 50
# 2: 13 57 50
# 3: 16 54 50
# 4: 19 51 50
# 5: 20 50 50
# ---
# 130: 60 40 20
# 131: 60 42 18
# 132: 60 48 12
# 133: 60 50 10
# 134: 60 54 6
the question you have is rather general - more like outsouring your programming than finding help for specific implementation parts of the code. Anyhow here goes a simplified solution that is far from perfect programming wise but does the job. I hope it helps you to understand progamming and functions plus a bit of R, given that you have a specific problem.
# we need this library to be able to use the %>% operator below
library(dplyr)
# vector of all the possible field numbers (i have no idea but you can look them up and fill the field)
dboard <- c(1, 2, 3, 4, 5)
# expand this to a grid of all possible combinations
all_combinations <- expand.grid(dboard, dboard,dboard ,stringsAsFactors = FALSE)
# your function with two inputs
get_throw_combinations <- function(remaining_points, throws){
# call the outside of the functions defined possible combinations
res <- all_combinations %>%
# select from first to number of trows columns
dplyr::select(1:throws) %>%
# reduce reduncancy (happens when throws < 3)
dplyr::distinct() %>%
# filter those where the rowsum is equal to the target
dplyr::filter(rowSums(.) == remaining_points)
# return the result
return (res)
}
# run your function for 5 points with two throws
get_throw_combinations(5, 2)
# results R will display
Var1 Var2
1 4 1
2 3 2
3 2 3
4 1 4
Here is a brute-force approach using expand.grid + subset
v <- 2:170
res <- lapply(
1:3,
function(k) subset(u <- expand.grid(rep(list(v), k)), rowSums(u) == 120)
)
Given the following named vector:
x <- c(54, 36, 67, 25, 76)
names(x) <- c('a', 'b', 'c', 'd', 'e')
How one can extract the elements between 'b' and 'd'? I can do that for data tables with the dplyr::select(dt, b:d) but for some reason, I cannot find a solution for named vectors (all the examples I find are for extracting element(s) by giving all the names not a range of names)...
You could do
x[which(names(x) == "b"):which(names(x) == "d")]
#> b c d
#> 36 67 25
The problem being that there is no guarantee in a named vector that names are unique, and if there are duplicate names the entire concept becomes meaningless.
If you wanted a complete solution that allows for tidyverse-style non-standard evaluation and sensible error messages you could have
subset_named <- function(data, exp)
{
if(missing(exp)) return(data)
exp <- as.list(match.call())$exp
if(is.numeric(exp)) return(data[exp])
if(is.character(exp)) return(data[exp])
tryCatch({
ss <- suppressWarnings(eval(exp))
return(data[ss])},
error = function(e)
{
if(as.character(exp[[1]]) != ":")
stop("`exp` must be a sequence created by ':'")
n <- names(data)
first <- as.character(exp[[2]])
second <- as.character(exp[[3]])
first_match <- which(n == first)
second_match <- which(n == second)
if(length(first_match) == 0)
stop("\"", first, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(second_match) == 0)
stop("\"", second, "\" not found in names(",
deparse(substitute(data)), ")")
if(length(first_match) > 1) {
warning("\"", first,
"\" found more than once. Using first occurence only")
first_match <- first_match[1]
}
if(length(second_match) > 1) {
warning("\"", second,
"\" found more than once. Using first occurence only")
second_match <- second_match[1]
}
return(data[first_match:second_match])
})
}
That allows the following behaviour:
subset_named(x, "b":"d")
#> b c d
#> 36 67 25
subset_named(x, b:d)
#> b c d
#> 36 67 25
subset_named(x, 1:3)
#> a b c
#> 54 36 67
subset_named(x, "e")
#> e
#> 76
subset_named(x)
#> a b c d e
#> 54 36 67 25 76
One option could be:
x[Reduce(`:`, which(names(x) %in% c("b", "d")))]
b c d
36 67 25
You can use match in base R :
x[match('b', names(x)):match('d', names(x))]
# b c d
#36 67 25
Or if you want to use something like b:d convert it into dataframe as column
library(dplyr)
t(x) %>%
as.data.frame() %>%
select(b:d)
1) subset In base R this can be done using the select argument of subset. The only catch is that only the data.frame method of subset supports the select argument but we can convert x to a data.frame and then convert back. It also allows more complex specifications such as c(b:d, d) .
unlist(subset(data.frame(as.list(x)), select = b:d))
## b c d
## 36 67 25
2) evalq Another base R possibility is to create a list with the values 1, 2, 3, ... and the same names as x and then evaluate b:d with respect to it giving the desired indexes which can then be indexed into x. This also allows complex specifications as in (1).
x[ evalq(b:d, setNames(as.list(seq_along(x)), names(x))) ]
## b c d
## 36 67 25
We could turn this into a function like this:
sel <- function(x, select, envir = parent.frame()) {
ix <- setNames(as.list(seq_along(x)), names(x))
x[ eval(substitute(select), ix, envir) ]
}
sel(x, b:d)
sel(x, c(b:c, d))
sel(x, d:b) # reverse order
3) logical condition Again with only base R, if the names are in sorted order, as in the question, then we can check for names between the endpoints:
x[names(x) >= "b" & names(x) <= "d"]
## b c d
## 36 67 25
4) zoo If the names are in ascending order, as in the question, we could create a zoo series with those names as the times and then use window.zoo to pick out the subseries and finally convert back.
library(zoo)
coredata(window(zoo(x, names(x)), start = "b", end = "d"))
## b c d
## 36 67 25
I am trying to improve speed in a case where I need to use data from dataframes and sample from other dataframes.
First I need to draw the number of samples I want from df_obs.
Then I need to determine a subset from where to sample based on which month I am in (the subset command).
Then I want to sample from the corresponding sample dataframe.
And finally put it all together in a new dataframe.
The code below Works, but it is far to slow, when I have to repeat this 1000 times. Is there an alternative method which uses apply functions better? Or perhaps some data.table function?
#Sample function to sample correct in case of only one value to sample from
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Creating dummy data
no_pr_month <- sort(sample(67:120, 20))
df_obs <- data.frame(replicate(20,sample(0:5,1000,rep=TRUE)))
colnames(df_obs) <- no_pr_month
amount <- sample(50:50000,200)
month <- sample(no_pr_month,200, rep=TRUE)
df <- data.frame(month,amount)
df_sum <- data.frame(matrix(NA, ncol = 20, nrow=1000))
#The far too slow loop
for (k in 1:length(no_pr_month)){
a <- df_obs[,k]
df_sample <- subset(df, df$month == names(df_obs[k]))
df_out <- sapply(a, function(x) sum(resample(df_sample$amount, x,replace = TRUE)))
df_sum[,k] <- df_out
}
Note: before creating your data, I inserted set.seed(000) for consistent results
Even when using the data.table package, it's best to keep data organized in a "tidy" way: row-based.
So we'll start by changing your df_obs dataset to a long-form data.table.
library(data.table)
df_obs_long <- data.table(
month = as.integer(rep(names(df_obs), each = nrow(df_obs))),
obs_count = unlist(df_obs)
)
df_obs_long
# month obs_count
# 1: 69 4
# 2: 69 5
# 3: 69 1
# 4: 69 3
# 5: 69 0
# ---
# 19996: 116 4
# 19997: 116 1
# 19998: 116 2
# 19999: 116 3
# 20000: 116 5
Next we'll define a function that takes a vector of sample sizes and the number of the month to draw samples from. The function will return a vector of sample sums for each of the sizes given.
Making df a data.table doesn't save much as far as written code, but can cut down runtime by a good amount.
setDT(df)
sample_and_sum_month <- function(sizes, month_number) {
choices <- df[month == month_number, amount]
vapply(
sizes,
FUN.VALUE = numeric(1),
FUN = function(s) {
sum(resample(choices, size = s, replace = TRUE))
}
)
}
sample_and_sum_month(1:3, 69)
# [1] 12729 55068 28605
Finally, we can just add the sums as a new column in df_obs_long.
df_obs_long[
,
sample_sum := sample_and_sum_month(obs_count, .BY[["month"]]),
by = "month"
]
df_obs_long
# month obs_count sample_sum
# 1: 69 4 82662
# 2: 69 5 160761
# 3: 69 1 5743
# 4: 69 3 108783
# 5: 69 0 0
# ---
# 19996: 116 4 56792
# 19997: 116 1 22570
# 19998: 116 2 35337
# 19999: 116 3 64734
# 20000: 116 5 69075
I have a list like this:
$20
[1] 500
$30
[2] 600
I want to convert this into a dataframe like this
id values
20 500
30 600
You can do:
L <- list(`20`=500,`30`=600)
df <- data.frame(id=names(L), values=sapply(L, function(x) x[1]))
# > df
# id values
# 20 20 500
# 30 30 600
or a bit more tricky:
df <- data.frame(id=names(L), values=sapply(L, '[', 1))
Till now I was thinking about longer vectors (and take only the first element). But in your case (if each element of the list is only a 1-element vector) a shorter solution (thx to Abdou for the comment) is:
df <- data.frame(id = names(L), values = unlist(L))
You can use do.call to solve your problem:
li <- list(`20`=500,`30`=600)
df <- data.frame(Values = do.call("rbind",li))
df$Id <- rownames(df)
rownames(df) <- NULL
df <- df[,c(2,1)]
df
Output:
> df
Id Values
1 20 500
2 30 600
purrr's *_df functions iterate a function over a list and simplify to a data.frame. With the development version, you can use the new imap variant that uses the names or indices as a second variable .y:
library(purrr)
l <- list(`25` = 900, `26` = 500)
l %>% imap_dfr(~data.frame(id = as.integer(.y),
value = .x))
#> id value
#> 1 25 900
#> 2 26 500
or with CRAN purrr, you can pass the names as the second variable to map2:
l %>% map2_df(names(.),
~data.frame(id = as.integer(.y),
value = .x))
#> id value
#> 1 25 900
#> 2 26 500
use unlist function.
L = list(`20`=500,`30`=600)
df = unlist(L)
It returns vector. If you want data.frame:
df = as.data.frame(t(unlist(L)))
Output:
> df
20 30
1 500 600
Here's a solution with Map
l <- list(`20`=500,`30`=600)
do.call(rbind,Map(data.frame,id=names(l),values=l))
id values
20 20 500
30 30 600
Easy way to achieve the same by the use of melt function from reshape2 package.
library(reshape2)
l = list('20'=500, '30'=600)
melt(as.data.frame(l, check.names = F))
Output:
variable value
1 20 500
2 30 600
Alternate approach without using any package
ls = list('20' = 500, '30' = 600, '40' = 400)
d = data.frame('id' = row.names(as.array(unlist(ls))), 'value' = unlist(ls),row.names = 1:length(ls))
Output
id value
1 20 500
2 30 600
3 40 700
4 50 800
I have a vector as follows:
v <- c(1,3,4,5,6,7,8,9,NA,NA,NA,NA,27,25,30,41,NA,NA)
How can I extract the values 1, 9, 27 and 41 (i. e. the first and last position of each subset without NAs)?
I thought about using head(v, 1) and tail(v, 1) in combination. However I don't have an idea how to 'stop' at the NAs and restart again after them.
We create a grouping variable with rleid based on the logical vector (is.na(v)), use that in tapply to select the first and last values of each group, unlist the list output, remove the NA elements with na.omit and remove the attributes with c.
library(data.table)
c(na.omit(unlist(tapply(v, rleid(is.na(v)), function(x) c(x[1],
x[length(x)])), use.names=FALSE)))
#[1] 1 9 27 41
Or another option is rle from base R
v[with(rle(!is.na(v)), {
i1 <- cumsum(lengths)
i2 <- lengths[values]
c(rbind(i1[values] - i2 + 1 , i1[values]))
})]
#[1] 1 9 27 41
Another possible solution via base R could be to split based on NA entries in the vector, lapply the head and tail functions and remove NA's, i.e.
ind <- unname(unlist(lapply(split(v, cumsum(c(1, diff(is.na(v)) != 0))), function(i)
c(head(i, 1), tail(i, 1)))))
ind[!is.na(ind)]
#[1] 1 9 27 41
A base R solution:
x = na.omit( v[is.na(c(NA,diff(v))) | is.na(c(diff(v),NA))] )
> as.numeric(x)
# [1] 1 9 27 41