how can i group a large number of elements quickly - r

I've created some 'data' here as an example. It's contains 100 elements where each start with a letter and are followed by 3 random numbers.
I want to know the best/quickest way to convert them into groups like i've started to do in the for loop.
Assume i need to create 50 groups and instead of 100 elements in 'data' there are a million.
The groupings themselves will be fairly random. I've used A000-A599 and A600-A999 as the first 2 groupings in my example but the groupings are not neatly spaced out e.g B000-B599 and B600-B999 are not necessarily the next groupings. The next groupings could for example be B000-C299,C300-C799,C800-D499 etc. I would need to enter these groupings manually.
I'm guessing a for loop isn't the best way to do this as it would take a long time to complete the loop.
library(stringr)
library(magicfor)
data <- paste(sample(LETTERS, 100, replace = T),
sample(str_pad(000:999, width = 3, side = "left", pad = "0"), 100, replace = T), sep = "")
magic_for()
for(x in seq_along(data)){
if( grepl("A[0-5]", data[1])){
range <- "A000-A599"
}elseif( grepl("A[6-9]", data[1])){
range <- "A600-A999"
}
put(range)
}

You can try something like this:
> #Round function
> roundUp <- function(x,to=10) {
+ to*(x%/%to + as.logical(x%%to))
+ }
> #Create a dataframe for easy store
> df <- data.frame(data = data, stringsAsFactors = F)
> df %>%
+ mutate(C = substr(data, 1, 1),
+ N = as.integer(substr(data, 2, 4))) %>%
+ mutate(N = roundUp(N, to = 500)) %>%
+ mutate(data2 = paste0(C, N)) %>%
+ select(data, data2)
data data2
1 U493 U500
2 A429 A500
3 N564 N1000
4 W656 W1000
5 J978 J1000
6 B232 B500
7 D240 D500
8 I796 I1000
9 E831 E1000
...(truncated)
The data2 field contains the new groups

Using the tidyverse packages, I would convert your data vector into a data.frame (or tibble) format.
library(tidyverse)
df <- tibble(my_variable = data) %>%
mutate(
first_char = substr(my_variable, 1, 1),
random_numbers = substr(my_variable, 2, 4)
)
Once you've gotten that far, it's easy to group the data however you want. Your loop could be achieved like so:
df %>%
mutate(
group = ifelse(as.numeric(random_numbers) < 600, "000-599", "600-999"),
desired_result = paste0(first_char, group)
)
I recommend reading the following (free) ebook cover to cover, it will equip you with a bunch of useful tools for everyday R tasks like the one you've outlined:
https://r4ds.had.co.nz/index.html

The groupings themselves will be fairly random. I've used A000-A599 and A600-A999 as the first 2 groupings in my example but the groupings are not neatly spaced out e.g B000-B599 and B600-B999 are not necessarily the next groupings. The next groupings could for example be B000-C299,C300-C799,C800-D499 etc.
Since your groups are lexicographic intervals, you could use rolling joins. In this case you only need to specify the lower bound for each group:
library(data.table)
# define decrement function
dec = function(x){
ltr = substr(x, 1, 1)
num = as.integer(substr(x, 2, 4))
w0 = num == 0L
ltr = replace(ltr, w0, LETTERS[match(ltr[w0], LETTERS) - 1L])
num = replace(num - 1L, w0, 999L)
sprintf("%s%03d", ltr, num)
}
# enumerate lower bounds and derive ranges
rangeDT = data.table(lb = c("A000", "A600", "B000", "C300", "C800"))
rangeDT[, ub := dec(shift(lb, type="lead", fill="Z999"))]
rangeDT[, range := sprintf("%s-%s", lb, ub)]
# lb ub range
# 1: A000 A599 A000-A599
# 2: A600 A999 A600-A999
# 3: B000 C299 B000-C299
# 4: C300 C799 C300-C799
# 5: C800 Z998 C800-Z998
Then the rolling update join is...
DT = data.table(x = data)
DT[, range := rangeDT[.SD, on=.(lb = x), roll=TRUE, x.range]]
The result looks like
> head(DT)
x range
1: C965 C800-Z999
2: Q973 C800-Z999
3: V916 C800-Z999
4: C701 C300-C799
5: A363 A000-A599
6: F144 C800-Z999
If your data were numeric, cut or findInterval from base R would work, but for whatever reason those do not support strings.

How about this?
library(data.table)
ranges <- c(paste0(LETTERS, "[0-5]"),paste0(LETTERS, "[6-9]"))
final <-lapply(ranges, function(y) {
matches <- grepl(y, data)
if(sum(matches)>0){
tmp <-data.table(element=data[matches], range=
paste0(str_sub(y,1,1), str_sub(y,3,3),0,0,"-", str_sub(y,1,1), str_sub(y,5,5),9,9))}
else return(NULL)
})
final_2 <- rbindlist(final)
# element range
# A374 A000-A599
# B498 B000-B599
# B064 B000-B599
# C131 C000-C599
# C460 C000-C599
# C099 C000-C599
structure(list(element = c("A374", "B498", "B064", "C131", "C460", "C099", "C193", "E428", "E108", "E527", "E138", "E375", "E312", "F046", "F417", "F094", "G142", "G461", "G068", "H372", "H523", "H027", "H506", "I470", "I169", "I050", "I495", "I405", "J298", "K165", "K169", "K131", "L510", "L210", "L277", "N257", "N554", "N452", "N484", "N247", "N373", "N492", "O347", "O221", "O176", "P578", "P477", "Q062", "Q257", "Q083", "R306", "S415", "S154", "S226", "S400", "T132", "T181", "T321", "V109", "V118", "V267", "W381", "W047", "X317", "X192", "Y390", "Y132", "Y327", "Y141", "Y353", "Z429", "C981", "D813", "F934", "G910", "G673", "G664", "I754", "I624", "L603", "N991", "N996", "O689", "O932", "P854", "P689", "P761", "P681", "Q631", "S620", "T923", "T841", "U787", "U929", "W942", "W702", "X770", "X880", "Y719", "Y969"), range = c("A000-A599", "B000-B599", "B000-B599", "C000-C599", "C000-C599", "C000-C599", "C000-C599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "E000-E599", "F000-F599", "F000-F599", "F000-F599", "G000-G599", "G000-G599", "G000-G599", "H000-H599", "H000-H599", "H000-H599", "H000-H599", "I000-I599", "I000-I599", "I000-I599", "I000-I599", "I000-I599", "J000-J599", "K000-K599", "K000-K599", "K000-K599", "L000-L599", "L000-L599", "L000-L599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "N000-N599", "O000-O599", "O000-O599", "O000-O599", "P000-P599", "P000-P599", "Q000-Q599", "Q000-Q599", "Q000-Q599", "R000-R599", "S000-S599", "S000-S599", "S000-S599", "S000-S599", "T000-T599", "T000-T599", "T000-T599", "V000-V599", "V000-V599", "V000-V599", "W000-W599", "W000-W599", "X000-X599", "X000-X599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Y000-Y599", "Z000-Z599", "C600-C999", "D600-D999", "F600-F999", "G600-G999", "G600-G999", "G600-G999", "I600-I999", "I600-I999", "L600-L999", "N600-N999", "N600-N999", "O600-O999", "O600-O999", "P600-P999", "P600-P999", "P600-P999", "P600-P999", "Q600-Q999", "S600-S999", "T600-T999", "T600-T999", "U600-U999", "U600-U999", "W600-W999", "W600-W999", "X600-X999", "X600-X999", "Y600-Y999", "Y600-Y999")), row.names = c(NA,
-100L), class = c("data.table", "data.frame"))

Related

How to efficiently collapse a vector of integers into a data.table of sequences, using R?

Given a large vector. For example:
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
How can I efficiently collapse this into a datatable that provides the start and end coordinates for all sequential integers. I am currently using the following code:
in_vec <- sort(in_vec) # sort by sequence
library(data.table)
interval_id <- findInterval(in_vec, in_vec[which(c(1, diff(in_vec)) > 1)]) # add unique IDs for sequences
dt <- data.table(vec = in_vec, # make data.table
int_id = interval_id)
long_to_short <- function(sub){ data.table(start = sub$vec[1], end = sub$vec[nrow(sub)]) } # custom function
library(plyr)
output <- ddply(dt, "int_id", long_to_short)
output$int_id <- NULL
However, the vector I am applying this to is very large, and I therefore need to maximise performance. Is there a data.table method? Any help will be greatly appreciated!
Using rleid() from data.table is helpful:
library(data.table)
set.seed(1)
dt <- data.table(in_vec = sample(1:10000, 5000, replace = F))
dt[order(in_vec),
.(start = min(in_vec),
end = max(in_vec)),
by = .(grp = rleid(c(0, cumsum(diff(in_vec) > 1))))
]
grp start end
1: 1 4 4
2: 2 6 7
3: 3 14 16
4: 4 19 19
5: 5 26 27
---
2483: 2483 9980 9980
2484: 2484 9988 9988
2485: 2485 9991 9992
2486: 2486 9994 9994
2487: 2487 9997 9998
For a completely base solution, this should be the most performant as it is not a grouping operation:
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
in_vec <- sort(in_vec)
grp <- c(0, cumsum(diff(in_vec) > 1))
data.frame(grp = unique(grp),
start = in_vec[!duplicated(grp)],
end = in_vec[!duplicated(grp, fromLast = T)]
)
Something like this?
dt[, .(start = first(vec), end = last(vec)), int_id]
Edit: I think the following will do what you need within data.table, adjust the fill = -1 depending on the actual range of your values.
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
dt <- data.table(vec = in_vec, key = 'vec')
dt[, int_id := cumsum(!shift(vec, 1, fill = -1) == vec - 1)]
dt[,.(start = first(vec), end = last(vec)), int_id]
You are almost there, just need to use the difference between sorted vectors to create a group. Then do range on them.
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
in_vec <- sort(in_vec)
grps <- cumsum(c(1,diff(in_vec)>1))
output <- data.frame(do.call(rbind,tapply(in_vec,grps,range)))
names(output) <- c("start","end")
And a dplyr solution
set.seed(1)
in_vec <- sample(1:10000, 5000, replace = F)
data.frame(x=in_vec) %>%
arrange(x) %>%
mutate(grps=cumsum(c(1,diff(x)>1))) %>%
group_by(grps) %>%
summarise(start=min(x),end=max(x)) %>%
select(start,end)

Efficient way to add numbers to alphanumeric strings in R

I have a data.frame with ids composed of sequences of alphanumeric characters (e.g., id = c(A001, A002, B013)). I was looking for an easy function under stringr or stirngi that would easily do math with this strings (id + 1 should return c(A002, A003, B014)).
I made a custom function that does the trick, however I have a feeling that there must be a better/more efficient/within package way to achieve this.
str_add_n <- function(df, string, n, width=3){
string <- enquo(string)
## split the string using pattern
df <- df %>%
separate(!!string,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])",
remove=FALSE
) %>%
mutate(num = as.numeric(num),
num = num + n,
num = stringr::str_pad(as.character(num),
width = width,
side = "left",
pad = 0
)
) %>%
unite(next_string, text:num, sep = "")
return(df)
}
Let's make a toy df
df <- data.frame(id = c("A001", "A002", "B013"))
str_add_n(df, id, 1)
id next_string
1 A001 A002
2 A002 A003
3 B013 B014
Again, this works, I'm wondering if there's a better way to do this, all tweaks welcome!
UPDATE
Based on the suggested answers I ran some benchmarking and it appears that both come very close, I would be inclined for the str_add_n_2 (I changed the name to be able to run both, and took the suggestion of x<-as.character(x))
microbenchmark::microbenchmark(question = str_add_n(df, id, 1),
answer = df %>% mutate_at(vars(id), funs(str_add_n_2(., 1))),
string_add = df %>% mutate_at(vars(id), funs(string_add(as.character(.)))))
Which yields
Unit: milliseconds
expr min lq mean median uq
question 4.312094 4.448391 4.695276 4.570860 4.755748
answer 2.932146 3.017874 3.191262 3.117627 3.240688
string_add 3.388442 3.466466 3.699363 3.534416 3.682762
max neval cld
10.29253 100 c
8.24967 100 a
9.05441 100 b
More tweaks are welcome!
Here is a way with gsubfn
id <- c("A001", "A002", "B013")
library(gsubfn)
gsubfn("([0-9]+)", function(x) sprintf("%03.0f", as.numeric(x) + 1), id)
#[1] "A002" "A003" "B014"
You could make it a function
string_add <- function(string, add = 1, width = 3) {
gsubfn::gsubfn("([0-9]+)", function(x) sprintf(paste0("%0", width, ".0f"), as.numeric(x) + add), string)
}
string_add(id, add = 10, width = 5)
#"A00011" "A00012" "B00023"
I'd suggest it's easier to define the function based on a vector of strings and not hard-code it to looking for columns in the frame; for the latter, you can always use something like mutate_at(vars(id,...), funs(str_add_n)).
str_add_n <- function(x, n = 1L) {
gr <- gregexpr("\\d+", x)
reg <- regmatches(x, gr)
widths <- nchar(reg)
regmatches(x, gr) <- sprintf(paste0("%0", widths, "d"), as.integer(reg) + n)
x
}
vec <- c("A001", "A002", "B013")
str_add_n(vec)
# [1] "A002" "A003" "B014"
If in a frame:
df <- data.frame(id = c("A001", "A002", "B013"), x = 1:3,
stringsAsFactors = FALSE)
library(dplyr)
df %>%
mutate_at(vars(id), funs(str_add_n(., 3)))
# id x
# 1 A004 1
# 2 A005 2
# 3 B016 3
Caveat: this silently requires true character, not factor ... a possible defensive tactic might be to add x <- as.character(x) in the function definition.

how to enclose combined cells as vector

I have got a data table like
library(data.table)
library(lifecontingencies)
dt <- data.table(cash = c(100,120), Flows = c(110,130),time = c(1,1),
Ids = c(2,2), int = c(0.02,0.04), Rates = c(0.02,0.04),
proba = c(0.9,0.8), bilities = c(0.7,0.6))
dt
# cash Flows time Ids int Rates proba bilities
#1: 100 110 1 2 0.02 0.02 0.9 0.7
#2: 120 130 1 2 0.04 0.04 0.8 0.6
and want to calculate
#presentValue(cashFlows, timeIds, interestRates, probabilities)
row-wise. How can I do it automatically instead of manually like so:
pV1 <- presentValue(cashFlows = c(100,110),
timeIds = c(1,2),
interestRates = c(0.02,0.02),
probabilities = c(0.9,0.7))
pV2 <- presentValue(cashFlows = c(120,130),
timeIds = c(1,2),
interestRates = c(0.04,0.04),
probabilities = c(0.8,0.6))
result <- c(pV1,pV2)
result
#162.2453 164.4231
As we are using data.table, one approach is to group by sequence of rows and apply the function
dt[, .(presValue = presentValue(cashFlows = unlist(c(cash, Flows)),
timeIds = unlist(c(time, Ids)),
interestRates = unlist(c(int, Rates)),
probabilities = unlist(c(proba, bilities)))), by = .(Row = 1:nrow(dt))]
# Row presValue
#1: 1 162.2453
#2: 2 164.4231
Another approach is to combine the multiple columns into one by melting and then apply the presentValue
dM <- melt(dt, measure = patterns('cash|Flow', 'time|Ids', 'int|Rates', 'proba|bilities'),
value.name = c('cashFlows', 'timeIds', 'interestRates', 'probabilities'))[,
rn := rowid(variable)][]
dM[, .(presValue = do.call(presentValue, .SD)),
by = .(Row = rn), .SDcols = cashFlows:probabilities]
# Row presValue
#1: 1 162.2453
#2: 2 164.4231
I would vectorize your calculation via apply:
apply(dt, 1, function(x) presentValue(cashFlows = x[1:2],
timeIds = x[3:4],
interestRates = x[5:6],
probabilities = x[7:8]))
# [1] 162.2453 164.4231
Note that you can manipulate indexes of row any way you want, for example, x[1:2] here stands for first and second cells of a row. You can choose first and fourth cells through x[c(1,4)], or just second cell via x[2]
It took me a minute to understand the presentValue and what it needed but I think this should do what you want.
apply(dt, 1, function(row) {
cashFlows <- c(row[1], row[2])
tIds <- c(row[3], row[4])
interestRates <- c(row[5], row[6])
probabilities <- c(row[7], row[8])
presentValue(cashFlows = cashFlows,
timeIds = tIds,
interestRates = interestRates,
probabilities = probabilities)
})

by-group calculation, limited to first N rows of each group

I asked a question before and received a good answer but I needed to apply it to a more specific problem. The DT needs to be divided into 16 sectors based on X and Y values. The X and Y variables represent the coordinates to loop through and divide the data table. I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number. I'm looking for a faster way to do this.
Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sCalc <- function(DT) {
setkey(DT, Norm)
cells <- DT[1:(nrow(DT)*0.02)]
nCells <- nrow(DT)
sumCell <- sum(cells[,Norm/sqrt(Unif)])
return(sumCell/nCells)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
x <- startstop(2000)
y <- startstop(1600, y = TRUE)
sectorLoop <- sectorCalc(x,y,DT)
sectorLoop returns:
-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221
-4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105
One solution was using the cut function.
DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]
The above solution returns a data table with the values:
-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748
-4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344
cor(sectorRef, sectorLoop)
The above returns: 0.0726904
As far as I can understand the question, the first thing I would explain is that you can use .N to tell you how many rows there are in each by=.(...)group. I think that is analogous to your nCells.
And where your cells takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)]. Assuming you want the top 2% in order of increasing Norm (which is the order you would get from setkey(DT, Norm), although setting a key does more than just sorting), you could call setkey(DT, Norm) before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm) inside your calculation.
The sum() part doesn't change, so the equivalent third line is:
DT[order(Norm),
.(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N),
by = .(x.sect, y.sect)]
Which returns the operation for the 16 groups:
x.sect y.sect sCalc
1: (1500,2000] (800,1200] -0.09380209
2: (499,1000] (399,800] -0.09833151
3: (499,1000] (1200,1600] -0.09606350
4: (0,499] (399,800] -0.09623751
5: (0,499] (800,1200] -0.09598717
6: (1500,2000] (0,399] -0.09306580
7: (1000,1500] (399,800] -0.09669593
8: (1500,2000] (399,800] -0.09606388
9: (1500,2000] (1200,1600] -0.09368166
10: (499,1000] (0,399] -0.09611643
11: (1000,1500] (0,399] -0.09404482
12: (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500] (800,1200] -0.09825285
15: (0,499] (0,399] -0.09890184
16: (499,1000] (800,1200] -0.09756506

Combining frequencies and summary statistics in one table?

I just discovered the power of plyr frequency table with several variables in R
and I am still struggling to understand how it works and I hope some here can help me.
I would like to create a table (data frame) in which I can combine frequencies and summary stats but without hard-coding the values.
Here an example dataset
require(datasets)
d1 <- sleep
# I classify the variable extra to calculate the frequencies
extraClassified <- cut(d1$extra, breaks = 3, labels = c('low', 'medium', 'high') )
d1 <- data.frame(d1, extraClassified)
The results I am looking for should look like that :
require(plyr)
ddply(d1, "group", summarise,
All = length(ID),
nLow = sum(extraClassified == "low"),
nMedium = sum(extraClassified == "medium"),
nHigh = sum(extraClassified == "high"),
PctLow = round(sum(extraClassified == "low")/ length(ID), digits = 1),
PctMedium = round(sum(extraClassified == "medium")/ length(ID), digits = 1),
PctHigh = round(sum(extraClassified == "high")/ length(ID), digits = 1),
xmean = round(mean(extra), digits = 1),
xsd = round(sd(extra), digits = 1))
My question: how can I do this without hard-coding the values?
For the records:
I tried this code, but it does not work
ddply (d1, "group",
function(i) c(table(i$extraClassified),
prop.table(as.character(i$extraClassified))),
)
Thanks in advance
Here's an example to get you started:
foo <- function(x,colfac,colval){
tbl <- table(x[,colfac])
res <- cbind(n = nrow(x),t(tbl),t(prop.table(tbl)))
colnames(res)[5:7] <- paste(colnames(res)[5:7],"Pct",sep = "")
res <- as.data.frame(res)
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
Don't take anything in that function foo as gospel. I just wrote that off the top of my head. Surely improvements/modifications are possible, but at least it's something to start with.
Thanks to Joran.
I slighlty modified your function to make it more generic (without reference to the position of the variables) .
require(plyr)
foo <- function(x,colfac,colval)
{
# table with frequencies
tbl <- table(x[,colfac])
# table with percentages
tblpct <- t(prop.table(tbl))
colnames( tblpct) <- paste(colnames(t(tbl)), 'Pct', sep = '')
# put the first part together
res <- cbind(n = nrow(x), t(tbl), tblpct)
res <- as.data.frame(res)
# add summary statistics
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
and it works !!!
P.S : I still do not understand what (group) stands for but

Resources