Ideas to improve this loop possible? - r

I've been reading how to improve code in R taking a look a some of the answers here and also reading a bit of the R inferno document. Now I have this problem and the loop I created seems to be taking forever (15 hours and counting).
k <- NROW(unique(df$EndStation.Id))
l <- NROW(unique(df$StartStation.Id))
m1 <- as.matrix(df[,c("Duration","StartStation.Id","EndStation.Id")])
g <- function(m){
for (i in 1:l){
for (j in 1:k){
duration <- m[(m[,2]==i & m[,3]==j),1]
if (NROW(duration)<=1) {
m[(m[,2]==i & m[,3]==j),1] <- NA
next
}
duration <- duration/median(duration)
m[(m[,2]==i & m[,3]==j),1] <- duration
}
}
return(m)
}
answer <- g(m1)
The number of Stations (Start and End) is both 750 and the duration vector size can vary a lot from 1 or 2 to 80. Is this loop improbable or should I give up and try to get access to a faster computer.
Best regards,
Fernando

The code is a bit hard to read, but I think this is what you want to do:
library(data.table)
## generate a data table
dt <- setDT(df[,c("Duration","StartStation.Id","EndStation.Id")])
## calculate the duration
dt[, Duration := Duration / median(Duration), by = .(StartStation.Id, EndStation.Id)]
## replace the result with NA when the vector length == 1
dt[, N := .N, by = .(StartStation.Id, EndStation.Id)][
N == 1, Duration := NA
][, N := NULL]

If I understand your function correctly, you want to divide the duration between two stations by it median duration and if there is only one entry for the pair of stations set to NA
Here is a base solution (it's a bit clunky, I haven't finished my first cup of coffee):
##Some sample data
df <- data.frame(StartStation.Id=sample(LETTERS[1:10], 100, replace =T),
EndStation.Id=sample(LETTERS[11:20], 100, replace =T),
Duration=runif(100, 0.1,100))
res <- tapply(df$Duration, paste0(df$StartStation.Id, df$EndStation.Id), function(x) x/median(x))
res <- data.frame(StartStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 1),
EndStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 2),
durn=unlist(res))
res[res$durn==1,] <- NA

Related

How to quantify the frequency of all possible row combinations of a binary matrix in R in a more efficient way?

Lets assume I have a binary matrix with 24 columns and 5000 rows.
The columns are Parameters (P1 - P24) of 5000 subjects. The parameters are binary (0 or 1).
(Note: my real data can contain as much as 40,000 subjects)
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
Now I would like to determine what are all possible combinations of the 24 measured parameters:
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
The final question is: How often does each of the possible row combinations from comb appear in matrix m?
I managed to write a code for this and create a new column in comb to add the counts. But my code appears to be really slow and would take 328 days to complete to run. Therefore the code below only considers the 20 first combinations
comb$count <- 0
for (k in 1:20){ # considers only the first 20 combinations of comb
for (i in 1:nrow(m)){
if (all(m[i,] == comb[k,1:24])){
comb$count[k] <- comb$count[k] + 1
}
}
}
Is there computationally a more efficient way to compute this above so I can count all combinations in a short time?
Thank you very much for your help in advance.
Data.Table is fast at this type of operation:
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
library(data.table)
data_t = data.table(m)
ans = data_t[, .N, by = P1:P24]
dim(ans)
head(ans)
The core of the function is by = P1:P24 means group by all the columns; and .N the number of records in group
I used this as inspiration - How does one aggregate and summarize data quickly?
and the data_table manual https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
If all you need is the combinations that occur in the data and how many times, this will do it:
m2 <- apply(m, 1, paste0, collapse="")
m2.tbl <- xtabs(~m2)
head(m2.tbl)
m2
# 000000000001000101010010 000000000010001000100100 000000000010001110001100 000000000100001000010111 000000000100010110101010 000000000100101000101100
# 1 1 1 1 1 1
You can use apply to paste the unique values in a row and use table to count the frequency.
table(apply(m, 1, paste0, collapse = '-'))

Replacing values in high-frequency data

I've got high frequency data about durations. I've found out that I've got some faulty entries that I cannot discard that have 1800*random number added to them. Now I was stupid enough to try:
for(i in 1:21863924) {while(rr[i]>=1800){rr[i]=rr[i]-1800}}
Which obviously didn't work even though I left it overnight. I was wondering if there is a more elegant way for this,since subsetting the dataset to exclude the faulty entries works in matter of seconds ?
It can be done in a vectorized way. Create a logical vector
i1 <- rr >= 1800
Use that vector to replace the values while assigning the values to the original vector
rr[i1] <- rr[i1] - 1800
A recursive function would be
f1 <- function(x, val) {
i1 <- x >= val
x[i1] <- x[i1] - val
if(sum(x >= val) > 0) f1(x, val)
}
out <- f1(rr, val = 1800)
sum(out >= 1800)
#[1] 0
data
set.seed(24)
rr <- sample(20000, 100)

Filter data using a loop to compare calculated metrics for different days of week in R

I have a data set of traffic by day and by hour. I have written a function that I would like to apply to different conditions of this data set.
For instance, I want to compare average traffic for different days of the week and different hours of the day.
How do I use a loop to filter through each possibility day of week and return the metric i have a function for, for each of them?
Would really appreciate some help here.
Thanks,
Zach
I'm not sure a loop is the best thing for what you're trying to do, but here's one way to do it.
# generate example data
set.seed(1234)
df <- data.frame(hour = sample(1:24, 100, T),
dow = sample(1:7, 100, T),
traffic = round(runif(100, 1, 50)))
# prep storage matrix for results
H <- sort(unique(df$hour))
D <- sort(unique(df$dow))
res_mat <- matrix(NA, nrow=length(H), ncol=length(D))
colnames(res_mat) <- D
rownames(res_mat) <- H
# function I want to apply to subsets of values
my_fun <- function(x) { mean(x) + 2 }
# loop
for(h in seq_along(H)) {
for(d in seq_along(D)) {
# get vector of traffic for a particular hour and day-of-week combo
subset_of_traffic <- df[df$hour == H[h] & df$dow == D[d], "traffic"]
# skip if no traffic data for this hour and day-of-week combo
if(length(subset_of_traffic)==0) next
# run function on that subset and store result
res_mat[h,d] <- my_fun(subset_of_traffic)
}
}
A faster way to get the same results with data.table:
library(data.table)
dt <- data.table(df)
res_dt <- dt[ , .(results = my_fun(traffic)), by=.(hour, dow)]

Efficiently counting numbers falling within each range of numbers

I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply (slow):
I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table (faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.
Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.
# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))
# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))
# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1
This is about 25% faster but outputs as a matrix rather than a list. You many be able to use appy/sappy to make it work with a list (saving as a list was slowing it down).
c=matrix(0,length(x),100)
for(j in 1:length(x)){
a=-1
b=0
for(i in 1:length(x[[j]])){
a=b
b=b+x[[j]][i]
c[j,i]=sum((a<=y[[j]])*(y[[j]]<=b))
}
}

Pulling information from the first approximate match of a text string in R (and summing the total number of matches)

I'm having trouble summing approximate matches of text strings, as well as pulling information from the string that was matched
first in time.
I have data that look like this:
text<-c("THEN it goes West","AT it falls East","it goes West", "it falls East", "AT it goes West")
date<-c(2008,2009,2003,2006,2011)
ID<-c(1,2,3,4,5)
data<-cbind(text,date,ID)
data<-as.data.frame(data)
Notice that the latest text strings have all-caps "THEN" and "AT" added to the earlier text strings.
I would like a table that looks like this:
ID Sum Originaltext Originaldate
[1,] "4" "3" "it goes West" "2003"
[2,] "2" "2" "it falls East" "2006"
This includes:
The ID number corresponding with the text with the earliest date (the "original" text that the others were derived from).
Sums of all approximate matches for each.
The text corresponding with the earliest date.
And the date of the text corresponding with the earliest date.
I have tens of millions of cases, so I'm having trouble automating the process.
I run Windows 7, and have access to fast-computing servers.
IDEAS
#order them backwards in time
data<-data[order(data$date, decreasing = TRUE),]
#find the strings with the latest date
pattern<-"AT|THEN"
k <- vector("list", length(data$text))
for (j in 1:length(data$text)){
k[[j]]<- grep(pattern,data$text[[j]], ignore.case=FALSE)
}
k<-subset(data$text, k==1)
k<-unique(k)
#this is a problem, because case nos. 1 and 5 are still in the dataset, but they derive from the same tweet.
From here, I can use "agrep", but I'm not sure in what context. Any help would be greatly appreciated!
NOTE: While the three answers below do answer my question the way I originally asked it, I have not mentioned that my text cases do vary even without the words "AT" and "THEN". In fact, most of them do not match exactly. I should have put this in the original question. However, I would still love an answer.
Thanks!
A data.table solution avoiding stringr. I am sure this could be improved
Dealing with text data
# make the factor columns character
.data <- lapply(data, function(x) if(is.factor(x)) {as.character(x)} else { x})
library(data.table)
DT <- as.data.table(.data)
DT[, original_text := text]
# using `%like% which is an easy data.table wrapper for grepl
DT[text %like% "^THEN", text := substr(text, 6, nchar(text))]
DT[text %like% "^AT", text := substr(text, 4, nchar(text))]
# or avoiding the two vector scans and replacing in one fell swoop
DT[,text := gsub('(^THEN )|(^AT )', '', text)]
DT[, c(sum=.N, .SD[which.min(date)]) ,by=text]
using factor levels (could be faster)
# assuming that text is a factor
DTF <- as.data.table(data)
DTF[, original_text := text]
levels_text <- DTF[, levels(text)]
new_levels <- gsub('(^THEN )|(^AT )', x= levels_text ,'')
# reset the levels
setattr(DTF[['text']], 'levels', new_levels)
# coerce to character and do the same count / min date
DTF[, c(sum=.N, .SD[which.min(date)]) ,by=list(text = as.character(text))]
I'm going to give you a base solution but I really think this is a big problem for base and the data.table package is what is needed (but I don't know how to use data.table very well:
dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]),
Original.Date = as.character(x[1, 2]))
}
data.frame(do.call(rbind, lapply(dat2, FUN)), row.names = NULL)
I don't really know how close each text string is so maybe my exact matching is not appropriate but if that's the case use agrep to develop a new variable. Sorry for the lack of annotations but I am pressed for time and I think data.table is more appropriate anyway.
EDIT: I still think that data.table is better and should be out the door but maybe running in parallel is smart. You're on a windows machine so this would work to use multiple cores of a computer:
dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]),
Original.Date = as.character(x[1, 2]))
}
library(parallel)
detectCores() #make sure you have > 1 core
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("FUN", "dat2"), envir=environment())
x <- parLapply(cl, dat2, FUN)
stopCluster(cl) #stop the cluster
data.frame(do.call(rbind, x), row.names = NULL)
plyr might be too slow given the number of records you mention, but here is a solution for you:
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))
result <- ddply(data, .(text), function(x) {
sum <- nrow(x)
x <- x[which(x$date==min(x$date)),]
return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
})
> result[, -1]
id Sum Originaltext Originaldate
1 4 2 it falls East 2006
2 3 3 it goes West 2003
If you have access to a multicore machine (4 or more cores), then here is a HPC solution
library(multicore)
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))
fux <- function(foo) {
sum <- nrow(x)
x <- x[which(x$date==min(x$date)),]
return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
}
x <- split(data, data$text)
result <- mclapply(x, fux, mc.cores = 4, mc.preschedule = TRUE)

Resources