R data.frame flow data pre-processing for aggregated time statistics - r

What is the most efficient way of processing a flow data.frame like
> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0))
> df
amount size tot first secs
1 4 164 1 1 2
2 3 124 1 1 2
3 1 131 1 3 0
4 1 315 1 3 0
5 4 1128 1 2 1
6 5 331 1 2 1
7 9 1135 1 2 1
8 13 13589 1 2 1
9 1 164 1 4 0
10 1 68 1 4 0
to an per-time aggregated totals
> df2
time tot amount size
1 1 2 3.5 144
2 2 6 34.5 16327
3 3 8 36.5 16773
4 4 2 2.0 232
.. using R, when the actual data-set can be more than 100 000 000 rows or even tens of gigabytes?
Column first denotes the start of a flow with duration secs, with metrics amount, size, and tot. In aggregated totals the size and amount are evenly divided to the time range in double-precision, whereas tot is summed to every time-slot as an integer. Duration secs denotes how many time-slots the flows last in addition to value first: If secs is 1 and first is 5, the flow lasts time-slots 5 and 6. My current implementation uses ugly and dead-slow for-loops, which is not an option:
df2 = data.frame()
for (i in 1:nrow(df)) {
items <- df[i, 'secs']
idd <- df[i, 'first']
for (ss in 0:items) { # run once for secs=0
if (items == 0) { items <- 1 }
df2[idd+ss, 'time'] <- idd+ss
if (is.null(df2[idd+ss, 'tot']) || is.na(df2[idd+ss, 'tot'])) {
df2[idd+ss, 'tot'] <- df[i, 'tot']
} else {
df2[idd+ss, 'tot'] <- df2[idd+ss, 'tot'] + df[i, 'tot']
}
if (is.null(df2[idd+ss, 'amount']) || is.na(df2[idd+ss, 'amount'])) {
df2[idd+ss, 'amount'] <- df[i, 'amount']/items
} else {
df2[idd+ss, 'amount'] <- df2[idd+ss, 'amount'] + df[i, 'amount']/items
}
if (is.null(df2[idd+ss, 'size']) || is.na(df2[idd+ss, 'size'])) {
df2[idd+ss, 'size'] <- df[i, 'size']/items
} else {
df2[idd+ss, 'size'] <- df2[idd+ss, 'size'] + df[i, 'size']/items
}
}
}
You can probably optimize this a lot and achieve good performance using only loops, but I bet that better algorithms exist. Maybe you could somehow expand/duplicate the rows with secs > 0, while increasing the first (timestamp) values of the expanded rows and adjust amount, size, and tot metrics on the fly:
now original data..
amount size tot first secs
1 4 164 1 1 0
2 4 164 1 1 1
3 3 124 1 1 2
magically becomes
amount size tot first
1 4 164 1 1
2 2 82 1 1
3 2 82 1 2
4 1 41.33 1 1
5 1 41.33 1 2
6 1 41.33 1 3
After this pre-processing step aggregation would be trivial using plyr ddply, of course in efficient parallel mode.
All example ddply, apply etc. function examples I was able to find operate on per-row or per-column basis, making it hard to modify other rows. Hopefully I don't have to rely on awk-magic.
Update: The mentioned algorithm can easily exhaust your memory when the expansion is done "as is". Some kind of "on the fly" calculation is thus preferred, where we don't map everything to memory. Mattrition's answer is however correct and helped a lot, so marking it as the accepted answer.

The following is an implementation using data.table. I chose data.table for its aggregation abilities, but it's a nifty and efficient class to work with too.
library(data.table)
dt <- as.data.table(df)
# Using the "expand" solution linked in the Q.
# +1 to secs to allow room for 0-values
dtr <- dt[rep(seq.int(1, nrow(dt)), secs+1)]
# Create a new seci column that enumerates sec for each row of dt
dtr[,seci := dt[,seq(0,secs),by=1:nrow(dt)][,V1]]
# All secs that equal 0 are changed to 1 for later division
dtr[secs==0, secs := 1]
# Create time (first+seci) and adjusted amount and size columns
dtr[,c("time", "amount2", "size2") := list(first+seci, amount/secs, size/secs)]
# Aggregate selected columns (tot, amount2, and size2) by time
dtr.a <- dtr[,list(tot=sum(tot), amount=sum(amount2), size=sum(size2)), by=time]
dtr.a
time tot amount size
1: 1 2 3.5 144
2: 2 6 34.5 16327
3: 3 8 36.5 16773
4: 4 2 2.0 232

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

How to generalize this algorithm (sign pattern match counter)?

I have this code in R :
corr = function(x, y) {
sx = sign(x)
sy = sign(y)
cond_a = sx == sy && sx > 0 && sy >0
cond_b = sx < sy && sx < 0 && sy >0
cond_c = sx > sy && sx > 0 && sy <0
cond_d = sx == sy && sx < 0 && sy < 0
cond_e = sx == 0 || sy == 0
if(cond_a) return('a')
else if(cond_b) return('b')
else if(cond_c) return('c')
else if(cond_d) return('d')
else if(cond_e) return('e')
}
Its role is to be used in conjunction with the mapply function in R in order to count all the possible sign patterns present in a time series. In this case the pattern has a length of 2 and all the possible tuples are : (+,+)(+,-)(-,+)(-,-)
I use the corr function this way :
> with(dt['AAPL'], table(mapply(corr, Return[-1], Return[-length(Return)])) /length(Return)*100)
a b c d e
24.6129416 25.4466058 25.4863041 24.0174672 0.3969829
> dt["AAPL",list(date, Return)]
symbol date Return
1: AAPL 2014-08-29 -0.3499903
2: AAPL 2014-08-28 0.6496702
3: AAPL 2014-08-27 1.0987923
4: AAPL 2014-08-26 -0.5235654
5: AAPL 2014-08-25 -0.2456037
I would like to generalize the corr function to n arguments. This mean that for every nI would have to write down all the conditions corresponding to all the possible n-tuples. Currently the best thing I can think of for doing that is to make a python script to write the code string using loops, but there must be a way to do this properly. Do you have an idea about how I could generalize the fastidious condition writing, maybe I could try to use expand.grid but how do the matching then ?
I think you're better off using rollapply(...) in the zoo package for this. Since you seem to be using quantmod anyway (which loads xts and zoo), here is a solution that does not use all those nested if(...) statements.
library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
AAPL <- AAPL["2007-08::2009-03"] # AAPL during the crash...
Returns <- dailyReturn(AAPL)
get.patterns <- function(ret,n) {
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(patterns,1,function(row)all(row==sign(x))))
}
returns <- na.omit(ret)
patterns <- expand.grid(rep(list(c(-1,1)),n))
labels <- apply(patterns,1,function(row) paste0("(",paste(row,collapse=","),")"))
result <- rollapply(returns,width=n,f,align="left")
data.frame(100*table(labels[result])/(length(returns)-(n-1)))
}
get.patterns(Returns,n=2)
# Var1 Freq
# 1 (-1,-1) 22.67303
# 2 (-1,1) 26.49165
# 3 (1,-1) 26.73031
# 4 (1,1) 23.15036
get.patterns(Returns,n=3)
# Var1 Freq
# 1 (-1,-1,-1) 9.090909
# 2 (-1,-1,1) 13.397129
# 3 (-1,1,-1) 14.593301
# 4 (-1,1,1) 11.722488
# 5 (1,-1,-1) 13.636364
# 6 (1,-1,1) 13.157895
# 7 (1,1,-1) 12.200957
# 8 (1,1,1) 10.765550
The basic idea is to create a patterns matrix with 2^n rows and n columns, where each row represents one of the possible patterns (e,g, (1,1), (-1,1), etc.). Then pass the daily returns to this function n-wise using rollapply(...) and identify which row in patterns matches sign(x) exactly. Then use this vector of row numbers an an index into labels, which contains a character representation of the patterns, then use table(...) as you did.
This is general for an n-day pattern, but it ignores situations where any return is exactly zero, so the $Freq columns do not add up to 100. As you can see, this doesn't happen very often.
It's interesting that even during the crash it was (very slightly) more likely to have two up days in succession, than two down days. If you look at plot(Cl(AAPL)) during this period, you can see that it was a pretty wild ride.
This is a little different approach but it may give you what you're looking for and allows you to use any size of n-tuple. The basic approach is to find the signs of the adjacent changes for each sequential set of n returns, convert the n-length sign changes into n-tuples of 1's and 0's where 0 = negative return and 1 = positive return. Then calculate the decimal value of each n-tuple taken as binary number. These numbers will clearly be different for each distinct n-tuple. Using a zoo time series for these calculations provides several useful functions including get.hist.quote() to retrieve stock prices, diff() to calculate returns, and the rollapply() function to use in calculating the n-tuples and their sums.The code below does these calculations, converts the sum of the sign changes back to n-tuples of binary digits and collects the results in a data frame.
library(zoo)
library(tseries)
n <- 3 # set size of n-tuple
#
# get stock prices and compute % returns
#
dtz <- get.hist.quote("AAPL","2014-01-01","2014-10-01", quote="Close")
dtz <- merge(dtz, (diff(dtz, arithmetic=FALSE ) - 1)*100)
names(dtz) <- c("prices","returns")
#
# calculate the sum of the sign changes
#
dtz <- merge(dtz, rollapply( data=(sign(dtz$returns)+1)/2, width=n,
FUN=function(x, y) sum(x*y), y = 2^(0:(n-1)), align="right" ))
dtz <- fortify.zoo(dtz)
names(dtz) <- c("date","prices","returns", "sum_sgn_chg")
#
# convert the sum of the sign changes back to an n-tuple of binary digits
#
for( i in 1:nrow(dtz) )
dtz$sign_chg[i] <- paste(((as.numeric(dtz$sum_sgn_chg[i]) %/%(2^(0:2))) %%2), collapse="")
#
# report first part of result
#
head(dtz, 10)
#
# report count of changes by month and type
#
table(format(dtz$date,"%Y %m"), dtz$sign_chg)
An example of possible output is a table showing the count of changes by type for each month.
000 001 010 011 100 101 110 111 NANANA
2014 01 1 3 3 2 3 2 2 2 3
2014 02 1 2 4 2 2 3 2 3 0
2014 03 2 3 0 4 4 1 4 3 0
2014 04 2 3 2 3 3 2 3 3 0
2014 05 2 2 1 3 1 2 3 7 0
2014 06 3 4 3 2 4 1 1 3 0
2014 07 2 1 2 4 2 5 5 1 0
2014 08 2 2 1 3 1 2 2 8 0
2014 09 0 4 2 3 4 2 4 2 0
2014 10 0 0 1 0 0 0 0 0 0
so this would show that in month 1, January of 2014, there was one set of three days with 000 indicating 3 down returns , 3 days with the 001 change indicating two down return and followed by one positive return and so forth. Most months seem to have a fairly random distribution but May and August show 7 and 8 sets of 3 days of positive returns reflecting the fact that these were strong months for AAPL.

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

R counting the occurrences of similar rows of data frame

I have data in the following format called DF (this is just a made up simplified sample):
eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1 1 1500 1500 100 120 40 232342
2 2 1000 1250 100 120 40 11843
3 3 1250 1250 100 120 40 981340234
4 4 1000 1187.5 100 120 40 4363453
5 1 2000 2000 200 100 40 345902
6 1 3000 3000 150 90 10 943
7 1 2000 2000 90 90 100 9304358
8 2 1800 1900 90 90 100 284333
However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.
The example above uses the expected values, but assume they are incorrect.
How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?
I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.
Ok, let's first do it in the easy case where you just have one column.
> data <- rep(sample(1000, 5),
sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278
Then you can just use rle to figure out the contiguous sequences:
> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1
Or altogether:
> head(cbind(data, sequence(rle(data)$lengths)))
[1,] 435 1
[2,] 435 2
[3,] 435 3
[4,] 278 1
[5,] 278 2
[6,] 278 3
For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.
Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:
cmpfun2 <- function(r) {
count <- 0
if (r[1] > 1)
{
for (row in 1:(r[1]-1))
{
if(all(r[27:51] == DF[row,27:51,drop=FALSE])) # compare to row bind
{
count <- count + 1
}
}
}
return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)
Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!
I have a solution I figured out over time (sorry I haven't checked this in a while)
checkIt <- function(bind) {
print(bind)
cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
brows <- apply(heeds.data[,23:47], 1, cmpfun)
#print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}
Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates
Also, I really haven't learned as much R as I should so I'm open to suggestions.
Hope this helps!

Resources