Row aggregation when values are close enough in a column - r

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!

You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

Related

In R how do I combine two numeric columns to create minute:second?

I have a data-frame that contains a minute & second column, both numeric:
enter image description here
I have been able to create a new column by combining the two values using:
preshot_time <- transform(preshot,time=interaction(minute,second,sep=':'))
However, I want to transform them into some sort of minute:second time signature, with the end goal of calculating the time difference in seconds between one row and the next.
I am relatively new to data manipulation in R so any help would be very welcome.
Thanks!
You can create a column representing total seconds, then use dplyr's lag function to calculate the difference from one row to the next.
set.seed(4669)
df <- data.frame(minutes = sample(0:5, size = 5),
seconds = sample(1:59, size = 5))
df$elapsedSeconds <- df$minutes * 60 + df$seconds
df$diff <- df$elapsedSeconds - dplyr::lag(df$elapsedSeconds)
df
minutes seconds elapsedSeconds diff
1 0 27 27 NA
2 4 2 242 215
3 5 12 312 70
4 1 45 105 -207
5 3 30 210 105

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Subsetting an integer vector based on a vector of corresponding dates

Elementary question:
I'm trying to subset a vector of a data frame based on a vector of dates that correspond with the vector that I wish to subset. Consider the following data frame as an example:
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 NA
2 6/12/10 5:01:00 40 1 NA
3 6/12/10 5:02:00 50 1 NA
4 6/13/10 5:03:00 10 2 NA
5 6/13/10 5:04:00 20 2 NA
6 6/13/10 5:05:00 30 2 NA
I want to fill the column to the right with the sum of values for each day. Basically, (1:3,5) should = 110, and (4:6,5) should = 60.
I know there are many ways to do this that are smarter/faster/better than what I'm attempting to do (e.g., my date variable is a factor split into "levels" that I don't know how to access), but I'm trying to build my skills from the ground up, and want to figure out how to:
Take a subset of data$Axis1 that will only grab the values for the 1st day
Take a subset of the values of data$Axis1 that will only grab the values for the 2nd day
Sum the values for each day, and place them in column 5, overwriting the "NA"
I successfully performed a function similar to this to auto-fill-in the "Day" vector, which was originally full of "NA" values (below). But I'm getting stuck as I think about how to a) subset with dates, and b) sum while subsetting.
Thanks in advance for your help - also, let me know if my question could be clearer/I'm violating cardinal stackoverflow rules. I'm very new to R and the coding community in general; I appreciate your help!
dates <-c("6/12/10","6/13/10")
counts <- c(1:2)
x <- nrow(data)
for (i in 1:x) {
for (j in 1:12) {
if (data[i,1] == dates[j]) {
data[i,4] <- counts[j]
}
}
}
Using ave :
transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum))
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 110
2 6/12/10 5:01:00 40 1 110
3 6/12/10 5:02:00 50 1 110
4 6/13/10 5:03:00 10 2 60
5 6/13/10 5:04:00 20 2 60
6 6/13/10 5:05:00 30 2 60
Another way would be using data.table
#Let's say df is your dataset
library(data.table)
dt = as.data.table(df)
dt = dt[, Sum.A1.Daily := sum(Axis1), by = Date]

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

How to extract certain rows

So As you can see I have a price and Day columns below
Price Day
2 1
5 2
8 3
11 4
14 5
17 6
20 7
23 8
26 9
29 10
32 11
35 12
38 13
41 14
44 15
47 16
50 17
53 18
56 19
59 20
I then want the output below
Difference Day
12 5
15 10
15 15
15 20
So now I have the difference in prices every 5 days...it just basically subtracts the 5th day with the first day.....and then the 10th day with the 5th day etc....
I already made a code that will seperate my data into 5 day intervals...but I want the code that will let me minus the 5th with the 1st day....the 10th day with the 5th day...etc
So the code should look something like this
difference<-tapply(Price[,1],Day, ____________)
So basically Price[,1] will be my Price data.....while "Day" is the variable that I created that will let me seperate my Day data into 5 day intervals.....I'm thinking that in the blank section I could put in the function or another variable that will let me subtract the 5th day with the 1st day prices and then the 10th day and 5th day prices...etc.....you dont have to help me to seperate my Days into intervals...just how to do "difference" section....thanks guys
Here's one option, assuming your data.frame is called "SODF":
within(SODF[c(1, seq(5, nrow(SODF), 5)), ], {
Price <- diff(c(0, Price))
})[-1, ]
# Price Day
# 5 12 5
# 10 15 10
# 15 15 15
# 20 15 20
The first step is basic subsetting. According to your description and expected answer, you want the first row, and then every fifth row starting from row 5:
> SODF[c(1, seq(5, nrow(SODF), 5)), ]
Price Day
1 2 1
5 14 5
10 29 10
15 44 15
20 59 20
From there, you can use diff on the "Price" column, but since diff will result in a vector that is one in length shorter than your input, you need to "pad" the input vector, which I did with diff(c(0, Price)).
# Correct values, but the number of rows needs to be 5
> diff(SODF[c(1, seq(5, nrow(SODF), 5)), "Price"])
[1] 12 15 15 15
Then, the [-1, ] at the end just deletes the extraneous row.
Update
In the comments below, #geektrader points out in the comments (thanks!), an alternative to using:
SODF[c(1, seq(5, nrow(SODF), 5)), ]
as your input data.frame, you may consider using the following instead:
rbind(SODF[1,], SODF[$Day %% 5 == 0,] )
The difference in the two approaches is that the first approach simply subsets by row number, while the second approach subsets according to the value in the "Day" column, extracting rows where "Day" is a multiple of 5. This second approach might be useful, for instance, when there are missing rows in the dataset.
Ananda's is a nice approach (always forget about within myself). Here's another approach:
dat2 <- dat[seq(0, nrow(dat), by=5), ]
data.frame(Difference=diff(c(dat[1,1], dat2[, 1])), Day=dat2[, 2])
Here a solution if you have a matrix as input.
The subsequent function, given a matrix m, a column col_id and a numeric interval interv, subtracts every interv rows the current value in the col_id column of the m matrix with the previous value (5 rows before, same column, obiviously).
The results are stored in a new column called diff and appended to the end of the m matrix.
In short, the approach is very similar to that used by #Ananda Mahto.
So, this is the function:
subtract_column <- function(m, col_id, interv) {
select <- c(1, seq(interv, nrow(m), interv))
cbind(m[select[-1], ], diff = diff(m[select, col_id]))
}
Example:
# this emulates your data as a matrix
price_vect <- c(2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59)
day_vect <- 1:20
matr <- do.call(cbind, list(price = price_vect, day = day_vect))
# and this calls the function above and does the job:
# subtracts every 5 rows the current and the previous (5 rows back) value in the column `price` of matrix `matr`
subtract_column(matr, 'price', 5)
Output:
price day diff
[1,] 14 5 12
[2,] 29 10 15
[3,] 44 15 15
[4,] 59 20 15

Resources