I have a data frame, in which I want to find the reuse lengh of (x,y). Can someone suggest me the quickest method to analyze it. For example:
df <- data.frame(
time=c(0,1,2,3,4,5,6),
x=c(1,4,2,1,6,1,4),
y=c(2,5,3,2,7,2,5)
)
I want average or median of the re-occurence of the same (x,y)/
Here, (1,2) repeats at time 0, 3, 5. So average = ((3-0) + (5-3))/2 = 2.5
And average for (4,5) is 5.
So, overall average is 3.75.
Can someone suggest how to do this?
Thanks.
Perhaps you're looking for something like this:
out <- aggregate(time ~ x + y, df, function(blah) {
mean(diff(blah))
})
out
# x y time
# 1 1 2 2.5
# 2 2 3 NaN
# 3 4 5 5.0
# 4 6 7 NaN
sum(out$time, na.rm=TRUE)
# [1] 7.5
A data.table approach:
library(data.table)
DT <- data.table(df, key = "x,y")
DT[, mean(diff(time)), by = key(DT)][, sum(V1, na.rm=TRUE)]
# [1] 7.5
Related
I have this data and would like to categorize each number every n-th number (example 2). Using cut or cut_interval, is not "cutting" it. Any suggestions very much welcome. Thanks!
haves <- data.frame(
some_vector = c(1,2,2,3,4,5,6,7,8)
)
haves$category <- cut_interval(df$some_vector, n=2)
haves
wants <- data.frame(
some_vector = c(1,2,2,3,4,5,6,7,8)
,category = c(1,1,1,2,2,3,3,4,4)
)
wants
This should do (for positive numbers)
cut_interval <- function(x, n) ceiling(x / n)
cut_interval(haves$some_vector, n=2)
# [1] 1 1 1 2 2 3 3 4 4
Anyway cut() should be able to cut it (with improvements from #Henrik so it generalises):
cut(haves$some_vector, c(-Inf, seq(2, max(haves$some_vector), by = 2), Inf), labels = FALSE)
# [1] 1 1 1 2 2 3 3 4 4
I'd like to sum rows 2 by 2, in order to study the lag of certain variable.
Suppose that I have the following the data base:
> SE eggs
4 2.0
6 4.0
7 10.0
8 0.5
5 1.0
1 3.0
2 6.0
3 9.0
So, I expected to obtain the following, where eggsare the sum of the indexes "SE"'s:
> df
SE2 eggs
"4+5" 3
"6+7" 14
"8+1" 3.5
"2+3" 15
Where
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
Obs.: Don't mater the order of the data frame, but I need to start from certain number (in this case, number 4), and then take the next number, in this case, number 5, and keep this logic. After SE 6+7, SE 8+1, SE 2+3...
Any hint on how can I do that?
I think I get the logic. You want ascending numbers starting from 4. When these numbers reach 8 (or whatever the maximum value of SE is), they wrap around back to one and continue to ascend until all the numbers are used up.
You then group these numbers into sequential pairs.
For each pair of numbers, you find the rows of your data frame with the matching values of SE. These rows contain the two values of eggs you wish to sum.
df = data.frame(SE=c(4,6,7,8,5,1,2,3),eggs = c(2,4,10,0.5,1,3,6,9))
first <- 4
i <- match(df$SE, c(first:nrow(df), seq(first - 1)))
groups <- ((seq_along(i) + 1) %/% 2)[i]
do.call(rbind, lapply(split(df, groups), function(x) {
data.frame(SE = paste(x$SE, collapse = "+"), eggs = sum(x$eggs))}))
#> SE eggs
#> 1 4+5 3.0
#> 2 6+7 14.0
#> 3 8+1 3.5
#> 4 2+3 15.0
Created on 2020-02-17 by the reprex package (v0.3.0)
Match c(4:8, 1:3) to SE using the match indexes to index into eggs, reshape into a 2x4 matrix and sum each column.
k <- 4 # starting index
nr <- nrow(df) # no of rows in df
with(df, colSums(matrix(eggs[match(c(k:nr, seq_len(k-1)), SE)], 2)))
## [1] 3.0 14.0 3.5 15.0
Another option, just a slight variation on my comment where we re-arrange the rows according to the specified logic and then aggregate every two rows:
aggregate(
eggs ~ ceiling(seq_along(SE)/2),
FUN = sum,
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
[1] 3.0 14.0 3.5 15.0
Or, if you'd like to keep the SE in the specified format:
df <- aggregate(
. ~ ceiling(seq_along(SE)/2),
FUN = paste, collapse = '+',
data = df[with(df, order(factor(SE, levels = c(seq(SE[1], max(SE)), SE[!SE %in% seq(SE[1], max(SE))])))),]
)[, -1]
df$eggs <- sapply(df$eggs, function(x) eval(parse(text = x)))
Output:
df
SE eggs
1 4+5 3.0
2 6+7 14.0
3 8+1 3.5
4 2+3 15.0
** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Untested example using dplyr, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333
I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.