Aggregate odd/even pairs - r

I am trying to simplify a large dataset (52k+ rows) by finding the maximum value for every two week interval. I have already assigned week number values to every row and used the aggregate() function to find the maximum value for each week.
Simplified sample data:
week <- c(1:5, 5, 7:10)
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,conc)
aggregate(df, by=list(week), FUN=max)
However, I am stuck on how to further aggregate based on two-week intervals (ex: weeks 1&2, weeks 3&4...). It's not as simple as combining every other row since every week was sampled.
I'm assuming there's a simple solution, I just haven't found it yet.
Thanks!

week <- c(1:5, 5, 7:10)
bi_week <- (week+1)%/%2
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,bi_week,conc)
aggregate(df, by=list(bi_week), FUN=max)

Use pracma::ceil to grab each bi-weekly pair
library(pracma)
aggregate(df, by=list(ceil(df$week/2)), FUN=max)
Output
Group.1 week conc
1 1 2 76.09191
2 2 4 50.20154
3 3 5 54.93041
4 4 8 69.17820
5 5 10 74.67518
ceil(df$week/2)
# 1 1 2 2 3 3 4 4 5 5

library(purrr)
library(dplyr)
Odds<-seq(1:max(week),2)
Evens<-seq(2,max(week),2)
map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>%
filter(week==x | week==y) %>% select(conc) %>% max})
I first made vectors of odds and even numbers. Then using the purrr package I fed these pairwise (1&2, then 3&4 etc) into a function that uses the dplyr package to get just the correct weeks, select the conc values and take the max.
Here is the output:
> map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>% filter(week==x | week==y) %>% select(conc) %>% max})
[[1]]
[1] 68.38759
[[2]]
[1] 56.9231
[[3]]
[1] 77.23965
[[4]]
[1] 49.39443
[[5]]
[1] 49.38465
Note: you could use map2_dbl in place of map2and get a numeric vector instead
Edit: removed the part about df2 as that was an error.

Related

R- Summarize every column using a dataframe of weights in dplyr?

Say I have a data frame of numeric values and a second dataframe of numeric values that are weights that is built like this:
Monday <- c(1, 1, 10)
Tuesday <- c(1, 2, 3)
df <- data.frame(Monday, Tuesday)
Monday <- c(10, 10, 1)
Tuesday <- c(1, 1, 1)
df_weights <- data.frame(Monday, Tuesday)
How can I summarize each column of the first data frame using weighted mean with the corresponding column in the second data frame as a source of the values for the weights?
In addition, I would like both the mean and the weighted mean in a single dataframe, how could I summarize_all with two functions like so?
Is it something like that?
library(dplyr)
library(Hmisc)
bind_cols(df,rename_all(df_weights,function(x) paste0(x,".wt"))) %>%
summarise(Monday=wtd.mean(Monday,w=Monday.wt),
Tuesday=wtd.mean(Tuesday,w=Tuesday.wt))
## Monday Tuesday
##1 1.428571 2
Or possibly something more general without dplyr :
Map(function(x) wtd.mean(df[[x]],w=df_weights[[x]]),colnames(df))
## $Monday
## [1] 1.428571
##
## $Tuesday
## [1] 2
Getting the mean and the weigthed mean together is a little more tricky but purrr can help to generalize the previous answer. I don't know if the structure of the result match your need :
purrr::map_dfr(colnames(df),
function(x) list(column=x,
mean=mean(df[[x]]),
wmean=wtd.mean(df[[x]],w=df_weights[[x]])))
### A tibble: 2 x 3
## column mean wmean
## <chr> <dbl> <dbl>
##1 Monday 4 1.43
##2 Tuesday 2 2

R Removing duplicate columns based on variability

I have a very large dataset with multiple duplicated column names (the values within the columns are different). I would like to remove the columns with duplicate names and lower variability. My problem is I have too many of these duplicate variables to do this manually. One path I am trying to to use read.csv(), which automatically adds '.1' to the duplicate column name, then make a vector of the variability of all the columns and try to work with that.
df<-data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
v<-lapply(df, function(x) var(x))
Is there a way to filter out duplicates based on variability when I am importing the dataset? Again, the biggest problem is that I have too many duplicates to do this manually. Thanks in advance!
Combining techniques from base R and tidyverse:
# calculate variance for each column
dvar <- apply(df, 2, var)
library(tidyverse)
# create data frame with column names
# "grouped" column names and variance
# find column with highest variance
keep_names <- data.frame(names = names(dvar),
grouping = gsub("[[:punct:]][0-9]", "", names(dvar)),
vals = dvar) %>%
group_by(grouping) %>%
slice_max(vals) %>%
pull(names)
# pull data
df[keep_names]
# A C
# 1 1 1
# 2 5 5
# 3 10 10
df <- data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
df["var",] <-apply(df, 2, var)
nrow(df)
df <- df[,which(df["var",] < 20)]
df
Imagine "20" being your threshold. I used apply here an appended it to the dataframe.
A.1 C.1 C.2
1 2 2 2.00000
2 2 2 5.00000
3 2 2 10.00000
var 0 0 16.33333
I would like to propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(df)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
df<-data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
for (i in 1:20){
df = rbind(df, df)
}
Which result in a data.frame of size (3 145 728, 5)
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(df, verbose=FALSE),
apply(df, 2, var) == 0
)
test replications elapsed relative
2 apply(df, 2, var) == 0 100 38.298 3.966
1 which_are_in_double(df, verbose = FALSE) 100 9.656 1.000
So which are bijection is 4 time faster than other proposed solution. The nice thing is that the bigger the data.frame the performance will be even more interesting.

How to sum every nth (200) observation in a data frame using R [duplicate]

This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)

Retrieving unique combinations [duplicate]

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

How to find the largest range from a series of numbers using R?

I have a data set where length and age correspond with individual items (ID #), there are 4 different items, you can see on the data set below.
range(dataset$length)
gives me the overall range of the length for all items. But I need to compare ranges to determine which item (ID #) has the largest range in length relative to the other 3.
length age ID #
3.5 5 1
7 10 1
10 15 1
4 5 2
8 10 2
13 15 2
3 5 3
7 10 3
9 15 3
4 5 4
5 10 4
7 15 4
This gives you the differences in ranges:
lapply( with(dat, tapply(length, ID, range)), diff)
And you can wrap which.max around htat list to give you the ID associated with the largest value:
which.max( lapply( with(dat, tapply(length, ID, range)), diff) )
2
2
In base R:
mins <- tapply(df$length, df$ID, min)
maxs <- tapply(df$length, df$ID, max)
unique( df$ID)[which.max(maxs-mins)]
group_by in dplyr may be helpful:
library(dplyr)
dataset %>%
group_by(ID) %>%
summarize(ID_range = n())
The above code is equivalent to the following (it's just written with %>%):
library(dplyr)
dataset <- group_by(dataset, ID)
summarize(dataset, ID_range = n())
An easy approach which doesn't use dplyr, though perhaps less elegant, is the which function.
range(dataset$length[which(dat$id == 1)])
range(dataset$length[which(dat$id == 2)])
range(dataset$length[which(dat$id == 3)])
range(dataset$length[which(dat$id == 4)])
You could also make a function that gives you the actual range (the difference between the max and the means) and use lapply to show you the IDs paired with their ranges.
largest_range <- function(id){
rbind(id,
(max(data$length[which(data$id == id)]) -
min(data$length[which(data$id == id)])))
}
lapply(X = unique(data$id), FUN = largest_range)

Resources