For example, I have a dataset called data, and the column names are date min max avg. The total number of rows is 366.
I want to add the each seven rows to get the total value of min. e.g. 1-7 8-14. How can I do this.
If you create a grouping column which increments after every 7 days you may apply all the answers from How to sum a variable by group .
Here's how you can do it in base R.
set.seed(123)
df <- data.frame(Date = Sys.Date() - 365:0, min = rnorm(366), max = runif(366))
df$group <- ceiling(seq(nrow(df))/7)
aggregate(min~group, df, sum)
# group min
#1 1 3.1438325
#2 2 -0.3022263
#3 3 -1.0769539
#4 4 -3.2934430
#5 5 2.8419110
#...
This is a solution based on {tidyverse}, in particular using {dplyr} for the main operations and {lubridate} for formatting your dates.
First simulate some data - as you have not provided a reproducible dataset. I take the year 2020 which has 366 days ... obviously adapt this to your problem.
For the min, max, and average values (columns) , let's generate some random numbers. Again, adapt this to your needs.
library(dplyr) # for general data frame crunching
library(lubridate) # to coerce date-time
data <- data.frame(
date = seq(from = lubridate::ymd("2020-01-01")
, to = lubridate::ymd("2020-12-31"), by = 1)
, min = sample(x = 1:10, size = 366, replace = TRUE)
, max = sample(x = 10:15, size = 366, replace = TRUE)) %>%
dplyr::mutate(avg = mean(min + max))
To group your data, inject a binning / grouping variable.
The following is a generic "every 7th row" based on the modulo operator.
If you want to group by weeks, etc. check out the {lubridate} documentation. You can get some useful bits out of dates for this. Or insert any other binning you need.
data <- data %>%
mutate(bin = c(0, rep(1:(nrow(data)-1)%/%7)))
This yields:
> data
# A tibble: 366 x 5
# Groups: bin [53]
date min max avg bin
<date> <int> <int> <dbl> <dbl>
1 2020-01-01 2 11 18.2 0
2 2020-01-02 6 14 18.2 0
3 2020-01-03 7 13 18.2 0
4 2020-01-04 6 15 18.2 0
5 2020-01-05 3 10 18.2 0
6 2020-01-06 5 12 18.2 0
7 2020-01-07 5 12 18.2 0
8 2020-01-08 7 13 18.2 1
9 2020-01-09 8 11 18.2 1
10 2020-01-10 5 10 18.2 1
We can now summarise our grouped data.
For this you use the bin-variable to group your data, and then summarise to perform aggregations on these groups. Based on your question, the following sums the min-values. Put the function/summary you need:
data %>%
group_by(bin) %>%
summarise(tot_min = sum(min))
# A tibble: 53 x 2
bin tot_min
<dbl> <int>
1 0 34
2 1 31
3 2 35
4 3 44
5 4 40
6 5 50
7 6 46
8 7 38
9 8 33
10 9 21
# ... with 43 more rows
Assign the result to your liking or whatever type of output you need.
If you want to combine this with your original data dataframe, read up on bind_rows().
Related
I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}
As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA
I have a dataset containing data for 6k asset and their marketprices.
I want to compute the daily returns, hence to apply the formula
martketprice[i] - marketprice[i-1]/marketprice[i-1]
The problem is that I have multiple observation for the same datetime, for example for asset x, I have 3 observation for the day t because it was traded by investor 1, 2 and 3. And so on so forth for every asset in the dataset.
So my dataset can look like:
investor asset datetime marketprice
1 x t 10
2 x t 10
3 x t 10
My idea was to use something like
res <- res %>%
arrange(datetime) %>%
group_by(asset) %>%
mutate(ret = (marketprice - dplyr::lag(marketprice))/dplyr::lag(marketprice, default = NA)) %>%
ungroup()
but it doesn't work since, in the example above, for row 2 would mean use marketprice [i-1] which is the same day marketprice, while I want the previous day [t-1] to be used (not included in the example dataset)
Furthermore R should check that the [i-1] marketprice is not belonging to a date which is more than 4 days distant, hence if the date of row i is 10th of july, then the computation should apply only if the date [i-1] is 6th of july or closer.
Any idea?
Based on the following assumptions I understand:
When repeated day for the same asset, the marketprice is the same, not matter the investor.
You don't mind which investor was (so we can remove rows)
When day (t) is 5 days or ahead from previous (t-1), a NaN output is ok.
Libraries and some data example:
library(lubridate)
library(tidyverse)
# Data example
set.seed(132) # reproducibility
example = data.frame(
investor = c(rep(1,3),2,3,rep(2,2),1,
rep(2,4),rep(3,4)),
asset = c(rep('A',8),
rep('B',8)),
datetime = c(today()+c(1,2,3,3,3,4,5,6),
today()+c(1,seq(6,9),seq(16,18))),
marketprice = c(10,20,30,30,30,sample(c(10,20,30),11,replace = TRUE))
)
Example dataset has 2 assets. First one (A) shows how the code deals with several rows for the same day. Second (B) shows how the code deals when there is a jump in dates greater than 4 days.
> example
investor asset datetime marketprice
1 1 A 2022-05-26 10
2 1 A 2022-05-27 20
3 1 A 2022-05-28 30
4 2 A 2022-05-28 30
5 3 A 2022-05-28 30
6 2 A 2022-05-29 30
7 2 A 2022-05-30 30
8 1 A 2022-05-31 30
9 2 B 2022-05-26 20
10 2 B 2022-05-31 10
11 2 B 2022-06-01 20
12 2 B 2022-06-02 10
13 3 B 2022-06-03 10
14 3 B 2022-06-10 30
15 3 B 2022-06-11 20
16 3 B 2022-06-12 10
Dplyr code:
# The formula is [price(t)-price(t-1)]/price(t-1) -> dif(price)/lag(price)
ret = example %>%
group_by(asset,datetime) %>%
slice(1) %>% # remove repeated dates
group_by(asset) %>%
arrange(datetime) %>%
mutate(ret = ifelse(datetime-lag(datetime) > 4,
NA,
(marketprice-lag(marketprice))/lag(marketprice))
) %>% # ifelse check the differences of days
arrange(asset,datetime) # show by assets and dates
Output:
# A tibble: 14 x 5
# Groups: asset [2]
investor asset datetime marketprice ret
<dbl> <chr> <date> <dbl> <dbl>
1 1 A 2022-05-26 10 NA
2 1 A 2022-05-27 20 1
3 1 A 2022-05-28 30 0.5
4 2 A 2022-05-29 30 0
5 2 A 2022-05-30 30 0
6 1 A 2022-05-31 30 0
7 2 B 2022-05-26 20 NA
8 2 B 2022-05-31 10 NA
9 2 B 2022-06-01 20 1
10 2 B 2022-06-02 10 -0.5
11 3 B 2022-06-03 10 0
12 3 B 2022-06-10 30 NA
13 3 B 2022-06-11 20 -0.333
14 3 B 2022-06-12 10 -0.5
2 rows dropped because a day had 3 entries of data.
Consider the following data frame obtained after a cbind operation on two lists
> fl
x meanlist
1 1 48.5
2 2 32.5
3 3 28.0
4 4 27.0
5 5 25.5
6 6 20.5
7 7 27.0
8 8 24.0
class_median <- list(0, 15, 25, 35, 45)
class_list <- list(0:10, 10:20, 20:30, 30:40, 40:50)
The values in class_median represent classes -10 to +10, 10 to 20, 20 to 30 etc
Firstly, I am trying to group the values in fl$meanlist as per the classes in class_list. Secondly, I am trying to return one value per class which is closest to the median values as follows
> fl_subset
x meanlist cm
1 1 48.5 45
2 2 32.5 35
3 5 25.5 25
I am trying to use loops to compare but it seems to be long and unmanageable and the result is not correct
Here's an approach with dplyr:
library(dplyr)
# do a little prep--name classes, extract breaks, put medians in a data frame
names(class_list) = letters[seq_along(class_list)]
breaks = c(min(class_list[[1]]), sapply(class_list, max))
med_data = data.frame(median = unlist(class_median), class = names(class_list))
fl %>%
# assign classes
mutate(class = cut(meanlist, breaks = breaks, labels = names(class_list))) %>%
# get medians
left_join(med_data) %>%
# within each class...
group_by(class) %>%
# keep the row with the smallest absolute difference to the median
slice(which.min(abs(meanlist - median))) %>%
# sort in original order
arrange(x)
# Joining, by = "class"
# # A tibble: 3 x 4
# # Groups: class [3]
# x meanlist class median
# <int> <dbl> <fct> <dbl>
# 1 1 48.5 e 45
# 2 2 32.5 d 35
# 3 5 25.5 c 25
One approach utilizing purrr and dplyr could be:
map2(.x = class_list,
.y = class_median,
~ fl %>%
mutate(cm = between(meanlist, min(.x), max(.x))) %>%
filter(any(cm)) %>%
mutate(cm = cm*.y)) %>%
bind_rows(.id = "ID") %>%
group_by(ID) %>%
slice(which.min(abs(meanlist-cm)))
ID x meanlist cm
<chr> <int> <dbl> <dbl>
1 3 5 25.5 25
2 4 2 32.5 35
3 5 1 48.5 45
I am trying to rearrange a dataset with a few thousand observations (to eventually use the drm function in package DRC), and I am tired of doing it in excel. Within a dataframe I am looking to add "start" and "end" times (up to inf) based on the intervals found in a vector within the df. This means I would have to end up adding an observation (row) where there the last "end" time is inf. For that last row (the one with inf) I ALSO need to subtract the total of "value" from an arbitrary number (in my example below this would be 50). All this grouped by two variables ("Name", and "Rep" in my example). I am hoping there is a solution using group_by, but honestly I'll be overjoyed at any solution!
I have a data set that looks like this;
# data
names<-c(rep("Luke",30), rep("Han", 30), rep("Leia", 30), rep("OB1", 30))
reps<-c(rep("A", 10), rep("B", 10), rep("C", 10))
time<-rep(seq(1:10), 4)
value<-rep(sample(0:5,10,replace=T), 4)
df<-data.frame(names, reps, time, value)
but need it to look like this;
Example of the data structure I need.
I'm at a loss. Please help!
If I have understood you correctly, we can do
library(dplyr)
df1 <- df %>%
group_by(names, reps) %>%
mutate(start = lag(time, default = 0),
end = time)
bind_rows(df1, df1 %>%
group_by(names, reps) %>%
summarise(start = last(time),
end = Inf,
value = sum(value))) %>%
select(-time) %>%
arrange(names, reps)
# names reps value start end
# <fct> <fct> <int> <dbl> <dbl>
# 1 Han A 2 0 1
# 2 Han A 2 1 2
# 3 Han A 1 2 3
# 4 Han A 1 3 4
# 5 Han A 3 4 5
# 6 Han A 2 5 6
# 7 Han A 0 6 7
# 8 Han A 2 7 8
# 9 Han A 2 8 9
#10 Han A 5 9 10
#11 Han A 20 10 Inf
#.....
We can do this in data.table shifting the 'time' while appending 'Inf' at the end of 'time' to create the end and difference of 50 from the sum of 'value' for 'value' after grouping by 'names' and 'reps'
library(data.table)
setDT(df)[, {stL <- last(time)
enL <- Inf
vL <- 50- sum(value)
.(start = c(shift(time, fill = 0), stL),
end = c(time, enL),
value = c(value, vL))}, .(names, reps)]
# names reps start end value
# 1: Luke A 0 1 0
# 2: Luke A 1 2 3
# 3: Luke A 2 3 3
# 4: Luke A 3 4 4
# 5: Luke A 4 5 0
# ---
#128: OB1 C 6 7 3
#129: OB1 C 7 8 0
#130: OB1 C 8 9 2
#131: OB1 C 9 10 5
#132: OB1 C 10 Inf 27
I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.