Group by a variable in dataframe R - r

I have a dataframe like below,
Date
cat
cam
reg
per
22-01-05
A
60
120
50
22-01-05
B
20
100
20
22-01-08
A
30
150
20
22-01-08
B
30
100
30
But i want something like below,
Date
cam
reg
per
22-01-05
80
220
14.5
22-01-08
60
250
24
How to get this using R?

I am not sure why your expected per values are like that, but maybe you want the following:
df <- data.frame(Date = c("22-01-05", "22-01-05", "22-01-08", "22-01-08"),
cat = c("A", "B", "A", "B"),
cam = c(60,20,30,30),
reg = c(120,100,150,100),
per = c(50,20,20,30))
library(dplyr)
df %>%
group_by(Date) %>%
summarise(cam = sum(cam),
reg = sum(reg),
per = cam/reg)
#> # A tibble: 2 × 4
#> Date cam reg per
#> <chr> <dbl> <dbl> <dbl>
#> 1 22-01-05 80 220 0.364
#> 2 22-01-08 60 250 0.24
Created on 2022-07-07 by the reprex package (v2.0.1)

Using only the package dplyr (which is part of package tidyverse) just do:
df %>% group_by(Date) %>% summarise(cam = sum(cam),
reg = sum(reg),
per = 100*(cam/reg))
Date cam reg per
<chr> <int> <int> <dbl>
1 22-01-05 80 220 36.4
2 22-01-08 60 250 24
The nice thing with this syntax is, you can modify and add additional variables like sum, but also like mean, median, etc. in a very clean and structured way.

you can try this, but I don't how to get the value of per ,14.5 and 24
library(dplyr)
aggregate(cbind(cam, reg) ~ Date,df,sum) %>% mutate(per = 100*(cam/reg))
A data.frame: 2 × 4
Date cam reg per
<chr> <dbl> <dbl> <dbl>
22-01-05 80 220 36.36364
22-01-08 60 250 24.00000

Related

How to find duplicate dates within a row in R, and then replace associated values with the mean?

There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA

Create a new columns in R

I am carrying out an analysis on some Italian regions. I have a dataset similar to the following:
mydata <- data.frame(date= c(2020,2021,2020,2021,2020,2021),
Region= c('Sicilia','Sicilia','Sardegna','Sardegna','Campania','Campania'),
Number=c(20,30,50,70,90,69) )
Now I have to create two new columns. The first (called 'Total population') containing a fixed number for each region (for example each row with Sicily will have a "Total Population" = 250). The second column instead contains the % ratio between the value of 'Number' column and the corresponding value of 'Total Population' (for example for Sicily the value will be 20/250 and so on).
I hope I explained myself well, Thank you very much
Like thsi perhaps:
mydata %<>% group_by( Region ) %>%
mutate(
`Total Population` = sum(Number),
`Ratio of Total` = sprintf( "%.1f%%",100 * Number / sum(Number)) )
mydata is now:
> mydata
# A tibble: 6 x 5
# Groups: Region [3]
date Region Number `Total Population` `Ratio of Total`
<dbl> <chr> <dbl> <dbl> <chr>
1 2020 Sicilia 20 50 40.0%
2 2021 Sicilia 30 50 60.0%
3 2020 Sardegna 50 120 41.7%
4 2021 Sardegna 70 120 58.3%
5 2020 Campania 90 159 56.6%
6 2021 Campania 69 159 43.4%

Is there any way to join two data frames by date ranges?

I have two data frames, the first dataset is the record for forecasted demand in the following 27 days for each item of the company, shown as below:
library(tidyverse)
library(lubridate)
daily_forecast <- data.frame(
item=c("A","B","A","B"),
date_fcsted=c("2020-8-1","2020-8-1","2020-8-15","2020-8-15"),
fcsted_qty=c(100,200,200,100)
) %>%
mutate(date_fcsted=ymd(date_fcsted)) %>%
mutate(extended_date=date_fcsted+days(27))
and the other dateset is the actual daily demand for each item:
actual_orders <- data.frame(
order_date=rep(seq(ymd("2020-8-3"),ymd("2020-9-15"),by = "1 week"),2),
item=rep(c("A","B"),7),
order_qty=round(rnorm(n=14,mean=50,sd=10),0)
)
What i am trying to accomplish is to get the actual total demand for each item within the date_fcsted and extended_date in the first dataset and then have them joined to calculate the forecast accuracy.
Solutions with tidyverse would be highly appreciated.
You can try the following :
library(dplyr)
daily_forecast %>%
left_join(actual_orders, by = 'item') %>%
filter(order_date >= date_fcsted & order_date <= extended_date) %>%
group_by(item, date_fcsted, extended_date, fcsted_qty) %>%
summarise(value = sum(order_qty))
# item date_fcsted extended_date fcsted_qty value
# <chr> <date> <date> <dbl> <dbl>
#1 A 2020-08-01 2020-08-28 100 179
#2 A 2020-08-15 2020-09-11 200 148
#3 B 2020-08-01 2020-08-28 200 190
#4 B 2020-08-15 2020-09-11 100 197
You could also try fuzzy_join as suggested by #Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).
library(fuzzyjoin)
library(dplyr)
daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))
Output
rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

Creating a new Data.Frame from variable values

I am currently working on a task that requires me to query a list of stocks from an sql db.
The problem is that it is a list where there are 1:n stocks traded per date. I want to calculate the the share of each stock int he portfolio on a given day (see example) and pass it to a new data frame. In other words date x occurs 2 times (once for stock A and once for stock B) and then pull it together that date x occurs only one time with the new values.
'data.frame': 1010 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : Date, format: "2019-11-22" "2019-11-21" "2019-11-20" "2019-11-19" ...
$ Close: num 52 51 50.1 50.2 50.2 ...
$ Volume : num 5415 6196 3800 4784 6189 ...
$ Stock_ID : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"), Close=c(50,55,56,10,11,12,200),Volume=c(100,110,150,60,70,80,30),Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
*cannot transfer the date to a date variable in this example
I would like to have a new dataframe that generates the Value traded per day, the weight of each stock, and the daily returns per day, while keeping the number of stocks variable.
I hope I translated the issue properly so that I can receive help.
Thank you!
I think the easiest way to do this would be to use the dplyr package. You may need to read some documentation but the mutate and group_by function may be able do what you want. This function will allow you to modify the current dataframe by either adding a new column or changing the existing data.
Lets start with a reproducible dataset
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"),
Close=c(50,55,56,10,11,12,200),
Volume=c(100,110,150,60,70,80,30),
Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
library(magrittr)
library(dplyr)
dat2 <- RawInput %>%
group_by(Date, Stock_ID) %>% #this example only has one stock type but i imagine you want to group by stock
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume)) #what ever computation you need to do with
#multiple stock values for a given date goes here
dat2 %>% select(Stock_ID, Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct() #dat2 will still be the same size as dat, thus use the distinct() function to reduce it to unique values
# A tibble: 7 x 6
# Groups: Date, Stock_ID [7]
Stock_ID Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2017-22-11 50 50 100 100
2 1 2017-22-12 55 55 110 110
3 1 2017-22-13 56 56 150 150
4 2 2017-22-11 10 10 60 60
5 2 2017-22-12 11 11 70 70
6 2 2017-22-13 12 12 80 80
7 3 2017-22-11 200 200 30 30
This data set that you provided actually only has one unique Stock_ID and Date combinations so there was nothing actually done with the data. However if you remove Stock_ID where necessary you can see how this function would work
dat2 <- RawInput %>%
group_by(Date) %>%
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume))
dat2 %>% select(Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct()
# A tibble: 3 x 5
# Groups: Date [3]
Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <dbl> <dbl> <dbl> <dbl>
1 2017-22-11 86.7 260 63.3 190
2 2017-22-12 33 66 90 180
3 2017-22-13 34 68 115 230
After reading your first reply, You will have to be specific on how you are trying to calculate the weight. Also define your end result.
Im going to assume weight is just percentage by total cost. And the end result is for each date show the weight per stock. In other words a matrix of dates and stock Ids
library(tidyr)
RawInput %>%
group_by(Date) %>%
mutate(weight=Close/sum(Close)) %>%
select(Date, weight, Stock_ID) %>%
spread(key = "Stock_ID", value = "weight", fill = 0)
# A tibble: 3 x 4
# Groups: Date [3]
Date `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 2017-22-11 0.192 0.0385 0.769
2 2017-22-12 0.833 0.167 0
3 2017-22-13 0.824 0.176 0

Subsetting data set to only retain the mean

Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)

Resources