How to divide data from each year from sum of each year in base [duplicate] - r

This question already has answers here:
Divide each value by the sum of values by group
(3 answers)
Closed 2 days ago.
sales <- read.table(header = TRUE, text="Year Name Sales
1980 Atari 4.00
1980 Activision 1.07
1981 Activision 4.21
1981 ParkerBros. 2.06
1981 Imagic 1.99
1981 Atari 1.84
1981 Coleco 1.36
1981 Mystique 0.76
1981 Fox 0.74
1981 Men 0.72")
I was able to get the sum using aggregate.
I want to divide sales data from that year's total sales and get the %. But I don't know how to divide each row from respective year's total sales.
DF <- aggregate(Sales ~ Year + NAame, data = sales, FUN=sum)
DFC48 <- aggregate(DF, NA_Sales~Year, FUN=sum)

Base R:
We could use ave(). Here we can apply function x/sum(x) to each group Year, where x is defined by sales$Sales:
sales$su <- ave(sales$Sales, sales$Year, FUN = function(x) x/sum(x))
Year Name Sales su
1 1980 Atari 4.00 0.78895464
2 1980 Activision 1.07 0.21104536
3 1981 Activision 4.21 0.30774854
4 1981 ParkerBros. 2.06 0.15058480
5 1981 Imagic 1.99 0.14546784
6 1981 Atari 1.84 0.13450292
7 1981 Coleco 1.36 0.09941520
8 1981 Mystique 0.76 0.05555556
9 1981 Fox 0.74 0.05409357
10 1981 Men 0.72 0.05263158

Could you please try the below code
data %>% group_by(Year) %>% mutate(su=Sales/sum(Sales))
Created on 2023-02-17 with reprex v2.0.2
# A tibble: 10 × 4
# Groups: Year [2]
Year Name Sales su
<dbl> <chr> <dbl> <dbl>
1 1980 Atari 4 0.789
2 1980 Activision 1.07 0.211
3 1981 Activision 4.21 0.308
4 1981 Parker Bros. 2.06 0.151
5 1981 Imagic 1.99 0.145
6 1981 Atari 1.84 0.135
7 1981 Coleco 1.36 0.0994
8 1981 Mystique 0.76 0.0556
9 1981 Fox 0.74 0.0541
10 1981 Men 0.72 0.0526

Another option using prop.table like this:
library(dplyr)
sales %>%
group_by(Year) %>%
mutate(su = prop.table(Sales))
#> # A tibble: 10 × 4
#> # Groups: Year [2]
#> Year Name Sales su
#> <int> <chr> <dbl> <dbl>
#> 1 1980 Atari 4 0.789
#> 2 1980 Activision 1.07 0.211
#> 3 1981 Activision 4.21 0.308
#> 4 1981 ParkerBros. 2.06 0.151
#> 5 1981 Imagic 1.99 0.145
#> 6 1981 Atari 1.84 0.135
#> 7 1981 Coleco 1.36 0.0994
#> 8 1981 Mystique 0.76 0.0556
#> 9 1981 Fox 0.74 0.0541
#> 10 1981 Men 0.72 0.0526
Created on 2023-02-18 with reprex v2.0.2

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

Processing data.frame that needs order and cumulative days

With the small reproducible example below, I'd like to identify the dplyr approach to arrive at the data.frame shown at the end of this note. The features of the dplyr output is that it will ensure that the data.frame is sorted by date (note that the dates 1999-04-13 and 1999-03-12 are out of order) and that it then "accumulate" the number of days within each wy grouping (wy = "water year"; Oct 1-Sep 30) that Q is above a threshold of 3.0.
dat <- read.table(text="
Date wy Q
1997-01-01 1997 9.82
1997-02-01 1997 3.51
1997-02-02 1997 9.35
1997-10-04 1998 0.93
1997-11-01 1998 1.66
1997-12-02 1998 0.81
1998-04-03 1998 5.65
1998-05-05 1998 7.82
1998-07-05 1998 6.33
1998-09-06 1998 0.55
1998-09-07 1998 4.54
1998-10-09 1999 6.50
1998-12-31 1999 2.17
1999-01-01 1999 5.67
1999-04-13 1999 5.66
1999-03-12 1999 4.67
1999-06-05 1999 3.34
1999-09-30 1999 1.99
1999-11-06 2000 5.75
2000-03-04 2000 6.28
2000-06-07 2000 0.81
2000-07-06 2000 9.66
2000-09-09 2000 9.08
2000-09-21 2000 6.72", header=TRUE)
dat$Date <- as.Date(dat$Date)
mdat <- dat %>%
group_by(wy) %>%
filter(Q > 3) %>%
?
Desired results:
Date wy Q abvThreshCum
1997-01-01 1997 9.82 1
1997-02-01 1997 3.51 2
1997-02-02 1997 9.35 3
1997-10-04 1998 0.93 0
1997-11-01 1998 1.66 0
1997-12-02 1998 0.81 0
1998-04-03 1998 5.65 1
1998-05-05 1998 7.82 2
1998-07-05 1998 6.33 3
1998-09-06 1998 0.55 3
1998-09-07 1998 4.54 4
1998-10-09 1999 6.50 1
1998-12-31 1999 2.17 1
1999-01-01 1999 5.67 2
1999-03-12 1999 4.67 3
1999-04-13 1999 5.66 4
1999-06-05 1999 3.34 5
1999-09-30 1999 1.99 5
1999-11-06 2000 5.75 1
2000-03-04 2000 6.28 2
2000-06-07 2000 0.81 2
2000-07-06 2000 9.66 3
2000-09-09 2000 9.08 4
2000-09-21 2000 6.72 5
library(dplyr)
dat %>%
arrange(Date) %>%
group_by(wy) %>%
mutate(abv = cumsum(Q > 3)) %>%
ungroup()
# # A tibble: 24 x 4
# Date wy Q abv
# <date> <int> <dbl> <int>
# 1 1997-01-01 1997 9.82 1
# 2 1997-02-01 1997 3.51 2
# 3 1997-02-02 1997 9.35 3
# 4 1997-10-04 1998 0.93 0
# 5 1997-11-01 1998 1.66 0
# 6 1997-12-02 1998 0.81 0
# 7 1998-04-03 1998 5.65 1
# 8 1998-05-05 1998 7.82 2
# 9 1998-07-05 1998 6.33 3
# 10 1998-09-06 1998 0.55 3
# # ... with 14 more rows
data.table approach
library(data.table)
setDT(dat, key = "Date")[, abvThreshCum := cumsum(Q > 3), by = .(wy)]

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

R - How to use cumulative sum by year and restart cumulative sum when condition is met

I have the following data frame in R:
YEAR DOY PRECTOT cumsum Lws prec0
<int> <chr> <dbl> <dbl> <chr> <chr>
1 1982 121 6.05 6.05 no no
2 1982 122 1.10 7.15 no no
3 1982 123 0.490 7.64 no no
4 1982 124 4.53 12.2 no no
5 1982 125 3.94 16.1 no no
6 1982 126 2.78 18.9 no no
7 1982 127 0.420 19.3 no no
8 1982 128 0. 19.3 no yes
9 1982 129 0.0700 19.4 no no
10 1982 130 8.94 28.3 no no
I want another column that calculates the cumulative sum like in the cumsum column but then restarts counting when PRECTOT is 0, such as in row 8. Basically it should restart the cumulative sum from row 8 and the continue the cumulative sum from there, as such:
YEAR DOY PRECTOT cumsum Lws prec0
<int> <chr> <dbl> <dbl> <chr> <chr>
1 1982 121 6.05 6.05 no no
2 1982 122 1.10 7.15 no no
3 1982 123 0.490 7.64 no no
4 1982 124 4.53 12.2 no no
5 1982 125 3.94 16.1 no no
6 1982 126 2.78 18.9 no no
7 1982 127 0.420 19.3 no no
8 1982 128 0. 0 no yes
9 1982 129 0.0700 0.0700 no no
Is there a nice and efficient way to this in R? Thank you.
The "restart when condition is met" part is done with a group_by(cumsum(<condition>)):
library(dplyr)
dat %>%
group_by(grp = cumsum(PRECTOT == 0)) %>%
mutate(cumsum = cumsum(PRECTOT))
# # A tibble: 10 x 7
# # Groups: grp [2]
# YEAR DOY PRECTOT cumsum Lws prec0 grp
# <int> <chr> <dbl> <dbl> <chr> <chr> <int>
# 1 1982 121 6.05 6.05 no no 0
# 2 1982 122 1.1 7.15 no no 0
# 3 1982 123 0.49 7.64 no no 0
# 4 1982 124 4.53 12.2 no no 0
# 5 1982 125 3.94 16.1 no no 0
# 6 1982 126 2.78 18.9 no no 0
# 7 1982 127 0.42 19.3 no no 0
# 8 1982 128 0 0 no yes 1
# 9 1982 129 0.07 0.07 no no 1
# 10 1982 130 8.94 9.01 no no 1
Data:
dat <- readr::read_table2(
"YEAR DOY PRECTOT cumsum Lws prec0
1982 121 6.05 6.05 no no
1982 122 1.10 7.15 no no
1982 123 0.490 7.64 no no
1982 124 4.53 12.2 no no
1982 125 3.94 16.1 no no
1982 126 2.78 18.9 no no
1982 127 0.420 19.3 no no
1982 128 0. 19.3 no yes
1982 129 0.0700 19.4 no no
1982 130 8.94 28.3 no no
", col_types = "icddcc")
Here is one way, to restart a cumulative sum when a condition is met, using data.table:
dat <- read.table(header = TRUE, text = "YEAR DOY PRECTOT cumsum Lws prec0
1982 121 6.05 6.05 no no
1982 122 1.10 7.15 no no
1982 123 0.490 7.64 no no
1982 124 4.53 12.2 no no
1982 125 3.94 16.1 no no
1982 126 2.78 18.9 no no
1982 127 0.420 19.3 no no
1982 128 0. 19.3 no yes
1982 129 0.0700 19.4 no no
1982 130 8.94 28.3 no no")
library(data.table)
dat <- data.table(dat)
dat[, NEWCOL:=cumsum(PRECTOT), by=cumsum(PRECTOT==0)]
The cumulative sum is restarted using the data.table group by (by=cumsum(<condition>)).

Subset winter months between two years

I currently am working with data frame below.
head(pdo,24)
Date Year Month Value Season
1 198001 1980 1 0.06 Winter
2 198002 1980 2 0.60 Spring
3 198003 1980 3 0.60 Spring
4 198004 1980 4 0.72 Spring
5 198005 1980 5 0.57 Summer
6 198006 1980 6 -0.78 Summer
7 198007 1980 7 -0.32 Summer
8 198008 1980 8 -0.12 Fall
9 198009 1980 9 -0.29 Fall
10 198010 1980 10 0.92 Fall
11 198011 1980 11 0.70 Winter
12 198012 1980 12 0.36 Winter
13 198101 1981 1 1.18 Winter
14 198102 1981 2 1.25 Spring
15 198103 1981 3 1.16 Spring
16 198104 1981 4 1.01 Spring
17 198105 1981 5 1.22 Summer
18 198106 1981 6 1.77 Summer
19 198107 1981 7 0.71 Summer
20 198108 1981 8 -0.11 Fall
21 198109 1981 9 0.34 Fall
22 198110 1981 10 -0.15 Fall
23 198111 1981 11 0.45 Winter
24 198112 1981 12 0.60 Winter
This is a subset of 2 years (1980-1981) of a larger data frame.I need a way to subset the entire data frame (1980-2014) to select the winter months in order.
What I would need is:
Date Year Month Value Season
11 198011 1980 11 0.70 Winter
12 198012 1980 12 0.36 Winter
13 198101 1981 1 1.18 Winter
Any idea how to do this? The reason I need this is so I can take an average of the "Value" column for the winter months.
Thanks for the help!
You can augment your data to reflect the year that a particular season starts in your data.
pdo$SeasonYear <- with(pdo, Year - (Season == "Winter" & Month < 6))
pdo[pdo$Season == "Winter",]
# Date Year Month Value Season SeasonYear
# 1 198001 1980 1 0.06 Winter 1979
# 11 198011 1980 11 0.70 Winter 1980
# 12 198012 1980 12 0.36 Winter 1980
# 13 198101 1981 1 1.18 Winter 1980
# 23 198111 1981 11 0.45 Winter 1981
# 24 198112 1981 12 0.60 Winter 1981
From here,
aggregate(pdo$Value, list(Season = pdo$Season, SeasonYear = pdo$SeasonYear), mean)
# Season SeasonYear x
# 1 Winter 1979 0.06000000
# 2 Spring 1980 0.64000000
# 3 Summer 1980 -0.17666667
# 4 Fall 1980 0.17000000
# 5 Winter 1980 0.74666667
# 6 Spring 1981 1.14000000
# 7 Summer 1981 1.23333333
# 8 Fall 1981 0.02666667
# 9 Winter 1981 0.52500000
Consumable data:
pdo <- read.table(text=' Date Year Month Value Season
198001 1980 1 0.06 Winter
198002 1980 2 0.60 Spring
198003 1980 3 0.60 Spring
198004 1980 4 0.72 Spring
198005 1980 5 0.57 Summer
198006 1980 6 -0.78 Summer
198007 1980 7 -0.32 Summer
198008 1980 8 -0.12 Fall
198009 1980 9 -0.29 Fall
198010 1980 10 0.92 Fall
198011 1980 11 0.70 Winter
198012 1980 12 0.36 Winter
198101 1981 1 1.18 Winter
198102 1981 2 1.25 Spring
198103 1981 3 1.16 Spring
198104 1981 4 1.01 Spring
198105 1981 5 1.22 Summer
198106 1981 6 1.77 Summer
198107 1981 7 0.71 Summer
198108 1981 8 -0.11 Fall
198109 1981 9 0.34 Fall
198110 1981 10 -0.15 Fall
198111 1981 11 0.45 Winter
198112 1981 12 0.60 Winter', header=TRUE)
pdo$Season <- factor(pdo$Season, levels = c("Spring", "Summer", "Fall", "Winter"))
I took the liberty of forcing the factor levels so that they were ordered correctly.
Do you just want to extract the mean of "Value" for the winter months? With base R you could do:
set.seed(123)
df <- data.frame(year = c(1980,1980,1980,1981,1981,1981,1982,1982,1982),
month = c(6,11,12,6,11,12,6,11,12),
season = c("Summer", "Winter", "Winter", "Summer", "Winter", "Winter", "Summer", "Winter", "Winter"),
value = sample(1:20, 9))
df
year month season value
1 1980 6 Summer 6
2 1980 11 Winter 15
3 1980 12 Winter 8
4 1981 6 Summer 16
5 1981 11 Winter 17
6 1981 12 Winter 1
7 1982 6 Summer 18
8 1982 11 Winter 12
9 1982 12 Winter 7
> mean(df[df$season == "Winter",]$value, na.rm = TRUE)
[1] 10

Resources