I am working with the R programming language.
I have a dataset that looks something like this:
id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)
my_data = data.frame(id, year,var)
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658
For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).
As an example:
row 1: NA
row 2: 12.186300/1
row 3: (12.186300 + 19.069836)/2
row 4: (12.186300 + 19.069836 + 7.45)/3
row 5: NA
row 6: 20.827933
row 7: (20.827933 + 5.029625)/2
I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:
transform(my_data, cmean = ave(var, id, FUN = cummean) - var)
id year var cmean
1 1 2010 12.186300 0.000000
2 1 2011 19.069836 -3.441768
3 1 2012 7.456078 5.447994
4 1 2013 14.875019 -1.478211
5 2 2012 20.827933 0.000000
6 2 2013 5.029625 7.899154
7 2 2014 -2.260658 10.126291
The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).
I then tried an answer provided here (Compute mean excluding current value):
my_data %>%
group_by(id) %>%
mutate(avg = (sum(var) - var)/(n() - 1))
# A tibble: 7 x 4
# Groups: id [2]
id year var avg
<dbl> <dbl> <dbl> <dbl>
1 1 2010 12.2 13.8
2 1 2011 19.1 11.5
3 1 2012 7.46 15.4
4 1 2013 14.9 12.9
5 2 2012 20.8 1.38
6 2 2013 5.03 9.28
But it is still not working.
Can someone please show me what I am doing wrong and what I can do this fix this problem?
Thanks!
df %>%
group_by(id)%>%
mutate(avg = lag(cummean(var)))
# A tibble: 7 × 4
# Groups: id [2]
id year var avg
<int> <int> <dbl> <dbl>
1 1 2010 12.2 NA
2 1 2011 19.1 12.2
3 1 2012 7.46 15.6
4 1 2013 14.9 12.9
5 2 2012 20.8 NA
6 2 2013 5.03 20.8
7 2 2014 -2.26 12.9
With the help of some intermediate variables you can do it like so:
library(dplyr)
df <- read.table(text = "
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658", header=T)
df |>
group_by(id) |>
#mutate(avg =lag(cummean(var)))
mutate(id_g = row_number()) |>
mutate(ms = cumsum(var)) |>
mutate(cm = ms/id_g,
cm = ifelse(ms == cm, NA, cm)) |>
select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> id year var cm
#> <int> <int> <dbl> <dbl>
#> 1 1 2010 12.2 NA
#> 2 1 2011 19.1 15.6
#> 3 1 2012 7.46 12.9
#> 4 1 2013 14.9 13.4
#> 5 2 2012 20.8 NA
#> 6 2 2013 5.03 12.9
#> 7 2 2014 -2.26 7.87
Related
So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior
I'm having some trouble on figuring out how to create a new column with the sum of 2 subsequent cells.
I have :
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
Now, I want a new column where the first line is the sum of 1+2, the second line is the sum of 1+2+3 , the third line is the sum 1+2+3+4 and so on.
As 1, 2, 3, 4... are hipoteticall values, I need to measure the absolute growth from a decade to another in order to create later on a new variable to measure the percentage change from a decade to another.
library(tibble)
df1<- tibble(Years=c(1990, 2000, 2010, 2020, 2030, 2050, 2060, 2070, 2080),
Values=c(1,2,3,4,5,6,7,8,9 ))
library(slider)
library(dplyr, warn.conflicts = F)
df1 %>%
mutate(xx = slide_sum(Values, after = 1, before = Inf))
#> # A tibble: 9 x 3
#> Years Values xx
#> <dbl> <dbl> <dbl>
#> 1 1990 1 3
#> 2 2000 2 6
#> 3 2010 3 10
#> 4 2020 4 15
#> 5 2030 5 21
#> 6 2050 6 28
#> 7 2060 7 36
#> 8 2070 8 45
#> 9 2080 9 45
Created on 2021-08-12 by the reprex package (v2.0.0)
Assuming the last row is to be repeated. Otherwise the fill part can be skipped.
library(dplyr)
library(tidyr)
df1 %>%
mutate(x = lead(cumsum(Values))) %>%
fill(x)
# Years Values x
# <dbl> <dbl> <dbl>
# 1 1990 1 3
# 2 2000 2 6
# 3 2010 3 10
# 4 2020 4 15
# 5 2030 5 21
# 6 2050 6 28
# 7 2060 7 36
# 8 2070 8 45
# 9 2080 9 45
Using base R
v1 <- cumsum(df1$Values)[-1]
df1$new <- c(v1, v1[length(v1)])
You want the cumsum() function. Here are two ways to do it.
### Base R
df1$cumsum <- cumsum(df1$Values)
### Using dplyr
library(dplyr)
df1 <- df1 %>%
mutate(cumsum = cumsum(Values))
Here is the output in either case.
df1
# A tibble: 9 x 3
Years Values cumsum
<dbl> <dbl> <dbl>
1 1990 1 1
2 2000 2 3
3 2010 3 6
4 2020 4 10
5 2030 5 15
6 2050 6 21
7 2060 7 28
8 2070 8 36
9 2080 9 45
A data.table option
> setDT(df)[, newCol := shift(cumsum(Values), -1, fill = sum(Values))][]
Years Values newCol
1: 1990 1 3
2: 2000 2 6
3: 2010 3 10
4: 2020 4 15
5: 2030 5 21
6: 2050 6 28
7: 2060 7 36
8: 2070 8 45
9: 2080 9 45
or a base R option following a similar idea
transform(
df,
newCol = c(cumsum(Values)[-1],sum(Values))
)
I have a dataframe like this:
X ID X1 X2 X3 X4 X5
BIL 1 1 2 7 1 5
Date 1 12.2 13.5 1.1 26.9 7.9
Year 1 2012 2013 2020 1999 2017
BIL 2 7 9 2 1 5
Date 2 12.2 13.5 1.1 26.9 7.9
Year 2 2022 2063 2000 1989 2015
BIL 3 1 2 7 1 5
Date 3 12.2 13.5 1.1 26.9 7.9
Year 3 2012 2013 2020 1999 2017
I would like to transform it so that I get a new df with BIL Date Year as column names and the values listed in the rows below for example
ID BIL Date Year
1 1 1 12.2 2012
2 1 2 13.5 2013
3 1 7
4 1 1
5 1 5
6 2 7 12.2 2022
7 2 9 13.5 2063
Any help would really be appreciated!
Edit: Is there any way to also add a grouping variable like I added above
This strategy will work.
create an ID column by equating first column name with X.
transform into long format, deselect unwanted column (your original variable names) using names_to = NULL argument
transform back into wide, this time using correct variable names
collect multiple instances into list column using values_fn = list argument in pivot_wider
unnest all except ID
df <- read.table(text = 'X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017', header = T)
library(tidyverse)
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID), names_to = NULL) %>%
pivot_wider(id_cols = c(ID), names_from = X, values_from = value, values_fn = list) %>%
unnest(!ID)
#> # A tibble: 15 x 4
#> ID BIL Date Year
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 12.2 2012
#> 2 1 2 13.5 2013
#> 3 1 7 1.1 2020
#> 4 1 1 26.9 1999
#> 5 1 5 7.9 2017
#> 6 2 7 12.2 2022
#> 7 2 9 13.5 2063
#> 8 2 2 1.1 2000
#> 9 2 1 26.9 1989
#> 10 2 5 7.9 2015
#> 11 3 1 12.2 2012
#> 12 3 2 13.5 2013
#> 13 3 7 1.1 2020
#> 14 3 1 26.9 1999
#> 15 3 5 7.9 2017
Created on 2021-05-17 by the reprex package (v2.0.0)
This will also give you same results
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID)) %>%
pivot_wider(id_cols = c(ID, name), names_from = X, values_from = value) %>%
select(-name)
Get the data in long format, create a unique row number for each value in X column and get it back in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -X) %>%
group_by(X) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = X, values_from = value) %>%
select(-row, -name)
# BIL Date Year
# <dbl> <dbl> <dbl>
# 1 1 12.2 2012
# 2 2 13.5 2013
# 3 7 1.1 2020
# 4 1 26.9 1999
# 5 5 7.9 2017
# 6 7 12.2 2022
# 7 9 13.5 2063
# 8 2 1.1 2000
# 9 1 26.9 1989
#10 5 7.9 2015
#11 1 12.2 2012
#12 2 13.5 2013
#13 7 1.1 2020
#14 1 26.9 1999
#15 5 7.9 2017
In data.table with melt + dcast
library(data.table)
dcast(melt(setDT(df), id.vars = 'X'), rowid(X)~X, value.var = 'value')
data
df <- structure(list(X = c("BIL", "Date", "Year", "BIL", "Date", "Year",
"BIL", "Date", "Year"), X1 = c(1, 12.2, 2012, 7, 12.2, 2022,
1, 12.2, 2012), X2 = c(2, 13.5, 2013, 9, 13.5, 2063, 2, 13.5,
2013), X3 = c(7, 1.1, 2020, 2, 1.1, 2000, 7, 1.1, 2020), X4 = c(1,
26.9, 1999, 1, 26.9, 1989, 1, 26.9, 1999), X5 = c(5, 7.9, 2017,
5, 7.9, 2015, 5, 7.9, 2017)), class = "data.frame", row.names = c(NA, -9L))
A base R option using reshape (first wide and then long)
p <- reshape(
transform(
df,
id = ave(X, X, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "X"
)
q <- reshape(
setNames(p, gsub("(.*)\\.(.*)", "\\2.\\1", names(p))),
direction = "long",
idvar = "id",
varying = -1
)
and you will see
id time BIL Date Year
1.X1 1 X1 1 12.2 2012
2.X1 2 X1 7 12.2 2022
3.X1 3 X1 1 12.2 2012
1.X2 1 X2 2 13.5 2013
2.X2 2 X2 9 13.5 2063
3.X2 3 X2 2 13.5 2013
1.X3 1 X3 7 1.1 2020
2.X3 2 X3 2 1.1 2000
3.X3 3 X3 7 1.1 2020
1.X4 1 X4 1 26.9 1999
2.X4 2 X4 1 26.9 1989
3.X4 3 X4 1 26.9 1999
1.X5 1 X5 5 7.9 2017
2.X5 2 X5 5 7.9 2015
3.X5 3 X5 5 7.9 2017
You may unlist, matrix and coerce it as.data.frame, use some cells for setNames.
setNames(as.data.frame(t(matrix(unlist(dat[-1]), 3, 15))), dat[1:3, 1])
# BIL Date Year
# 1 1 12.2 2012
# 2 7 12.2 2022
# 3 1 12.2 2012
# 4 2 13.5 2013
# 5 9 13.5 2063
# 6 2 13.5 2013
# 7 7 1.1 2020
# 8 2 1.1 2000
# 9 7 1.1 2020
# 10 1 26.9 1999
# 11 1 26.9 1989
# 12 1 26.9 1999
# 13 5 7.9 2017
# 14 5 7.9 2015
# 15 5 7.9 2017
If less hardcoded wanted, use:
m <- length(unique(dat$X))
n <- ncol(dat[-1]) * m
setNames(as.data.frame(t(matrix(unlist(dat[-1]), m, n))), dat[1:m, 1])
Data:
dat <- read.table(header=T, text='X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
')
UPD:
HERE what I need:
Example of some datasets are here (I have 8 of them):
https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
What I need is:
For example, in those datasets there is lev variable. Let's say this is a snapshot of the data in these datasets:
ID Year lev
1 2011 0.19
1 2012 0.19
1 2013 0.21
1 2014 0.18
2 2013 0.39
2 2014 0.15
2 2015 0.47
2 2016 0.35
3 2013 0.30
3 2015 0.1
3 2017 0.13
3 2018 0.78
4 2011 0.13
4 2012 0.35
Now, I need to create in each of my datasets EE_AB, EE_C, EE_H, etc., create variables ff1 and ff2 which are constructed for year ID, in each year respectively to the median of the whole IDs in that particular year.
Let's take an example of the year 2011. The median of the variable lev in this dataset in 2011 is (0.19+0.13)/2 = 0.16, so ff1 for ID 1 in 2011 should be 0.19/0.16 = 1.1875, and for ID 4 in 2011 ff1 = 0.13/0.16 = 0.8125.
Now let's take the example of 2013. The median lev is 0.3. so ff1 for ID 1, 2, 3 will be 0.7, 1.3, 1 respectively.
The desired output should be the ff1 variable in each dataset (e.g., EE_AB, EE_C, EE_H) as:
ID Year lev ff1
1 2011 0.19 1.1875
1 2012 0.19 0.7037
1 2013 0.21 0.7
1 2014 0.18 1.0909
2 2013 0.39 1.3
2 2014 0.15 0.9091
2 2015 0.47 1.6491
2 2016 0.35 1
3 2013 0.30 1
3 2015 0.1 0.3509
3 2017 0.13 1
3 2018 0.78 1
4 2011 0.13 0.8125
4 2012 0.35 1.2963
And this should be in the same way for other dataframes.
Here's a tidyverse method:
library(dplyr)
# library(purrr)
data_frameAB %>%
group_by(Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>%
ungroup()
# # A tibble: 14 x 5
# ID Year c d ff1
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2011 10 12 2.2
# 2 1 2012 11 13 2.18
# 3 1 2013 12 14 2.17
# 4 1 2014 13 15 2.15
# 5 1 2015 14 16 2.14
# 6 1 2016 15 34 3.27
# 7 1 2017 16 25 2.56
# 8 1 2018 17 26 2.53
# 9 1 2019 18 56 4.11
# 10 15 2015 23 38 2.65
# 11 15 2016 26 25 1.96
# 12 15 2017 30 38 2.27
# 13 45 2011 100 250 3.5
# 14 45 2012 200 111 1.56
Without purrr, that inner expression would be
mutate(ff1 = (c+d) / mapply(median, c, d))
albeit with type-safeness.
Since you have multiple frames in your data management, I have two suggestions:
Combine them into a list. This recommendation stems off the assumption that whatever you're doing to one frame you are likely to do all three. In that case, you can use lapply or purrr::map on the list of frames, doing all frames in one step. See https://stackoverflow.com/a/24376207/3358227.
list_of_frames <- list(AB=data_frameAB, C=data_frameC, F=data_frameF)
list_of_frames2 <- purrr::map(
list_of_frames,
~ .x %>%
group_by(Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>% ungroup()
)
Again, without purrr, that would be
list_of_frames2 <- lapply(
list_of_frames,
function(.x) group_by(.x, Year) %>%
mutate(ff1 = (c+d) / mapply(median c, d)) %>%
ungroup()
)
Combine them into one frame, preserving the original data. Starting with list_of_frames,
bind_rows(list_of_frames, .id = "Frame") %>%
group_by(Frame, Year) %>%
mutate(ff1 = (c+d) / purrr::map2_dbl(c, d, median)) %>%
ungroup()
# # A tibble: 42 x 6
# Frame ID Year c d ff1
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB 1 2011 10 12 2.2
# 2 AB 1 2012 11 13 2.18
# 3 AB 1 2013 12 14 2.17
# 4 AB 1 2014 13 15 2.15
# 5 AB 1 2015 14 16 2.14
# 6 AB 1 2016 15 34 3.27
# 7 AB 1 2017 16 25 2.56
# 8 AB 1 2018 17 26 2.53
# 9 AB 1 2019 18 56 4.11
# 10 AB 15 2015 23 38 2.65
# # ... with 32 more rows
I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA