I have a data df as follows:
df <- data.frame(id = c(1,1,1,2,2,2,2,3,3),
year=c(2011,2012,2013,2010,2011,2012,2013,2012,2013),
points=c(45,69,79,53,13,12,11,89,91),
result = c(2,3,5,4,6,1,2,3,4))
But I want to make df as below:
df <- data.frame(id = c(1,1,2,2,2,3),
year=c(2011,2012,2010,2011,2012,2012),
points=c(45,69,53,13,12,89),
result = c(3,5,6,1,2,4))
Here, I want to do some regression with the response variable result. Since I want to estimate result, I have to delay the response variable result and leave the other dependent variable points. So, for my regression setting, result is the response variable and points is the dependent variable. In summary, I want to do time lag for result. Within each id, each last row should be removed because, there are no next result.
I simplified my problem for demonstration purpose. Is there any way to achieve this using R?
Tidyverse solution:
library(tidyverse)
df %>% group_by(id) %>% mutate(lead_result = lead(result)) %>% na.exclude
# A tibble: 6 x 5
# Groups: id [3]
id year points result lead_result
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2011 45 2 3
2 1 2012 69 3 5
3 2 2010 53 4 6
4 2 2011 13 6 1
5 2 2012 12 1 2
6 3 2012 89 3 4
A data.table solution:
library(data.table)
na.omit(setDT(df)[, result := shift(result, type = "lead"), by = "id"], "result")
Output
id year points result
<num> <num> <num> <num>
1: 1 2011 45 3
2: 1 2012 69 5
3: 2 2010 53 6
4: 2 2011 13 1
5: 2 2012 12 2
6: 3 2012 89 4
Related
I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)
I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.
The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.
library(tidyverse)
library(plm)
ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))
I first tried a filter over periods like so:
dt <- dt %>% group_by(ID) %>%
filter(year %in% first(year)+1:last(year)) %>%
mutate(mean.y = mean(y))
But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:
See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4
i hope you people could understand it. I would massively appreciate any help.
Bye
We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.
library(tidyverse)
library(plm)
dt %>%
group_by(ID) %>%
mutate(mean.y = ifelse(year == first(year),
NA,
mean(y[year != first(year)], na.rm = TRUE)))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID y year mean.y
#> <fct> <dbl> <fct> <dbl>
#> 1 a 9 2001 NA
#> 2 a 2 2002 3.25
#> 3 a 5 2003 3.25
#> 4 a 3 2004 3.25
#> 5 a 3 2005 3.25
#> 6 b 9 2001 NA
#> 7 b 1 2002 2
#> 8 b 2 2003 2
#> 9 b 3 2004 2
#> 10 c 9 2002 NA
#> 11 c 2 2003 3.5
#> 12 c 5 2004 3.5
Created on 2022-01-23 by the reprex package (v0.3.0)
Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.
library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))
Output
# A tibble: 12 x 4
# Groups: ID [3]
ID y year mean.y
<chr> <dbl> <dbl> <dbl>
1 a 9 2001 NA
2 a 2 2002 3.25
3 a 5 2003 3.25
4 a 3 2004 3.25
5 a 3 2005 3.25
6 b 9 2001 NA
7 b 1 2002 2
8 b 2 2003 2
9 b 3 2004 2
10 c 9 2002 NA
11 c 2 2003 3.5
12 c 5 2004 3.5
More compactly,
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])
In the past, when I've needed to create a new variable in an R data frame that is partly based on a 'group_by' summary statistic, I've always used the following sequence:
(1) calculate 'group stats' from data in the base (ungrouped) data frame using group_by() and summarize()
(2) join the base data frame with the result of the previous step, then calculate the new variable value using mutate.
However, (after years of using dplyr!) I accidentally did the 'summarizing' in a mutate step and everything seemed to work. This is illustrated in Option #2 in the code snippet below. I'm assuming Option #2 is okay because I'm getting identical results using both options, and because I found similar examples searching the web today. However, I wasn't sure.
Is Option #2 acceptable practice, or is Option #1 preferred (and if so why)?
set.seed(123)
df <- tibble(year_ = c(rep(c(2019), 4), rep(c(2020), 4)),
qtr_ = c(rep(c(1,2,3,4), 2)),
foo = sample(seq(1:8)))
# Option 1: calc statistics then rejoin with input data
df_stats <- df %>%
group_by(year_) %>%
summarize(mean_foo = mean(foo))
df_with_stats <- left_join(df, df_stats) %>%
mutate(dfoo = foo - mean_foo)
# Option 2: everything in one go
df_with_stats2 <- df %>%
group_by(year_) %>%
mutate(mean_foo = mean(foo),
dfoo = foo - mean_foo)
df_with_stats
# A tibble: 8 x 5
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
df_with_stats2
# A tibble: 8 x 5
# Groups: year_ [2]
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement:
df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo))
also, data.table
setDT(df)[,dfoo:=foo-mean(foo), by =year_]
I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)
A simplification of the dataframe which I'm working is:
> df1
Any nomMun
1 2010 CADAQUES
2 2011 CADAQUES
3 2012 CADAQUES
4 2010 BEGUR
5 2011 BEGUR
6 2012 BEGUR
I've been reading some post and found that count of plyr library returns a dataframe with the strings and it's frequency. But I want the frequency by year. The final result I want to obtain is a dataframe like:
> df2
nomMun freq_2010 freq_2011 freq_2012
1 CADAQUES 1 1 1
2 BEGUR 1 1 1
Could anyone you help me?
Sorry if my explanation is bad... i'm non-native speaker and it's my first time asking here...
In data.table, simply use .N:
setDT(df1)
df1[, .N, .(nomMun, Any)]
This will give you the data in long format. In other words, it will look like:
Any nomMum N
2010 CADAQUES 1
2011 CADAQUES 1
2012 CADAQUES 1
2010 BEGUR 1
2011 BEGUR 1
2012 BEGUR 1
But then you can dcast it if you'd like:
dcast(df1[, .N, .(nomMun, Any)], nomMum ~ Any, value.var = "N")
Seems silly to load a package when base R includes the table function.
> table(df1)
nomMun
Any BEGUR CADAQUES
2010 1 1
2011 1 1
2012 1 1
tidyr::spread can be used to get the desired output:
library(tidyverse)
df1 %>%
group_by(nomMun, Any) %>%
mutate(freq = n()) %>%
spread(Any, freq)
# # A tibble: 2 x 4
# # Groups: nomMun [2]
# nomMun `2010` `2011` `2012`
# * <chr> <int> <int> <int>
# 1 BEGUR 1 1 1
# 2 CADAQUES 1 1 1
Sample data:
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
Desired outcome:
desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34),
new.employee.rank=c(1,1,2,2,2,2,1,1))
id year month new.employee new.employee.rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1
The ranking rule is: I choose month 2 in each year to rank number of new employees between A and B. Then I need to give those ranks to month 1. i.e., month 1 of each year rankings must be equal to month 2 ranking in the same year.
I tried these code to get rankings for each month and each year,
library(data.table)
df1 <- data.table(df1)
df1[,rank:=rank(new.employee), by=c("year","month")]
If (anyone can roll the rank value within a column to replace rank of month 1 by rank of month 2 ), it might be a solution.
You've tried a data.table solution, so here's how would I do this using data.table
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
It appears somewhat similar to the above dplyr solution. Which is basically ranks the ids per year and joins them back to the original data set. I'm using data.table V1.9.6+ here.
Here's a dplyr-based solution. The idea is to reduce the data to the parts you want to compare, make the comparison, then join the results back into the original data set, expanding it to fill all of the relevant slots. Note the edits to your code for creating the sample data.
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=rep(c(2014,2014,2015,2015), 2),
month=rep(c(1,2), 4),
new.employee=c(4,6,2,6,23,2,5,34))
library(dplyr)
df1 %>%
# Reduce the data to the slices (months) you want to compare
filter(month==2) %>%
# Group the data by year, so the comparisons are within and not across years
group_by(year) %>%
# Create a variable that indicates the rankings within years in descending order
mutate(rank = rank(-new.employee)) %>%
# To prepare for merging, reduce the new data to just that ranking var plus id and year
select(id, year, rank) %>%
# Use left_join to merge the new data (.) with the original df, expanding the
# new data to fill all rows with id-year matches
left_join(df1, .) %>%
# Order the data by id, year, and month to make it easier to review
arrange(id, year, month)
Output:
Joining by: c("id", "year")
id year month new.employee rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1