How do I duplicate and add rows between the values of two different columns? - r

I'm trying to duplicate each observation for all of the years that fall between "styear" and "endyear." So, for example, there should end up being 118 USA rows with years 1898-2016.
Here's the data:
# A tibble: 14 x 9
stateabb ccode styear stmonth stday endyear endmonth endday version
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 USA 2 1898 8 13 2016 12 31 2016
2 UKG 200 1816 1 1 2016 12 31 2016
3 FRN 220 1816 1 1 1940 6 22 2016
4 FRN 220 1945 8 15 2016 12 31 2016
5 GMY 255 1816 1 1 1918 11 11 2016
6 GMY 255 1925 1 1 1945 5 7 2016
7 GMY 255 1991 12 11 2016 12 31 2016
8 AUH 300 1816 1 1 1918 11 3 2016
9 ITA 325 1860 1 1 1943 9 2 2016
10 RUS 365 1816 1 1 1917 12 5 2016
11 RUS 365 1922 1 1 2016 12 31 2016
12 CHN 710 1950 1 1 2016 12 31 2016
13 JPN 740 1895 4 1 1945 8 14 2016
14 JPN 740 1991 12 11 2016 12 31 2016
I have tried various combinations of slice() and seq() and complete(), but I think I'm just too new at coding to a) know what to do and b) how to really understand other answers to similar questions.
Ultimately, I am merging this data with other data and creating 0/1 dummy variable to indicate if a country was a "great power" in a given year. The easiest way I thought of was to do this by creating individual rows for each year a country was a great power (the data in this question) because the data I am merging it with is also in the country-year format. I am open to other options, though, if something else is easier!
Thank you!

I think tidyr::expand() and full_seq() can achieve what you want, with grouping on stateabb and styear since you have multiple start years for some states.
Assuming your data frame is named mydata, something like this. I have retained the column of expanded years and named it filled_year, but you may want to remove it.
library(dplyr)
library(tidyr)
new_data <- mydata %>%
group_by(stateabb, styear) %>%
tidyr::expand(stateabb, full_seq(c(styear, endyear), 1)) %>%
inner_join(mydata) %>%
rename(filled_year = `full_seq(c(styear, endyear), 1)`) %>%
ungroup()
The top and bottom of the USA rows:
new_data %>%
filter(stateabb == "USA") %>%
head()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 1898 2 8 13 2016 12 31 2016
2 1898 USA 1899 2 8 13 2016 12 31 2016
3 1898 USA 1900 2 8 13 2016 12 31 2016
4 1898 USA 1901 2 8 13 2016 12 31 2016
5 1898 USA 1902 2 8 13 2016 12 31 2016
6 1898 USA 1903 2 8 13 2016 12 31 2016
new_data %>%
filter(stateabb == "USA") %>%
tail()
# A tibble: 6 x 10
styear stateabb filled_year ccode stmonth stday endyear endmonth endday version
<int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1898 USA 2011 2 8 13 2016 12 31 2016
2 1898 USA 2012 2 8 13 2016 12 31 2016
3 1898 USA 2013 2 8 13 2016 12 31 2016
4 1898 USA 2014 2 8 13 2016 12 31 2016
5 1898 USA 2015 2 8 13 2016 12 31 2016
6 1898 USA 2016 2 8 13 2016 12 31 2016
Your example data:
mydata <- structure(list(stateabb = c("USA", "UKG", "FRN", "FRN", "GMY",
"GMY", "GMY", "AUH", "ITA", "RUS", "RUS", "CHN", "JPN", "JPN"
), ccode = c(2L, 200L, 220L, 220L, 255L, 255L, 255L, 300L, 325L,
365L, 365L, 710L, 740L, 740L), styear = c(1898L, 1816L, 1816L,
1945L, 1816L, 1925L, 1991L, 1816L, 1860L, 1816L, 1922L, 1950L,
1895L, 1991L), stmonth = c(8L, 1L, 1L, 8L, 1L, 1L, 12L, 1L, 1L,
1L, 1L, 1L, 4L, 12L), stday = c(13L, 1L, 1L, 15L, 1L, 1L, 11L,
1L, 1L, 1L, 1L, 1L, 1L, 11L), endyear = c(2016L, 2016L, 1940L,
2016L, 1918L, 1945L, 2016L, 1918L, 1943L, 1917L, 2016L, 2016L,
1945L, 2016L), endmonth = c(12L, 12L, 6L, 12L, 11L, 5L, 12L,
11L, 9L, 12L, 12L, 12L, 8L, 12L), endday = c(31L, 31L, 22L, 31L,
11L, 7L, 31L, 3L, 2L, 5L, 31L, 31L, 14L, 31L), version = c(2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"))

My guess is that there is a better way to do this, but here is one way with a small reduced data set. The steps are
Make a minimal dataset.
Make function that creates the seq of dates that you want
Split the dataframe into a list of small dataframes by country with dplyr group_split.
Apply a function using purrr map that maps your list into new expanded date dataframes.
Bind list of dataframes back into one dataframe.
library(dplyr)
library(purrr )
df<-data.frame(
stringsAsFactors = FALSE,
stateabb = c("USA", "UKG"),
styear = c(1898L, 1816L),
endyear = c(2016L, 2016L)
)
expand_dates<-function(df) {
stateabb<-df %>% pluck("stateabb")
styear<-df %>% pluck("styear")
endyear<-df%>% pluck("endyear")
years=seq(styear,endyear )
data.frame(years) %>%
mutate(stateabb=stateabb,styear=styear,endyear=endyear)
}
df_new<-df %>%
group_split(stateabb)%>%
map(expand_dates) %>%
bind_rows()
head(df_new)
#> years stateabb styear endyear
#> 1 1816 UKG 1816 2016
#> 2 1817 UKG 1816 2016
#> 3 1818 UKG 1816 2016
#> 4 1819 UKG 1816 2016
#> 5 1820 UKG 1816 2016
#> 6 1821 UKG 1816 2016
Created on 2022-01-13 by the reprex package (v2.0.1)

Related

How to calculate year to date and 12 months variations in r

I have data of multiple years (in this example only 3). Need to calculate year to date and 12 months variations in r. Year-to-date variations are as follows:
Year to date variation January 2021= Value January 2021/ Value January 2020.
Year to date variation February 2021= Sum(Value January 2021: February 2021)/Sum(Value January 2020: February 2020)
Year to date variation March 2021= Sum(Value January 2021: March 2021)/Sum(Value January 2020: March 2020)
12 months variations are as follows:
12 months variation January 2021= Sum (Value February 2020:Value January 2021)/ Sum(Value February 2019:Value January 2020).
12 months variation February 2021= Sum (Value March 2020:Value February 2021)/ Sum(Value March 2019:Value February 2020).
Year
Month
Value
Year to date variations R
12 mothns
2019
1
182
2019
2
160
2019
3
170
2019
4
123
2019
5
165
2019
6
153
2019
7
152
2019
8
182
2019
9
156
2019
10
141
2019
11
161
2019
12
193
2020
1
143
0,785714285714286
2020
2
138
0,821637426900585
2020
3
113
0,76953125
2020
4
127
0,820472440944882
2020
5
107
0,785
2020
6
185
0,853095487932844
2020
7
125
0,848868778280543
2020
8
109
0,813519813519814
2020
9
167
0,841302841302841
2020
10
128
0,847222222222222
2020
11
139
0,848710601719198
2020
12
173
0,853457172342621
2021
1
111
0,776223776223776
0,854133754607688
2021
2
169
0,99644128113879
0,88066062866276
2021
3
125
1,02791878172589
0,914835164835165
2021
4
134
1,03454894433781
0,916666666666667
2021
5
112
1,03662420382166
0,94960362400906
2021
6
114
0,940959409594096
0,89321468298109
2021
7
171
0,997867803837953
0,932806324110672
2021
8
161
1,04775549188157
1,00353356890459
2021
9
195
1,06425041186161
1,01345816266823
2021
10
150
1,07451564828614
1,03419811320755
2021
11
181
1,09588116137745
1,07287933094385
2021
12
163
1,07980652962515
1,07980652962515
You can do this by leveraging the power of data.table::shift().
library(data.table)
setDT(df)
df <- df[order(Month,Year),p:=shift(Value),Month] %>%
.[,ytd_var:=cumsum(Value)/cumsum(p),Year] %>%
.[,`:=`(mon_var=Reduce(`+`, shift(Value,0:11))/Reduce(`+`, shift(Value,12:23)), p=NULL)]
Explanation:
First, get the 12-month prior value (p) for each month, by ordering by Month and Year, and using shift(Value), by Month.
Then, create the year-to-date value (ytd_var) by dividing the cumulative sum of Value by the cumulative sum of p, by Year.
Finally, create the 12-month variation value (mon_var) by dividing the sum of Value in the current and prior 11 months by the sum of Value in months 12 thru 23 prior to the current month. Notice that I use Reduce() to sum over all these prior values.
Output:
Year Month Value ytd_var mon_var
1: 2019 1 182 NA NA
2: 2019 2 160 NA NA
3: 2019 3 170 NA NA
4: 2019 4 123 NA NA
5: 2019 5 165 NA NA
6: 2019 6 153 NA NA
7: 2019 7 152 NA NA
8: 2019 8 182 NA NA
9: 2019 9 156 NA NA
10: 2019 10 141 NA NA
11: 2019 11 161 NA NA
12: 2019 12 193 NA NA
13: 2020 1 143 0.7857143 NA
14: 2020 2 138 0.8216374 NA
15: 2020 3 113 0.7695312 NA
16: 2020 4 127 0.8204724 NA
17: 2020 5 107 0.7850000 NA
18: 2020 6 185 0.8530955 NA
19: 2020 7 125 0.8488688 NA
20: 2020 8 109 0.8135198 NA
21: 2020 9 167 0.8413028 NA
22: 2020 10 128 0.8472222 NA
23: 2020 11 139 0.8487106 NA
24: 2020 12 173 0.8534572 0.8534572
25: 2021 1 111 0.7762238 0.8541338
26: 2021 2 169 0.9964413 0.8806606
27: 2021 3 125 1.0279188 0.9148352
28: 2021 4 134 1.0345489 0.9166667
29: 2021 5 112 1.0366242 0.9496036
30: 2021 6 114 0.9409594 0.8932147
31: 2021 7 171 0.9978678 0.9328063
32: 2021 8 161 1.0477555 1.0035336
33: 2021 9 195 1.0642504 1.0134582
34: 2021 10 150 1.0745156 1.0341981
35: 2021 11 181 1.0958812 1.0728793
36: 2021 12 163 1.0798065 1.0798065
Input:
df = structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L,
2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2020L, 2020L, 2020L,
2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), Month = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), Value = c(182L,
160L, 170L, 123L, 165L, 153L, 152L, 182L, 156L, 141L, 161L, 193L,
143L, 138L, 113L, 127L, 107L, 185L, 125L, 109L, 167L, 128L, 139L,
173L, 111L, 169L, 125L, 134L, 112L, 114L, 171L, 161L, 195L, 150L,
181L, 163L)), row.names = c(NA, -36L), class = "data.frame")

How to combine dataframes in R based-on similar timelines for multiple attributes and then transforming the data to make these columns as row headers?

I am trying to merge my sales data and patients data in R (and some other attributes) which are rolled-up at the country level for the same time-frame. After merging, I want to consolidate it to a long format instead of wide format and keep it unique at the Country-Month level.
This is how my input data looks like -
1) Sales Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 44 72 85 25 72
A0002 Germany 98 70 69 48 41
A0003 Russia 82 42 32 29 43
A0004 UK 79 83 51 48 47
A0005 France 45 75 10 13 23
A0006 India 92 85 28 13 18
2) Patients Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 7 13 22 23 13
A0002 Germany 9 10 17 25 25
A0003 Russia 24 19 6 8 5
A0004 UK 6 8 20 1 11
A0005 France 4 9 8 10 25
A0006 India 18 21 2 13 17
AND this is how I intend output to look like -
Coutry_ID Country_Name Month Sales Patients
A0001 USA 1/28/2018 44 7
A0001 USA 2/28/2018 72 13
A0001 USA 3/28/2018 85 22
A0001 USA 4/28/2018 25 23
A0001 USA 5/28/2018 72 13
A0002 Germany 1/28/2018 98 9
A0002 Germany 2/28/2018 70 10
A0002 Germany 3/28/2018 69 17
A0002 Germany 4/28/2018 48 25
A0002 Germany 5/28/2018 41 25
A0003 Russia 1/28/2018 82 24
A0003 Russia 2/28/2018 42 19
A0003 Russia 3/28/2018 32 6
A0003 Russia 4/28/2018 29 8
A0003 Russia 5/28/2018 43 5
A0004 UK 1/28/2018 79 6
A0004 UK 2/28/2018 83 8
A0004 UK 3/28/2018 51 20
A0004 UK 4/28/2018 48 1
A0004 UK 5/28/2018 47 11
A0005 France 1/28/2018 45 4
A0005 France 2/28/2018 75 9
A0005 France 3/28/2018 10 8
A0005 France 4/28/2018 13 10
A0005 France 5/28/2018 23 25
A0006 India 1/28/2018 92 18
A0006 India 2/28/2018 85 21
A0006 India 3/28/2018 28 2
A0006 India 4/28/2018 13 13
A0006 India 5/28/2018 18 17
I need a little guidance on these 2 things -
1 - How to convert the data from wide to long?
2 - For merging data, I am thinking about using DPLYR left_join on all these data-sets with my master list of countries with ID and Name. My doubt is whether I should first convert the data sets into The long format from wide or do that after merging?
You can get both the dataframes in long format and then join :
library(dplyr)
library(tidyr)
inner_join(
sales %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), values_to = 'Sales'),
patients %>% pivot_longer(cols = -c(Coutry_ID, Country_Name),
values_to = 'Patients'),
by = c("Coutry_ID", "Country_Name", "name"))
# A tibble: 30 x 5
# Coutry_ID Country_Name name Sales Patients
# <fct> <fct> <chr> <int> <int>
# 1 A0001 USA 1/28/2018 44 7
# 2 A0001 USA 2/28/2018 72 13
# 3 A0001 USA 3/28/2018 85 22
# 4 A0001 USA 4/28/2018 25 23
# 5 A0001 USA 5/28/2018 72 13
# 6 A0002 Germany 1/28/2018 98 9
# 7 A0002 Germany 2/28/2018 70 10
# 8 A0002 Germany 3/28/2018 69 17
# 9 A0002 Germany 4/28/2018 48 25
#10 A0002 Germany 5/28/2018 41 25
# … with 20 more rows
data
sales <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(44L, 98L, 82L, 79L, 45L, 92L), `2/28/2018` = c(72L,
70L, 42L, 83L, 75L, 85L), `3/28/2018` = c(85L, 69L, 32L,
51L, 10L, 28L), `4/28/2018` = c(25L, 48L, 29L, 48L, 13L,
13L), `5/28/2018` = c(72L, 41L, 43L, 47L, 23L, 18L)), class =
"data.frame", row.names = c(NA, -6L))
patients <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(7L, 9L, 24L, 6L, 4L, 18L), `2/28/2018` = c(13L,
10L, 19L, 8L, 9L, 21L), `3/28/2018` = c(22L, 17L, 6L, 20L,
8L, 2L), `4/28/2018` = c(23L, 25L, 8L, 1L, 10L, 13L), `5/28/2018` = c(13L,
25L, 5L, 11L, 25L, 17L)), class = "data.frame", row.names = c(NA, -6L))
Base R (not as eloquent as above):
# Create a named list of dataframes:
df_list <- list(patients = patients, sales = sales)
# Create a vector in each with the name of the dataframe:
df_list <- mapply(cbind, df_list, "desc" = as.character(names(df_list)),
SIMPLIFY = FALSE)
# Define a function to reshape the data:
reshape_ps <- function(x){
tmp <- setNames(reshape(x,
direction = "long",
varying = which(names(x) %in% names(x[,sapply(x, is.numeric)])),
idvar = c(!(names(x) %in% names(x[,sapply(x, is.numeric)]))),
v.names = "month",
times = as.Date(names(x[,sapply(x, is.numeric)]), "%m/%d/%Y"),
new.row.names = 1:(nrow(x)*length(which(names(x) %in% names(x[,sapply(x, is.numeric)]))))),
c(names(x[!(names(x) %in% names(x[,sapply(x, is.numeric)]))]), "month", as.character(unique(x$desc))))
# Drop the dataframe name vector:
clean <- tmp[,names(tmp) != "desc"]
# Specify the return object:
return(clean)
}
# Merge the result of the function applied on both dataframes:
Reduce(function(y, z){merge(y, z, by = intersect(colnames(y), colnames(z)), all = TRUE)},
Map(function(x){reshape_ps(x)}, df_list))

Filtering dataframe on multiple columns in R with at least 6 matches

I have the following data and I would like to keep only the cases that have exactly 6 instances of the same individual (same last name and first name) in the dataset. For example, Quincy Acy appears 6 times in the df and I would like to retain each of these cases but get rid of Alex Abrines because there are only 3 instances (< 6) of that individual.
last first start_year end_year Team GP MIN PTS W L
<chr> <chr> <int> <int> <chr> <int> <dbl> <dbl> <int> <int>
1 Abri… Alex 2016 2017 OKC 68 15.5 6 37 31
2 Abri… Alex 2017 2018 OKC 75 15.1 4.8 42 33
3 Abri… Alex 2018 2019 OKC 31 19 5.3 21 10
4 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
5 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
6 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
7 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
8 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
9 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
I have tried x <- df %>% count(last, first) %>% filter(n == 6) followed by df %>% filter(last %in% x$last & first %in% x$first) but that matches any last name and any first name separately rather than matching both last and first name. I am sure there is also an easier solution with filter without having to use group_by first.
I would like the solution to look like:
<chr> <chr> <int> <int> <chr> <int> <dbl> <dbl> <int> <int>
1 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
2 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
3 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
4 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
5 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
6 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
7 Adams Stev… 2013 2014 OKC 81 14.8 3.3 59 22
8 Adams Stev… 2014 2015 OKC 70 25.3 7.7 37 33
9 Adams Stev… 2015 2016 OKC 80 25.2 8 54 26
10 Adams Stev… 2016 2017 OKC 80 29.9 11.3 47 33
11 Adams Stev… 2017 2018 OKC 76 32.7 13.9 43 33
12 Adams Stev… 2018 2019 OKC 80 33.4 13.8 47 33
Instead of counting to summarise the data, creating a new object and then do the filter, we can group_by, the 'last', 'first' and directly filter the groups based on the condition
library(dplyr)
df1 <- df %>%
group_by(last, first) %>%
filter(n() == 6)
If it is at least 6, then change the == or >=
Or another option is table
subset(df, paste(last, first) %in% names(which(table(paste(last, first)) == 6)))
In base R, we can use ave to count number of rows in each group of first and last values and select groups where number of rows is 6.
subset(df, ave(start_year, first, last, FUN = length) == 6)
# last first start_year end_year Team GP MIN PTS W L
#4 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
#5 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
#6 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
#7 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
#8 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
#9 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
We can do the same with data.table
library(data.table)
setDT(df)[,.SD[.N == 6], .(first, last)]
data
df <- structure(list(last = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("Abri…", "Acy"), class = "factor"), first = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Alex", "Quin…"
), class = "factor"), start_year = c(2016L, 2017L, 2018L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L), end_year = c(2017L, 2018L,
2019L, 2014L, 2015L, 2016L, 2017L, 2018L, 2019L), Team = structure(c(3L,
3L, 3L, 5L, 2L, 5L, 1L, 1L, 4L), .Label = c("BKN", "NYK", "OKC",
"PHX", "SAC"), class = "factor"), GP = c(68L, 75L, 31L, 63L,
68L, 59L, 38L, 70L, 10L), MIN = c(15.5, 15.1, 19, 13.5, 18.9,
14.8, 14.7, 19.4, 12.3), PTS = c(6, 4.8, 5.3, 2.7, 5.9, 5.3,
5.8, 5.9, 1.7), W = c(37L, 42L, 21L, 22L, 12L, 21L, 11L, 26L,
2L), L = c(31L, 33L, 10L, 41L, 56L, 38L, 27L, 44L, 8L)), class = "data.frame",
row.names = c(NA, -9L))

filling in missing data using fitted value in R

I have a dataframe like this:
ID year age wage
1 2 1981 22 10000
2 2 1982 23 11000
3 2 1983 24 11500
4 2 1984 25 11000
5 2 1985 26 14000
6 2 1986 27 16000
7 2 1987 28 20000
8 2 1988 29 19000
9 2 1989 30 20000
10 2 1990 31 20000
11 2 1991 32 22000
12 2 1992 33 25000
13 2 1993 34 0
14 2 1994 35 NA
15 2 1995 36 0
16 2 1996 37 NA
17 2 1997 38 0
18 2 1998 39 NA
19 2 1999 40 0
20 2 2000 41 NA
21 2 2001 42 0
22 2 2002 43 NA
23 2 2003 44 0
24 2 2004 45 NA
25 2 2005 46 5500
26 2 2006 47 NA
27 2 2007 48 5000
28 2 2008 49 NA
29 2 2009 50 6000
30 2 2010 51 NA
31 2 2011 52 19000
32 2 2012 53 NA
33 2 2013 54 21000
34 2 2014 55 NA
35 2 2015 56 23000
36 3 1984 22 1300
37 3 1985 23 0
38 3 1986 24 1500
39 3 1987 25 1000
40 3 1988 26 0
I want to use an individual-specific regression of wage on age and age-squared to impute missing wage observations. I want to only impute when at least 5 non-missing observations are available.
As suggested by jay.sf, I tried the following but with fitted values:
df_imp <- do.call(rbind,
by(df, df$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$fitted.values
x$wage[IDs] <- with(x, b)[IDs]
}
return(x)
}))
I got the following results:
ID year age wage
36 2 1981 22 10000.000
37 2 1982 23 11000.000
38 2 1983 24 11500.000
39 2 1984 25 11000.000
40 2 1985 26 14000.000
41 2 1986 27 16000.000
42 2 1987 28 20000.000
43 2 1988 29 19000.000
44 2 1989 30 20000.000
45 2 1990 31 20000.000
46 2 1991 32 22000.000
47 2 1992 33 25000.000
48 2 1993 34 0.000
49 2 1994 35 7291.777
50 2 1995 36 0.000
51 2 1996 37 6779.133
52 2 1997 38 0.000
53 2 1998 39 7591.597
54 2 1999 40 0.000
55 2 2000 41 9729.168
56 2 2001 42 0.000
57 2 2002 43 13191.847
58 2 2003 44 0.000
59 2 2004 45 17979.633
60 2 2005 46 5500.000
61 2 2006 47 NA
62 2 2007 48 5000.000
63 2 2008 49 NA
64 2 2009 50 6000.000
65 2 2010 51 NA
66 2 2011 52 19000.000
67 2 2012 53 NA
68 2 2013 54 21000.000
69 2 2014 55 NA
70 2 2015 56 23000.000
You could use a simple if statement, without an else. Define an ID vector IDs that identifies missings, which you use to count them and to subset your Y column wage.
For this you can use by(), which splits your data similar to split() but you may apply a function; just rbind the result.
It's probably wiser to rather use the coefficients than the fitted values, because the latter also would be NA if your Y are NA. And you need to use raw=TRUE in the poly.
DF.imp <- do.call(rbind,
by(DF, DF$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$coefficients
x$wage[IDs] <- with(x, (b[1] + b[2]*age + b[3]*age^2))[IDs]
}
return(x)
}))
Note that I've slightly changed your example data, so that ID 3 also has missings, but less than 5 non-missings.
Result
DF.imp
# ID year age wage
# 2.1 2 1981 22 10000.000
# 2.2 2 1982 23 11000.000
# 2.3 2 1983 24 11500.000
# 2.4 2 1984 25 11000.000
# 2.5 2 1985 26 14000.000
# 2.6 2 1986 27 16000.000
# 2.7 2 1987 28 20000.000
# 2.8 2 1988 29 19000.000
# 2.9 2 1989 30 20000.000
# 2.10 2 1990 31 20000.000
# 2.11 2 1991 32 22000.000
# 2.12 2 1992 33 25000.000
# 2.13 2 1993 34 0.000
# 2.14 2 1994 35 7626.986
# 2.15 2 1995 36 0.000
# 2.16 2 1996 37 7039.387
# 2.17 2 1997 38 0.000
# 2.18 2 1998 39 6783.065
# 2.19 2 1999 40 0.000
# 2.20 2 2000 41 6858.020
# 2.21 2 2001 42 0.000
# 2.22 2 2002 43 7264.252
# 2.23 2 2003 44 0.000
# 2.24 2 2004 45 8001.761
# 2.25 2 2005 46 5500.000
# 2.26 2 2006 47 9070.546
# 2.27 2 2007 48 5000.000
# 2.28 2 2008 49 10470.609
# 2.29 2 2009 50 6000.000
# 2.30 2 2010 51 12201.948
# 2.31 2 2011 52 19000.000
# 2.32 2 2012 53 14264.565
# 2.33 2 2013 54 21000.000
# 2.34 2 2014 55 16658.458
# 2.35 2 2015 56 23000.000
# 3.36 3 1984 22 1300.000
# 3.37 3 1985 23 NA
# 3.38 3 1986 24 1500.000
# 3.39 3 1987 25 1000.000
# 3.40 3 1988 26 NA
Data
DF <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1981L,
1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 1984L, 1985L,
1986L, 1987L, 1988L), age = c(22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 22L, 23L, 24L, 25L, 26L), wage = c(10000L, 11000L,
11500L, 11000L, 14000L, 16000L, 20000L, 19000L, 20000L, 20000L,
22000L, 25000L, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA,
5500L, NA, 5000L, NA, 6000L, NA, 19000L, NA, 21000L, NA, 23000L,
1300L, NA, 1500L, 1000L, NA)), row.names = c(NA, -40L), class = "data.frame")

How to find quartiles grouped by 2 different columns

I have a dataset that looks like this
year month age
2007 1 17
2007 1 18
2007 1 19
2007 1 30
2007 1 31
2007 2 18
2007 2 19
2007 2 30
2008 2 41
2008 2 52
2008 2 49
2008 3 23
2008 3 19
2008 3 39
And I'm stuck trying to find quartile group by each year and month.
The results should be like:
2007 1 Q1 Q2 Q3 Q4
2007 2 Q1 Q2 Q3 Q4
etc..
Thanks
Your question is a bit confusing. It only takes three cutpoints to separate into quartiles. So what do you really want in those Q1, Q2, Q3,Q4 columns? If you want counts it would seem to be a bit boring. I'm going to assume you want the min, 25th.pctile, median, 75th.pctile, and max:
do.call ( rbind, with( dfrm, tapply(age, interaction(year=year , month=month), quantile,
probs=c(0, .25,.5, 0.75, 1) ) ) )
#---------------------
0% 25% 50% 75% 100%
2007.1 17 18.0 19 30.0 31
2007.2 18 18.5 19 24.5 30
2008.2 41 45.0 49 50.5 52
2008.3 19 21.0 23 31.0 39
Aggregate does this.
> aggregate(.~year + month, data=age, FUN=fivenum)
year month age.1 age.2 age.3 age.4 age.5
1 2007 1 17.0 18.0 19.0 30.0 31.0
2 2007 2 18.0 18.5 19.0 24.5 30.0
3 2008 2 41.0 45.0 49.0 50.5 52.0
4 2008 3 19.0 21.0 23.0 31.0 39.0
> dput(age)
structure(list(year = c(2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L), month = c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), age = c(17L,
18L, 19L, 30L, 31L, 18L, 19L, 30L, 41L, 52L, 49L, 23L, 19L, 39L
)), .Names = c("year", "month", "age"), class = "data.frame", row.names = c(NA,
-14L))

Resources