filling in missing data using fitted value in R - r

I have a dataframe like this:
ID year age wage
1 2 1981 22 10000
2 2 1982 23 11000
3 2 1983 24 11500
4 2 1984 25 11000
5 2 1985 26 14000
6 2 1986 27 16000
7 2 1987 28 20000
8 2 1988 29 19000
9 2 1989 30 20000
10 2 1990 31 20000
11 2 1991 32 22000
12 2 1992 33 25000
13 2 1993 34 0
14 2 1994 35 NA
15 2 1995 36 0
16 2 1996 37 NA
17 2 1997 38 0
18 2 1998 39 NA
19 2 1999 40 0
20 2 2000 41 NA
21 2 2001 42 0
22 2 2002 43 NA
23 2 2003 44 0
24 2 2004 45 NA
25 2 2005 46 5500
26 2 2006 47 NA
27 2 2007 48 5000
28 2 2008 49 NA
29 2 2009 50 6000
30 2 2010 51 NA
31 2 2011 52 19000
32 2 2012 53 NA
33 2 2013 54 21000
34 2 2014 55 NA
35 2 2015 56 23000
36 3 1984 22 1300
37 3 1985 23 0
38 3 1986 24 1500
39 3 1987 25 1000
40 3 1988 26 0
I want to use an individual-specific regression of wage on age and age-squared to impute missing wage observations. I want to only impute when at least 5 non-missing observations are available.
As suggested by jay.sf, I tried the following but with fitted values:
df_imp <- do.call(rbind,
by(df, df$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$fitted.values
x$wage[IDs] <- with(x, b)[IDs]
}
return(x)
}))
I got the following results:
ID year age wage
36 2 1981 22 10000.000
37 2 1982 23 11000.000
38 2 1983 24 11500.000
39 2 1984 25 11000.000
40 2 1985 26 14000.000
41 2 1986 27 16000.000
42 2 1987 28 20000.000
43 2 1988 29 19000.000
44 2 1989 30 20000.000
45 2 1990 31 20000.000
46 2 1991 32 22000.000
47 2 1992 33 25000.000
48 2 1993 34 0.000
49 2 1994 35 7291.777
50 2 1995 36 0.000
51 2 1996 37 6779.133
52 2 1997 38 0.000
53 2 1998 39 7591.597
54 2 1999 40 0.000
55 2 2000 41 9729.168
56 2 2001 42 0.000
57 2 2002 43 13191.847
58 2 2003 44 0.000
59 2 2004 45 17979.633
60 2 2005 46 5500.000
61 2 2006 47 NA
62 2 2007 48 5000.000
63 2 2008 49 NA
64 2 2009 50 6000.000
65 2 2010 51 NA
66 2 2011 52 19000.000
67 2 2012 53 NA
68 2 2013 54 21000.000
69 2 2014 55 NA
70 2 2015 56 23000.000

You could use a simple if statement, without an else. Define an ID vector IDs that identifies missings, which you use to count them and to subset your Y column wage.
For this you can use by(), which splits your data similar to split() but you may apply a function; just rbind the result.
It's probably wiser to rather use the coefficients than the fitted values, because the latter also would be NA if your Y are NA. And you need to use raw=TRUE in the poly.
DF.imp <- do.call(rbind,
by(DF, DF$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$coefficients
x$wage[IDs] <- with(x, (b[1] + b[2]*age + b[3]*age^2))[IDs]
}
return(x)
}))
Note that I've slightly changed your example data, so that ID 3 also has missings, but less than 5 non-missings.
Result
DF.imp
# ID year age wage
# 2.1 2 1981 22 10000.000
# 2.2 2 1982 23 11000.000
# 2.3 2 1983 24 11500.000
# 2.4 2 1984 25 11000.000
# 2.5 2 1985 26 14000.000
# 2.6 2 1986 27 16000.000
# 2.7 2 1987 28 20000.000
# 2.8 2 1988 29 19000.000
# 2.9 2 1989 30 20000.000
# 2.10 2 1990 31 20000.000
# 2.11 2 1991 32 22000.000
# 2.12 2 1992 33 25000.000
# 2.13 2 1993 34 0.000
# 2.14 2 1994 35 7626.986
# 2.15 2 1995 36 0.000
# 2.16 2 1996 37 7039.387
# 2.17 2 1997 38 0.000
# 2.18 2 1998 39 6783.065
# 2.19 2 1999 40 0.000
# 2.20 2 2000 41 6858.020
# 2.21 2 2001 42 0.000
# 2.22 2 2002 43 7264.252
# 2.23 2 2003 44 0.000
# 2.24 2 2004 45 8001.761
# 2.25 2 2005 46 5500.000
# 2.26 2 2006 47 9070.546
# 2.27 2 2007 48 5000.000
# 2.28 2 2008 49 10470.609
# 2.29 2 2009 50 6000.000
# 2.30 2 2010 51 12201.948
# 2.31 2 2011 52 19000.000
# 2.32 2 2012 53 14264.565
# 2.33 2 2013 54 21000.000
# 2.34 2 2014 55 16658.458
# 2.35 2 2015 56 23000.000
# 3.36 3 1984 22 1300.000
# 3.37 3 1985 23 NA
# 3.38 3 1986 24 1500.000
# 3.39 3 1987 25 1000.000
# 3.40 3 1988 26 NA
Data
DF <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1981L,
1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 1984L, 1985L,
1986L, 1987L, 1988L), age = c(22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 22L, 23L, 24L, 25L, 26L), wage = c(10000L, 11000L,
11500L, 11000L, 14000L, 16000L, 20000L, 19000L, 20000L, 20000L,
22000L, 25000L, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA,
5500L, NA, 5000L, NA, 6000L, NA, 19000L, NA, 21000L, NA, 23000L,
1300L, NA, 1500L, 1000L, NA)), row.names = c(NA, -40L), class = "data.frame")

Related

data.table efficiently finding common pairs between 2 columns

say I have a dataframe
subject stim1 stim2 feedback
1 1003 50 51 1
2 1003 48 50 1
3 1003 49 51 1
4 1003 47 49 1
5 1003 47 46 1
6 1003 46 48 1
10 1003 50 48 1
428 1003 48 51 0
433 1003 46 50 0
434 1003 50 49 0
435 1003 54 59 0
I want to create a new column "transitive_pair" by
group by subject (column 1),
For every row in which feedback==0 (starting index 428, otherwise transitive_pair=NaN).
I want to return a boolean which tells me whether there is any chain of pairings (but only those in which feedback==1) that would transitively link stim1 and stim2 values.
Working out a few examples.
row 428- stim1=48 and stim2=51
48 and 51 are not paired but 51 was paired with 50 (e.g.row 1 ) and 50 was paired with 48 (row 10) so transitive_pair[428]=True
row 433- stim 1=46 and stim2=50
46 and 48 were paired (row 6) and 48 was paired with 50 (row 2) so transitive_pair[433]=True
in row 435, stim1=54, stim2=59
there is no chain of pairs that could link them (59 is not paired with anything while feedback==1) so transitive_pair[435]=False
desired output
subject stim1 stim2 feedback transitive_pair
1 1003 50 51 1 NaN
2 1003 48 50 1 NaN
3 1003 49 51 1 NaN
4 1003 47 49 1 NaN
5 1003 47 46 1 NaN
6 1003 46 48 1 NaN
10 1003 50 48 1 NaN
428 1003 48 51 0 1
433 1003 46 50 0 1
434 1003 50 49 0 1
435 1003 54 59 0 0
any help would be greatly appreciated!!
and putting a recreateble df here
structure(list(subject = c(1003L, 1003L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1003L, 1003L), stim1 = c(50L, 48L,
49L, 47L, 47L, 46L, 50L, 48L, 46L, 50L, 54L), stim2 = c(51L,
50L, 51L, 49L, 46L, 48L, 48L, 51L, 50L, 49L, 59L), feedback = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), transitive_pair = c(NaN,
NaN, NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 0)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 428L, 433L, 434L, 435L), class = "data.frame")
The columns "stim1" and "stim2" define an undirected graph. Create the graph for feedback == 1, get its connected components and for each row of the data.frame, check if the values of "stim1" and "stim2" belong to the same component. In the end assign NaN to the rows where feedback is 1.
suppressPackageStartupMessages(library(igraph))
inx <- df1$feedback == 1
g <- graph_from_data_frame(df1[inx, c("stim1", "stim2")], directed = FALSE)
plot(g)
g_comp <- components(g)$membership
df1$transitive_pair_2 <- apply(df1[c("stim1", "stim2")], 1, \(x) {
i <- names(g_comp) == x[1]
j <- names(g_comp) == x[2]
if(any(i) & any(j))
g_comp[i] == g_comp[j]
else 0L
})
df1$transitive_pair_2[inx] <- NaN
df1
#> subject stim1 stim2 feedback transitive_pair transitive_pair_2
#> 1 1003 50 51 1 NaN NaN
#> 2 1003 48 50 1 NaN NaN
#> 3 1003 49 51 1 NaN NaN
#> 4 1003 47 49 1 NaN NaN
#> 5 1003 47 46 1 NaN NaN
#> 6 1003 46 48 1 NaN NaN
#> 10 1003 50 48 1 NaN NaN
#> 428 1003 48 51 0 1 1
#> 433 1003 46 50 0 1 1
#> 434 1003 50 49 0 1 1
#> 435 1003 54 59 0 0 0
Created on 2022-07-31 by the reprex package (v2.0.1)

How to combine dataframes in R based-on similar timelines for multiple attributes and then transforming the data to make these columns as row headers?

I am trying to merge my sales data and patients data in R (and some other attributes) which are rolled-up at the country level for the same time-frame. After merging, I want to consolidate it to a long format instead of wide format and keep it unique at the Country-Month level.
This is how my input data looks like -
1) Sales Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 44 72 85 25 72
A0002 Germany 98 70 69 48 41
A0003 Russia 82 42 32 29 43
A0004 UK 79 83 51 48 47
A0005 France 45 75 10 13 23
A0006 India 92 85 28 13 18
2) Patients Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 7 13 22 23 13
A0002 Germany 9 10 17 25 25
A0003 Russia 24 19 6 8 5
A0004 UK 6 8 20 1 11
A0005 France 4 9 8 10 25
A0006 India 18 21 2 13 17
AND this is how I intend output to look like -
Coutry_ID Country_Name Month Sales Patients
A0001 USA 1/28/2018 44 7
A0001 USA 2/28/2018 72 13
A0001 USA 3/28/2018 85 22
A0001 USA 4/28/2018 25 23
A0001 USA 5/28/2018 72 13
A0002 Germany 1/28/2018 98 9
A0002 Germany 2/28/2018 70 10
A0002 Germany 3/28/2018 69 17
A0002 Germany 4/28/2018 48 25
A0002 Germany 5/28/2018 41 25
A0003 Russia 1/28/2018 82 24
A0003 Russia 2/28/2018 42 19
A0003 Russia 3/28/2018 32 6
A0003 Russia 4/28/2018 29 8
A0003 Russia 5/28/2018 43 5
A0004 UK 1/28/2018 79 6
A0004 UK 2/28/2018 83 8
A0004 UK 3/28/2018 51 20
A0004 UK 4/28/2018 48 1
A0004 UK 5/28/2018 47 11
A0005 France 1/28/2018 45 4
A0005 France 2/28/2018 75 9
A0005 France 3/28/2018 10 8
A0005 France 4/28/2018 13 10
A0005 France 5/28/2018 23 25
A0006 India 1/28/2018 92 18
A0006 India 2/28/2018 85 21
A0006 India 3/28/2018 28 2
A0006 India 4/28/2018 13 13
A0006 India 5/28/2018 18 17
I need a little guidance on these 2 things -
1 - How to convert the data from wide to long?
2 - For merging data, I am thinking about using DPLYR left_join on all these data-sets with my master list of countries with ID and Name. My doubt is whether I should first convert the data sets into The long format from wide or do that after merging?
You can get both the dataframes in long format and then join :
library(dplyr)
library(tidyr)
inner_join(
sales %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), values_to = 'Sales'),
patients %>% pivot_longer(cols = -c(Coutry_ID, Country_Name),
values_to = 'Patients'),
by = c("Coutry_ID", "Country_Name", "name"))
# A tibble: 30 x 5
# Coutry_ID Country_Name name Sales Patients
# <fct> <fct> <chr> <int> <int>
# 1 A0001 USA 1/28/2018 44 7
# 2 A0001 USA 2/28/2018 72 13
# 3 A0001 USA 3/28/2018 85 22
# 4 A0001 USA 4/28/2018 25 23
# 5 A0001 USA 5/28/2018 72 13
# 6 A0002 Germany 1/28/2018 98 9
# 7 A0002 Germany 2/28/2018 70 10
# 8 A0002 Germany 3/28/2018 69 17
# 9 A0002 Germany 4/28/2018 48 25
#10 A0002 Germany 5/28/2018 41 25
# … with 20 more rows
data
sales <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(44L, 98L, 82L, 79L, 45L, 92L), `2/28/2018` = c(72L,
70L, 42L, 83L, 75L, 85L), `3/28/2018` = c(85L, 69L, 32L,
51L, 10L, 28L), `4/28/2018` = c(25L, 48L, 29L, 48L, 13L,
13L), `5/28/2018` = c(72L, 41L, 43L, 47L, 23L, 18L)), class =
"data.frame", row.names = c(NA, -6L))
patients <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(7L, 9L, 24L, 6L, 4L, 18L), `2/28/2018` = c(13L,
10L, 19L, 8L, 9L, 21L), `3/28/2018` = c(22L, 17L, 6L, 20L,
8L, 2L), `4/28/2018` = c(23L, 25L, 8L, 1L, 10L, 13L), `5/28/2018` = c(13L,
25L, 5L, 11L, 25L, 17L)), class = "data.frame", row.names = c(NA, -6L))
Base R (not as eloquent as above):
# Create a named list of dataframes:
df_list <- list(patients = patients, sales = sales)
# Create a vector in each with the name of the dataframe:
df_list <- mapply(cbind, df_list, "desc" = as.character(names(df_list)),
SIMPLIFY = FALSE)
# Define a function to reshape the data:
reshape_ps <- function(x){
tmp <- setNames(reshape(x,
direction = "long",
varying = which(names(x) %in% names(x[,sapply(x, is.numeric)])),
idvar = c(!(names(x) %in% names(x[,sapply(x, is.numeric)]))),
v.names = "month",
times = as.Date(names(x[,sapply(x, is.numeric)]), "%m/%d/%Y"),
new.row.names = 1:(nrow(x)*length(which(names(x) %in% names(x[,sapply(x, is.numeric)]))))),
c(names(x[!(names(x) %in% names(x[,sapply(x, is.numeric)]))]), "month", as.character(unique(x$desc))))
# Drop the dataframe name vector:
clean <- tmp[,names(tmp) != "desc"]
# Specify the return object:
return(clean)
}
# Merge the result of the function applied on both dataframes:
Reduce(function(y, z){merge(y, z, by = intersect(colnames(y), colnames(z)), all = TRUE)},
Map(function(x){reshape_ps(x)}, df_list))

Filtering dataframe on multiple columns in R with at least 6 matches

I have the following data and I would like to keep only the cases that have exactly 6 instances of the same individual (same last name and first name) in the dataset. For example, Quincy Acy appears 6 times in the df and I would like to retain each of these cases but get rid of Alex Abrines because there are only 3 instances (< 6) of that individual.
last first start_year end_year Team GP MIN PTS W L
<chr> <chr> <int> <int> <chr> <int> <dbl> <dbl> <int> <int>
1 Abri… Alex 2016 2017 OKC 68 15.5 6 37 31
2 Abri… Alex 2017 2018 OKC 75 15.1 4.8 42 33
3 Abri… Alex 2018 2019 OKC 31 19 5.3 21 10
4 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
5 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
6 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
7 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
8 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
9 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
I have tried x <- df %>% count(last, first) %>% filter(n == 6) followed by df %>% filter(last %in% x$last & first %in% x$first) but that matches any last name and any first name separately rather than matching both last and first name. I am sure there is also an easier solution with filter without having to use group_by first.
I would like the solution to look like:
<chr> <chr> <int> <int> <chr> <int> <dbl> <dbl> <int> <int>
1 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
2 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
3 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
4 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
5 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
6 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
7 Adams Stev… 2013 2014 OKC 81 14.8 3.3 59 22
8 Adams Stev… 2014 2015 OKC 70 25.3 7.7 37 33
9 Adams Stev… 2015 2016 OKC 80 25.2 8 54 26
10 Adams Stev… 2016 2017 OKC 80 29.9 11.3 47 33
11 Adams Stev… 2017 2018 OKC 76 32.7 13.9 43 33
12 Adams Stev… 2018 2019 OKC 80 33.4 13.8 47 33
Instead of counting to summarise the data, creating a new object and then do the filter, we can group_by, the 'last', 'first' and directly filter the groups based on the condition
library(dplyr)
df1 <- df %>%
group_by(last, first) %>%
filter(n() == 6)
If it is at least 6, then change the == or >=
Or another option is table
subset(df, paste(last, first) %in% names(which(table(paste(last, first)) == 6)))
In base R, we can use ave to count number of rows in each group of first and last values and select groups where number of rows is 6.
subset(df, ave(start_year, first, last, FUN = length) == 6)
# last first start_year end_year Team GP MIN PTS W L
#4 Acy Quin… 2013 2014 SAC 63 13.5 2.7 22 41
#5 Acy Quin… 2014 2015 NYK 68 18.9 5.9 12 56
#6 Acy Quin… 2015 2016 SAC 59 14.8 5.3 21 38
#7 Acy Quin… 2016 2017 BKN 38 14.7 5.8 11 27
#8 Acy Quin… 2017 2018 BKN 70 19.4 5.9 26 44
#9 Acy Quin… 2018 2019 PHX 10 12.3 1.7 2 8
We can do the same with data.table
library(data.table)
setDT(df)[,.SD[.N == 6], .(first, last)]
data
df <- structure(list(last = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("Abri…", "Acy"), class = "factor"), first = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Alex", "Quin…"
), class = "factor"), start_year = c(2016L, 2017L, 2018L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L), end_year = c(2017L, 2018L,
2019L, 2014L, 2015L, 2016L, 2017L, 2018L, 2019L), Team = structure(c(3L,
3L, 3L, 5L, 2L, 5L, 1L, 1L, 4L), .Label = c("BKN", "NYK", "OKC",
"PHX", "SAC"), class = "factor"), GP = c(68L, 75L, 31L, 63L,
68L, 59L, 38L, 70L, 10L), MIN = c(15.5, 15.1, 19, 13.5, 18.9,
14.8, 14.7, 19.4, 12.3), PTS = c(6, 4.8, 5.3, 2.7, 5.9, 5.3,
5.8, 5.9, 1.7), W = c(37L, 42L, 21L, 22L, 12L, 21L, 11L, 26L,
2L), L = c(31L, 33L, 10L, 41L, 56L, 38L, 27L, 44L, 8L)), class = "data.frame",
row.names = c(NA, -9L))

creating a new variable using dplyr and ifelse where more than one conditions are used

I have the following data set. I want to create a variable called "specialized". For creating the variable, I need to group the data by using group_by (sic, year). Then the dummy variable "specialized" will be created -
if in a given "year and "sic", the "percentage" variable is the highest AND the difference between the highest percentage and the second highest percentage is greater than 10, then it will be coded "1", "0" otherwise.
However, Note that if in a given "year" and "sic", there is no second highest percentage - meaning that only one percentage, which is the highest - then it will be coded 1. This kind of case is "sic ==0100" in "year==2000"in my data set.
I tried the following code
df <- df %>%
group_by(sic, year) %>%
mutate(SPECIALIZED = ifelse(max(percentage) && (max(percentage)-nth(sort(percentage), - 2)) > 10), 1, 0 ) %>%
ungroup()
But it does not work.
Here is the data -
gvkey auditor_fkey year sic percentage
1 001266 4 2001 0100 26.9605909
2 003107 2 2000 1000 37.0939127
3 003107 2 2000 1000 37.0939127
4 003107 2 2001 1000 9.8899690
5 003107 2 2001 1000 9.8899690
6 005560 1 2000 1040 100.0000000
7 005560 7 2001 1040 8.2959428
8 007881 5 2001 1040 71.1026743
9 009728 597 2001 1040 1.0906007
10 009728 597 2001 1040 1.0906007
11 010390 2 2000 0100 100.0000000
12 010390 2 2000 0100 100.0000000
13 010390 2 2001 0100 73.0394091
14 010390 2 2001 0100 73.0394091
15 012321 1 2001 1040 18.1873703
16 012321 1 2001 1040 18.1873703
17 014590 5 2000 1000 60.6862904
18 014590 5 2000 1000 60.6862904
19 014590 5 2001 1000 18.8287898
20 014590 5 2001 1000 18.8287898
21 014793 2 2000 1220 34.7515455
22 014793 2 2000 1220 34.7515455
23 014793 2 2001 1220 58.0859392
24 014793 2 2001 1220 58.0859392
25 015274 1 2000 1220 65.2484545
26 015274 1 2000 1220 65.2484545
27 015274 1 2001 1220 41.9140608
28 015274 1 2001 1220 41.9140608
29 019565 1 2001 1000 71.1457384
30 019565 1 2001 1000 71.1457384
31 020488 1 2000 1040 100.0000000
32 020488 1 2001 1040 18.1873703
33 025776 1 2000 1000 2.2197969
34 025776 1 2001 1000 71.1457384
35 031626 2 2000 1000 37.0939127
36 031626 2 2001 1000 9.8899690
37 061811 5 2000 1000 60.6862904
38 061811 5 2001 1000 18.8287898
39 061811 5 2001 1000 18.8287898
40 064134 580 2001 1000 0.1355028
41 064134 580 2001 1000 0.1355028
42 065921 1 2000 1040 100.0000000
43 065921 1 2000 1040 100.0000000
44 065921 1 2001 1040 18.1873703
45 065921 1 2001 1040 18.1873703
46 102341 2 2001 1040 1.3234119
47 142460 2 2001 1220 58.0859392
48 142460 2 2001 1220 58.0859392
49 142460 2 2001 1220 58.0859392
The final data set should be look like this --
gvkey auditor_fkey year sic percentage specialized
1 10390 2 2000 0100 100.0000000 1
2 10390 2 2000 0100 100.0000000 1
3 3107 2 2000 1000 37.0939127 0
4 3107 2 2000 1000 37.0939127 0
5 14590 5 2000 1000 60.6862904 1
6 14590 5 2000 1000 60.6862904 1
7 25776 1 2000 1000 2.2197969 0
8 31626 2 2000 1000 37.0939127 0
9 61811 5 2000 1000 60.6862904 1
10 5560 1 2000 1040 100.0000000 1
11 20488 1 2000 1040 100.0000000 1
12 65921 1 2000 1040 100.0000000 1
13 65921 1 2000 1040 100.0000000 1
14 14793 2 2000 1220 34.7515456 0
15 14793 2 2000 1220 34.7515456 0
16 15274 1 2000 1220 65.2484544 1
17 15274 1 2000 1220 65.2484544 1
18 1266 4 2001 0100 26.9605909 0
19 10390 2 2001 0100 73.0394091 1
20 10390 2 2001 0100 73.0394091 1
21 3107 2 2001 1000 9.8899690 0
22 3107 2 2001 1000 9.8899690 0
23 14590 5 2001 1000 18.8287898 0
24 14590 5 2001 1000 18.8287898 0
25 19565 1 2001 1000 71.1457384 1
26 19565 1 2001 1000 71.1457384 1
27 25776 1 2001 1000 71.1457384 1
28 31626 2 2001 1000 9.8899690 0
29 61811 5 2001 1000 18.8287898 0
30 61811 5 2001 1000 18.8287898 0
31 64134 580 2001 1000 0.1355028 0
32 64134 580 2001 1000 0.1355028 0
33 5560 7 2001 1040 8.2959428 0
34 7881 5 2001 1040 71.1026743 1
35 9728 597 2001 1040 1.0906007 0
36 9728 597 2001 1040 1.0906007 0
37 12321 1 2001 1040 18.1873703 0
38 12321 1 2001 1040 18.1873703 0
39 20488 1 2001 1040 18.1873703 0
40 65921 1 2001 1040 18.1873703 0
41 65921 1 2001 1040 18.1873703 0
42 102341 2 2001 1040 1.3234119 0
43 14793 2 2001 1220 58.0859392 1
44 14793 2 2001 1220 58.0859392 1
45 15274 1 2001 1220 41.9140608 0
46 15274 1 2001 1220 41.9140608 0
47 142460 2 2001 1220 58.0859392 1
48 142460 2 2001 1220 58.0859392 1
49 142460 2 2001 1220 58.0859392 1
I appreciate your help.
The order changed in you data and the expected result. So I took the data from the result instead. Here is breakdown of the logic into seperate columns before creating the dummy with dummy from hablar.
library(hablar)
library(dplyr)
df %>%
group_by(sic, year) %>%
mutate(second_highest = nth(sort(unique(percentage), decreasing = T), 2),
max_value = max(percentage),
is_max = percentage == max_value,
is_ab_10 = (max_value - second_highest) > 10,
specialized = dummy(is_max & is_ab_10, missing = 1)
) %>%
ungroup() %>%
select(-c(second_highest, max_value, is_max, is_ab_10))
Result
# A tibble: 49 x 6
gvkey auditor_fkey year sic percentage specialized
<int> <int> <int> <int> <dbl> <int>
1 10390 2 2000 100 100 1
2 10390 2 2000 100 100 1
3 3107 2 2000 1000 37.1 0
4 3107 2 2000 1000 37.1 0
5 14590 5 2000 1000 60.7 1
6 14590 5 2000 1000 60.7 1
7 25776 1 2000 1000 2.22 0
8 31626 2 2000 1000 37.1 0
9 61811 5 2000 1000 60.7 1
10 5560 1 2000 1040 100 1
# … with 39 more rows
Data
df <- structure(list(gvkey = c(10390L, 10390L, 3107L, 3107L, 14590L,
14590L, 25776L, 31626L, 61811L, 5560L, 20488L, 65921L, 65921L,
14793L, 14793L, 15274L, 15274L, 1266L, 10390L, 10390L, 3107L,
3107L, 14590L, 14590L, 19565L, 19565L, 25776L, 31626L, 61811L,
61811L, 64134L, 64134L, 5560L, 7881L, 9728L, 9728L, 12321L, 12321L,
20488L, 65921L, 65921L, 102341L, 14793L, 14793L, 15274L, 15274L,
142460L, 142460L, 142460L), auditor_fkey = c(2L, 2L, 2L, 2L,
5L, 5L, 1L, 2L, 5L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 2L, 2L,
2L, 2L, 5L, 5L, 1L, 1L, 1L, 2L, 5L, 5L, 580L, 580L, 7L, 5L, 597L,
597L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L), sic = c(100L, 100L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1040L, 1040L, 1040L, 1040L, 1220L,
1220L, 1220L, 1220L, 100L, 100L, 100L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1040L,
1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L,
1220L, 1220L, 1220L, 1220L, 1220L, 1220L, 1220L), percentage = c(100,
100, 37.0939127, 37.0939127, 60.6862904, 60.6862904, 2.2197969,
37.0939127, 60.6862904, 100, 100, 100, 100, 34.7515456, 34.7515456,
65.2484544, 65.2484544, 26.9605909, 73.0394091, 73.0394091, 9.889969,
9.889969, 18.8287898, 18.8287898, 71.1457384, 71.1457384, 71.1457384,
9.889969, 18.8287898, 18.8287898, 0.1355028, 0.1355028, 8.2959428,
71.1026743, 1.0906007, 1.0906007, 18.1873703, 18.1873703, 18.1873703,
18.1873703, 18.1873703, 1.3234119, 58.0859392, 58.0859392, 41.9140608,
41.9140608, 58.0859392, 58.0859392, 58.0859392)), row.names = c(NA,
-49L), class = c("tbl_df",
"tbl", "data.frame"))

How to find quartiles grouped by 2 different columns

I have a dataset that looks like this
year month age
2007 1 17
2007 1 18
2007 1 19
2007 1 30
2007 1 31
2007 2 18
2007 2 19
2007 2 30
2008 2 41
2008 2 52
2008 2 49
2008 3 23
2008 3 19
2008 3 39
And I'm stuck trying to find quartile group by each year and month.
The results should be like:
2007 1 Q1 Q2 Q3 Q4
2007 2 Q1 Q2 Q3 Q4
etc..
Thanks
Your question is a bit confusing. It only takes three cutpoints to separate into quartiles. So what do you really want in those Q1, Q2, Q3,Q4 columns? If you want counts it would seem to be a bit boring. I'm going to assume you want the min, 25th.pctile, median, 75th.pctile, and max:
do.call ( rbind, with( dfrm, tapply(age, interaction(year=year , month=month), quantile,
probs=c(0, .25,.5, 0.75, 1) ) ) )
#---------------------
0% 25% 50% 75% 100%
2007.1 17 18.0 19 30.0 31
2007.2 18 18.5 19 24.5 30
2008.2 41 45.0 49 50.5 52
2008.3 19 21.0 23 31.0 39
Aggregate does this.
> aggregate(.~year + month, data=age, FUN=fivenum)
year month age.1 age.2 age.3 age.4 age.5
1 2007 1 17.0 18.0 19.0 30.0 31.0
2 2007 2 18.0 18.5 19.0 24.5 30.0
3 2008 2 41.0 45.0 49.0 50.5 52.0
4 2008 3 19.0 21.0 23.0 31.0 39.0
> dput(age)
structure(list(year = c(2007L, 2007L, 2007L, 2007L, 2007L, 2007L,
2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L), month = c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), age = c(17L,
18L, 19L, 30L, 31L, 18L, 19L, 30L, 41L, 52L, 49L, 23L, 19L, 39L
)), .Names = c("year", "month", "age"), class = "data.frame", row.names = c(NA,
-14L))

Resources