How to dummyfy wiht one row per ID and multple dummies == 1? - r

I've been working in something related to the dummyfying of a variable, but I have not been able to get the result I'm looking for. I'm sure there must be an easy solution for it, but was not able to find it.
My data set that looks like this:
TM_ID
APPLICATION_YEAR
EXPIRATION
DURATION
NICE
5
1870
1902
32
CLASE 34
6
1870
1891
21
CLASE 32
19
1902
1943
41
CLASE 34
19
1902
1943
41
CLASE 16
20
1876
1881
5
CLASE 34
21
1876
1877
12
CLASE 34
70
1877
1902
25
CLASE 16
70
1877
1902
25
CLASE 34
and I would like to dummyfy the column NICE to obtain one row per TM_ID and ones in the columns representing each of the levels in column NICE.
I tried with dummyVars(); it created one dummy for each level but there were still more than one observation per TM_ID. I tried as well with dcast() or pivot_wider(), but I have not been able to achieve what I need. It should look like this:
TM_ID
APPLICATION_YEAR
EXPIRATION
DURATION
CLASE 16
CLASE 32
CLASE 34
5
1870
1902
32
0
0
1
6
1870
1891
21
0
1
0
19
1902
1943
41
1
0
1
20
1876
1881
5
0
0
1
21
1876
1877
12
0
0
1
70
1877
1902
25
1
0
1
Were there is one only observation per ID and 1s en each of the corresponding columns of NICE levels.
The only way I found to do this is one-hot encoding first with dummyVars(), and then group_by() and mutate(), like:
group_by(TM_ID) %>%
mutate(NICE_1 = sum(NICE.1), NICE_10 = sum(NICE.10), NICE_11 = sum(NICE.11)......
But the problem with this solution is that I had to type each argument in the mutate, one per dummy (level of the original variable). What if there were hundreds of levels?
Thanks for your help.

You may use pivot_wider to get the data in wide format and use values_fn = length to dummify the NICE column.
library(dplyr)
library(tidyr)
res <- df %>%
arrange(NICE) %>%
pivot_wider(names_from = NICE, values_from = NICE,
values_fn = length, values_fill = 0)
res
# TM_ID APPLICATION_YEAR EXPIRATION DURATION `CLASE 16` `CLASE 32` `CLASE 34`
# <int> <int> <int> <int> <int> <int> <int>
#1 19 1902 1943 41 1 0 1
#2 70 1877 1902 25 1 0 1
#3 6 1870 1891 21 0 1 0
#4 5 1870 1902 32 0 0 1
#5 20 1876 1881 5 0 0 1
#6 21 1876 1877 12 0 0 1
data
df <- structure(list(TM_ID = c(5L, 6L, 19L, 19L, 20L, 21L, 70L, 70L
), APPLICATION_YEAR = c(1870L, 1870L, 1902L, 1902L, 1876L, 1876L,
1877L, 1877L), EXPIRATION = c(1902L, 1891L, 1943L, 1943L, 1881L,
1877L, 1902L, 1902L), DURATION = c(32L, 21L, 41L, 41L, 5L, 12L,
25L, 25L), NICE = c("CLASE 34", "CLASE 32", "CLASE 34", "CLASE 16",
"CLASE 34", "CLASE 34", "CLASE 16", "CLASE 34")), row.names = c(NA,
-8L), class = "data.frame")

using data.table
library(data.table)
dcast(setDT(df)[order(NICE)], ... ~ NICE, value.var = 'NICE', length)
TM_ID APPLICATION_YEAR EXPIRATION DURATION CLASE 16 CLASE 32 CLASE 34
1: 5 1870 1902 32 0 0 1
2: 6 1870 1891 21 0 1 0
3: 19 1902 1943 41 1 0 1
4: 20 1876 1881 5 0 0 1
5: 21 1876 1877 12 0 0 1
6: 70 1877 1902 25 1 0 1
data
df <- structure(list(TM_ID = c(5L, 6L, 19L, 19L, 20L, 21L, 70L, 70L
), APPLICATION_YEAR = c(1870L, 1870L, 1902L, 1902L, 1876L, 1876L,
1877L, 1877L), EXPIRATION = c(1902L, 1891L, 1943L, 1943L, 1881L,
1877L, 1902L, 1902L), DURATION = c(32L, 21L, 41L, 41L, 5L, 12L,
25L, 25L), NICE = c("CLASE 34", "CLASE 32", "CLASE 34", "CLASE 16",
"CLASE 34", "CLASE 34", "CLASE 16", "CLASE 34")), row.names = c(NA,
-8L), class = "data.frame")

Great, that's been of much help.
What I wonder is, what if we want to do the same for several variables. Actually, my data has this structure:
TM_ID APPLICATION_YEAR EXPIRATION DURATION NICE INDUSTRY
1: 5 1870 1902 32 CLASE 34 food
2: 6 1870 1891 21 CLASE 32 tobacco
3: 19 1902 1943 41 CLASE 34 tobacco
4: 19 1902 1943 41 CLASE 16 paper
5: 20 1876 1881 5 CLASE 34 chemical
6: 21 1876 1877 12 CLASE 34 arms
7: 70 1877 1902 25 CLASE 16 chemical
8: 70 1877 1902 25 CLASE 34 machinery
And I'd need:
TM_ID APPLICATION_YEAR EXPIRATION DURATION CLASE 16 CLASE 32 CLASE 34 food tobacco paper chemical arms machinery
5 1870 1902 32 0 0 1 1 0 0 0 0 0
6 1870 1891 21 0 1 0 0 1 0 0 0 0
19 1902 1943 41 1 0 1 0 1 1 0 0 0
20 1876 1881 5 0 0 1 0 0 0 1 0 0
21 1876 1877 12 0 0 1 0 0 0 0 1 0
70 1877 1902 25 1 0 1 1 0 0 1 0 1
I tried with different combinations in dcast(), but what I get is a column for each combination of NICE and INDUSTRY. Concretely, I tried to follow the example in the data.table vignettes:
## new 'cast' functionality - multiple value.vars
DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
DT.c2
# family_id age_mother dob_1 dob_2 dob_3 gender_1 gender_2 gender_3
# 1: 1 30 1998-11-26 2000-01-29 <NA> 1 2 NA
# 2: 2 27 1996-06-22 <NA> <NA> 2 NA NA
# 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
# 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
# 5: 5 29 2000-12-05 2005-02-28 <NA> 2 1 NA
But it couldn't make it work for my case. I can imagine a work-around doing it separately and then cbind, but I'm sure there is a cleaner way.
Thanks again for your help.

Related

Merge 2 data frames using common date, plus 2 rows before and n-1 rows after

So i need to merge 2 data frames:
The first data frame contains dates in YYYY-mm-dd format and event lengths:
datetime length
2003-06-03 1
2003-06-07 1
2003-06-13 1
2003-06-17 3
2003-06-28 5
2003-07-10 1
2003-07-23 1
...
The second data frame contains dates in the same format and discharge data:
datetime q
2003-05-29 36.2
2003-05-30 34.6
2003-05-31 33.1
2003-06-01 30.7
2003-06-02 30.0
2003-06-03 153.0
2003-06-04 69.0
...
The second data frame is much larger.
I want to merge/join only the following rows of the second data frame to the first:
all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
two rows before that row
n-1 rows after that row, where n = "length" value of row in first data frame.
I would like to identify the rows belonging to the same event as well.
Ideally i would have the following output: (Notice the event from 2003-06-17)
EventDatesNancy length q event#
2003-06-03 1 153.0 1
2003-06-07 1 120.0 2
2003-06-13 1 45.3 3
2003-06-15 na 110.0 4
2003-06-16 na 53.1 4
2003-06-17 3 78.0 4
2003-06-18 na 167.0 4
2003-06-19 na 145.0 4
...
I hope this makes clear what I am trying to do.
This might be one approach using tidyverse and fuzzyjoin.
First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).
Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))

How to calculate the sum of periods over each column for each row in R

I would like to calculate the sum of each flower in each year in R. Below is an example of how the table looks (Table 1) and what I want the outcome to be (Table 2). I know how to do the code calculation in a long table format but I am not sure how to do it in a wide table format. Note: I am using package: dplyr
(Table 1)
flower
1902
1950
2010
2012
2021
lily
23
0
0
8
5
rose
50
60
5
16
0
daisy
30
7
10
2
0
I need to calculate the sum for each flower in each year. The end result should give me:
(Table 2)
flower
1902
1950
2010
2012
2021
lily
23
23
23
31
36
rose
50
110
115
131
131
daisy
30
37
47
49
49
One option involving dplyr and purrr might be:
dat %>%
mutate(pmap_dfr(across(-1), ~ cumsum(c(...))))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Using rowCumsums from matrixStats
library(matrixStats)
df1[-1] <- rowCumsums(as.matrix(df1[-1]))
-output
df1
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Here is one way of getting your expected result:
Your data frame :
dat <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Apply a function that calculate the cumulative sums and apply to each row of the data at column 2 to 6:
dat[1:nrow(dat), 2:6] <- t(apply(dat[1:nrow(dat), 2:6], 1, function(x) cumsum(c(x))))
# The result
dat
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
#benson23 has kindly suggested the following simpler code to get the same result:
dat[, 2:6] <- t(apply(dat[,2:6], 1, cumsum))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
You can use apply with cumsum, plus a little bit of re-formatting.
setNames(as.data.frame(cbind(df[, 1], t(apply(df[, -1], 1, cumsum)))), colnames(df))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Data
df <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Here is an alternative using pivoting:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-flower) %>%
group_by(flower) %>%
mutate(value = cumsum(value)) %>%
pivot_wider() %>%
ungroup()
flower X1902 X1950 X2010 X2012 X2021
<chr> <int> <int> <int> <int> <int>
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49

how to find depature from normal in temperature - R 4.0.0

I have been working on a dataset that will be universal on multiple climate stations for analyzing temperature and precipitation. I have run into a brick wall designing the 'climatic norms', I have successfully calculated daily temperature average TAVG, monthly temp avg AVG_TAVG, and summed up PRCP and SNOW for monthly totals.
Where I am at a standstill is calculating departure from normal, currently, data from 1981 - 2010 is considered climate norms.
Here is what my dataset looks like currently:
mso_light
year month day date PRCP SNOW SNWD TMAX TMIN TAVG
1 1948 1 1 1948-01-01 0 0 102 44 -122 -39.0
2 1948 1 2 1948-01-02 3 0 51 44 6 25.0
3 1948 1 3 1948-01-03 0 0 25 44 -39 2.5
4 1948 1 4 1948-01-04 38 64 76 33 -56 -11.5
5 1948 1 5 1948-01-05 0 0 76 -6 -83 -44.5
6 1948 1 6 1948-01-06 107 0 51 22 -61 -19.5
7 1948 1 7 1948-01-07 147 0 25 28 -17 5.5
8 1948 1 8 1948-01-08 8 13 25 39 -83 -22.0
9 1948 1 9 1948-01-09 0 0 25 -6 -117 -61.5
10 1948 1 10 1948-01-10 8 10 25 -11 -156 -83.5
So I originally felt I needed date for sorting purposes, I will remove it if not needed in the future.
Next, I would like to add a column for DepNormT, which is calculated by taking every Jan 1 - Dec 31 from 1981 - 2010 and averaging TAVG to find the normal average temp. Then DepNormT will be the difference between itself and TAVG for the entire dataset.
I have tried multiple ways to accomplish this here are two versions:
mso_DeptT <- mso_light %>%
group_by(month, day) %>%
mean(mso_light$TAVG[1981:2010], na.rm = T) %>%
ungroup()
This gives me the following error:
no applicable method for 'ungroup' applied to an object of class "c('double', 'numeric')"
In addition: Warning message:
In mean.default(., mso_light$TAVG[1981:2010], na.rm = T) :
argument is not numeric or logical: returning NA
This is another version:
##mso_DeptT <- filter(mso_light, year >= "1981", year <= "2010") %>%
## group_by(day, month) %>%
## mutate(daily_DeptT = mean(TAVG, na.rm = T)) %>%
## ungroup()
mso_sum <- mso_light %>%
group_by(month, year) %>%
summarize(AVG_TAVG=mean(TAVG, na.rm = TRUE),
T_PRCP=sum(PRCP, na.rm=TRUE),
T_SNOW=sum(SNOW, na.rm=TRUE)) %>%
ungroup()
## To find monthly normal precipitation and snowfall - using dataset mso_sum
cli_Avg <- filter(mso_sum, year >= "1981", year <= "2010") %>%
group_by(month) %>%
summarize(Mon_Precip = mean(T_PRCP, na.rm = T),
Mon_Snow = mean(T_SNOW, na.rm = T))
This gave me a 30 year average that was equal to each individual day average TAVG. For example:
year month day date PRCP SNOW SNWD TMAX TMIN TAVG DepNormT
1 1948 1 1 1948-01-01 0 0 102 44 -122 -39.0 -39.0
2 1948 1 2 1948-01-02 3 0 51 44 6 25.0 25.0
3 1948 1 3 1948-01-03 0 0 25 44 -39 2.5 2.5
4 1948 1 4 1948-01-04 38 64 76 33 -56 -11.5 ect
5 1948 1 5 1948-01-05 0 0 76 -6 -83 -44.5 .
6 1948 1 6 1948-01-06 107 0 51 22 -61 -19.5 .
7 1948 1 7 1948-01-07 147 0 25 28 -17 5.5 .
8 1948 1 8 1948-01-08 8 13 25 39 -83 -22.0
9 1948 1 9 1948-01-09 0 0 25 -6 -117 -61.5
10 1948 1 10 1948-01-10 8 10 25 -11 -156 -83.5
Thanks for suggestions.

Cross join two dataframes by key column using condition in R

I have two dataframes.
mydata1=structure(list(ID_WORKES = c(58005854L, 58005854L, 58002666L,
58002666L), ID_SP_NAR = c(463L, 1951L, 21L, 465L), KOD_DEPO = c(3786L,
3786L, 1439L, 1439L), KOD_DOR = c(58L, 58L, 92L, 92L), COLUMN_MASH = c(6L,
6L, 5L, 5L), prop_violations = structure(c(1L, 2L, 2L, 2L), .Label = c("0.2",
"1"), class = "factor"), mash_score = c(0L, 2L, 2L, 2L)), .Names = c("ID_WORKES",
"ID_SP_NAR", "KOD_DEPO", "KOD_DOR", "COLUMN_MASH", "prop_violations",
"mash_score"), class = "data.frame", row.names = c(NA, -4L))
mydata2=structure(list(ID_SP_NAR = c(463L, 1951L, 21L, 465L, 500L, 600L
)), .Names = "ID_SP_NAR", class = "data.frame", row.names = c(NA,
-6L))
i need crossjoin merge these dataframes by ID_SP_NAR. Mydata2 contatins only key variable ID_SP_NAR.
I need join this in such a way that if the id_workers does not have any codes from the ID_SP_NAR from mydata2, then these code are inserted into the dataset, but for them in variables prop_violations and mash_score must be inserted zero values.
I.E. SP_ID_NAR in mydata2 has such values
ID_SP_NAR
463
1951
21
465
500
600
ID_workes =58005854 has
463,
1951
but another not have.
and
ID_workes =58002666 has 21 and 465 and not anonter!
So desired output after cross join
ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
1 58005854 463 3786 58 6 0.2 0
2 58005854 1951 3786 58 6 1 2
3 58005854 21 3786 58 6 0 0
4 58005854 465 3786 58 6 0 0
5 58005854 500 3786 58 6 0 0
6 58005854 600 3786 58 6 0 0
7 58002666 21 1439 92 5 1 2
8 58002666 465 1439 92 5 1 2
9 58002666 500 1439 92 5 0 0
10 58002666 600 1439 92 5 0 0
11 58002666 463 1439 92 5 0 0
12 58002666 1951 1439 92 5 0 0
KOD_DEPO,KOD_DOR,COLUMN_MASH have fixed value , it must be saved too.
How to do that?
merge(mydata1,mydata2, by = ID_SP_NAR) is not working( i try use via left join doesn't work), it doesn't insert zeros as i want .
We could use complete from tidyr to expand the dataset based on the 'ID_WORKES' and the valuse of 'ID_SP_NAR' in the second dataset
library(tidyverse)
mydata1 %>%
mutate_if(is.factor, as.character) %>%
complete(ID_WORKES, ID_SP_NAR = mydata2$ID_SP_NAR,
fill = list(prop_violations = '0', mash_score = 0)) %>%
fill(3:5)
# A tibble: 12 x 7
# ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
# <int> <int> <int> <int> <int> <chr> <dbl>
# 1 58002666 21 1439 92 5 1 2
# 2 58002666 463 1439 92 5 0 0
# 3 58002666 465 1439 92 5 1 2
# 4 58002666 500 1439 92 5 0 0
# 5 58002666 600 1439 92 5 0 0
# 6 58002666 1951 1439 92 5 0 0
# 7 58005854 21 1439 92 5 0 0
# 8 58005854 463 3786 58 6 0.2 0
# 9 58005854 465 3786 58 6 0 0
#10 58005854 500 3786 58 6 0 0
#11 58005854 600 3786 58 6 0 0
#12 58005854 1951 3786 58 6 1 2

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

Resources