I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))
I'm new to R so this question might be quite basic.
There is a column in my data which goes like 4 4 4 4 7 7 7 13 13 13 13 13 13 13 4 4 7 7 7 13 13 13 13 13 13 13 13 4 4.....
One cycle of 4...7...13... is considered as one complete run, to which I will assign a Run Number (1, 2, 3...) to each run.
The number of times that each value (4, 7, 13) repeats is not fixed, and the total number of rows in a run is not fixed either. The total number of runs is unknown (but typically ranging from 60-90). The order of (4, 7, 13) is fixed.
I have attached my current code here. It works fine, but it does take a minute or two when there's about a few million rows of data. I'm aware that growing vectors in a for loop is really not recommended in R, so I would like to ask if anyone has a more elegant solution to this.
Sample data can be generated with the code below, and the desired output can also be generated with the sample code below.
#Generates sample data
df <- data.frame(Temp = c(sample(50:250, 30)), Pres = c(sample(500:1000, 30)),
Message = c(rep(4, 3), rep(7, 2), rep(13, 6), rep(4, 4), rep(7, 1), rep(13, 7), rep(4, 3), rep(7, 4)))
Current Solution
prev_val = 0
Rcount = 1
Run_Count = c()
for (val in df$Message)
{
delta = prev_val - val
if((delta == 9))
Rcount = Rcount + 1
prev_val = val
Run_Count = append(Run_Count, Rcount)
}
df$Run = Run_Count
The desired output:
226 704 4 1
138 709 4 1
136 684 4 1
57 817 7 1
187 927 7 1
190 780 13 1
152 825 13 1
126 766 13 1
202 855 13 1
214 757 13 1
172 922 13 1
50 975 4 2
159 712 4 2
212 802 4 2
181 777 4 2
102 933 7 2
165 753 13 2
67 962 13 2
119 631 13 2
The data frame will later be split by the Run Number, but after being categorized according to the value, i.e.
... 4 1
... 4 1
... 4 1
... 4 1
... 4 2
... 4 2
... 4 2
... 4 3
.....
I am not sure if this is an improvement, but it uses the rle run length encoding function to determine the length of each repeat in each run.
df <- data.frame(Temp = c(sample(50:250, 30)), Pres = c(sample(500:1000, 30)),
Message = c(rep(4, 3), rep(7, 2), rep(13, 6), rep(4, 4), rep(7, 1), rep(13, 7), rep(4, 3), rep(7, 4)))
rleout<-rle(df$Message)
#find the length of the runs and create the numbering
runcounts<-ceiling(length(rleout$lengths)/3)
runs<-rep(1:runcounts, each=3)
#need to trim the length of run numbers for cases where there is not a
# full sequence, as in the test case.
rleout$values<-runs[1:length(rleout$lengths)]
#create the new column
df$out<-inverse.rle(rleout)
I'm sure someone can come along and demonstrate and a better and faster method using data tables.
easily use:
df$runID <- cumsum(c(-1,diff(df$Message)) < 0)
# Temp Pres Message runID
# 1 174 910 4 1
# 2 181 612 4 1
# 3 208 645 4 1
# 4 89 601 7 1
# 5 172 812 7 1
# 6 213 672 13 1
# 7 137 848 13 1
# 8 153 833 13 1
# 9 127 591 13 1
# 10 243 907 13 1
# 11 146 599 13 1
# 12 151 567 4 2
# 13 139 855 4 2
# 14 147 793 4 2
# 15 227 533 4 2
# 16 241 959 7 2
# 17 206 948 13 2
# 18 236 875 13 2
# 19 133 537 13 2
# 20 70 688 13 2
# 21 218 528 13 2
# 22 244 927 13 2
# 23 161 697 13 2
# 24 177 572 4 3
# 25 179 911 4 3
# 26 192 559 4 3
# 27 60 771 7 3
# 28 245 682 7 3
# 29 196 614 7 3
# 30 171 536 7 3
I have a dataframe that looks more or less like follows (the original one has 12 years of data):
Year Quarter Age_1 Age_2 Age_3 Age_4
2005 1 158 120 665 32
2005 2 257 145 121 14
2005 3 68 69 336 65
2005 4 112 458 370 101
2006 1 75 457 741 26
2006 2 365 134 223 45
2006 3 257 121 654 341
2006 4 175 124 454 12
2007 1 697 554 217 47
2007 2 954 987 118 54
2007 4 498 235 112 65
Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.
What I´d like to have now is a dataframe structured in a way like:
Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort
2005 1 158 120 665 32 158
2005 2 257 145 121 14 257
2005 3 68 69 336 65 68
2005 4 112 458 370 101 112
2006 1 75 457 741 26 457
2006 2 365 134 223 45 134
2006 3 257 121 654 341 121
2006 4 175 124 454 12 124
2007 1 697 554 217 47 47
2007 2 954 987 118 54 54
2007 4 498 235 112 65 65
In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.
I have tried to use the following function, but somehow it only works for the first couple of years:
extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {
ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset
nq <- 1:nlevels(d$Quarter)
age0 <- (paste("age", ny, sep="_"))
year0 <- as.character(yearclass + ny - 1)
quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)
df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)
n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
tmp$Cohort <- tmp[[age[i]]]
dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":",
year,":",quarter,":", age, sep="")
dnew
}
I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.
Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.
-M
I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs.
Here is the data:
DT <- as.data.table(list(Year = c(2005, 2005, 2005, 2005, 2006, 2006 ,2006 ,2006, 2007, 2007, 2007),
Quarter= c(1, 2, 3, 4 ,1 ,2 ,3 ,4 ,1 ,2 ,4),
Age_1 = c(158, 257, 68, 112 ,75, 365, 257, 175, 697 ,954, 498),
Age_2= c(120 ,145 ,69 ,458 ,457, 134 ,121 ,124 ,554 ,987, 235),
Age_3= c(665 ,121 ,336 ,370 ,741 ,223 ,654 ,454,217,118,112),
Age_4= c(32,14,65,101,26,45,341,12,47,54,65)
))
Here is th code :
DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]
and the output:
> DT
Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
1: 2005 1 158 120 665 32 1 158
2: 2005 2 257 145 121 14 1 257
3: 2005 3 68 69 336 65 1 68
4: 2005 4 112 458 370 101 1 112
5: 2006 1 75 457 741 26 2 457
6: 2006 2 365 134 223 45 2 134
7: 2006 3 257 121 654 341 2 121
8: 2006 4 175 124 454 12 2 124
9: 2007 1 697 554 217 47 3 217
10: 2007 2 954 987 118 54 3 118
11: 2007 4 498 235 112 65 3 112
What it does:
DT[,index := .GRP, by = Year]
creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence).
I use it to call the column that you named Age_ with the number created
DT[,cohort := get(paste0("Age_",index)),by = Year]
You can even do everything in the single line
DT[,cohort := get(paste0("Age_",.GRP)),by = Year]
I hope it helps
Here is an option using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
gather(key, Cohort, -Year, -Quarter) %>%
separate(key, into = c('key1', 'key2')) %>%
mutate(ind = match(Year, unique(Year))) %>%
group_by(Year) %>%
filter(key2 == Quarter[ind]) %>%
mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
ungroup %>%
select(Cohort, newcol) %>%
bind_cols(df1, .)
# Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort newcol
#1 2005 1 158 120 665 32 158 2005:1:Age_1
#2 2005 2 257 145 121 14 257 2005:2:Age_1
#3 2005 3 68 69 336 65 68 2005:3:Age_1
#4 2005 4 112 458 370 101 112 2005:4:Age_1
#5 2006 1 75 457 741 26 457 2006:1:Age_2
#6 2006 2 365 134 223 45 134 2006:2:Age_2
#7 2006 3 257 121 654 341 121 2006:3:Age_2
#8 2006 4 175 124 454 12 124 2006:4:Age_2
#9 2007 1 697 554 217 47 47 2007:1:Age_3
#10 2007 2 954 987 118 54 54 2007:2:Age_3
#11 2007 4 498 235 112 65 65 2007:4:Age_3