Creating new column based on row values of multiple data subsetting conditions - r

I have a dataframe that looks more or less like follows (the original one has 12 years of data):
Year Quarter Age_1 Age_2 Age_3 Age_4
2005 1 158 120 665 32
2005 2 257 145 121 14
2005 3 68 69 336 65
2005 4 112 458 370 101
2006 1 75 457 741 26
2006 2 365 134 223 45
2006 3 257 121 654 341
2006 4 175 124 454 12
2007 1 697 554 217 47
2007 2 954 987 118 54
2007 4 498 235 112 65
Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.
What I´d like to have now is a dataframe structured in a way like:
Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort
2005 1 158 120 665 32 158
2005 2 257 145 121 14 257
2005 3 68 69 336 65 68
2005 4 112 458 370 101 112
2006 1 75 457 741 26 457
2006 2 365 134 223 45 134
2006 3 257 121 654 341 121
2006 4 175 124 454 12 124
2007 1 697 554 217 47 47
2007 2 954 987 118 54 54
2007 4 498 235 112 65 65
In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.
I have tried to use the following function, but somehow it only works for the first couple of years:
extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {
ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset
nq <- 1:nlevels(d$Quarter)
age0 <- (paste("age", ny, sep="_"))
year0 <- as.character(yearclass + ny - 1)
quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)
df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)
n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
tmp$Cohort <- tmp[[age[i]]]
dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":",
year,":",quarter,":", age, sep="")
dnew
}
I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.
Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.
-M

I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs.
Here is the data:
DT <- as.data.table(list(Year = c(2005, 2005, 2005, 2005, 2006, 2006 ,2006 ,2006, 2007, 2007, 2007),
Quarter= c(1, 2, 3, 4 ,1 ,2 ,3 ,4 ,1 ,2 ,4),
Age_1 = c(158, 257, 68, 112 ,75, 365, 257, 175, 697 ,954, 498),
Age_2= c(120 ,145 ,69 ,458 ,457, 134 ,121 ,124 ,554 ,987, 235),
Age_3= c(665 ,121 ,336 ,370 ,741 ,223 ,654 ,454,217,118,112),
Age_4= c(32,14,65,101,26,45,341,12,47,54,65)
))
Here is th code :
DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]
and the output:
> DT
Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
1: 2005 1 158 120 665 32 1 158
2: 2005 2 257 145 121 14 1 257
3: 2005 3 68 69 336 65 1 68
4: 2005 4 112 458 370 101 1 112
5: 2006 1 75 457 741 26 2 457
6: 2006 2 365 134 223 45 2 134
7: 2006 3 257 121 654 341 2 121
8: 2006 4 175 124 454 12 2 124
9: 2007 1 697 554 217 47 3 217
10: 2007 2 954 987 118 54 3 118
11: 2007 4 498 235 112 65 3 112
What it does:
DT[,index := .GRP, by = Year]
creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence).
I use it to call the column that you named Age_ with the number created
DT[,cohort := get(paste0("Age_",index)),by = Year]
You can even do everything in the single line
DT[,cohort := get(paste0("Age_",.GRP)),by = Year]
I hope it helps

Here is an option using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
gather(key, Cohort, -Year, -Quarter) %>%
separate(key, into = c('key1', 'key2')) %>%
mutate(ind = match(Year, unique(Year))) %>%
group_by(Year) %>%
filter(key2 == Quarter[ind]) %>%
mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
ungroup %>%
select(Cohort, newcol) %>%
bind_cols(df1, .)
# Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort newcol
#1 2005 1 158 120 665 32 158 2005:1:Age_1
#2 2005 2 257 145 121 14 257 2005:2:Age_1
#3 2005 3 68 69 336 65 68 2005:3:Age_1
#4 2005 4 112 458 370 101 112 2005:4:Age_1
#5 2006 1 75 457 741 26 457 2006:1:Age_2
#6 2006 2 365 134 223 45 134 2006:2:Age_2
#7 2006 3 257 121 654 341 121 2006:3:Age_2
#8 2006 4 175 124 454 12 124 2006:4:Age_2
#9 2007 1 697 554 217 47 47 2007:1:Age_3
#10 2007 2 954 987 118 54 54 2007:2:Age_3
#11 2007 4 498 235 112 65 65 2007:4:Age_3

Related

Scraping .txt files for many files that are formatted similarly

I converted some .pdf's into .txt files using R and am having trouble finding a way to scrape them to ultimately construct a data frame. I am new to text scraping, so please have mercy on my ignorance.
This is the format of the .txt file and I am mainly interested in the numbers and headers. Any recommendations are much appreciated.
Township of Buena Vista
General Election Results - November 2, 2010
Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk
Township Committee Public Count
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Ward Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
I would like the output to be something in tabular form like
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
except with the names and party affiliation as one character string. The goal is to merge this with other files like it to create a dataset.
It's always going to be ugly, but this should be somewhat automated:
# read it in as individual lines
rl <- readLines(textConnection(txt))
# drop all the extra info at top
rl <- rl[-(1:9)]
# just keep header
dist <- which(rl == "District")
hd <- head(rl, dist - 1)
# make everything same length and split characters
hd <- lapply(strsplit(hd, ""), `length<-`, max(nchar(hd)))
hd <- lapply(hd, function(x) replace(x, is.na(x), " "))
# find where spaces are in common in all rows
wdths <- rle(Reduce(`&`, lapply(hd, `==`, " ")))$lengths
# read it all in, ignoring district row
out <- read.fwf(textConnection(rl[-dist]), widths=wdths )
# keep those columns that aren't all NA
out <- out[!sapply(out, function(x) all(is.na(x)) )]
# collapse the header
hdr <- sapply(head(out, dist - 1),
function(x) trimws(gsub("\\s+", " ", paste(na.omit(x), collapse=" "))))
# finalise by joining
setNames(
data.frame(lapply(tail(out, -(dist-1)), type.convert, as.is=TRUE)),
hdr
)
Result:
# Ward Peter C. Bylone, Sr. Democratic Richard Harlan Democratic
#1 D-1 205 195
#2 D-2 202 160
#3 D-3 331 346
#4 D-4 215 205
#5 D-5 104 95
#6 D-6 77 70
# Mary Ann Micheletti- Levari Republican Henry L. Coia, Jr. Republican
#1 230 223
#2 275 261
#3 99 87
#4 164 152
#5 169 166
#6 109 108
# Total Machine Count Total Vote By Mail Total Provisional Count
#1 436 113 16
#2 459 NA NA
#3 457 NA NA
#4 377 NA NA
#5 271 NA NA
#6 188 NA NA
# Total Emergency Count Total Public Count
#1 NA 565
#2 NA 459
#3 NA 457
#4 NA 377
#5 NA 271
#6 NA 188
The example txt used was:
" Township of Buena Vista\n General Election Results - November 2, 2010\n Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk\n\n\n\n\n Township Committee Public Count\n\n Mary Ann\n Peter C. Richard Henry L. Total Total Total Total Total\n Micheletti-\n Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public\n Levari\nWard Democratic Democratic Republican Count Mail Count Count Count\n Republican\nDistrict\n D-1 205 195 230 223 436 113 16 565\n D-2 202 160 275 261 459 459\n D-3 331 346 99 87 457 457\n D-4 215 205 164 152 377 377\n D-5 104 95 169 166 271 271\n D-6 77 70 109 108 188 188"
Perhaps you can generalize this approach, but I don't think, it is very stable when used with other data than the example data.
I put your example into a file named example.txt.
library(tidyverse)
input <- read_lines("example.txt")
input[as.logical(cumsum(input == "District"))] %>%
tibble() %>%
slice(-1) %>%
mutate(count = str_replace_all(string = ., "\\s{9,12}", ";")) %>%
select(-.) %>%
separate(col = count, into = c("District", as.character(1:9)), sep = ";") %>%
mutate(across(everything(), str_trim),
across(as.character(1:9), as.integer))
returns
# A tibble: 6 x 10
District `1` `2` `3` `4` `5` `6` `7` `8` `9`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 D-1 205 195 230 223 436 113 16 NA 565
2 D-2 202 160 275 261 459 NA NA NA 459
3 D-3 331 346 99 87 457 NA NA NA 457
4 D-4 215 205 164 152 377 NA NA NA 377
5 D-5 104 95 169 166 271 NA NA NA 271
6 D-6 77 70 109 108 188 NA NA NA 188
Creating the column names (the candidate names) is a tricky task. Depending on the counts, perhaps it is necessary to adjust the spaces replaced with ";": \\s{9,12} means replace at least 9 up to 12 space characters.

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

Re-code and spread data in columns, based on values in another column

I have a table that looks like this:
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
etc...
The Tax# columns give info about the tax paid in years subsequent to the value in the Year column. I would like to re-arrange the table, and rename the columns, so it looked like this:
Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
2004 12 123 145 104 NA
2004 145 99 90 56 NA
2005 NA 212 300 240 123
I was thinking of splitting the table into separate tables, based on the year column, then renaming the Tax# columns, and joining back together. But its a bit convoluted, and I was wondering if there was a simpler way to do this?
Any help much appreciated.
library(dplyr)
library(tidyr)
df <- read.table(text = "
Year Tax1 Tax2 Tax3 Tax4
2004 12 123 145 104
2004 145 99 90 56
2005 212 300 240 123
", header = TRUE)
df %>%
mutate(id = row_number()) %>%
gather(rel_year, amount, contains("Tax")) %>%
mutate(rel_year = as.integer(gsub("Tax", "", rel_year)),
pay_year = Year + rel_year - 1,
pay_year = paste0("Tax", pay_year)) %>%
select(-rel_year) %>%
spread(pay_year, amount)
Result:
Year id Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
dat1%>%
gather(key,value,-Year)%>%
group_by(key)%>%
mutate(col=1:n())%>%
ungroup()%>%
mutate(key=paste0("Tax",2004:2008)[(Year==2005)+
as.numeric(sub("\\D+","",key))])%>%
spread(key,value)
# A tibble: 3 x 7
Year col Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
<int> <int> <int> <int> <int> <int> <int>
1 2004 1 12 123 145 104 NA
2 2004 2 145 99 90 56 NA
3 2005 3 NA 212 300 240 123
>
Here is an option using data.table
library(data.table)
library(readr)
dcast(melt(setDT(df, keep.rownames = TRUE), id.var = c("rn", "Year"))[,
newYear := paste0("Tax", Year + parse_number(variable) - 1)],
rn + Year~ newYear, value.var = 'value')[, rn := NULL][]
# Year Tax2004 Tax2005 Tax2006 Tax2007 Tax2008
#1: 2004 12 123 145 104 NA
#2: 2004 145 99 90 56 NA
#3: 2005 NA 212 300 240 123

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

diff operation within a group, after a dplyr::group_by()

Let's say I have this data.frame (with 3 variables)
ID Period Score
123 2013 146
123 2014 133
23 2013 150
456 2013 205
456 2014 219
456 2015 140
78 2012 192
78 2013 199
78 2014 133
78 2015 170
Using dplyr I can group them by ID and filter these ID that appear more than once
data <- data %>% group_by(ID) %>% filter(n() > 1)
Now, what I like to achieve is to add a column that is:
Difference = Score of Period P - Score of Period P-1
to get something like this:
ID Period Score Difference
123 2013 146
123 2014 133 -13
456 2013 205
456 2014 219 14
456 2015 140 -79
78 2012 192
78 2013 199 7
78 2014 133 -66
78 2015 170 37
It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.
Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.
data %>% group_by(ID) %>% filter(n() > 1) %>%
mutate(
Difference = Score - lag(Score)
)
# ID Period Score Difference
# 1 123 2013 146 NA
# 2 123 2014 133 -13
# 3 456 2013 205 NA
# 4 456 2014 219 14
# 5 456 2015 140 -79
# 6 78 2012 192 NA
# 7 78 2013 199 7
# 8 78 2014 133 -66
# 9 78 2015 170 37

Resources