Merging three factors so their dependent variable sums in R - r

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)

If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))

Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)

There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum

I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

Related

cannot reference grouped data in summarize(across(...))

When I try to create several columns within summarize(), I can reference a newly created column name in the same summarize statement.
Example:
Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").
Step 1 (start to assign sd for se):
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd)))
returns
style score_mean score_sd score_se
* <fct> <dbl> <dbl> <dbl>
1 S1 3.5 0.707 0.707
Step 2: calculate se based on sd
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))
returns
Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".
Step 3 debugging assignment term
3a) grouped data reference
I replaced the grouped data in nrow(score)) by the other column names or even nrow(data), but they all led to the same error message.
3b) assignment operation
I replaced the assignement for se sd/sqrt(nrow(score))) with different variations leading all to the same error. The simplest was sd/2, so even dividing by a constant doesn't work.
3c) assignment reference
I replaced sd by score_sd to reference the new column created, as seen in the output (Step 1). Still the same error message.
Question: Why does Step 1 work but not Step 2?
The error message just refers to the whole across() statement, so doesn't help to narrow down the root cause.
My hunch is that I have to reference the grouped data somehow, but I tried
se = sd(.)/sqrt(nrow(data) with no success.
Would be grateful for any hints...
Minimal reproducible example:
data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"),
param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), score = c(4,
1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across function, as dplyr 1.0.0 is not yet released.
So the answer to my question is:
You must reference the grouped data by the . operator - BUT ONLY IF you use the purrr formula operator ~!
However, you must NOT reference the grouped data in the n() function, as the n() does NOT accept the . operator.
The second point took endless trials to find out, and is the reason why I wanted to share this solution.
You might not find it intuitive to understand either that, even though n() is defined with brackets, it is never allowed to use the . operator as it always refers to the grouped data.
This is how this double trick looks like:
data %>%
group_by(style) %>%
summarise(across(
score,
list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
))
If you know it, it's easy :-)

Producing two new variables using recode in R?

I'm gradually getting used to recoding variables in R but I'm having a bit of trouble creating two new variables. For example, I have tried the following:
income2018$income2 <- dplyr::recode(income2018$income, '51' = 1L, '52' = 1L, '53' = 2L)
income2018$income3 <- dplyr::recode(income2018$income, '57' = 1L, '58' = 1L, '50' = 2L)
It doesn't look like the values are being correctly applied to the new variables.
Here is the SPSS syntax that I am attempting to recreate:
RECODE income (51,52=1)(53=2) into income2
RECODE income (57,58=1)(50=2) into income3
I'd be very grateful for any assistance.
Many thanks.
It looks like you might need to rearrange your code a little bit, but it's hard to tell without a reprex
You might want to try:
income2018 <- income2018 %>%
dplyr::mutate(income2 = income) %>%
dplyr::recode(income2, '51' = 1L, '52' = 1L, '53' = 2L)

Conditional Seasonal Averaging Time-Series Data

Introduction
Summary:
Trying to average data by season (when necessary) when certain conditions are met.
Hello everyone.
I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.
For the most part, the data reported per parameter is one instance per season.
Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.
However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.
To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.
Data Example
Summary:
Code and Tables show requested changes in data-set
So, for a data-driven example...
Here's a few rows from a data set.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
This is ideally what I would like as output
Data.Example.New<-structure(list(
Station.ID.new = c(13402, 13402, 13402, 13402),
End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"),
Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value.new = c(10, 20, 22.5, 10)),
.Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")
Here, the following things are occurring:
For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
Alkalinity in October is different season, so keep row.
All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.
If for some reason you have a GTLT and non-GTLT for same parameter:
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/13/2016 10 Alk
8/15/2016 20 Alk
Then final result would be
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/14/2016 15 Alk
Approach
Summary:
Define seasons and then aggregate using package like dplyr?
Create loop function to read row by row (after sort by Parameter.Name then Date?)
As one might expect, this is where I'm stuck.
I know seasons can be defined in R from prior Stack Q's:
New vector of seasons based on dates
And I know that average/aggregation packages such as dplyr (and possibly zoo?) can do chaining commands.
My issue is putting this thought process into code that can be repeated for each data set.
I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.
I quickly sketched my thoughts on what some sort of loop function would have to include:
Rough idea of flow diagram
Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].
Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.
Conclusion and Summary
In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.
I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.
Thank you for your time in advance.
Cheers,
soccernamlak
Using dplyr and lubridate I was able to come up with a solution. My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))
# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]
# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1
# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>%
group_by(Station.ID, Year, Season, Parameter.Name) %>%
summarize(has_GTLT = max(nchar(as.character(GTLT))))
# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
summarize(Value.new = mean(Value))
# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
summarize(Value.new = mean(Value)) %>%
select(-Value)
# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)
EDIT: I just noticed you linked to another SO question for converting dates to seasons. Mine simply converts by month, not date, and does not use actual seasons. I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings.

Calculating percent of categorical responses (with grouping) in R

I have the following dataframe:
IV Device1 Device2 Device3
Color Same Same Missing
Color Different Same Missing
Color Same Unique Missing
Shape Same Missing Same
Shape Different Same Different
Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).
Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.
Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).
For example, device 2 would have the following percentages:
Color- 67% same, 0% different, 33% unique.
Shape- 100% same, 0% different, 0% unique.
Thank you!
This is a not a TIDY solution, but you can use this until someone else posts a better one:
# Replace all "Missing" with NAs
df[df == "Missing"] <- NA
# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
factor(x, levels = c('Same', 'Different', 'Unique'))
})
# Custom function to calculate percent of categorical responses
custom <- function(x) {
y <- length(na.omit(x))
if(y > 0)
return(round((table(x)/y)*100))
else
return(rep(0, 3))
}
library(purrr)
# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>%
map(., function(x) {
x <- x[, -1]
t(sapply(x, custom))
})
Output
Final is a list of two data frames:
$Color
Same Different Unique
Device1 67 33 0
Device2 67 0 33
Device3 0 0 0
$Shape
Same Different Unique
Device1 50 50 0
Device2 100 0 0
Device3 50 50 0
Data
structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color",
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L,
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"),
Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same",
"Different", "Unique"), class = "factor"), Device3 = structure(c(NA,
NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
), class = "factor")), .Names = c("IV", "Device1", "Device2",
"Device3"), row.names = c(NA, -5L), class = "data.frame")
Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:
myStats <- function(x){
table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}
apply(yourData, 2, myStats)
This will return the summary of what you want.

Use of x`apply` to speed up loops

This is a more focussed question based on another question I have open at Vectorize/Speed up Code with Nested For Loops
Basically, I want to speed up the execution of this code. I was thinking of using one of the apply family of functions. The apply function would have to use/perform the following:
Input: loop over regions 1 to 10; vectors sed and borewidth with preallocated dimensions filled with NAs
Process: fill data in each of sed and borewidth in the manner implemented in the inner for loop
Output: sed and borewidth vectors
Assumptions (h/t Simon Urbanek): the begin, finish points of each row are contiguous, sequential and for each region, begin at 0.
Code is as below:
for (region in 1:10) {
# subset standRef and sample by region code
standRef.region <- standRef[which(standRef$region == region),]
sample.region <- sample[which(sample$region == region),]
for (i in 1:nrow(sample.region))
{
# create a dataframe - locations - that includes:
# 1) those indices of standRef.region in which the value of the location column is greater than the value of the ith row of the begin column of sample.region
# 2) those indices of standRef.region in which the value of the location column is less than the value of the ith row of the finish column of sample.region
locations <- standRef.region[which((standRef.region$location > sample.region$begin[i]) & (standRef.region$location < sample.region$finish[i])),]
sed[end_tracker:(end_tracker + nrow(locations))] <- sample.region$sed[i]
borewidth[end_tracker:(end_tracker + nrow(locations))] <- sample.region$borewidth[i]
# update end_tracker to the number of locations rows for this iteration
end_tracker <- end_tracker + nrow(locations)
}
cat("Finished region", region,"\n")
}
Sample Data for borewidth andsed. Edit: corrected formatting error in dput
structure(list(region = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
begin = c(0L, 2253252L, 7091077L, 9120205L, 0L, 135094L,
941813L, 5901391L, 6061324L), finish = c(2253252L, 7091077L,
9120205L, 17463033L, 135094L, 941813L, 5901391L, 6061324L,
7092402L), sed = c(3.31830840984048, 1.38014704208403, 6.13049140975458,
2.10349875097134, 0.48170587509345, 0.13058713509175, 9.13509713513509,
6.13047153058701, 3.81734081501503), borewidth = c(3L, 5L,
2L, 1L, 1L, 1L, 2L, 4L, 4L)), .Names = c("region", "begin",
"finish", "sed", "borewidth"), class = "data.frame", row.names = c(NA,
-9L))
TIA.
With some extra assumptions based on the data you posted (incl. the other question), this is one way you could do it:
index <- unlist(lapply (unique(standRef$region), function(reg) {
reg.filter <- which(standRef$region == reg)
samp.filter <- which(sample$region == reg)
samp.filter[cut(standRef$location[reg.filter],c(0L,sample$finish[samp.filter]),labels=F)]
}))
sed <- sample$sed[index]
borewidth <- sample$borewidth[index]
The extra assumption is that your samples are contiguous, sequential (all your examples were) and start at 0. This allows us to use cut() on the $finish instead of treating each interval separately. One difference is that you code left gaps at the breaks, but I'm assuming that was not intentional.

Resources