How to calculate correlation by group

How to calculate correlation by group - r

I am trying to run an iterative for loop to calculate correlations for levels of a factor variable. I have 16 rows of data for each of 32 teams in my data set. I want to correlate year with points for each of the teams individually. I can do this one by one but want to get better at looping.
correlate <- data %>%
select(Team, Year, Points_Game) %>%
filter(Team == "ARI") %>%
select(Year, Points_Game)
cor(correlate)
I made an object "teams" by:
teams <- levels(data$Team)
A little help in using [i] to iterate over all 32 teams to get each teams correlation of year and points would be greatly helpful!

require(dplyr)
# dummy data
data = data.frame(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each team
# r - correlation coefficient
correlate <- data %>%
group_by(Team) %>%
summarise(r = cor(Year, Points_Game))

The data.table way:
library(data.table)
# dummy data (same as #Aleksandr's)
dat <- data.table(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each Team
result <- dat[ , .(r = cor(Year, Points_Game)), by = Team]

Related

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.

You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

How to sample without replacement within groups in R

I have a data frame which contains a 'year' variable with values between 1 and 100000 repeating multiple times. I have another data frame with 1000 'loss amounts' with an associated probability for each loss. I'd like to merge loss amounts onto the year data frame by sampling from the loss amounts table. I want to sample without replacement within each level of the year variable e.g. within each level of the year variable the loss amounts should be unique.
Reproducible example below where I can only get it to sample without replacement across the full 'year' dataset and not just within the different levels of the year variable as required. Is there a way of doing this (ideally without using loops as I need the code to run quickly)
#mean frequency
freq <- 100
years <- 100000
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
#lookup table with loss amounts
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(number){
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- sample_from_lookup(nrow(year))
year <- tibble(year = year$year, amount = amounts)

According to your description, maybe you can try replicate within your sample_from_lookup, i.e.,
sample_from_lookup <- function(number){
amount <- replicate(number,
sample(lookup$amount,
1,
replace = FALSE,
prob = lookup$pdf))
}
In this case, you need to set size 1 to your sample function.

I ended up using split to break the 'year' data into groups within a list. Then running the(slightly amended) sample_from_lookup function on each element of the list using map. Amended code below.
#mean frequency
freq <- 5
years <- 100
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
year_split = split(year, year$year)
#lookup table
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(x){
number = NROW(x)
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- map(year_split, sample_from_lookup) %>% unlist() %>% tibble(amount = .)
year <- tibble(year = year$year, amount = amounts$amount)

How to aggregate Likert-Type Scales across subgroups in R?

I am trying to aggregate the cumulative proportion of specific response options (in this case choices 4 and 5) for subgroups on a Likert-type scale questionnaire.
This way I would have the average favorability (in this case options 4 and 5 correspond to "agree" and "strongly agree" on the scale) for each subgroup across questions.
I have figured out a way to this for each separate item with following this post How to calculate cumulative proportion of Likert-type responses in r?, but I want to see if I can simplify it more by just creating a function that automatically does the same thing for all items. The dataset I am actually using has 99 items and you can see that it could become painful repeating the same code for each of them.
Here in the replicable example, my dataset has 2 questions named "Q1" and "Q2" (each on a 5-point scale) and subgroup codes named "subgroup". The "some_num_col" is just a variable created as an anchor to generate counts for aggregate function. The "A-rollup" variable is created to recode observations that fell under certain rollups.
# Creating the dataset and rollups variable
set.seed(8302019)
dataset <- data.frame(
subgroup = sample(c(1000,1005,807,886,779,458,557,628), 500, replace=TRUE),
Q1 = sample(1:5, 500, replace=TRUE), Q2 = sample(1:5, 500, replace =TRUE),
some_num_col = 1
)
str(dataset)
dataset$A_rollup <- with(dataset,ifelse(subgroup %in% c(1005,1000),1,ifelse(subgroup %in% c(807),2,ifelse(subgroup %in% c(886,779,458),3,ifelse(subgroup %in% c(557,628),4,"N/A")))))
# Aggregate Q1
agg_Q1 <- aggregate(cbind(count=some_num_col) ~ Q1 + A_rollup, dataset, FUN=length)
agg_Q1$prop <- with(agg_Q1, count / ave(count, A_rollup, FUN=sum))
filtered <- agg_Q1[agg_Q1$Q1 %in% c(4,5),]
Final_Q1 <- aggregate(filtered$prop, by=list(filtered$A_rollup), FUN=sum, na.rm=T)
names(Final_Q1) <- c("A_rollup", "Q1.Fav")
remove(filtered,agg_Q1)
# Aggregate Q2
agg_Q2 <- aggregate(cbind(count=some_num_col) ~ Q2 + A_rollup, dataset, FUN=length)
agg_Q2$prop <- with(agg_Q2, count / ave(count, A_rollup, FUN=sum))
filtered <- agg_Q2[agg_Q2$Q2 %in% c(4,5),]
Final_Q2 <- aggregate(filtered$prop, by=list(filtered$A_rollup), FUN=sum, na.rm=T)
names(Final_Q2) <- c("A_rollup", "Q2.Fav")
remove(filtered,agg_Q2)
# Binding the aggregates
Final <- cbind(Final_Q1, Final_Q2$Q2.Fav)

You can simply use case_when() from dplyr:
library(dplyr)
dataset %>%
mutate_at(.vars = c("Q1","Q2",..), .funs = funs(. = case_when(
. < 5 ~ 0,
. >= 5 ~ 1
)))
The .vars selection can be done via named vectors as in the example or selection by position (e.g. .vars = 2:10)

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.

I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to calculate correlation by group - r

Related

Using a loop to create columns based on two data frames

Apply a function within list-column to another column (compare to reference ecdf by group)

How to sample without replacement within groups in R

How to aggregate Likert-Type Scales across subgroups in R?

Multiply a grouped data frame by a matrix dplyr

Categories

Resources