Creating Iterated Variables in R - r

I've looked around and seen some questions similar to mine, but none directly on point. I have a series of presidential election results for various states from 1940 to 2012. They are labeled, in sequence, r1940, d1940, r1944, d1944, r1948, d1948, and so forth.
I want to create a series of two-party vote variables, which are calculated by dividing the number of Democratic votes by the number of republican and democratic votes. So in a df called votes:
d2pv1940 <- (votes$d1940/(votes$d1940+votes$r1940))
Obviously I can do this 18 more times by hand, e.g., d2pv1944<-(votes$d1944/(votes$d1944+votes$r1944)) but obviously that is time consuming and invites errors. I've seen some solutions to similar problems using lapply or for loops, but I'm not really sure how I'd iterate the four variable names in the commands above.

Try something like this:
namest=colnames(votes)
rep=which(substr(namest, 1,1)=="r")
dem=which(substr(namest, 1,1)=="d")
res=votes[,dem]/(votes[,dem]+votes[,rep])
colnames(res)=paste("d2pv",substring(colnames(votes[,dem]),2),sep="")
res

Here's a tidy way to do it:
library(dplyr)
library(rex)
data =
c(1, 2, 2, 1) %>%
setNames(
c("r1940", "d1940", "r1944", "d1944") ) %>%
as.list %>%
as.data.frame
regex_1 =
rex(capture(letter),
capture(digits) )
abbreviations = data_frame(
abbreviation = c("d", "r"),
party = c("democrat", "republican") )
data %>%
gather(variable, value) %>%
extract(variable,
c("abbreviation", "year"),
regex_1) %>%
left_join(abbreviations) %>%
group_by(year) %>%
mutate(total = sum(value),
proportion = value / total ) %>%
select(-abbreviation, -value) %>%
spread(party, proportion)

Related

Recurring investment using R and PerformanceAnalytics

I am using R and PerformanceAnalytics to calculate the portfolio returns of a strategy.
Specifically, I want to start with $3000, invested equally across available assets, while adding a recurring $1000 split equally across available assets in January and June. The previous investments should not be reallocated, just the $1000 equally split each rebalance.
The below code can be used to calculate the growth in returns when investing $3000 initially, and rebalancing six monthly, but does not allow recurring investments split across assets.
Specifically, I want to split the additional investment amount ($1000 six monthly) to each stock at rebalancing, without reallocating what has already been allocated to the stocks.
The following does not achieve this, but gives a starting point for someone able to assist:
library(tidyverse);library(PerformanceAnalytics);library(tbl2xts)
data(managers)
df_series <- managers[,1:3] %>% xts_tbl()
w_xts <-
df_series %>% filter(format(date, "%b") %in% c("Jan", "Jun")) %>%
gather(fund, value, -date) %>%
mutate(value = coalesce(value, 0)) %>%
mutate(value = ifelse(abs(value) == 0, 0, 1)) %>% arrange(date) %>%
group_by(date) %>% mutate(value = value / sum(value)) %>% ungroup() %>% tbl_xts(cols_to_xts = value, spread_by = fund)
r_xts <- df_series %>% tbl_xts()
r_xts[is.na(r_xts)] <- 0
portfolio_return <- PerformanceAnalytics::Return.portfolio(R = r_xts, weights = w_xts, value = 3000, verbose = T)

Compute a custom mean for each row over multiple columns, based on a set of conditions

I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.
My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this
dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
Ozone_2016=runif(2000))
In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.
The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.
Is their a way to do this complex task in R?
NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried
dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>% group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")
but got an error
*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*
Link to the previous question
How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?
Thank you
The tidying step from your last question works well:
tidy_data = dat %>%
pivot_longer(
starts_with("Ozone"),
names_pattern = "(.*)_(.*)",
names_to = c(NA, "year"),
values_to = "ozone"
) %>%
mutate(year = as.integer(year))
Now you can filter out the years you want to get mean exposure by country / age:
mean_lifetime_exposure = tidy_data %>%
group_by(CNT, dob) %>%
filter(year >= dob) %>%
summarise(mean(ozone))
PS I'm sorry I don't quite understand your first question about country AO.
Edit:
Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.
tidy_data_filtered = tidy_data %>%
filter(
!(CNT == "AO" & year != 1998),
!(CNT == "SL" & !year %in% 1998:2014)
)

Reduce duplicated entries considering more than one column

I have a long dataset in which there are duplicated entries whose data I need to merge, e.g. paste values together.
In my case, I have a database of scientific articles: the strongest unique identifiers are the DOI and the article title, but the first may be missing in one of the copies, and the second may have slight phonetic/graphic differences that are easy to spot for humans but not programmatically (e.g. one copy uses β and the other plain beta).
A "match" are two articles that share at least one of the two columns. That is, I need a way to dplyr::group_by by the DOI OR the article title (usual group_by uses an AND logic).
The only solution that comes to my mind is to repeat the aggregation twice, for each column. Not very efficient given the large number of records.
Example:
imagine an input like:
df <- data.frame(
ID = c(1, NA, 2, 2),
Title = c('A', 'A', 'beta', 'β'),
to.join = 1:4
)
After (OR)grouping and summarising:
df %>%
group_by_OR(ID, Title) %>% # dummy function
summarise(
ID = na.omit(ID)[1],
Title = Title[1],
joined = paste(to.join, collapse = ', '))
I should get something like this:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
That is, the data was grouped by the title for the first group and by the id for the second.
I don't think you can avoid having to group the data twice, but we can do it sequentially, that way we can be as efficient as possible.
library(dplyr)
df_aggregated <- df %>%
group_by(ID) %>%
arrange(Title) %>%
summarise(Title = first(Title),
to.join = paste0(to.join, collapse=", ")) %>%
group_by(Title) %>%
arrange(ID) %>%
summarise(ID = first(ID),
to.join = paste0(to.join, collapse=", ")) %>%
select(ID, Title, joined=to.join) %>%
as.data.frame()
Now,
df_aggregated
is:
ID Title joined
1 1 A 1, 2
2 2 beta 3, 4
Eventually I found a solution, thanks also to #dario.
First I group by Title and impute the missing DOIs if at least one of the copies has one. Then I ungroup and create a new unique ID, using the DOI if present and the Title for those entries whose no copies have it.
Finally I group and summarize by this ID.
This way the computational-heavy summarising step is done only once.
records %>%
mutate(
uID = str_to_lower(Title) %>% str_remove_all('[^\\w\\d]+') # Improve matching between slightly different copies
) %>%
group_by(uID) %>%
mutate(DOI = na.omit(DOI)[1]) %>%
ungroup() %>%
mutate(
uID = ifelse(is.na(DOI), uID, DOI)
) %>%
group_by(uID) %>%
summarise(...) # various stuff here.

Is there a way to use `pivot_wider()` to summarize survey data?

I have a bunch of survey data, something like:
I have some survey data, let's say it's about how often respondents tackle various daily routines:
survey <- tribble(
~Q1_toothbrush, ~Q1_bathe, ~Q1_brush_hair, ~Q1_make_bed,
"Always","Sometimes","Often","Never",
"Never","Never","Always","Sometimes",
"Often","Sometimes","Sometimes","Often",
"Sometimes","Always","Often","Never"
)
I want to arrange it into a table that shows how many people selected "Often" or "Always".
I can create a new tibble and update it, taking each question one at a time, eg.
habits <- tribble(
~Habit, ~Description, ~Count,
"Q1_toothbrush", "Brushes teeth for two minutes twice each daty.", 0,
"Q1_bathe", "Bathes with soap and water every morning or evening", 0,
"Q1_hair", "Attends to daily hair health", 0,
"Q1_make_bed", "Tidies bed covers daily", 0
)
top_two <- c("Always", "Often")
tmp <- survey %>%
filter(Q1_toothbrush %in% top_two) %>%
count()
habits <- habits %>%
mutate(Count = ifelse(Habit == "Q1_toothbrush", tmp, Count))
kable(habits)
But I'm struggling to consolidate this into a single function.
If we need to do this for each row, an option is c_across after doing rowwise
library(dplyr) # >= 1.0.0
survey %>%
rowwise %>%
mutate(count = sum(c_across(everything()) %in% top_two)) %>%
ungroup
Or we can reshape to 'long' format and then do the count
library(dplyr)
library(tidyr)
pivot_longer(survey, everything()) %>%
filter(value %in% top_two) %>%
dplyr::count(name)

Applying group_by and summarise(sum) but keep a large number of additional columns

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.
In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).
In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.
The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.
Example:
location <- c("A", "A", "B", "B", "C", "C")
date <- c("1", "2", "1", "2", "1", "2")
count <- c(3, 6, 4, 2, 7, 5)
important_1 <- c(1,1,2,2,3,3)
important_30 <- c(4,4,5,5,6,6)
df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30)
I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).
My solution so far:
check <- df %>%
group_by(location) %>%
summarise(count = sum(count))
add2 <- df %>%
select(-count, -date) %>%
distinct()
results <- merge(check, add2)
Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.
We can create a column with mutate and then apply distinct
library(dplyr)
df %>%
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, important_1, important_30, .keep_all = TRUE)
If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)
df %>%
group_by(location) %>%
mutate(count = sum(count)) %>% select(-date) %>%
distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)
You can group_by all the variables that you want to keep and sum count.
library(dplyr)
df %>%
group_by(location, important_1, important_30) %>%
summarise(count = sum(count))
# location important_1 important_30 count
# <chr> <dbl> <dbl> <dbl>
#1 A 1 4 9
#2 B 2 5 6
#3 C 3 6 12

Resources