R: Creating a Function with a "Dynamic" Structure - r

I am working with the R programming language.
Suppose there is a classroom of students - each student flips the same coin many times (the students don't flip the coin the same number of times). Here is a simulate dataset to represent this example:
library(tidyverse)
library(dplyr)
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time)
And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54)
Here is some R code that performs these tasks:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
My Question: Suppose I want to now extend this problem to "4 Flip Sequences" (e.g. probability of H given HHH) - I can manually extend this code:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2), lead(coin_result, 3)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(three_seq = substr(Sequence, 1, 3)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, 4, 4)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
Is it possible to convert the above code into a function such that I can repeat this for arbitrary combinations? For example:
results <- function(i) {return(my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, i+1), lead(coin_result, i+2) .....### insert code here ####), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence))}
final <- function(i)
return(results %>%
mutate(three_seq = substr(Sequence, 1, i)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, i+1, i+1)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums)))
}
I am not sure how exactly I would do this, seeing as the first function would require to be "dynamically changed" depending on the value of "i".
Can someone please show me how to do this?
Thanks!

Here's a way you can do it in base R:
# Returns a vector of 0's and 1's, bit more efficient than sample
tosses <- floor(runif(1e3, 0, 2))
count_seqs <- function(x, seq_length) {
vec_length <- length(x)
rolling_window_indices <- rep(1:seq_length, vec_length - seq_length + 1) +
rep(0:(vec_length - seq_length), each = seq_length)
mat <- matrix(x[rolling_window_indices], nrow = seq_length)
sequences <- apply(mat, 2, paste0, collapse = "")
table(sequences)
}
count_seqs(tosses, 3)
Notice I didn't include any ids in the code above. The reason is that, if all students have the same probability of tossing heads or tails, we can treat them as independent (or, more precisely, treat the design as ignorable). However, it's easy to expand the code for situations where the tosses are not independent, e.g. where each participant has a different probability of tossing heads:
ids <- floor(runif(1e3, 1, 101))
probs <- runif(1e2, 0, 1)
tosses_by_id <- lapply(ids, function(i) rbinom(10, 1, probs[i]))
lapply(tosses_by_id, function(x) count_seqs(x, 3))

Related

Vectorization to extract and bind very nested data

I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.
You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()
If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))

Repeat or while loop with map function in a simulation in r

Below is my data where columns group, score and prob represent treatment groups, scores outcome and probability of occurrence of the score outcome
data1 <- tibble(group = rep(c('A', 'B'), each = 5),
score = c(0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
prob = c(.08, .8, .1, .02, 0, 0, 0, .4, .4, .2))
data1
I want to use this data and simulate the scores R times so I expand as follows
R <- 1:10 # number of simulations
data2 <- data1 %>%
nest(data = c(score, prob)) %>%
unnest_wider(col = data) %>%
mutate(size = 15)
data3 <- data2 %>%
group_by(group, score, prob, size) %>%
tidyr::expand(iter = R) %>%
ungroup
data3
My function to simulate the data is as follows where I use multinomial distribution
sim_data <- function(dt, n, size, prob){
d1 <- dt %>%
mutate(sim_score = pmap(list(n = 1,
size = size,
prob = prob),
rmultinom)) %>%
select(-prob) %>%
unnest(cols = c(sim_score, score))
# create real scores from counts above
d2 <- d1 %>%
rowwise %>%
mutate(outcome = map(score, rep, times = sim_score)) %>%
unnest(cols = outcome) %>%
select(size, contains('iter'), group, outcome) %>%
nest(data = c(group, outcome))
return(d2)
}
data4 <- sim_data(data3, n, size, prob)
data4
This works fine however I want to omit those instances where each group has less than or equal to 2 unique scores. In this case, I modify my function sim_data to do the check as follows using repeat so as to ignore cases that do not meet the above mentioned requirement
sim_check <- function(my_data, n, size, prob){
repeat{
# simulate data
dt = sim_data(my_data, n, size, prob)
# check requirement
check <- dt %>%
select(data) %>%
unnest(cols = c(data)) %>%
group_by(group) %>%
distinct(outcome) %>%
mutate(id = 1:n()) %>%
summarise(n = n()) %>%
distinct(n) %>%
pull(n)
if(all(check > 2)) break
}
return(dt)
}
data5 <- sim_check(data3, n, size, prob)
This however does not do the checking as I see observations with 2 unique scores.
I can achieve my objective with a for loop as follows but I want to avoid the for loop. Can someone point out where I'm doing wrong
out <- list()
for(i in 1:10){
data6 <- sim_check(data2, n, size, prob)
out[[i]] <- data6 %>%
pull(data)
}
#Ronak thanks for the observation. I just corrected it

How to count the number of negative values for each observation of a certain variable

I want to calculate total number of negative values for each observation, using previous 10 observations. I used the following code, but it does not work -
funda_addit <- funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapply(ni, 10, sum (ni<0), partial=TRUE)) %>%
ungroup()
Actually I want to create the new variable "NEG_EARN", which is the number of negative values of previous 10 observations(10 years in my data) for the variable "ni". I also use the following code, but it does not work -
funda_addit <- funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapply(ni, 10, length(which(ni<0)), partial=TRUE)) %>%
ungroup()
You could create a vector cumsum(ni < 0) and then subtract a lagged version of that vector from it
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = {cs <- cumsum(ni < 0)
cs - lag(cs, 10, default = 0)})
This is equivalent to akrun's answer if you change rollapply to rollapplyr (tested using akrun's example data)
use_cumsum <-
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = {cs <- cumsum(ni < 0)
cs - lag(cs, 10, default = 0)})
use_rollapply <-
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(neg_earn = rollapplyr(ni, 10, FUN = f1, partial=TRUE))
all(use_cumsum == use_rollapply)
# [1] TRUE
We can use anonymous function call (or create a new function) instead of the whole column 'ni'
library(dplyr)
library(zoo)
f1 <- function(x) sum(x < 0)
funda_addit %>%
group_by(TICKER) %>%
arrange(year) %>%
mutate(NEG_EARN = rollapplyr(ni, 10, FUN = f1, partial=TRUE)) %>%
ungroup()
EDIT: changed rollapply to rollapplyr (based on comments from #IceCreamToucan)
data
set.seed(24)
funda_addit <- data.frame(TICKER = rep(LETTERS[1:3], each = 20),
year = 1921:1940, ni = rnorm(60))

working with paired data across groups in the tidyverse

I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})

Moving mean as a function in dplyr

I'd like to create a function that can calculate the moving mean for a variable number of last observations and different variables. Take this as mock data:
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
df$temp = rpois(dim(df)[1], 5)
Calculating for 1 variable and a fixed number of last observations works. E.g. this calculates the average of the temperature of the last 5 days:
library(dplyr)
library(zoo)
df <- df %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = temp, 5, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
So far so good. Now trying to functionalize fails.
avg_last_x <- function(dataframe, column, last_x) {
dataframe <- dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = column, k = last_x, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
return(dataframe) }
avg_last_x(dataframe = df, column = "temp", last_x = 10)
I get this error:
Error in mutate_impl(.data, dots) : k <= n is not TRUE
I understand this is probably related to the evaluation mechanism in dplyr, but I don't get it fixed.
Thanks in advance for your help.
This should fix it.
library(lazyeval)
avg_last_x <- function(dataframe, column, last_x) {
dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate_(almost_avg = interp(~rollmean(x = c, k = last_x, align = "right",
fill = NA), c = as.name(column)),
avg = ~lag(almost_avg, 1))
}

Resources