I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Related
I am working with the R programming language.
Suppose there is a classroom of students - each student flips the same coin many times (the students don't flip the coin the same number of times). Here is a simulate dataset to represent this example:
library(tidyverse)
library(dplyr)
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
I want to count the number of "3 Flip Sequences" recorded by each student (e.g. Student 1 got HHHTH : HHH 1 time, HHT 1 time, HTH 1 time)
And the probability of the 3rd Flip based on the previous 2 flips (e.g. in general, over all students, the probability of a H following HH was 0.54)
Here is some R code that performs these tasks:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
My Question: Suppose I want to now extend this problem to "4 Flip Sequences" (e.g. probability of H given HHH) - I can manually extend this code:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2), lead(coin_result, 3)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(three_seq = substr(Sequence, 1, 3)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, 4, 4)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
Is it possible to convert the above code into a function such that I can repeat this for arbitrary combinations? For example:
results <- function(i) {return(my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, i+1), lead(coin_result, i+2) .....### insert code here ####), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence))}
final <- function(i)
return(results %>%
mutate(three_seq = substr(Sequence, 1, i)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, i+1, i+1)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums)))
}
I am not sure how exactly I would do this, seeing as the first function would require to be "dynamically changed" depending on the value of "i".
Can someone please show me how to do this?
Thanks!
Here's a way you can do it in base R:
# Returns a vector of 0's and 1's, bit more efficient than sample
tosses <- floor(runif(1e3, 0, 2))
count_seqs <- function(x, seq_length) {
vec_length <- length(x)
rolling_window_indices <- rep(1:seq_length, vec_length - seq_length + 1) +
rep(0:(vec_length - seq_length), each = seq_length)
mat <- matrix(x[rolling_window_indices], nrow = seq_length)
sequences <- apply(mat, 2, paste0, collapse = "")
table(sequences)
}
count_seqs(tosses, 3)
Notice I didn't include any ids in the code above. The reason is that, if all students have the same probability of tossing heads or tails, we can treat them as independent (or, more precisely, treat the design as ignorable). However, it's easy to expand the code for situations where the tosses are not independent, e.g. where each participant has a different probability of tossing heads:
ids <- floor(runif(1e3, 1, 101))
probs <- runif(1e2, 0, 1)
tosses_by_id <- lapply(ids, function(i) rbinom(10, 1, probs[i]))
lapply(tosses_by_id, function(x) count_seqs(x, 3))
I want to use dplyr to calculate a field using it's previous value. A for loop would do the job but I want to calculate by different groups of st. I understand mutate (lag or ave) can't use an unitialized field.
df <- data.frame(st = rep(c('a','b', 'c'), each = 5),
z = rep(c(10,14,12), each = 5),
day = rep(1:5, 3),
GAI = rep(0:4, 3),
surfT = sample(1:15))
df %>%
group_by(st) %>%
mutate(soilT = lag(soilT, order_by = day) + (surfT - lag(soilT,
order_by = day))*0.24*exp(-z*0.017)*exp(-0.15*GAI))
or
df %>%
group_by(st) %>%
mutate(soilT = ave(soilT, c(st), FUN=function(x) c(0, soilT + (surfT - soilT)))
*0.24*exp(-z*0.017)*exp(-0.15*GAI))
how can a simple for loop be caculated in dplyr by group of st:
df$soilT <- 0
for (i in 2:dim(df)[1]){
df$soilT[i]=df$soilT[i-1] + (df$surfT[i] - df$soilT[i-1])
*0.24*exp(-z[i]*0.017)*exp(-0.15*GAI[i])
}
We can use accumulate from purrr to get the output of previous row as an input to current row.
library(dplyr)
result <- df %>%
group_by(st) %>%
mutate(soilT = purrr::accumulate(2:n(),
~.x + (surfT[.y] - .x)*0.24*exp(-z[.y]*0.017)*exp(-0.15*GAI[.y]),
.init = 0))
Below is my data where columns group, score and prob represent treatment groups, scores outcome and probability of occurrence of the score outcome
data1 <- tibble(group = rep(c('A', 'B'), each = 5),
score = c(0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
prob = c(.08, .8, .1, .02, 0, 0, 0, .4, .4, .2))
data1
I want to use this data and simulate the scores R times so I expand as follows
R <- 1:10 # number of simulations
data2 <- data1 %>%
nest(data = c(score, prob)) %>%
unnest_wider(col = data) %>%
mutate(size = 15)
data3 <- data2 %>%
group_by(group, score, prob, size) %>%
tidyr::expand(iter = R) %>%
ungroup
data3
My function to simulate the data is as follows where I use multinomial distribution
sim_data <- function(dt, n, size, prob){
d1 <- dt %>%
mutate(sim_score = pmap(list(n = 1,
size = size,
prob = prob),
rmultinom)) %>%
select(-prob) %>%
unnest(cols = c(sim_score, score))
# create real scores from counts above
d2 <- d1 %>%
rowwise %>%
mutate(outcome = map(score, rep, times = sim_score)) %>%
unnest(cols = outcome) %>%
select(size, contains('iter'), group, outcome) %>%
nest(data = c(group, outcome))
return(d2)
}
data4 <- sim_data(data3, n, size, prob)
data4
This works fine however I want to omit those instances where each group has less than or equal to 2 unique scores. In this case, I modify my function sim_data to do the check as follows using repeat so as to ignore cases that do not meet the above mentioned requirement
sim_check <- function(my_data, n, size, prob){
repeat{
# simulate data
dt = sim_data(my_data, n, size, prob)
# check requirement
check <- dt %>%
select(data) %>%
unnest(cols = c(data)) %>%
group_by(group) %>%
distinct(outcome) %>%
mutate(id = 1:n()) %>%
summarise(n = n()) %>%
distinct(n) %>%
pull(n)
if(all(check > 2)) break
}
return(dt)
}
data5 <- sim_check(data3, n, size, prob)
This however does not do the checking as I see observations with 2 unique scores.
I can achieve my objective with a for loop as follows but I want to avoid the for loop. Can someone point out where I'm doing wrong
out <- list()
for(i in 1:10){
data6 <- sim_check(data2, n, size, prob)
out[[i]] <- data6 %>%
pull(data)
}
#Ronak thanks for the observation. I just corrected it
Give a minimum example.
df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)
I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.
SummarizeFn <- function(x,y,z) {
x %>% group_by(Treatment) %>%
summarize(n = n(), Mean = mean(y), SD = sd(y)) %>%
cbind("Var" = rep(y, 3)) # add a column to show which variable those statistics belong to.
}
SumPrice <- SummarizeFn(df, df$Price, "Price")
However, R tells me that object "Price" is not found. How to solve this problem?
Also, how to make y as a character indicating the mean and sd are of price?
Price isnt a variable, you need SummarizeFn(df,df$Price) because Price is just defined in your list df
SummarizeFn <- function(x,y,z)
{
df1<-(x %>% group_by(Treatment)
%>% summarize(n = n(), Mean = mean(y), SD = sd(y))
)
df1<- df1 %>% mutate ("Var" = z)
return(df1)
}
SumPrice <- SummarizeFn(df, df$Price,"Price")
I'd like to create a function that can calculate the moving mean for a variable number of last observations and different variables. Take this as mock data:
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
df$temp = rpois(dim(df)[1], 5)
Calculating for 1 variable and a fixed number of last observations works. E.g. this calculates the average of the temperature of the last 5 days:
library(dplyr)
library(zoo)
df <- df %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = temp, 5, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
So far so good. Now trying to functionalize fails.
avg_last_x <- function(dataframe, column, last_x) {
dataframe <- dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate(almost_avg = rollmean(x = column, k = last_x, align = "right", fill = NA)) %>%
mutate(avg = lag(almost_avg, 1))
return(dataframe) }
avg_last_x(dataframe = df, column = "temp", last_x = 10)
I get this error:
Error in mutate_impl(.data, dots) : k <= n is not TRUE
I understand this is probably related to the evaluation mechanism in dplyr, but I don't get it fixed.
Thanks in advance for your help.
This should fix it.
library(lazyeval)
avg_last_x <- function(dataframe, column, last_x) {
dataframe %>%
group_by(site, year) %>%
arrange(site, year, day) %>%
mutate_(almost_avg = interp(~rollmean(x = c, k = last_x, align = "right",
fill = NA), c = as.name(column)),
avg = ~lag(almost_avg, 1))
}