R simulation of correlated data - r

data = data.frame(GROUP = sample(1:4, size = 1000, r = TRUE),
SCORE = runif(1000),
HELP = runif(1000, -.2, .8))
I have sample 'data' that has variable GROUP which indicates the GROUP an individual is in and SCORE which is the SCORE for that individual and HELP which is another measure.
Now if you know that GROUP = 1 will have a higher SCORE on average if their values of HELP are greater than 0.3, but GROUP = 2 will have a lower SCORE on average if their values of HELP are less than 0.4 how can you simulate a data set that uses this information?

Here's an unsophisticated solution using dplyr to increase/decrease SCORE by one sd for the relevant conditions.
library(dplyr)
sd(data$SCORE)
#> [1] 0.2868018
data <-
data %>%
mutate(helpgt3 = as.integer(HELP > .3),
helplt4 = as.integer(HELP < .4))
original_means <-
data %>%
group_by(GROUP, helpgt3) %>%
summarise(mean.score = mean(SCORE))
data <-
data %>%
mutate(SCORE = case_when(
helpgt3 == 1 & GROUP == 1 ~ SCORE + 0.2868018,
helplt4 == 1 & GROUP == 2 ~ SCORE - 0.2868018,
TRUE ~ SCORE
))
new_means <-
data %>%
group_by(GROUP, helpgt3) %>%
summarise(mean.score = mean(SCORE))
#> `summarise()` regrouping output by 'GROUP' (override with `.groups` argument)
original_means
#> # A tibble: 8 x 3
#> # Groups: GROUP [4]
#> GROUP helpgt3 mean.score
#> <int> <int> <dbl>
#> 1 1 0 0.486
#> 2 1 1 0.474
#> 3 2 0 0.473
#> 4 2 1 0.525
#> 5 3 0 0.482
#> 6 3 1 0.486
#> 7 4 0 0.545
#> 8 4 1 0.521
new_means
#> # A tibble: 8 x 3
#> # Groups: GROUP [4]
#> GROUP helpgt3 mean.score
#> <int> <int> <dbl>
#> 1 1 0 0.486
#> 2 1 1 0.761
#> 3 2 0 0.186
#> 4 2 1 0.478
#> 5 3 0 0.482
#> 6 3 1 0.486
#> 7 4 0 0.545
#> 8 4 1 0.521
Your data
set.seed(2020)
data = data.frame(GROUP = sample(1:4, size = 1000, r = TRUE),
SCORE = runif(1000),
HELP = runif(1000, -.2, .8))

Related

Filter by value counts within groups

I want to filter my grouped dataframe based on the number of occurrences of a specific value within a group.
Some exemplary data:
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
The “lapse” column is my filter variable in this case.
I want to exclude every “ID” group that has more than 15 counts of “lapse” == 2 within!
data %>% group_by(ID) %>% count(lapse == 2)
So, if for example the group “A” has 17 times “lapse” == 2 within it should be filtered entirely from the datafame.
First I created some reproducible data using a set.seed and check the number of values per group. It seems that in this case only group D more values with lapse 2 has. You can use filter and sum the values with lapse 2 per group like this:
set.seed(7)
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
library(dplyr)
# Check n values per group
data %>%
group_by(ID, lapse) %>%
summarise(n = n())
#> # A tibble: 8 × 3
#> # Groups: ID [4]
#> ID lapse n
#> <chr> <int> <int>
#> 1 A 1 8
#> 2 A 2 7
#> 3 B 1 13
#> 4 B 2 15
#> 5 C 1 18
#> 6 C 2 6
#> 7 D 1 17
#> 8 D 2 16
data %>%
group_by(ID) %>%
filter(!(sum(lapse == 2) > 15))
#> # A tibble: 67 × 3
#> # Groups: ID [3]
#> ID rt lapse
#> <chr> <dbl> <int>
#> 1 B 0.517 2
#> 2 C 0.589 1
#> 3 C 0.598 2
#> 4 C 0.715 1
#> 5 B 0.475 2
#> 6 C 0.965 1
#> 7 B 0.234 1
#> 8 B 0.812 2
#> 9 C 0.517 1
#> 10 B 0.700 1
#> # … with 57 more rows
Created on 2023-01-08 with reprex v2.0.2

Replace NA in muliple column by group in r

df <- data.frame(A = c(NA,5,4,NA,1),
B = c(1,NA,1,1,NA),
C = c(3,3,NA,NA,6),
D = c(0,0,1,1,1))
I have something like above dataset and trying to replace the NA values with the mean of the subgroup from target varibale D.
I tried the following code to replace them individually.
df <- df %>%
group_by(D) %>%
mutate(
A = ifelse(is.na(A),
mean(A, na.rm=TRUE),A)
) %>%
mutate(
B = ifelse(is.na(B),
mean(B, na.rm=TRUE),B)
) %>%
mutate(
C = ifelse(is.na(C),
mean(C, na.rm=TRUE),C)
)
Is there more efficent way to impute the mean values?
Perhaps this 'tidyverse' approach will suit:
library(tidyverse)
df <- data.frame(A = c(NA,5,4,NA,1),
B = c(1,NA,1,1,NA),
C = c(3,3,NA,NA,6),
D = c(0,0,1,1,1))
df_output <- df %>%
group_by(D) %>%
mutate(
A = ifelse(is.na(A),
mean(A, na.rm=TRUE),A)
) %>%
mutate(
B = ifelse(is.na(B),
mean(B, na.rm=TRUE),B)
) %>%
mutate(
C = ifelse(is.na(C),
mean(C, na.rm=TRUE),C)
)
df_output
#> # A tibble: 5 × 4
#> # Groups: D [2]
#> A B C D
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 3 0
#> 2 5 1 3 0
#> 3 4 1 6 1
#> 4 2.5 1 6 1
#> 5 1 1 6 1
df_output_2 <- df %>%
group_by(D) %>%
mutate(across(A:C, ~replace_na(.x, mean(.x, na.rm = TRUE))))
df_output_2
#> # A tibble: 5 × 4
#> # Groups: D [2]
#> A B C D
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 3 0
#> 2 5 1 3 0
#> 3 4 1 6 1
#> 4 2.5 1 6 1
#> 5 1 1 6 1
all_equal(df_output, df_output_2)
#> [1] TRUE
Created on 2022-10-04 by the reprex package (v2.0.1)
I encountered the same problem before but my dataset was bigger. In these cases, I use mutate_all
df %>% group_by(D) %>% mutate_all(funs(replace(., is.na(.), mean(., na.rm = TRUE))))
A B C D
<dbl> <dbl> <dbl> <dbl>
1 5 1 3 0
2 5 1 3 0
3 4 1 6 1
4 2.5 1 6 1
5 1 1 6 1

How to Create Iterative Forumla to calculate Z Score in R?

I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!
Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

Get median of group means without breaking piped workflow (or joining back temporary tibble)

My data have a grouping variable group, and I would like to find the median of the group means of x so that I can flag groups that have group means of x higher than the median group mean of x.
This calculation is easy if I save the group means to a tibble temp, compare x_mean to median(x_mean), and merge back temp.
library(tidyverse)
set.seed(2001)
tb <- tibble(group = c(1, 2, rep(3, 3))) %>%
mutate(x = runif(n()) + ifelse(group %in% 1:2, 1, 0))
tb
#> # A tibble: 5 x 2
#> group x
#> <dbl> <dbl>
#> 1 1 1.76
#> 2 2 1.61
#> 3 3 0.218
#> 4 3 0.229
#> 5 3 0.153
temp <- tb %>%
group_by(group) %>%
summarize(x_mean = mean(x)) %>%
ungroup() %>%
mutate(x_hi = (x_mean > median(x_mean)))
temp
#> # A tibble: 3 x 3
#> group x_mean x_hi
#> <dbl> <dbl> <lgl>
#> 1 1 1.76 TRUE
#> 2 2 1.61 FALSE
#> 3 3 0.200 FALSE
tb <- inner_join(tb, temp)
#> Joining, by = "group"
Here is the desired output. It may seem odd that 4/5 observations are below the median, but this is possible since my group counts are not equal.
tb
#> # A tibble: 5 x 4
#> group x x_mean x_hi
#> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1.76 1.76 TRUE
#> 2 2 1.61 1.61 FALSE
#> 3 3 0.218 0.200 FALSE
#> 4 3 0.229 0.200 FALSE
#> 5 3 0.153 0.200 FALSE
I would like to do this without breaking my piped workflow. The following attempt fails because my groups have different counts.
tb <- tb %>%
group_by(group) %>%
mutate(x_mean2 = mean(x)) %>%
ungroup() %>%
mutate(x_hi2 = (x_mean > median(x_mean)))
tb
#> # A tibble: 5 x 6
#> group x x_mean x_hi x_mean2 x_hi2
#> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl>
#> 1 1 1.76 1.76 TRUE 1.76 TRUE
#> 2 2 1.61 1.61 FALSE 1.61 TRUE
#> 3 3 0.218 0.200 FALSE 0.200 FALSE
#> 4 3 0.229 0.200 FALSE 0.200 FALSE
#> 5 3 0.153 0.200 FALSE 0.200 FALSE
Is there a way to grab the median of the group means of x without breaking my piped workflow?
Created on 2019-07-29 by the reprex package (v0.3.0)
Just use unique:
library(dplyr)
tb %>%
group_by(group) %>%
mutate(x_mean = mean(x)) %>%
ungroup %>%
mutate(x_hi = x_mean > median(unique(x_mean)))
#> # A tibble: 5 x 4
#> group x x_mean x_hi
#> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1.76 1.76 TRUE
#> 2 2 1.61 1.61 FALSE
#> 3 3 0.218 0.200 FALSE
#> 4 3 0.229 0.200 FALSE
#> 5 3 0.153 0.200 FALSE
M-M's answer works for the specific case, but I don't think it would be accurate if more than one group had the same mean scores.
tb %>%
group_by(group) %>%
mutate(x_mean = mean(x)) %>%
ungroup %>%
nest(-x_mean, -group) %>%
mutate(x_median = median(x_mean)) %>%
unnest %>%
mutate(x_hi = x_mean > x_median)

Winners within pairs; or vector-valued group_by mutate?

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

Resources