How to make raw data from descriptives? - r

I have a table of descriptives, but I'd like to generate the raw data so that I can run stats like t.test and whatnot.
# Table. Heart rate between groups
# Mean heart rate sd n
# Group1 125 11 218
# Group2 133 12 156
I'd like to fill in the remainder of group 2 with NAs so the vectors are the same length. Then, I'd like to run a t test to see if the two groups are different. I've been hacking away at it, but I can't seem to get everything working properly.
I was using norm, I'm just not sure how to then get that into a suitable format for stats:
group1 <- rnorm(n=218, mean = 125, sd = 11)
group2 <- rnorm(n=156, mean = 133, sd = 12)

Related

Repeated random sampling and kurtosis on unbalanced sample

I have an unbalanced dataset with people from liberal and conservative background giving rating on an issue (1-7). Would like to see how polarized the issue is.
The sample is heavily skewed towards liberal (70% of the sample). How do I do repeated sampling using R to create a balanced sample (50-50) and calculate kurtosis?
For example, I have total 50 conservatives. How do I randomly sample 50 liberals out of 150 repeatedly?
A sample dataframe below:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1
What you're describing is termed 'undersampling'. Here is one method using tidyverse functions:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random
There is also the ROSE package that has a function ("ovun.sample") that can do this for you: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample

Calculate the SD in a different dataset with different data value limits

I am a beginner with R and want to calculate the SD of values in another dataframe several times within limits of values in a dataframe.
Imagine I have a dataframe looking like this.
peak <- c("max", "max", "max")
value <- c(42, 105, 170)
minbefore<- c(20, 50, 115)
minafter <- c(50, 115, 180)
extrema <- data.frame(peak, value, minbefore, minafter)
I now want to calculate the SD of the values in another dataframe em$Position within the limits of extrema$minbeforeand extrema$minafter for each row of the dataframe extreme.
My idea was something like this
extrema$SD <- sd(em$Position[em$Position>extrema$minbefore & em$Position<extrema$minafter])
Then I get the following error message: longer object length is not a multiple of shorter object length
Which absolutely makes sense to me because I assume that R probably tries to insert the whole vector extrema$minbefore and extrema$minafter resepectively and at the same time which obviuosly makes no sense.
What would be the right way to do it?
Thanks in advance.
Dominik.
You can use apply function to do this:
# dummy data
em <- data.frame(Position = unlist(as.integer(runif(n = 30, min = 20, max = 190))))
# function to calculate sd
extrema$SD <- apply(extrema[,c('minbefore','minafter')], 1, function(x){
return( sd(em[(em$Position > x[1]) & (em$Position < x[2]),'Position']))
})
print(extreme)
peak value minbefore minafter SD
1 max 42 20 50 5.966574
2 max 105 50 115 19.07878
3 max 170 115 180 18.407426
Explanation:
We traverse through each row of extreme, get the min and max values.
Using min, max values, we subset the em$Position and calculate the sd.

R: Assign treatment based on groups with different levels of correlations

I want to assign a (non-random) treatment variable based on different correlation levels with a group variable.
E.g. for the following data table:
library(data.table)
set.seed(123)
dt <- data.table(id=1:1000, group=sample(c(1:4), 1000, replace=T))
I want to assign a treatment [0,1] to the groups, where individuals in e.g. group 4 have a higher probability of getting the treatment. And I also want to be able to change the level of correlation for the assignment.
I know that the trtObserve() function from the simstudy package does assign treatment depending on other variables.
E.g. one can set a formula to assign probabilities for the assignment:
library(simstudy)
formula1 <- c("0.1 + 0.1*group")
dtExp <- trtObserve(dt, formulas = formula1, logit.link = TRUE, grpName = "treatment")
table(dtExp$treatment, dtExp$group)
1 2 3 4
0 119 120 82 85
1 128 140 163 163
But I do not get how the assignment takes place based on the formula.
Can someone explain how the formula is incorporated in the assignment or give an alternative solution to my problem?
Thanks a lot!
You can specify the probability that you will have more 4's in your group using the probs options in the sample() function. You can do this like so:
set.seed(123)
dt <- data.table(id=1:1000, group=sample(1:4, size=1000, prob = c(0.1,.2,.2,.5), replace = T))
dt$treatment <- sample(c(0,1), 1000, replace = TRUE)
table(dt$group, dt$treatment)
# 0 1
# 1 46 43
# 2 93 107
# 3 93 105
# 4 269 244

How to simulate MAR missing data in R?

I would like to simulate some missing data in R but am having trouble. I have created two variables ('pre' and 'post') that represent a measurement for the same individual pre- and post-treatment (i.e. paired data). I have been able to do it for data that is Missing Completely at Random (MCAR) - see below, but am unable to figure out how to code it for Missing at Random (MAR). For the MAR missing data, I would like to create 3 categories based on the pre-treatment observations that will determine how many of the post-treatment observations are missing. i.e.
For pre > 25, 40% post missing
For pre > 21 and ≤ 25, 30% post missing
For pre ≤ 21, 20% post missing
Can anyone help out? (I'd be really grateful!)
Thanks
set.seed(80122)
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
# Split into pre and post treatment and check correlation is what was specified
pre <- data[, 1]
post <- data[, 2]
cor.test(pre,post)
# Simulate MCAR
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's
post_mcar <- post
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector
post_mcar[mcar == 0] <- NA # Change zero's to NAs
This is an old question, but I thought I'd take a crack at it.
Simulate fake data as in the OP:
library(tidyverse)
library(mvtnorm)
# Number of data values
n <- 1000
# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
set.seed(80122)
data <- rmvnorm(n, mean=c(28,18),
sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix
Convert to data frame:
data = as.data.frame(data)
names(data) = c("pre", "post")
Simulate missing completely at random (MCAR) data:
data$post_mcar <- data$post
set.seed(2)
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA
Simulate missing at random (MAR) data: First, we'll create a grouping variable, frac, whose value is the fraction of the group that we want to set to missing. We'll use the cut function to create these groups and set the label values, then we'll convert the labels to numeric for later use:
data = data %>%
mutate(post_mar = post,
frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf),
labels=c(0.2,0.3,0.4)))))
Now, group by frac and set a randomly selected fraction of the values to NA, using frac to determine the fraction of values set to NA.
set.seed(3)
data = data %>%
group_by(frac) %>%
mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>%
ungroup
Here are the last 6 rows of the resulting data frame:
pre post post_mcar post_mar frac
995 28.63476 19.35081 19.35081 19.35081 0.4
996 32.86278 24.16119 NA NA 0.4
997 28.25965 16.64538 16.64538 16.64538 0.4
998 24.35255 17.80365 17.80365 17.80365 0.3
999 28.12426 18.25222 18.25222 NA 0.4
1000 27.55075 14.47757 14.47757 14.47757 0.4
Here's a check on the fraction of values missing in each group. Note that the actual percentage of values set to missing can differ from frac if the requested percentage doesn't result in an integer number of rows. Here, for example, there's no way to select 20% of 8 values. It can be 12.5% (1 value) or 25% (2 values).
data %>% group_by(frac) %>%
summarise(N=n(),
N_missing=sum(is.na(post_mar)),
Frac_missing=N_missing/N)
frac N N_missing Frac_missing
1 0.2 8 2 0.2500000
2 0.3 138 41 0.2971014
3 0.4 854 342 0.4004684

Grouping in R changes mean substantially

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.
table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

Resources