Simulate unbalanced clustered data - r

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?

As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]

Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180

Related

r average of distance by Id

I have a dataset with two groups of subjects, Group A, Group B like this.
Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18
The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.
What I am trying to do is
a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.
b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.
Expecting a dataset like this.
ID Group Age AvDistance DistanceToMedian
1 A 17 NA NA
2 A 14 NA NA
3 A 10 NA NA
4 A 17 NA NA
5 A 12 NA NA
6 A 6 NA NA
7 A 18 NA NA
8 A 7 NA NA
9 B 18 6 2.11
9 B 13 3.875 2.88
10 B 6 ... ...
10 B 12 ... ...
11 B 16 ... ...
11 B 17 ... ...
12 B 11 ... ...
12 B 18 ... ...
I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.
# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)+ sqrt((14-18)^2)+ sqrt((10-18)^2) + sqrt((17-18)^2) + sqrt((12-18)^2) + sqrt((6-18)^2) + sqrt((18-18)^2) + sqrt((7-18)^2))/8 = 6
(sqrt((17 - 13)^2)+ sqrt((14-13)^2)+ sqrt((10 - 13)^2) + sqrt((17-13)^2) + sqrt((12-13)^2) + sqrt((6-13)^2) + sqrt((18-13)^2) + sqrt((7-13)^2))/8 = 3.875
estimate_mode <- function(x) {
d <- density(x)
d$x[which.max(d$y)]
}
# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)
m1 <- estimate_mode(x)
# Estimate Mode of
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88
This will be easier with a unique row ID, so I'll create one:
library(dplyr)
library(tibble)
df = df %>%
mutate(rownum = paste0("row", row_number()))
ages = setNames(df$Age, df$rownum)
## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))
## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")
## calculate median age for A
med_a = median(ages[df$Group == "A"])
## add back to original data
df %>%
left_join(av_dist, by = "rownum") %>%
mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
# Id Group Age rownum AvDist DistanceToMedian
# 1 1 A 17 row1 NA NA
# 2 2 A 14 row2 NA NA
# 3 3 A 10 row3 NA NA
# 4 4 A 17 row4 NA NA
# 5 5 A 12 row5 NA NA
# 6 6 A 6 row6 NA NA
# 7 7 A 18 row7 NA NA
# 8 8 A 7 row8 NA NA
# 9 9 B 18 row9 5.375 5
# 10 9 B 13 row10 3.875 0
# 11 10 B 6 row11 6.625 7
# 12 10 B 12 row12 3.875 1
# 13 11 B 16 row13 4.375 3
# 14 11 B 17 row14 4.625 4
# 15 12 B 11 row15 4.125 2
# 16 12 B 18 row16 5.375 5
I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.
Using this sample data:
df = read.table(text = 'Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18', header = T)

How to insert a row which calculates the average of the rows above it?

I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

Counting rows based on column values in R

I have a dataframe df
Reads Counts
aaaa 10
bbbb 20
cccc 25
and so on.
I want to calculate the number of reads which exceed a certain value of counts and plot that. Example I want a data frame that looks like
Counts>= #reads with Counts>=
1 3
2 3
3 3
11 2
20 2
21 1
and so on. Can you suggest how I can get such a dataframe and plot it.
Given the levels you want to plot at...
cutoffs <- 1:30
... you could do something like:
data.frame(cutoff=cutoffs, num.above=Reduce("+", lapply(dat$Counts, ">=", cutoffs)))
# cutoff num.above
# 1 1 3
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 3
# 6 6 3
# 7 7 3
# 8 8 3
# 9 9 3
# 10 10 3
# 11 11 2
# 12 12 2
# 13 13 2
# 14 14 2
# 15 15 2
# 16 16 2
# 17 17 2
# 18 18 2
# 19 19 2
# 20 20 2
# 21 21 1
# 22 22 1
# 23 23 1
# 24 24 1
# 25 25 1
# 26 26 0
# 27 27 0
# 28 28 0
# 29 29 0
# 30 30 0
Basically for each value in the original data frame you compute a vector of whether it's greater than or equal to each cutoff (using lapply with >=). Then you add them up (using Reduce with +), getting the total number greater than or equal to each cutoff.
Another option would be using outer/colSums
cutoff <- 1:30
data.frame(cutoff=cutoffs, num.above=colSums(outer(df$Counts, cutoffs, ">=")))

t-test in R by specific factors

I have a data set with a few variables:
X is a numeric variable, Y and Z are factor variables containing only 2 factors (Y=1,2 Z=3,4)
x y z
1 -0.59131983 1 3
2 1.51800178 1 3
3 0.03079412 1 3
4 -0.43881764 1 3
5 -1.44914000 1 3
6 -1.33483914 1 4
7 0.25612595 1 4
8 0.12606742 1 4
9 0.44735965 1 4
10 1.83294817 1 4
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
A t-test is easy to perform if my factor variable is Y (t.test(X~Y)). but i am not sure how to do a t-test which would compare for example only the X values for Y==2, between Z (3 and 4)?
I am not sure if I expressed myself correct, so it might be easier to see it in the table. So, I would like to do a t test for X, where the factor variable is Z and Y==2. how could i do this?
in STATA it is easy:
ttest var1 if var3==3, by(var2)
but i dont get it in R :(
x y z
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
If you read the t.test documentation in R you will see that for one-sample t.tests you shouldn't use the formula interface of the function (type ?t.test):
The formula interface is only applicable for the 2-sample tests.
So, in your case you need to create a subset of your data.frame according to the conditions you specified like this:
df2 <- df[df$y==2 & df$z %in% c(3,4), ]
> df2
x y z
11 -0.59131983 2 3
12 1.51800178 2 3
13 0.03079412 2 3
14 -0.43881764 2 3
15 -1.44914000 2 3
16 -1.33483914 2 4
17 0.25612595 2 4
18 0.12606742 2 4
19 0.44735965 2 4
20 1.83294817 2 4
And then run the one-sample t.test using the following syntax:
> t.test(x=df2$x)
One Sample t-test
data: df2$x
t = 0.1171, df = 9, p-value = 0.9094
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.7275964 0.8070325
sample estimates:
mean of x
0.03971805

Resources