Calculating Confidence Intervals for two datasets - r

Quite the number of questions I've made today.
I'd like to calculate the Confidence Interval (99% level, not 95) for the mean value of variable age of two dataframes, infert_control and infert_patient where:
infert_control = subset(infert$age, infert$case == 0)
infert_patient = subset(infert$age, infert$case == 1)
infert is a built-in R dataset, for those not familiar with it, here it is: case 0 denominates the control-group patients, case 1 the actual ones.
> infert
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6 1 1 2 1 3
2 0-5yrs 42 1 1 1 0 2 1
3 0-5yrs 39 6 2 1 0 3 4
4 0-5yrs 34 4 2 1 0 4 2
5 6-11yrs 35 3 1 1 1 5 32
6 6-11yrs 36 4 2 1 1 6 36
7 6-11yrs 23 1 0 1 0 7 6
8 6-11yrs 32 2 0 1 0 8 22
9 6-11yrs 21 1 0 1 1 9 5
10 6-11yrs 28 2 0 1 0 10 19
11 6-11yrs 29 2 1 1 0 11 20
...
239 12+ yrs 38 6 0 0 2 74 63
240 12+ yrs 26 2 1 0 1 75 49
241 12+ yrs 31 1 1 0 0 76 45
242 12+ yrs 31 2 0 0 1 77 53
243 12+ yrs 25 1 0 0 1 78 41
244 12+ yrs 31 1 0 0 1 79 45
245 12+ yrs 34 1 0 0 0 80 47
246 12+ yrs 35 2 2 0 0 81 54
247 12+ yrs 29 1 0 0 1 82 43
248 12+ yrs 23 1 0 0 1 83 40
What would be the correct way to solve this?
I've already calculated the Mean value of column age for both infert_control and infert_patient, plus the standard deviation of each subset.

You could use bootstrap for this:
library(boot)
set.seed(42)
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4)
quantile(boot_mean$t, probs=c(0.005, 0.995))
# 0.5% 99.5%
# 30.47273 32.58182
Or if you don't want to use a library:
set.seed(42)
R <- 1e4
boot_mean <- colMeans(
matrix(
sample(infert_control, R * length(infert_control), TRUE),
ncol=R))
quantile(boot_mean, probs=c(0.005, 0.995))
# 0.5% 99.5%
#30.42424 32.55152

So many answers...
The mean value of a random sample has a t-distribution, not normal, although t -> N as df -> Inf.
cl <- function(data,p) {
n <- length(data)
cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n)
m <- mean(data)
return(c(lower=m-cl,upper=m+cl))
}
cl.control <- cl(infert_control,0.01)
cl.control
# lower upper
# 30.42493 32.55689
cl.patient <- cl(infert_patient,0.01)
cl.patient
# lower upper
# 30.00221 33.05803
aggregate(age~case,data=infert,cl,p=0.01) # much better way...
# case age.lower age.upper
# 1 0 30.42493 32.55689
# 2 1 30.00221 33.05803
Also, the quantile functions (e.q. qt(...) and qnorm(...)) return the lower tail by default, so your limits would be reversed unless you set lower.tail=F

You could easily calculate the confidence interval manually:
infert_control <- subset(infert$age, infert$case == 0)
# calculate needed values
m <- mean(infert_control)
s <- sd(infert_control)
n <- length(infert_control)
# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution)
a <- 0.995 # 99% CI => 0.5% on both sides
error <- qnorm(a)*s/sqrt(n)
# calculate CI
ci_lower <- m-error
ci_upper <- m+error
See also http://en.wikipedia.org/wiki/Confidence_interval (sorry for a wikipedia link, but it has a good explanation and shows you the formula)

... or as small function:
cifun <- function(data, ALPHA){
c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)))
}
cifun(infert_control, 0.01)

Related

After imputation, how to round to nearest level of a factor

I have imputed missing values in the following data frame in column q1. q1 is a factor with the levels 0,20,40,60,80,100. I need to round the column q1 to nearest level of factors, number 49 and 91 need to be rounded. Does anyone have a solution for this problem?, thanks in advance!
id <- rep(c(300,450), each=6)
> visit <- rep(1:6,2)
> trt <- rep(c(0,"A",0,"B",0,"C"),2)
> q1 <- c(0,100,0,89,0, 60,0,85,0,40,0, 20)
> df <- data.frame(id,visit,trt,q1)
> df
id visit trt q1
1 300 1 0 0
2 300 2 A 100
3 300 3 0 0
4 300 4 B 49
5 300 5 0 0
6 300 6 C 60
7 450 1 0 0
8 450 2 A 91
9 450 3 0 0
10 450 4 B 40
11 450 5 0 0
12 450 6 C 20
>
Maybe you could use plyr's round_any function here which would round to nearest multiple of 20 in this case.
plyr::round_any(df$q1, 20)
#[1] 0 100 0 80 0 60 0 80 0 40 0 20

dplyr: average over a range based on first occurrence in a different column

I would like to use dyplr and mutate to create a new variable that is either 0 or the average the values in column y, conditional on a range from column z.
For column z range, I would like to use the first time z >= 90 for the max value of the range and then the first time z=31 immediately before z >= 90 for the minimum value of the range.
Note: I will be grouping by column x
For example:
x y z
1 100 0
1 90 0
1 90 31
1 90 60
1 80 31
1 75 60
1 60 90
1 60 60
2 60 0
2 60 30
I would to average y over this range:
x y z
1 80 31
1 75 60
1 60 90
so I would end up with the value 71.7 (I don't care about rounding).
x y z ave
1 100 0 0
1 90 0 0
1 90 31 0
1 90 60 0
1 80 31 71.7
1 75 60 71.7
1 60 90 71.7
1 60 60 0
2 60 0 0
2 60 30 0
We may do
df %>% group_by(x) %>% mutate(ave = {
if(any(z >= 90)) {
idxU <- which.max(z >= 90)
idxL <- max(which(z[1:idxU] == 31))
replace(z * 0, idxL:idxU, mean(z[idxL:idxU]))
} else {
0
}
})
# x y z ave
# 1 1 100 0 0.00000
# 2 1 90 0 0.00000
# 3 1 90 31 0.00000
# 4 1 90 60 0.00000
# 5 1 80 31 60.33333
# 6 1 75 60 60.33333
# 7 1 60 90 60.33333
# 8 1 60 60 0.00000
# 9 2 60 0 0.00000
# 10 2 60 30 0.00000
So, idxU is the upper limit for the range, idxL is the lower limit, then in the last line we replace elements idxL:idxU of the zero vector z * 0 by the required mean.

Count next n rows that meets a condition in R

Let's say I have a df that looks like this
ID X_Value
1 40
2 13
3 75
4 83
5 64
6 43
7 74
8 45
9 54
10 84
So what I would like to do, is to do a rolling function that if in the actual and last 4 rows, there are 2 or more values that are higher than X (let's say 70 for this example) then return 1, else 0.
So the output would be something like the following:
ID X_Value Next_4_2
1 40 0
2 13 0
3 75 0
4 83 1
5 64 1
6 43 1
7 24 1
8 45 0
9 74 0
10 84 1
I think this would be possible with a rolling function, but I have tried and not sure how to do it. Thank you in advance
Given your expected output, I suppose you meant "in the actual and previous 3 rows". Then using some rolling function indeed does the job:
library(zoo)
thr1 <- 70
thr2 <- 2
last <- 3 + 1
df$Next_4_2 <- 1 * (rollsum(df$X_Value > thr1, last, align = "right", fill = 0) >= thr2)
df
# ID X_Value Next_4_2
# 1 1 40 0
# 2 2 13 0
# 3 3 75 0
# 4 4 83 1
# 5 5 64 1
# 6 6 43 1
# 7 7 74 1
# 8 8 45 0
# 9 9 54 0
# 10 10 84 1
The indexing using max(1,i-3) is perhaps the only part of the code worth remembering. I might help in subsequent construction when a for-loop was really needed.
dat$X_Next_4_2 <- integer( length(dat$X_Value) )
dat$ X_Next_4_2[1]=0
for (i in 2:length(dat$X_Value) ){
dat$ X_Next_4_2[i]=
( sum(dat$X_Value[i: (max(0, i-4) )] >=70) >=2 )}
(Not very pretty and clearly inferior to the rollsum answer already posted.)

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>

Resources