R: calculating ICC across multiple columns in two dataframes - r

I am calculating the ICC's for 301 variables of 2 readers. Results are saved in two files with 301 columns each. The first column of file1 (reader1$Var1) corresponds to the first column of file2 (reader2$Var302). I can perform the ICC manually (see below), but I need help to automate this process using apply or a loop. Thank you.
library(irr)
irr::icc()
a= data.frame(reader1$Var1)
b= data.frame(reader2$Var302)
X= data.frame (a,b)
function.ICC <- function (X) {irr::icc(X, model =c("oneway"), type = c("consistency"), unit =("single"), r0 = 0, conf.level = 0.95)}
Results <- function.ICC(X)
Results[7]

A combination of lapply and do.call could do for your case (although there's quite a few options). You don't provide a sample of your data, so I assume you first do a cbind of your 2 dataframes one after the other, so that in this toy example
> X = data.frame(cbind(1:10, 11:20, 21:30, 21:30))
> X
X1 X2 X3 X4
1 1 11 21 21
2 2 12 22 22
3 3 13 23 23
4 4 14 24 24
5 5 15 25 25
6 6 16 26 26
7 7 17 27 27
8 8 18 28 28
9 9 19 29 29
10 10 20 30 30
you would like to run icc of X1 vs X3 and X2 versus X4. It would be something like the following, relying on function.ICC as you've defined it:
> do.call(cbind, lapply(1:2, function(i) function.ICC(X[,c(i, i+2)])))
[,1] [,2]
subjects 10 10
raters 2 2
model "oneway" "oneway"
type "consistency" "consistency"
unit "single" "single"
icc.name "ICC(1)" "ICC(1)"
value -0.8320611 -0.4634146
r0 0 0
Fvalue 0.09166667 0.3666667
df1 9 9
df2 10 10
p.value 0.9993158 0.926668
conf.level 0.95 0.95
lbound -0.9526347 -0.8231069
ubound -0.4669701 0.1848105
So, for your cbind'ed dataframes with 301 columns, omething similar to this should work:
do.call(cbind, lapply(1:301, function(i) function.ICC(X[,c(i, i+301)])))

Related

Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)
If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

How to compute regressions and extract parameters on all columns of a dataset [duplicate]

This question already has an answer here:
Fitting a linear model with multiple LHS
(1 answer)
Closed 6 years ago.
I want to compute a linear regression on all column (or to a selected column) of a specific dataset. The first column respresent a X axis of the regression, the other each subject response. The second step is to extract for each specific subject the coefficients parameters of regression (linear or logistic).
Actually I do it manually for each column using lm (or glm) and extracting the coefficients to a specific variable and dataset.
Example using lm:
dataset <- as.data.frame(matrix(c(1,1,
3,7,2,1,4,5,3,2,4,6,4,2,5,8,5,5,9,9,6,4,
12,10,7,6,15,11,8,6,15,15,9,8,16,10,10,9,18,9,11,12,
20,12,12,15,21,16,13,18,22,15,14,22,21,10,15,29,24,12)
,nrow=15, ncol=4,byrow=TRUE))
colnames(dataset) <- c("X","Sj1","Sj2","Sj3")
Output:
dataset
X Sj1 Sj2 Sj3
1 1 1 3 7
2 2 1 4 5
3 3 2 4 6
4 4 2 5 8
5 5 5 9 9
6 6 4 12 10
7 7 6 15 11
8 8 6 15 15
9 9 8 16 10
10 10 9 18 9
11 11 12 20 12
12 12 15 21 16
13 13 18 22 15
14 14 22 21 10
15 15 29 24 12
Regressions:
attach (dataset)
mod1 <- lm(Sj1~X)
mod2 <- lm(Sj2~X)
mod3 <- lm(Sj3~X)
Intercept <- 0
Intercept[1] <- mod1$coefficients[[1]]
Intercept[2] <- mod2$coefficients[[1]]
Intercept[3] <- mod3$coefficients[[1]]
Slope <- 0
Slope[1] <- mod1$coefficients[[2]]
Slope[2] <- mod2$coefficients[[2]]
Slope[3] <- mod3$coefficients[[2]]
data.frame(Intercept,Slope,row.names=colnames(dataset)[-1])
and the final output is
Intercept Slope
Sj1 -4.580952 1.7392857
Sj2 1.104762 1.6035714
Sj3 6.104762 0.5285714
There is a code to perform it automatically, indipendently from the number of columns? I tried apply and function without results.
What is the best way to do this?
lm accepts a matrix on the LHS. See the documentation.
f <- as.formula(paste0("cbind(", paste(names(dataset)[-1], collapse = ","), ") ~ X"))
mods <- lm(f, data = dataset)
coef(mods)
# Sj1 Sj2 Sj3
#(Intercept) -4.580952 1.104762 6.1047619
#X 1.739286 1.603571 0.5285714
PS: You should get out of the habit of using attach.

Calculate mean of each n-rows in a dataframe in r when the first row is varying

First make some example data:
df = data.frame(matrix(rnorm(200), nrow=100))
df1=data.frame(t(c(25,34)))
The starting row is different in each column. For example, in X1 I would like to start from 25 th row while in X2 from row 34. Then, I want to calculate the mean for each 5 values for the next 50 rows for all the columns in df.
I am new to R so this is probably very obvious. Can anyone provide some suggestions that how I can do this?
You could try Map.
lst <- Map(function(x,y) {x1 <- x[y:length(x)]
tapply(x1,as.numeric(gl(length(x1), 5,
length(x1))), FUN=mean)},
df, df1)
lst
# $X1
# 1 2 3 4 5 6
#-0.16500158 0.11339623 -0.86961872 -0.54985564 0.19958461 0.35234983
# 7 8 9 10 11 12
#0.32792769 0.65989801 -0.30409184 -0.53264725 -0.45792792 -0.59139844
# 13 14 15 16
# 0.03934133 -0.38068187 0.10100007 1.21017392
#$X2
# 1 2 3 4 5 6
# 0.24525622 0.07367300 0.18733973 -0.43784202 -0.45756095 -0.45740178
# 7 8 9 10 11 12
#-0.54086152 0.10439072 0.65660937 0.70623380 -0.51640088 0.46506135
# 13 14
#-0.09428336 -0.86295101
Because of the length difference, it might be better to keep it as a list. But, if you need it in a matrix/data.frame, you can make the lengths equal by padding with NAs.
do.call(cbind,lapply(lst, `length<-`,(max(sapply(lst, length)))))
Update
If you need only 50 rows, then change y:(length(x) to y:(y+49) in the Map code
data
set.seed(24)
df <- data.frame(matrix(rnorm(200), nrow=100))
df1 <- data.frame(t(c(25,34)))
Not entirely clear, especially, the second line of your code, but I think this might be close to what you want to do:
every_fifth_row <- df[seq(1, nrow(df), 5), ]
every_fifth_row
# X1 X2
# 1 -0.09490455 -0.28417104
# 6 -0.14949662 0.12857284
# 11 0.15297366 -0.84428186
# 16 -1.03397309 0.04775516
# 21 -1.95735213 -1.03750794
# 26 1.61135194 1.10189370
# 31 0.12447365 1.80792719
# 36 -0.92344017 0.66639710
# 41 -0.88764143 0.10858376
# 46 0.27761464 0.98382526
# 51 -0.14503359 -0.66868956
# 56 -1.70208187 0.05993688
# 61 0.33828525 1.00208639
# 66 -0.41427863 1.07969341
# 71 0.35027994 -1.46920059
# 76 1.38943839 0.01844205
# 81 -0.81560917 -0.32133221
# 86 1.38188423 -0.77755471
# 91 1.53247872 -0.98660308
# 96 0.45721909 -0.22855622
rowMeans(every_fifth_row)
colMeans(every_fifth_row)
# Alternative
# apply(every_fifth_row, 1, mean) # Row-wise mean
# apply(every_fifth_row, 2, mean) # Column-wise mean

Apply LR models to another dataframe

I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

Resources