Paired t-test in R - r

I am trying to run a paired t-test in R on data grouped by factors:
> head(i.o.diff,n=20)
# Difference Tree.ID Tree.Name Ins Outs
#1 0.20 AK-1 Akun 1.20 1.0
#2 -1.60 AK-2 Akun 0.40 2.0
#3 -0.60 AK-3 Akun 1.40 2.0
#4 0.40 AK-4 Akun 0.40 0.0
#5 1.30 AK-5 Akun 1.80 0.5
#6 2.70 J-1 Jaror 10.20 7.5
#7 6.60 J-2 Jaror 10.60 4.0
#8 2.50 J-3 Jaror 6.00 3.5
#9 7.50 J-4 Jaror 22.00 14.5
#10 -4.50 J-5 Jaror 5.00 9.5
#11 3.50 Ce-1 Ku'ch 4.00 0.5
#12 -0.70 Ce-2 Ku'ch 4.80 5.5
#13 1.60 Ce-3 Ku'ch 2.60 1.0
#14 -2.40 Ce-4 Ku'ch 2.60 5.0
#15 -1.75 Ce-5 Ku'ch 2.25 4.0
I first tried using:
pairwise.t.test(i.o.diff$In,i.o.diff$Out,g=i.o.diff$Tree.Name,paired=TRUE,pool=FALSE,p.adj="none",alternative=c("less"),mu=0)
but I get the error
Error in complete.cases(x, y) : not all arguments have the same length
which doesn't make a whole lot of sense to me.
I considered using ddply(), apply(), and summaryBy(), but couldn't get it to work because the inputs for the paired t-test require 2 vectors and most of the previous functions I mention seem to work best when only one column is being "operated" upon.
In order to get around this, I tried to use a for loop to achieve the same end:
for(i in unique(i.o.diff$Tree.Name)) {
pair_sub<-subset(i.o.diff,Tree.Name==i)
t.pair<-t.test(pair_sub$Ins,pair_sub$Outs,paired="True")
print(t.pair)
}
However when I do this, I get error
in paired || !is.null(y) : invalid 'x' type in x||y
So I checked typeof(pair_sub$Ins). Turns out that type is double, which is numeric, so I am not sure why the paired t-test is not working. Any ideas as to how to fix either of these methods?

Removed the quotes around TRUE in the for loop. Works great now.

From R documentation:
t.test {stats} R Documentation Student’s t-Test:
Description:
Performs one and two sample t-tests on vectors of data. Usage t.test(x, …)
Default S3 method:
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)

Related

How to sum a current value of a first variable and a previous value of the a new created variable

I am quit new in using rstudio, I am trying to created a new variable that is based on the sum current value of the first variable and the previous values coming from the new created variable as shown.
mean std z score
230.00 16.02 0.50 0.50
226.86 16.12 -0.56 0.0
227.57 16.99 0.85 0.85
231.14 19.74 -0.55 0.30
236.57 12.29 1.96 2.26
241.14 13.97 3.28 5.55
241.57 13.87 1.05 6.60
246.29 18.18 0.85 7.45
The calculation here are made using excel but I would like to do the same using rstudio.
Any help is very appreciated.
Thank you Ismail

Conditional density distribution, two discrete variables

I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".

avoid nested for-loops for tricky operation R

I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?

how do I split a dataframe by row into chunks of n, apply a function and combine?

I have a data.frame of 130,209 rows.
> head(dt)
mLow1 mHigh1 mLow2 mHigh2 meanLow meanHigh fc mean
A_00001 37.00 12.75 99.25 78.50 68.125 45.625 1.4931507 56.8750
A_00002 31.00 21.50 84.75 53.00 57.875 37.250 1.5536913 47.5625
A_00003 72.50 26.50 81.75 74.75 77.125 50.625 1.5234568 63.8750
I want to split the data.frame in 12, apply the scale function on the column fc and then combine it. There is no grouping variable here, else I'd have used ddply. Also, because 130,209 is not perfectly divisible by 12, the resulting data.frames will be unbalanced, i.e., 11 data.frames will have 10,851 rows and the last one will have 10,848 rows, but that's fine.
So how do I split a data.frame by row into chunks of n (in this case 12), apply a function and then combine them together? Any help'd be much appreciated.
Update:
Using the two top solutions, I get different results:
Using #Ben Bolker's solution,
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 -0.5231249
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 -0.5231249
Using #MichaelChirico's answer:
mLow1 mHigh1 mLow2 mHigh2 UID gene_id meanLow meanHigh mean fc fc_scaled
1.5 3.25 1 1.25 MGLibB_00021 0610010K14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00034 0610037L13Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibB_00058 1100001G20Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00061 1110001A16Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00104 1110034G24Rik 1.25 2.25 1.75 0.5555556 -0.5089608
1.5 3.25 1 1.25 MGLibA_00110 1110038F14Rik 1.25 2.25 1.75 0.5555556 -0.5089608
I'm not sure the structure of dt matters that much (if you are not using any of its internal values to do the splitting). Does this help?
spl.dt <- split( dt , cut(1:nrow(dt), 12) )
lapply( spl.dt, my_fun)
ggplot2 has a cut_number() convenience function that will do this for you. If you don't want the overhead of loading that package, you can look at ggplot2:::breaks for the necessary logic.
Reproducible example stolen from #MichaelChirico:
set.seed(100)
KK<-130209L; nn<-12L
library("dplyr")
dt <- data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK)) %>% arrange(mean)
With apologies to those who don't like pipes:
library("ggplot2") ## for cut_number()
dt %>% mutate(grp=cut_number(mean,12)) %>%
group_by(grp) %>%
mutate(fc=c(scale(fc))) %>%
ungroup() %>%
select(-grp) %>% ## drop grouping variable
as.data.frame -> dt2 ## convert back to data frame, assign result
It turns out that the c() around scale() is necessary -- otherwise the fc variable ends up with some attributes that confuse tail() ...
The same logic should apply to using plyr, or base R split-apply-combine, as well (the key is using cut_number() to define the grouping variable).
With data.table, you can do:
library(data.table)
setDT(dt)[,scale(fc),by=rep(1:nn,each=ceiling(KK/nn),length.out=KK)]
Here, KK is 130,209 and nn is 12. Reproducible data:
set.seed(100)
KK<-130209L; nn<-12L
dt<-data.frame(mLow1=rnorm(KK),mHigh1=rnorm(KK),
mLow2=rnorm(KK),mHigh2=rnorm(KK),
meanLow=rnorm(KK),meanHigh=rnorm(KK),
fc=rnorm(KK),mean=rnorm(KK))
So no need to split the data and recombine.
If you'd like to add this to the data frame instead of just extract it, you can use the := operator to assign by reference:
setDT(dt)[,fc_scaled:=scale(fc)...]

Using the ddply comand on a subset of data

I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))

Resources