If I have wind direction readings from a collection of wind vanes, is there something like a t.test (or other significance test) that I can perform on the circular data? I am assuming a normal distribution (which the data below is from). I found the CircStats package, but figured I would check here for some additional guidance.
Some sample data:
df1 <- data.frame(unit=letters, wind.direction=c(99,88,93,99,86,90,101,109,109,91,86,94,106,92,99,103,110,98,107,109,93,102,92,99,109,85))
That one works fine using just a standard t.test since it doesn't wrap around zero. But,
df2 <- data.frame(unit=letters, wind.direction=c(1,350,355,1,348,352,3,11,11,353,348,356,8,3,1,5,12,0,9,11,355,4,354,1,11,347))
doesn't since its circular mean is ~0 but linear mean is ~139...
You can use aov.circular, in the circular package.
# Sample data (with two groups, to compare the means)
library(circular)
x <- as.circular(
c(1,350,355,1,348,352,3,11,11,353,348,356,
8,3,1,5,12,0,9,11,355,4,354,1,11,347),
unit="degrees"
)
g <- sample(LETTERS[1:2], 26, replace=TRUE)
# Test
aov.circular(x, g)
This is what I meant to say:
> df2$wd.scaled = apply(as.matrix(df2[,2]),1,function(x) ifelse(x>180,x-360,x))
> df2
unit wind.direction wd2 wd.scaled
1 a 1 1 1
2 b 350 -10 -10
3 c 355 -5 -5
4 d 1 1 1
5 e 348 -12 -12
6 f 352 -8 -8
> mean(df2$wd.scaled)
[1] 0.3846154
This would work if you don't have many observations near 180.
Related
For graphical purpose, I want to create a new data frame with two columns.
The first column is the dose of the treatment received (i; 10 grammes up to 200 grammes).
The second column must be filed with the result of a calculus corresponding to the value of the dose received, id est the percentage of patients developing the disease according the corresponding dose which is given by the formula below:
The dose is extracted from a much larger dataset (data_fcpa) of more than 1 000 rows (patients).
percent_i <- round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > i] > 1))[2] * 100, 1)
I know how to create a new data (df) with the doses I want to explore:
df <- data.frame (dose <- seq (10, 200, by = 10))
names (df) <- c("cpa_dose")
> df
cpa_dose
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
10 100
11 110
12 120
13 130
14 140
15 150
16 160
17 170
18 180
19 190
20 200
For example for a dose of 10 grammes the result is:
> round (prop.table (table (data_fcpa $ n_chir_act [data_fcpa $ cyproterone_dose > 10] > 1))[2] * 100, 1)
TRUE
11.7
I suspect that a loop is needed to produce an output alike the little example provided below but, I have no idea of how to do it.
cpa_dose percentage
1 10 11.7
2 20
3 30
4 40
Any suggestion are welcomed.
Thank you in advance for your help.
It seems that you are describing a a situation where you want to show predicted effects from a statistical model? In that case, ggeffects is your best friend.
library(tidyverse)
library(ggeffects)
lm(mpg ~ hp,mtcars) %>%
ggpredict() %>%
as_tibble()
Btw, in order to answer your question it's required to provide some data and show what you have tried.
This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone
I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.
I'm a PhD student of genetics and I am trying do association analysis of some genetic data using linear regression. In the table below I'm regressing each 'trait' against each 'SNP' There is also a interaction term include as 'var'
I've only used R for 2 weeks and I don't have any programming background so please explain any help provided as I want to understand.
This is a sample of my data:
Sample ID var trait 1 trait 2 trait 3 SNP1 SNP2 SNP3
77856517 2 188 3 2 1 0 0
375689755 8 17 -1 -1 1 -1 -1
392513415 8 28 14 4 1 1 1
393612038 8 85 14 6 1 1 0
401623551 8 152 11 -1 1 0 0
348466144 7 -74 11 6 1 0 0
77852806 4 81 16 6 1 1 0
440614343 8 -93 8 0 0 1 0
77853193 5 3 6 5 1 1 1
and this is the code I've been using for a single regression:
result1 <-lm(trait1~SNP1+var+SNP1*var, na.action=na.exclude)
I want to run a loop where every trait is tested against each SNP.
I've been trying to modify codes I've found online but I always run into some error that I don't understand how to solve.
Thank you for any and all help.
Personally I don't find the problem so easy. Specially for an R novice.
Here a solution based on creating dynamically the regression formula.
The idea is to use paste function to create different formula terms, y~ x + var + x * var then coercing the result string tp a formula using as.formula. Here y and x are the formula dynamic terms: y in c(trait1,trai2,..) and x in c(SNP1,SNP2,...). Of course here I use lapply to loop.
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
factor1 <- x
factor2 <- 'var'
factor3 <- paste(x,'var',sep='*')
listfactor <- c(factor1,factor2,factor3)
form <- as.formula(paste(y, "~",paste(listfactor,collapse="+")))
lm(formula = form, data = dat)
})
I hope someone come with easier solution, ore more R-ish one:)
EDIT
Thanks to #DWin comment , we can simplify the formula to just y~x*var since it means y is modeled by x,var and x*var
So the code above will be simplified to :
lapply(1:3,function(i){
y <- paste0('trait',i)
x <- paste0('SNP',i)
LHS <- paste(x,'var',sep='*')
form <- as.formula(paste(y, "~",LHS)
lm(formula = form, data = dat)
})
I use the R package GBM as probably my first choice for predictive modeling. There are so many great things about this algorithm but the one "bad" is that I cant easily use model code to score new data outside of R. I want to write code that can be used in SAS or other system (I will start with SAS (no access to IML)).
Lets say I have the following data set (from GBM manual) and model code:
library(gbm)
set.seed(1234)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
#X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
X3[sample(1:N,size=30)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
distribution="gaussian",
n.trees=2, # number of trees
shrinkage=0.005, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=5, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 1, # subsampling fraction, 0.5 is probably best
train.fraction = 1, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 5, # do 5-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=TRUE) # print out progress
Now I can see the individual trees using pretty.gbm.tree as in
pretty.gbm.tree(gbm1,i.tree = 1)[1:7]
which yields
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight
0 2 1.5000000000 1 8 15 983.34315 1000
1 1 1.0309565491 2 6 7 190.62220 501
2 2 0.5000000000 3 4 5 75.85130 277
3 -1 -0.0102671518 -1 -1 -1 0.00000 139
4 -1 -0.0050342273 -1 -1 -1 0.00000 138
5 -1 -0.0076601353 -1 -1 -1 0.00000 277
6 -1 -0.0014569934 -1 -1 -1 0.00000 224
7 -1 -0.0048866747 -1 -1 -1 0.00000 501
8 1 0.6015416372 9 10 14 160.97007 469
9 -1 0.0007403551 -1 -1 -1 0.00000 142
10 2 2.5000000000 11 12 13 85.54573 327
11 -1 0.0046278704 -1 -1 -1 0.00000 168
12 -1 0.0097445692 -1 -1 -1 0.00000 159
13 -1 0.0071158065 -1 -1 -1 0.00000 327
14 -1 0.0051854993 -1 -1 -1 0.00000 469
15 -1 0.0005408284 -1 -1 -1 0.00000 30
The manual page 18 shows the following:
Based on the manual, the first split occurs on the 3rd variable (zero based in this output) which is gbm1$var.names[3] "X3". The variable is ordered factor.
types<-lapply (lapply(data[,gbm1$var.names],class), function(i) ifelse (strsplit(i[1]," ")[1]=="ordered","ordered",i))
types[3]
So, the split is at 1.5 meaning the value 'd and c' levels[[3]][1:2.5] (also zero based) splits to left node and the others levels[[3]][3:4] go to the right.
Next, the rule continues with a split at gbm1$var.names[2] as denoted by SplitVar=1 in the row indexed 1.
Has anyone written anything to move through this data structure (for each tree), constructing rules such as:
"If X3 in ('d','c') and X2<1.0309565491 and X3 in ('d') then scoreTreeOne= -0.0102671518"
which is how I think the first rule from this tree reads.
Or have any advice how to best do this?
The mlmeta package has a function gbm2sas that exports a GBM model from R to SAS.
Here is a very generic answer of how this might be done.
Add some R code to write the output to a file. https://stat.ethz.ch/R-manual/R-devel/library/base/html/sink.html
Then through SAS, access the ability to execute R with: http://support.sas.com/documentation/cdl/en/hostunx/61879/HTML/default/viewer.htm#a000303551.htm
(You'll need to know where your R executable is to point the R code you have written above at the executable)
From there you should be able to manipulate the output within SAS to do any scoring you may need.
If it is simply a one time scoring and not a process, omit the SAS execution of R and simply develop SAS code to parse through the R output file.