Sometimes I used to use psych::corr.test function with two data frames like:
df1 <- tibble(a=c(1,2,4,5,67,21,21,65,1,5), b=c(21,5,2,6,8,4,2,6,2,2))
df2 <- tibble(a=c(1,2,3,4,5,6,7,8,9,8), b=c(1,6,54,8,3,8,9,5,2,1), c=c(1,4,6,8,5,3,9,7,5,4))
corr <- corr.test(df1,df2, adjust = "BH")
And I was getting p-values from corr$p.adj
But sometimes it gives me strange repetitive p.values like:
a b c
a 0.5727443 0.5964993 0.5727443
b 0.2566757 0.5727443 0.2566757
Does anyone know how adequate these p-values are? Can we do this with the corr.test? If not, how can I make an asymmetric correlation?
I'm stressed that if I try to perform symmetric correlation like
df <- bind_cols(df1,df2[-3])
corr <- corr.test(df, adjust = "BH")
it's p-values not so repetative:
Probability values (Entries above the diagonal are adjusted for multiple tests.)
a...1 b...2 a...3 b...4
a...1 0.00 0.97 0.62 0.72
b...2 0.97 0.00 0.38 0.62
a...3 0.39 0.06 0.00 0.62
b...4 0.60 0.40 0.41 0.00
UPD: Okay, I realised that it's as repetitive as the first and I'm a bit stupid.
The BH correction is based on computing the cumulative minimum of n/i * p, where the p has your n = 6 unadjusted p-values in decreasing order, and i is 6:1. (You can see the calculation in psych::p.adjust.)
Because it's a cumulative minimum (i.e. the first value, then the min of the first and second, then the min of the first to third, etc.) there are likely to be repetitions.
Related
I want to simulate 100 data with 5 columns. I want to get a correlation of 0.5 between the columns. To complete it, I have done the following action
F1 <- matrix( c(1, .5, .5, .5,.5,
.5, 1, .5, .5,.5,
.5, .5, 1, .5,.5,
.5, .5, .5, 1,.5,
.5, .5, .5, .5,1
), 5,5)
To simulate the intended data frame, I have done this, but it does not work properly.
df2 <- as.data.frame (rbinom(100, 1,.5),ncol(5), F1)
I'm surprised this isn't a duplicate (this question refers specifically to non-binary responses, i.e. binomial with N>1). The bindata package does what you want.
library(bindata)
## set up correlation matrix (compound-symmetric with rho=0.5)
m <- matrix(0.5,5,5)
diag(m) <- 1
Simulate with a mean of 0.5 (as in your example):
set.seed(101)
## this simulates 10 rather than 100 realizations
## (I didn't read your question carefully enough)
## but it's easy to change
r <- rmvbin(n=10, margprob=rep(0.5,5), bincorr=m)
round(cor(r),2)
Results
1.00 0.22 0.80 0.05 0.22
0.22 1.00 0.00 0.65 1.00
0.80 0.00 1.00 -0.09 0.00
0.05 0.65 -0.09 1.00 0.65
0.22 1.00 0.00 0.65 1.00
this looks wrong - the correlations aren't exactly 0.5 - but on average they will be (when I sampled 10,000 vectors rather than 10, the values ranged from about 0.48 to 0.51). Equivalently, if you simulated many samples of 10 and computed the correlation matrix for each, you should find that the expected (average) correlation matrix is correct.
simulating values with correlation exactly equal to the specified value is much harder (and not necessarily what you want to do anyway, depending on the application)
note that there will be limitations about what mean vectors and correlation matrices are feasible. For example, the off-diagonal elements of an n-by-n compound-symmetric (equal-correlation) matrix can't be less than -1/(n-1). Similarly, there may be limits on what correlations are possible for a given set of means (this may be discussed in the technical reference, I haven't checked).
The reference for this method is
Leisch, Friedrich and Weingessel, Andreas and Hornik, Kurt (1998) On the generation of correlated artificial binary data. Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science", 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna. https://epub.wu.ac.at/286/
I'm trying to train a classifier for the classes "Hit", "Miss" based on the variables User, Planning Horizon, Material, and some more.
Most of them are categorical variables except for Planning Horizon (integer)
I have unbalanced data so im trying to use thresholding to select the final output of the model (Rather than just using the default 0.5 probability)
The variable User has the most impact on the class outcome, so im trying to use different thresholds for every user. Im thinking about using the naive bayes posterior probability P(Class|User).
The question is, how can i apply those different rules for the output matrix of the model:
The "Thresholds matrix", a different threshold for every user:
User P("Hit"|User)
A 0.80
B 0.40
C 0.61
And the outputs of the classifier (P(Hit) and P(Miss)) and the last column (Final Prediction) is what i need to construct.
User P("Miss") P("Hit") Final Prediction
B 0.79 0.21 Miss
B 0.20 0.80 Hit
A 0.15 0.85 Hit
C 0.22 0.78 Hit
A 0.90 0.10 Miss
B 0.80 0.20 Miss
Notice the first row gets a MISS because P(Miss) is lower than P(Hit|User=B)
I would merge my threshold matrix and then create the Final Prediction column by hand like this.
df <- read.table(text='User P("Miss") P("Hit") "Final Prediction"
B 0.79 0.21 Miss
B 0.20 0.80 Hit
A 0.15 0.85 Hit
C 0.22 0.78 Hit
A 0.90 0.10 Miss
B 0.80 0.20 Miss',
header=TRUE, sep=' ', check.names=FALSE)
thm <- read.table(text='User P("Hit"|User)
A 0.80
B 0.40
C 0.61',
header=TRUE, sep=' ', check.names=FALSE)
thmdf <- merge(thm, df)
thmdf['My Final Prediction'] <-
ifelse(thmdf$`P(Hit)` < thmdf$`P(Hit|User)`,
'Miss',
'Hit')
thmdf
Say there are $n$ independent events. Each has a probability $p_n$ and an associated loss $l_n$. My goal is to produce a list of all possible loss amounts and their associated probabilities.
Eventually I would like to extend this to sets of 10-20 events with variable probabilities and loss amounts. This will all be done in R.
The various outcomes are given by the power set, e.g. for three events: (null), (A), (B), (C), (A and B), (A and C), (B and C), (A and B and C). The probability of each of these outcomes can be found by taking the product of the probabilities in each subset, and the total loss by taking the sum of losses in each subset.
My problem is how to aggregate by the loss amounts, i.e. to find all unique loss amounts in the power set and produce their probabilities.
I feel like I'm halfway there with the inclusion/exclusion principle, but I can't quite get my head around how to apply it to my particular problem, especially as the number of events goes above 3, or in the case of the sets of intermediate size, e.g. how to group all the 2 element sets above.
For a problem this small--there are at most 2^20 (around a million) possibilities--brute force works fine.
To illustrate, let's generate some data of moderate size:
n <- 15
set.seed(17)
p <- runif(n)
loss <- ceiling(rgamma(n, 3, 1/2))
signif(rbind(Probability=p, Loss=loss), 2)
Here are the input values for this example:
Probability 0.16 0.97 0.47 0.78 0.41 0.54 0.21 0.19 0.78 0.19 0.43 0.0023 0.83 0.83 0.96
Loss 12.00 4.00 10.00 8.00 10.00 6.00 12.00 5.00 4.00 8.00 8.00 8.0000 4.00 4.00 4.00
Generate a binary indicator of the power set with expand.grid and then use array operations for relatively fast calculation of the losses and the probabilities of all the possible outcomes:
powerset <- t(expand.grid(lapply(p, function(x) 0:1)))
probability <- apply(powerset * (2*p - 1) + (1-p), 2, prod)
losses <- colSums(powerset * loss)
(On this aging Xeon workstation, this takes up to 5 seconds when n is 20.)
Summarize by loss using tapply:
x <- tapply(probability, losses, sum)
(This takes another 1 to 2 seconds when n is 20.)
We can check for consistency by (a) verifying the probabilities sum to unity and (b) checking that the expected loss is the sum of the expected losses of the individual events:
if(sum(probability) - 1 != 0) warning("Unnormalized probability.")
if(sum(probability * losses) - sum(p*loss) != 0) warning("Inconsistent result.")
Let's plot the resulting loss distribution.
library(ggplot2)
ggplot(data.frame(Loss=as.numeric(names(x)), Probability=x),
aes(Loss, Probability)) +
geom_col(color="White")
I'm working with the Rssa package to decompose time series, witch works fine except that I can't get the percentage of explained variance from each eigenvector (if these are the right words to explain this). However, these percentages are noted on top on one of the graphs I can plot with this package.
Let me give an example:
d=rnorm(200,10,3)
plot(d,type="l")
ssa=ssa(d, L = 100,digits=0)
plot(ssa,type="vector") #the percentage I want is in the title of each individual graph
# to reconstruct the trend and the residuals
res <- reconstruct(ssa, groups = list(1))
trend <- res$F1
How do I get these percentages in a vector? Especially since I want to loop over multiple series.
Thank you!
Seems that the code for weighted norm of the series by component is hidden in the package.
I extract the code from Rssa:::.plot.ssa.vectors.1d.ssa and wrapped it a small function:
component_wnorm <-
function(x) {
idx <- seq_len(min(nsigma(x), 10))
x <- ssa
total <- wnorm(x)^2
round(100*x$sigma[idx]^2 / total, digits = 2)
}
component_wnorm(ssa)
[1] 92.02 0.35 0.34 0.27 0.27 0.25 0.22 0.20 0.20 0.18
The recent version of Rssa has the function contributions.
Therefore, you can use
> s <- ssa(d, L=100)
> c <- contributions(s)*100
> print(c[1:10], digits = 2)
[1] 92.41 0.28 0.26 0.26 0.26 0.23 0.23 0.21 0.20 0.20
When I use the principal function, like in the following code, I get a nice table which gives all the standardized loadings, as well as a table with the eigenvalues and the proportion and cumulative proportion explained.
rotatedpca <- principal(PCFdataset, nfactors = 8, rotate = "varimax", scores = T)
I would like to export this output to an excel file (using WriteXLS), but I can only do that for dataframes, and rotatedpca is not a dataframe and cannot be coerced into one it seems. I am able to extract the standardized loadings by using the following code:
loadings<-as.data.frame(unclass(rotatedpca$loadings))
But I cannot figure out how to access the other information that normally displays when I simply call the principal function, in particular the eigenvalues and the proportion and cumulative variance explained. I tried rotatedcpa$values, but that returns what looks like the eigenvalues for all 12 original variables as factors without rotation, which I don't understand. And I haven't been able to figure out any way to even try to extract the variance explained values. How can I simply create a dataframe that looks like the R output I get below from the principal function, for example?
RC2 RC3 RC8 RC1 RC4 RC5 RC6 RC7
SS loadings 1.52 1.50 1.45 1.44 1.01 1.00 0.99 0.98
Proportion Var 0.13 0.12 0.12 0.12 0.08 0.08 0.08 0.08
Cumulative Var 0.13 0.25 0.37 0.49 0.58 0.66 0.74 0.82
Proportion Explained 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10
Cumulative Proportion 0.15 0.31 0.45 0.60 0.70 0.80 0.90 1.00
Thanks for reading my post!
I have just added this feature to the latest (as of today) release of psych 1.3.10.11.
If you either
f3 <- fa(Thurstone,3)
#or
p3 <- principal(Thurstone,3)
#then
p <- print(f3)
p # will give you
p
$Vaccounted
MR1 MR2 MR3
SS loadings 2.6411150 1.8621522 1.4951831
Proportion Var 0.2934572 0.2069058 0.1661315
Cumulative Var 0.2934572 0.5003630 0.6664945
Proportion Explained 0.4402995 0.3104389 0.2492616
Proportion 0.4402995 0.7507384 1.0000000
In general, if you have suggestions or questions re the psych package, you will get a faster answer if you contact me directly.
Bill
Why not this:
capture.output( print(rotatedpca), file="pc.txt")
You can read desired portions into Excel using its Text to Columns... function off the /Data menu. Or you can just paste it into an open blank Excel document and select the rows you want to convert. Use the "fixed" option that will probably be offered automagically.