Reproducing output from one statistical program to R - r

A previous employee from my organization performed all of their analyses on a different statistical program than R (with no documentation), and no one currently employed knows which program was used. Looking at the model output table and comparing it to Google search results, I think they used Statistica. In an effort to be transparent with other organizations who work with us, I'm trying to replicate their work and potentially reevaluate it.
Model: They modeled the relationship between three variables which I will call A, B, C. Variables were chosen based on exploratory analyses (i.e., correlation matrices and GLM modeling). Parameter estimates are used for prediction purposes. From what I can tell, they used a GLM with a log-link function to model C as a function of A and B.
Data:
A <- c(0.937918714, 1.277501774, 34.46428571, 3.843879361, 5.135520685, 0.324675325, 1.038421599, 0.333333333, 0.058139535, 0.09009009, 0.080515298, 5.174234424, 10.625, 21.9047619, 0.162074554, 2.372881356, 1.084430674, 18.53658537, 6.438631791, 0.172413793, 0.291120815, 9.090909091, 5.882352941)
B <- c(0.416666667, 0.555555556, 0.833333333, 0.4, 0.833333333, 0.4, 0.625, 0.625, 0.294117647, 0.37037037, 0.285714286, 1.111111111, 0.588235294, 0.476190476, 0.555555556, 0.833333333, 0.666666667, 0.476190476, 0.208333333, 0.163934426, 0.163934426, 0.3125, 0.454545455)
C <- c(0.009533367, 0.020812183, 0.056208054, 0.015002587, 0.042735043, 0.013661202, 0.004377736, 0.00635324, 0.001345895, 0.001940492, 0.00446144, 0.043768997, 0.021134594, 0.004471772, 0.023488256, 0.029441118, 0.052287582, 0.003526093, 0.030984508, 0.010891089, 0.020812686, 0.016032064, 0.018145161)
My Approach:
I combined each vector into a data frame (dat) and modeled using the following:
glm(formula = C ~ A + B, family = binomial(link = logit), data = dat)
The Question:
I notice we have different parameter estimates; in fact, their analysis includes 'Scale' as a factor, and an associated parameter estimate and standard error (see below). I haven't figured out how to include separate 'Scale' factor. My parameter estimates are close to these, but are obviously different with the inclusion of a new variable.
Anyone familiar with this [Statistica] output and how I could replicate it in R? Primarily, how would I incorporate the Scale factor into my analyses?
Side-note
I've also posted this to Reddit (r/rstats - Replicating an analysis performed in different software).
Much appreciated!

Related

Summary function for principal coordinates analysis

R function prcomp() for performing principal components analysis provides a handy summary() method which (as is expected) offers a summary of the results at a glance:
data(cars)
pca <- prcomp(cars)
summary(pca)
Importance of components:
PC1 PC2
Standard deviation 26.1252 3.08084
Proportion of Variance 0.9863 0.01372
Cumulative Proportion 0.9863 1.00000
I would like a similar summary for displaying the results of a principal coordinates analysis, usually performed using the cmdscale() function:
pco <- cmdscale(dist(cars), eig=TRUE)
But this function does not provide a summary() method and therefore there is no direct way of displaying the results as percent variances, etc.
Has anyone out there already developed some simple method for summarizing such results from PCO?
Thanks in advance for any assistance you can provide!
With best regards,
Instead of the basic cmdscale, one can use capscale or dbRda from package vegan instead. These functions generalize PCO as they "also perform unconstrained principal coordinates analysis, optionally using extended dissimilarities" (citation from ?capscale help page). This is much more than a workaround.
The following gives an example:
library(vegan)
A <- data.frame(
x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)
pco <- capscale(dist(A) ~ 1) # uses euclidean distance
# or
pco <- capscale(A ~ 1) # with same result
pco # shows some main results
summary(pco) # shows more results, including eigenvalues
The summary function has several arguments, documented in ?summary.cca. Parts of the result can be extracted and formated by the user, the eigenvalues for example with pco$CA$eig.

Clustering with non independent variables and very large data set

I have a very large data set ~ 400 000 instances, that looks like data below.
data <- as.data.frame(matrix(0, 10, 5))
samp <- function(){
x <-sample( c(0:9), 5, replace =TRUE, prob = c(0.5, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
return(x)
}
data <- lapply(split(data, c(1:10)), function(x) samp() )
data <- do.call(rbind.data.frame, data)
colnames(data) <- c("fail","below_a", "aver", "above_a", "exceed")
data$class_size <- apply(data[1:5] , 1, FUN = sum)
class_prof <- sample(letters[1:6], nrow(data), replace = T)
data$class_prof <- class_prof
I am trying to cluster this set, but there are following problems:
class size is the sum of the first five columns - I think it may cause collinearity issue, but it is an important variable.
the first five variables are not independent they are the results of measuring the same quality, everyone in the class must fall in one of the categories.
the set is really big, the only algorithm that did not have convergence issues was kmeans, (without using class profile variable).
I can drop categorical variable as it can be included in the models in the later stage, but I am keen to try some methods that use it as well and compare results.
For the convergence problems , I tried downsampling, but for many methods, I need to downsample to 5000-7000 to avoid the memory issues, which is the less than 2%of original data.
What method could be applied here using r packages?
Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.
There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).
Look into 'prcomp' or 'princomp'

Correcting for multiple comparisons in permutation procedure using R and multtest

I have carried out a permutation test comprising a Null-distribution of distances and then 5 observed distances as statistics. Now I would like to correct for multiple comparisons using the Max-T method; using the multtest package, and the ss.maxT, the ss.minT and/or the sd.maxT functions.
But I have problems implementing the functions and making sense of the results; the first function only gives 1s as result, the third only give back the unadjusted p-values and the third throws an error.
Please see example data below:
## Example data
# Observed distances
obs <- matrix(c(0.001, 0.2, 0.50, 0.9, .9999))
null_values <- runif(20)
# Null distribution of distances
null <- matrix(null_values, nrow = length(obs), ncol = length(c(1:20)), byrow=TRUE)
null
# Hypotheses
alternative <- "more"
# The unadjusted raw p-value
praw <- c(0, 0.1, 0.45, 0.85, 1)
# Only getting 1s as results
adjusted_p_values_max <- multtest::ss.maxT(null, obs, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_max
# Should probably use this one: but getting praw back, which is supposedly correct (but perhaps odd)
# this is because of the null distribution being identical for all 5 variables.
# Hence, should each word be tested against its own unique null distribution?
adjusted_p_values_min <- multtest::ss.minP(null, obs, praw, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha=0.05)
adjusted_p_values_min
# Throwing and error
adjusted_p_values_sdmax <- sd.maxT(null, obs, alternative, get.cutoff=TRUE,
get.cr = TRUE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_sdmax
Considering the very different conclusions from the first two methods, I’m wondering if my plan to implement these methods are incorrect in the first place. Basically, I want to examine several hundred distances against a null distribution of several thousands.
obs = The observed distances between different observed points in space to the same “original” point A. (Hence, distances are not independent since they all relate to the same point)
null = The null distribution comprises distances between points that have been randomly selected (replacement = TRUE) from the different observed points and the same original point A.
It seems way too conservative to use ss.maxP for me. Whereas it seems unnecessary to use ss.minP if it “just” returns the raw p-values; or what am I missing?
Can I perhaps solve this situation by constructing individual null distributions for every observed distance?
Thank you in advance!

Am I following the correct procedures with the dunn.test function?

I tested differences among sampling sites in terms of abundance values using kruskal.test. However, I want to determine the multiple differences between sites.
The dunn.test function has the option to use a vector data with a categorical vector or use the formula expression as lm.
I write the function in the way to use in a data frame with many columns, but I have not found an example that confirms my procedures.
library(dunn.test)
df<-data.frame(a=runif(5,1,20),b=runif(5,1,20), c=runif(5,1,20))
kruskal.test(df)
dunn.test(df)
My results were:
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.04929
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.05
Comparison of df by group
Between 1 and 2 2.050609, 0.0202
Between 1 and 3 -0.141421, 0.4438
Between 2 and 3 -2.192031, 0.0142
I took a look at your code, and you are close. One issue is that you should be specifying a method to correct for multiple comparisons, using the method argument.
Correcting for Multiple Comparisons
For your example data, I'll use the Benjamini-Yekutieli variant of the False Discovery Rate (FDR). The reasons why I think this is a good performer for your data are beyond the scope of StackOverflow, but you can read more about it and other correction methods here. I also suggest you read the associated papers; most of them are open-access.
library(dunn.test)
set.seed(711) # set pseudorandom seed
df <- data.frame(a = runif(5,1,20),
b = runif(5,1,20),
c = runif(5,1,20))
dunn.test(df, method = "by") # correct for multiple comparisons using "B-Y" procedure
# Output
data: df and group
Kruskal-Wallis chi-squared = 3.62, df = 2, p-value = 0.16
Comparison of df by group
(Benjamini-Yekuteili)
Col Mean-|
Row Mean | 1 2
---------+----------------------
2 | 0.494974
| 0.5689
|
3 | -1.343502 -1.838477
| 0.2463 0.1815
alpha = 0.05
Reject Ho if p <= alpha/2
Interpreting the Results
The first row in each cell provides the Dunn's pairwise z test statistic for each comparison, and the second row provides your corrected p-values.
Notice that, once corrected for multiple comparisons, none of your pairwise tests are significant at an alpha of 0.05, which is not surprising given that each of your example "sites" was generated by exactly the same distribution. I hope this has been helpful. Happy analyzing!
P.S. In the future, you should use set.seed() if you're going to construct example dataframes using runif (or any other kind of pseudorandom number generation). Also, if you have other questions about statistical analysis, it's better to ask at: https://stats.stackexchange.com/

using the apcluster package in R, it is possible to "score" unclustered data points

I am new to R and I have a request that I am not sure is possible. We have a number of retail locations that my boss would like to use affinity propagation to group into clusters. We will not be clustering based on geographic location. Once he has found a configuration he likes, he wants to be able to input other locations to determine which of those set clusters they should fall into.
The only solution I have been able to come up with is to use the same options and re-cluster with the original points and the new ones added in, however I believe that this might change the outcome.
Am I understanding this right, or are there other options?
Sorry for the late answer, I just incidentally stumbled over your question.
I agree with Anony-Mousse's answer that clustering is the first step and classification is the second. However, I'm not sure whether this is the best option here. Elena601b is obviously talking about a task with truly spatial data, so my impression is that the best approach is to cluster first and then to "classify" new points/samples/locations by looking for the closest cluster exemplar. Here is some code for synthetic data:
## if not available, run the following first:
## install.packages("apcluster")
library(apcluster)
## create four synthetic 2D clusters
cl1 <- cbind(rnorm(30, 0.3, 0.05), rnorm(30, 0.7, 0.04))
cl2 <- cbind(rnorm(30, 0.7, 0.04), rnorm(30, 0.4, .05))
cl3 <- cbind(rnorm(20, 0.50, 0.03), rnorm(20, 0.72, 0.03))
cl4 <- cbind(rnorm(25, 0.50, 0.03), rnorm(25, 0.42, 0.04))
x <- rbind(cl1, cl2, cl3, cl4)
## run apcluster() (you may replace the Euclidean distance by a different
## distance, e.g. driving distance, driving time)
apres <- apcluster(negDistMat(r=2), x, q=0)
## create new samples
xNew <- cbind(rnorm(10, 0.3, 0.05), rnorm(10, 0.7, 0.04))
## auxiliary predict() function
predict.apcluster <- function(s, exemplars, newdata)
{
simMat <- s(rbind(exemplars, newdata),
sel=(1:nrow(newdata)) + nrow(exemplars))[1:nrow(exemplars), ]
unname(apply(simMat, 2, which.max))
}
## assign new data samples to exemplars
predict.apcluster(negDistMat(r=2), x[apres#exemplars, ], xNew)
## ... the result is a vector of indices to which exemplar/cluster each
## data sample is assigned
I will probably add such a predict() method in a future release of the package (I am the maintainer of the package). I hope that helps.
Clustering is not a drop-in replacement for classification.
Few clustering algorithms can meaningfully integrate new information.
The usual approach for your problem however is simple:
Do clustering.
use the cluster labels as class labels
train a classifier
apply the classifier to the new data

Resources