In R, how can I compute the summary function in parallel? - r

I have a huge dataset. I computed the multinomial regression by multinom in nnet package.
mylogit<- multinom(to ~ RealAge, mydata)
It takes 10 minutes. But when I use summary function to compute the coefficient
it takes more than 1 day!!!
This is the code I used:
output <- summary(mylogit)
Coef<-t(as.matrix(output$coefficients))
I was wondering if anybody know how can I compute this part of the code by parallel processing in R?
this is a small sample of data:
mydata:
to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03

If you just want the coefficients, use only the coef() method which do less computations.
Example:
mydata <- readr::read_table("to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03")[rep(1:20, 3000), ]
mylogit <- nnet::multinom(to ~ RealAge, mydata)
system.time(output <- summary(mylogit)) # 6 sec
all.equal(output$coefficients, coef(mylogit)) # TRUE & super fast
If you profile the summary() function, you'll see that the most of the time is taken by the crossprod() function.
So, if you really want the output of the summary() function, you could use an optimized math library, such as the MKL provided by Microsoft R Open.

Related

Variable lengths differ - fligner.test

I have the folloing a dataset like follows:
attack defense sp_attack sp_defense speed is_legendary
60 62 63 80 60 0
80 100 123 122 120 0
39 52 43 60 65 0
58 64 58 80 80 0
90 90 85 125 90 1
100 90 125 85 90 1
106 150 70 194 120 1
100 100 100 100 100 1
90 85 75 115 100 1
From this dataset, I want to check if there is heteroscedasticity between two groups: Legendary vs. Non legendary pokemons. To do that, first I checked the normality of the data for the legendary and non legendary pokémon as follows:
# Shapiro-test for legendary and non legendari pokemon, hp comparison.
shapiro.test(df_net$hp[df_net$is_legendary==0])
shapiro.test(df_net$hp[df_net$is_legendary==1])
I´ve seen that in both cases the result is not distributed normally. Now, I´ve decided to carry out a Fligner test as follows:
fligner.test(hp[df_net$is_legendary==0] ~ hp[df_net$is_legendary==1], data = df_net)
However, I obtain the following error:
Error in model.frame.default(formula = hp[df_net$is_legendary == 0] ~ : variable lengths differ (found for 'hp[df_net$is_legendary == 1]')
I guess that this is due to the number of observations of pokemon legendary different from non legendary but then how can I check the heteroscedasticity between this two groups?
The correct syntax for fligner.test is
fligner.test(x ~ group, data)
In your case the correct syntax would be (e.g for variable sp_defense)
fligner.test(sp_defense ~ is_legendary, data=df_net)

multidimensional data clustering

Problem: I have two groups of multidimensional heterogeneous data. I have concocted a simple illustrative example below. Notice that some columns are discrete (age) while some are binary (gender) and another is even an ordered pair (pant size).
Person Age gender height weight pant_size
Control_1 55 M 167.6 155 32,34
Control_2 68 F 154.1 137 28,28
Control_3 53 F 148.9 128 27,28
Control_4 57 M 167.6 165 38,34
Control_5 62 M 147.4 172 36,32
Control_6 44 M 157.6 159 32,32
Control_7 76 F 172.1 114 30,32
Control_8 49 M 161.8 146 34,34
Control_9 53 M 164.4 181 32,36
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28
The question is does the entire experimental group differ significantly from the entire control group?
Or roughly speaking do they form two distinct clusters in the space of [age,gender,height,weight,pant_size]?
The general idea of what I’ve tried so far is a metric that compares corresponding columns of the experimental group to those of the control; the metric then takes the sum of the column scores (see below). A somewhat arbitrary threshold is picked to decide if the two groups are different. This arbitrariness is confounded by the weighting of the columns which is also somewhat arbitrary. Remarkably this approaches is preforming well for the actual problem I have but it needs to be formalized. I’m wondering if this approach is similar to any existing approaches or if other well established approaches more widely accepted?
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28 metric
column score 2 1 5 1 7 16
Treat this as a classification rather than a clustering problem if you assume the results "cluster".
Because you don't need to find these clusters, but they are predefined classes.
The "rewritten" approach is as follows:
Train different classifiers to predict whether a point is from data A or data B. If you can get a much better accuracy than 50% (assuming balanced data) then the geoups do differ. If all your classifiers are only as good as random (and you didn't make mistakes) then tthe two sets are probably just too similar.

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

Fitting logistic growth curves to data

I've been attempting to fit logistic growth equations to data sets I have, with mixed results. I typically use a setup like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
The goal here is to model Diversity over time logistically; this is a species diversity curve that asymptotes. However, for particular datasets, I cannot get the model to work and can't for the life of me figure out why. As an example, in one iteration, the Diversity (and therefore, new8) value that is being pulled is
[1] 25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147
[36] 148 151 154 165 215 216 206 205 207 207 220 200 204
# plot via this, and it is a nice species diversity curve beginning to level off
plot(Diversity,type="l")
This data is beginning to reach its limit, yet I cannot fit a logistic curve to it. If I try, I get an exceeded max iterations error, no matter how high I crank up the iterations. I've played with the starting parameters over and over with no luck. Currently, for example the code looks like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
Any help is more than appreciated. Spent all day sitting on my couch stuck on this. If someone has a better way to coerce a logistic growth curve out of data, I'd love to hear it! As a side note, I've used SSlogis for these datasets with no luck, either.
Numerical instability is often a problem with models involving exponential terms. Try evaluating your model at your starting parameters:
> 200/(1+exp(25-1.6*df$norm_time))
[1] 2.871735e-09 2.969073e-09 3.069710e-09 3.173759e-09 3.281333e-09 3.392555e-09 3.507546e-09 3.626434e-09 3.749353e-09
[10] 3.876437e-09 4.007830e-09 4.143676e-09 4.284126e-09 4.429337e-09 4.579470e-09 4.734691e-09 4.895174e-09 5.061097e-09
[19] 5.232643e-09 5.410004e-09 5.593377e-09 5.782965e-09 5.978979e-09 6.181637e-09 6.391165e-09 6.607794e-09 6.831766e-09
[28] 7.063329e-09 7.302742e-09 7.550269e-09 7.806186e-09 8.070778e-09 8.344338e-09 8.627170e-09 8.919589e-09 9.221919e-09
[37] 9.534497e-09 9.857670e-09 1.019180e-08 1.053725e-08 1.089441e-08 1.126368e-08 1.164546e-08 1.204019e-08 1.244829e-08
[46] 1.287023e-08 1.330646e-08 1.375749e-08
With predicted data having such small values, it's likely that any moderate change in the parameters, as required by nls() to estimate gradients, will produce changes in the data that are very small, barely above or even below minFactor().
It's better to normalize your data so that its numerical range is within a nice friendly range, like 0 to 1.
require(stringr)
require(ggplot2)
new8 <- '25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147 148 151 154 165 215 216 206 205 207 207 220 200 204'
Diversity = as.numeric(str_split(new8, '[ ]+')[[1]])
time <- 1:48
df = data.frame(time=time, div=Diversity)
# normalize time
df$norm_time <- df$time / max(df$time)
# normalize diversity
df$norm_div <- (df$div - min(df$div)) / max(df$div)
With this way of normalizing diversity, your Po parameter can always be assumed to be 0. That means we can eliminate it from the model. The model now only has two degrees of freedom instead of three, which also makes fitting easier.
That leads us to the following model:
logisticModel <- nls(norm_div~K/(1+exp(r*norm_time)), data=df,
start=list(K=1, r=-1.6),
control=list(maxiter=1000, minFactor=.00000000001))
Your data doesn't look like that great a fit to the model to me, but I'm not an expert in your field:
ggplot(data=df, aes(x=norm_time, y=norm_div)) +
geom_point(log='y') +
geom_line(aes(x=norm_time, y=predict(logisticModel)), color='red') +
theme_bw()
quartz.save('~/Desktop/SO_31236153.png', type='png')
summary(logisticModel)
Formula: norm_div ~ K/(1 + exp(r * norm_time))
Parameters:
Estimate Std. Error t value Pr(>|t|)
K 0.6940 0.1454 4.772 1.88e-05 ***
r -2.6742 2.4222 -1.104 0.275
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1693 on 46 degrees of freedom
Number of iterations to convergence: 20
Achieved convergence tolerance: 5.895e-06

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.
I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:
Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable
I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.
Any thoughts/ideas how to tackle this issue?
Check out the clara() function in package cluster which is shipped with all versions of R.
library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
> clarax
Call: clara(x = x, k = 2, samples = 50)
Medoids:
[,1] [,2]
[1,] -1.15913 0.5760027
[2,] 50.11584 50.3360426
Objective function: 10.23341
Clustering vector: int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes: 200 300
Best sample:
[1] 10 17 45 46 68 90 99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500
Available components:
[1] "sample" "medoids" "i.med" "clustering" "objective"
[6] "clusinfo" "diss" "call" "silinfo" "data"
Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Resources