I have the folloing a dataset like follows:
attack defense sp_attack sp_defense speed is_legendary
60 62 63 80 60 0
80 100 123 122 120 0
39 52 43 60 65 0
58 64 58 80 80 0
90 90 85 125 90 1
100 90 125 85 90 1
106 150 70 194 120 1
100 100 100 100 100 1
90 85 75 115 100 1
From this dataset, I want to check if there is heteroscedasticity between two groups: Legendary vs. Non legendary pokemons. To do that, first I checked the normality of the data for the legendary and non legendary pokémon as follows:
# Shapiro-test for legendary and non legendari pokemon, hp comparison.
shapiro.test(df_net$hp[df_net$is_legendary==0])
shapiro.test(df_net$hp[df_net$is_legendary==1])
I´ve seen that in both cases the result is not distributed normally. Now, I´ve decided to carry out a Fligner test as follows:
fligner.test(hp[df_net$is_legendary==0] ~ hp[df_net$is_legendary==1], data = df_net)
However, I obtain the following error:
Error in model.frame.default(formula = hp[df_net$is_legendary == 0] ~ : variable lengths differ (found for 'hp[df_net$is_legendary == 1]')
I guess that this is due to the number of observations of pokemon legendary different from non legendary but then how can I check the heteroscedasticity between this two groups?
The correct syntax for fligner.test is
fligner.test(x ~ group, data)
In your case the correct syntax would be (e.g for variable sp_defense)
fligner.test(sp_defense ~ is_legendary, data=df_net)
Related
I'm struggeling to get a good performing script for this problem: I have a table with a score, x, y. I want to sort the table by score and than build groups based on the x value. Each group should have an equal sum (not counts) of x. x is a metric number in the dataset and resembles the historic turnover of a customer.
score x y
0.436024136 3 435
0.282303336 46 56
0.532358015 24 34
0.644236597 0 2
0.99623626 0 4
0.557673456 56 46
0.08898779 0 7
0.702941303 453 2
0.415717835 23 1
0.017497461 234 3
0.426239166 23 59
0.638896238 234 86
0.629610596 26 68
0.073107526 0 35
0.85741877 0 977
0.468612039 0 324
0.740704267 23 56
0.720147257 0 68
0.965212467 23 0
a good way to do so is adding a group variable to the data.frame with cumsum! Now you can easily sum the groups with e. g. subset.
data.frame$group <-cumsum(as.numeric(data.frame$x)) %/% (ceiling(sum(data.frame$x) / 3)) + 1
remarks:
in big data.frames cumsum(as.numeric()) works reliably
%/% is a division where you get an integer back
the '+1' just let your groups start with 1 instead of 0
thank you #Ronak Shah!
I created two variable called "low.income" and "mid.income" from survey, they are variables which obtained based on participants income. here you can see the variables how looks like:
low.income = 75 95 85 100 85 100 85 90 75 90 65 80 85 90 85 70 95 85 100 95 85 95 90 95 95
mid.income = 95 100 90 90 85 95 100 95 80
But when try to call aov(low.income~mid.income) it gives me Error in model.frame.default(formula = low.income ~ mid.income, drop.unused.levels = TRUE) :
variable lengths differ (found for 'mid.income')
So, what should i do ?
That is not correct, I think you are looking for t.test ie
t.test(low.income, mid.income, var.equal = TRUE)
To use the formula method, you have to create a dataframe with the level and the income. It should look like below:
data <- data.frame(level = rep(paste0(c("low","mid"),".income"),c(25,9)), income = c(low.income,mid.income))
level income
1 low.income 75
2 low.income 95
3 low.income 85
4 low.income 100
5 low.income 85
6 low.income 100
: : :
29 mid.income 90
30 mid.income 85
31 mid.income 95
32 mid.income 100
33 mid.income 95
34 mid.income 80
Now you could do:
t.test(income~level,data,var.equal = TRUE)
Well Since you are using aov, I will give you an example of how to do that:
aov(income~level,data)
These two will lead to the exact same result. You can run TukeyHSD to see that the results are the same.
NOTE: You only run ANOVA when you have more than 2 groups. If you only have 2 groups, run a t.test. Recall that ANOVA is a generalization of the t.test
I have a problem using the nor.test function as a oneway test in R.
My data contain yield value (Rdt_pied) that are grouped by treatment (Traitement). In each treatment I have between 60 and 90 values.
> describe(Rdt_pied ~ Traitement, data = dataMax)
n Mean Std.Dev Median Min Max 25th 75th Skewness Kurtosis NA
G(1) 90 565.0222 282.1874 535.0 91 1440 379.00 751.25 0.7364071 3.727566 0
G(2) 90 703.1444 366.1114 632.5 126 1628 431.50 1007.75 0.4606251 2.392356 0
G(3) 90 723.9667 523.5872 650.5 64 2882 293.50 1028.50 1.2606231 5.365014 0
G(4) 90 954.1000 537.0138 834.5 83 2792 565.25 1143.75 1.1695460 4.672321 0
G(A) 60 368.0667 218.1940 326.0 99 1240 243.00 420.00 2.2207612 9.234473 0
G(H) 60 265.4667 148.0383 223.5 107 866 148.00 357.25 1.3759925 5.685456 0
G(S) 60 498.8000 280.1277 401.0 170 1700 292.75 617.50 1.6792061 7.125804 0
G(T) 60 521.7167 374.7822 448.5 74 1560 214.00 733.25 1.1367209 3.737134 0
>
Why do the nor.test returns me this answer?
> nor.test(Rdt_pied ~ Traitement, data = dataMax)
Error in shapiro.test(y[which(group == (levels(group)[i]))]) :
sample size must be between 3 and 5000
Thank you for your help!
Haven't used that package, but per documentation (and your error), nor.test performs Shapiro-Wilk normality test by default, which needs a numeric vector as an input (at least 3 values). My guess is that, there is a group, based on Traitement, which has less than 3 values, or more than 5000.
Try to check it with something like
table(dataMax$Traitement)
I have a huge dataset. I computed the multinomial regression by multinom in nnet package.
mylogit<- multinom(to ~ RealAge, mydata)
It takes 10 minutes. But when I use summary function to compute the coefficient
it takes more than 1 day!!!
This is the code I used:
output <- summary(mylogit)
Coef<-t(as.matrix(output$coefficients))
I was wondering if anybody know how can I compute this part of the code by parallel processing in R?
this is a small sample of data:
mydata:
to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03
If you just want the coefficients, use only the coef() method which do less computations.
Example:
mydata <- readr::read_table("to RealAge
513 59.608
513 84.18
0 85.23
119 74.764
116 65.356
0 89.03
513 92.117
69 70.243
253 88.482
88 64.23
513 64
4 84.03
65 65.246
69 81.235
513 87.663
513 81.21
17 75.235
117 49.112
69 59.019
20 90.03")[rep(1:20, 3000), ]
mylogit <- nnet::multinom(to ~ RealAge, mydata)
system.time(output <- summary(mylogit)) # 6 sec
all.equal(output$coefficients, coef(mylogit)) # TRUE & super fast
If you profile the summary() function, you'll see that the most of the time is taken by the crossprod() function.
So, if you really want the output of the summary() function, you could use an optimized math library, such as the MKL provided by Microsoft R Open.
Problem: I have two groups of multidimensional heterogeneous data. I have concocted a simple illustrative example below. Notice that some columns are discrete (age) while some are binary (gender) and another is even an ordered pair (pant size).
Person Age gender height weight pant_size
Control_1 55 M 167.6 155 32,34
Control_2 68 F 154.1 137 28,28
Control_3 53 F 148.9 128 27,28
Control_4 57 M 167.6 165 38,34
Control_5 62 M 147.4 172 36,32
Control_6 44 M 157.6 159 32,32
Control_7 76 F 172.1 114 30,32
Control_8 49 M 161.8 146 34,34
Control_9 53 M 164.4 181 32,36
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28
The question is does the entire experimental group differ significantly from the entire control group?
Or roughly speaking do they form two distinct clusters in the space of [age,gender,height,weight,pant_size]?
The general idea of what I’ve tried so far is a metric that compares corresponding columns of the experimental group to those of the control; the metric then takes the sum of the column scores (see below). A somewhat arbitrary threshold is picked to decide if the two groups are different. This arbitrariness is confounded by the weighting of the columns which is also somewhat arbitrary. Remarkably this approaches is preforming well for the actual problem I have but it needs to be formalized. I’m wondering if this approach is similar to any existing approaches or if other well established approaches more widely accepted?
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28 metric
column score 2 1 5 1 7 16
Treat this as a classification rather than a clustering problem if you assume the results "cluster".
Because you don't need to find these clusters, but they are predefined classes.
The "rewritten" approach is as follows:
Train different classifiers to predict whether a point is from data A or data B. If you can get a much better accuracy than 50% (assuming balanced data) then the geoups do differ. If all your classifiers are only as good as random (and you didn't make mistakes) then tthe two sets are probably just too similar.