Related
I have a distribution, for example:
d
#[1] 4 22 15 5 9 5 11 15 21 14 14 23 6 9 17 2 7 10 4
Or, the vector d in dput format.
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
And when I apply the ks.test,:
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
This gives the following warning:
Warning message:
In ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563) :
ties should not be present for the Kolmogorov-Smirnov test
I tried put unique(d), but obvious my data reduce the values and I wouldn't like this happen.
And the others manners and examples online, this example happen too, but the difference is the test show some results with the warning message, not only the message without values of ks.test.
Some help?
In gamma you can find your result, warning message is not blocking
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
Warning message: In ks.test(d, "pgamma", shape = 3.178882, scale =
3.526563) : ties should not be present for the Kolmogorov-Smirnov test
gamma
One-sample Kolmogorov-Smirnov test
data: d
D = 0.14549, p-value = 0.816
alternative hypothesis: two-sided
You find an explanation of the warning in the help page ??ks.test
The presence of ties always generates a warning, since continuous
distributions do not generate them. If the ties arose from rounding
the tests may be approximately valid, but even modest amounts of
rounding can have a significant effect on the calculated statistic.
As you can see some rounding is applied and the test is "approximately" valid.
I am trying to figure out how to calculate the average,median and standard deviation for each value of each variable. Here is some of the data (thanks to #Barranka for providing the data in a easy-to-copy format):
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
To save it into csv later, i have to make it into data frame that supposed to be like this
Table for gama
Value Average Median Standard Deviation
10 (Avg of 10) (median of 10) (Stdev of 10)
1 (Avg of 1) (median of 1) (Stdev of 1)
0.1 (Avg of 0.1) (median of 0.1) (Stdev of 0.1)
Table for theta
Value Average Median Standard Deviation
1 (Avg of 10) (median of 10) (Stdev of 10)
0.65 (Avg of 0.65) (median of 0.65) (Stdev of 0.65)
Table for detectionsLimit
Value Average Median Standard Deviation
3 (Avg of 3) (median of 3) (Stdev of 3)
5 (Avg of 5) (median of 5) (Stdev of 5)
...
The columns to be used as ID's are:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
Summary statistics should be computed over the following columns:
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
I have tried using data.table package/function, but I cannot figuring out how to develop an approach using data.table without renaming values one by one; also, when pursuing this approach, my code gets very complicated.
Clever use of melt() and tapply() can help you. I made the following assumptions:
You have to get the mean, median and average of the last three columns
You need to group the data for each of the first ten columns (gama, theta, ..., negTrades)
For reproducibility, here's the input:
# Your example data
df <- data.frame(
gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10),
theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1),
detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1),
NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40),
NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30),
NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10),
sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15),
lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33),
PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5),
NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1),
Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333),
AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437),
AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)
And here's my proposed solution:
library(reshape)
md <- melt(df, id=colnames(df)[1:10]) # This will create one row for each
# 'id' combination, and will store
# the rest of the column headers
# in the `variable` column, and
# each value corresponding to the
# variable. Like this:
head(md)
## gama theta detectl NSMOOTH NREF NOBS sma lma PosTrades NegTrades variable value
## 1 10 1.00 3 10 50 10 15 33 11 2 Acc 0.846154
## 2 1 1.00 5 5 80 40 15 33 7 2 ## Acc 0.777778
## 3 1 1.00 1 20 80 40 15 33 6 1 ## Acc 0.857143
## 4 1 1.00 1 20 50 20 15 33 3 0 ## Acc 1.000000
## 5 1 0.65 5 5 80 20 15 33 9 1 ## Acc 0.900000
## 6 1 1.00 3 20 50 20 15 33 3 0 ## Acc 1.000000
results <- list() # Prepare the results list
for(i in unique(md$variable)) { # For each variable you have...
results[[i]] <- list() # ... create a new list to hold the 'summary'
tmp_data <- subset(md, variable==i) # Filter the data you'll use
for(j in colnames(tmp_data)[1:10]) { # For each variable, use tapply()
# to get what you need, and
# store it into a data frame
# inside the results
results[[i]][[j]] <- as.data.frame(
t(
rbind(
tapply(tmp_data$value, tmp_data[,j], mean),
tapply(tmp_data$value, tmp_data[,j], median),
tapply(tmp_data$value, tmp_data[,j], sd))
)
)
colnames(results[[i]][[j]]) <- c('average', 'median', 'sd')
}
rm(tmp_data) # You'll no longer need this
}
Now what? Check out the summary for results:
summary(results)
## Length Class Mode
## Acc 10 -none- list
## AvgWin 10 -none- list
## AvgLoss 10 -none- list
## Return 10 -none- list
You have a list for each variable. Now, if you check out the summary for any results "sublist", you'll see this:
summary(results$Acc)
## Length Class Mode
## gama 3 data.frame list
## theta 3 data.frame list
## detectl 3 data.frame list
## NSMOOTH 3 data.frame list
## NREF 3 data.frame list
## NOBS 3 data.frame list
## sma 3 data.frame list
## lma 3 data.frame list
## PosTrades 3 data.frame list
## NegTrades 3 data.frame list
See what happens when you peek into the results$Acc$gama list:
results$Acc$gama
## average median sd
## 0.1 0.5454550 0.545455 NA
## 1 0.9069842 0.900000 0.09556548
## 10 0.8455433 0.846154 0.01191674
So, for each variable and each "id" column, you have the data summary you want.
Hope this helps.
I have an approach involving data.table.
EDIT: I tried to submit an edit to the question, but I took some liberties so it'll probably get rejected. I made assumptions about which columns were to be used as "id" columns (columns whose values subset data), and which should be "measure" columns (columns whose values are used to calculate the summary statistics). See here for these designations:
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
Setup
# Convert to data.table
df <- data.table(df)
# Helper function to convert a string to a call
# useful in a data.table j
s2c <- function (x, type = "list"){
as.call(lapply(c(type, x), as.symbol))
}
# Function to computer the desired summary stats
smry <- function(x) list(Average=mean(x, na.rm=T), Median=median(x, na.rm=T), StandardDeviation=sd(x, na.rm=T))
# Define some names to use later
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
usenames <- paste(rep(c("Average","Median","StdDev"),each=length(vals)), vals,sep="_")
Calculations in data.table
# Compute the summary statistics
df2 <- df[,j={
for(i in 1:length(ids)){ # loop through each id
t.id <- ids[i]
t.out <- .SD[,j={
t.vals <- .SD[,eval(s2c(vals))] # this line returns a data.table with each vals as a column
sapply(t.vals, smry) # apply summary statistics
},by=t.id] # this by= loops through each value of the current id (t.id)
setnames(t.out, c("id.val", usenames)) # fix the names of the data.table to be returned for this i
t.out <- cbind(id=t.id, t.out) # add a column indicating the variable name (t.id)
if(i==1){big.out <- t.out}else{big.out <- rbind(big.out, t.out)} # accumulate the output data.table
}
big.out
}]
Formatting
df2 <- data.table:::melt.data.table(df2, id.vars=c("id","id.val")) # melt into "long" format
df2[,c("val","metric"):=list(gsub(".*_","",variable),gsub("_.*","",variable))] # splice names to create id's
df2[,variable:=NULL] # delete old column that had the names we just split up
df2 <- data.table:::dcast.data.table(df2, id+id.val+val~metric) # go a bit wider, so stats in diff columns
# reshape2:::acast(df2, id+id.val~metric~val) # maybe replace the above line with this
Result
id id.val val Average Median StdDev
1: NOBS 10 Acc 3.214550 0.01191674 0.006052701
2: NOBS 10 AvgLoss 1.000000 0.06300610 1.409930000
3: NOBS 10 AvgWin 1.333333 0.06100090 1.447786667
4: NOBS 10 NegTrades 6.000000 0.84615400 -0.019449800
5: NOBS 10 PosTrades 7.333333 0.84554333 -0.021165467
---
128: theta 1 AvgLoss 1.000000 0.06897450 1.447160000
129: theta 1 AvgWin 1.571429 0.08320849 1.455691429
130: theta 1 NegTrades 6.000000 0.84615400 -0.017465300
131: theta 1 PosTrades 5.857143 0.83712329 -0.017420860
132: theta 1 Return 1.718249 0.03285638 0.068957635
I have a variable e.g. c(0, 8, 7, 15, 85, 12, 46, 12, 10, 15, 15)
how can I calculate a mean value out of random maximal values in R?
for example, I would like to calculate a mean value with three maximal values?
First step: You draw a sample of 3 from your data and store it in x
Second step: You calculate the mean of the sample
try
dat <- c(0,8,7,15, 85, 12, 46, 12, 10, 15,15)
x <- sample(dat,3)
x
mean(x)
possible output:
> x <- sample(dat,3)
> x
[1] 85 15 0
> mean(x)
[1] 33.33333
If you mean the three highest values, just sort your vector and subset:
> mean(sort(c(0,8,7,15, 85, 12, 46, 12, 10, 15,15), decreasing=T)[1:3])
[1] 48.66667
For instance , if the number is 100 and the number of groups is 4 it should give any random list of 4 numbers that add upto 100:
input number = 100
number of groups = 4
Possible outputs:
25, 25, 25, 25
10, 20, 30, 40
15, 35, 2, 48
The output should only be one list generated. More application oriented example would be how i would split a probability 1 into multiple groups given the number of groups using R?
rmultinom might be handy here:
x <- rmultinom(n = 1, size = 100, prob = rep(1/4, 4))
x
colSums(x)
Here I draw one vector, with a total size of 100, which is splitted into 4 groups.
You can try following
total <- 100
n <- 4
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 23 27 24 26
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 25 26 28 21
as.vector(table(sample(1:n, size = total, replace = T)))
## [1] 24 20 28 28
When it comes to probabilities, I think this is a good idea:
generate.probabilities <- function(n){
bordersR <- c(sort(runif(n-1)), 1)
bordersL <- c(0, bordersR[1:(n-1)])
bordersR - bordersL
}
It gives you n numbers from random distribution which sum up to 1.
Define the parameters for generality
inN <- 100 # input number
nG <- 4 # number of groups
Following storaged's idea that we only need 3 random numbers to split the space into 4 regions, but requiring integers, the inner borders can be found as:
sort(sample(inN,nG-1, replace = TRUE))
The OP wanted the count in each group which we can find by
diff(c(0,sort(sample(inN,nG-1, replace = TRUE)), inN))
I am new to R, when I am going to estimate a logistic model using glm() it's not predicting the response, but gives a not actual output on calling predict function like 1 for every input at my predict function.
Code:
ex2data1R <- read.csv("/media/ex2data1R.txt")
x <-ex2data1R$x
y <-ex2data1R$y
z <-ex2data1R$z
logisticmodel <- glm(z~x+y,family=binomial(link = "logit"),data=ex2data1R)
newdata = data.frame(x=c(10),y=(10))
predict(logisticmodel, newdata, type="response")
Output:
> predict(logisticmodel, newdata, type="response")
1
1.181875e-11
Data(ex2data1R.txt) :
"x","y","z"
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Let me know am I doing something wrong?
I'm not seeing any problem. Here are predictions for x,y = 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100:
newdata = data.frame(x=seq(30, 100, 5) ,y=seq(30, 100, 5))
predict(logisticmodel, newdata, type="response")
1 2 3 4 5 6
2.423648e-06 1.861140e-05 1.429031e-04 1.096336e-03 8.357794e-03 6.078786e-02
7 8 9 10 11 12
3.320041e-01 7.923883e-01 9.670066e-01 9.955766e-01 9.994218e-01 9.999247e-01
13 14 15
9.999902e-01 9.999987e-01 9.999998e-01
You were predicting x=10, y=10 which is way outside the range of your x, y values (30 - 100), but the prediction was zero which fits these results. When x and y are low (30 - 55), the prediction for z is zero. when x and y are high (75 - 100), the prediction is one (or nearly one). It may be easier to interpret the results if you round them to a few decimals:
round(predict(logisticmodel, newdata, type="response") , 5)
1 2 3 4 5 6 7 8 9 10
0.00000 0.00002 0.00014 0.00110 0.00836 0.06079 0.33200 0.79239 0.96701 0.99558
11 12 13 14 15
0.99942 0.99992 0.99999 1.00000 1.00000
Here is a simple way to predict a category and compare the results with your data:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
xtabs(~predict+ex2data1R$z)
ex2data1R$z
predict 0 1
0 34 5
1 6 55
We used predict() on your original data and then created a rule that picks 1 if the probability is greater than .5 and 0 if it is not. Then we use xtabs() to compare the predictions to the data. When z is 0, we correctly predict zero 34 times and incorrectly predict one 6 times. When z is 1 we correctly predict one 55 times and incorrectly predict zero 5 times. We are correct 89% of the time (34+55)/100*100. You could explore the accuracy of prediction if you use .45 or .55 as the cutoff instead of .5.
In my opinion all is correct, as you can read from R manual:
newdata - optionally, a data frame in which to look for variables with
which to predict. If omitted, the fitted linear predictors are used.
If you have data frame with 1 record it will produce prediction only for that one.
For more details see R manual/glm/predict
or just in R console, after loading library glm put:
?glm
You can also use the following command to make the confusion matrix:
predict <- ifelse(predict(logisticmodel, type="response")>.5, 1, 0)
table(predict,ex2data1R$z)