R: calculate p-value of t-test using qt-function - r

We are testing if a computer's performance increases after its OS update by comparing the performance of 10 different Programs before and after the update which results in:
Program: #1 #2 #3 #4 #5
before: 34 29 32 27 28
after: 32 34 36 27 28
now we should do a t-test by calculating t on our own and only the p-value using the qt-function in R. But how do I have to use the qt-function to get my p-value?
I calculated t and it is -0,6486 (t.test in R says -0.64854 so close enough) and df is 8

Related

evaluating neural network performance

I trained my neural network with a sigmoid activation function so that the predicted values lie in the range [0,1). However, the range of real data in which the z-score transformation has been performed goes beyond [0,1). In this case what would be the appropriate way to evaluate my model. Should I rescale as well the original test data to the same range and then evaluate with criteria like mean square forecast error?
> real_predicted_neural
predicted real
1 1.909219e-07 -3.57877473
2 4.161819e-08 -2.28704595
3 1.754706e-11 -1.08509429
4 1.149891e-13 -0.46573114
5 7.777560e-02 0.42381300
6 4.173448e-07 -0.44060297
7 1.119703e-01 0.21075550
8 8.682557e-01 -0.01292402
9 4.736056e-08 -0.29830701
10 7.506821e-08 -1.20302227
11 7.341235e-01 -0.03986571
12 7.501776e-05 -0.94315815
13 1.145697e-04 0.49730175
14 2.214929e-13 0.04252241
15 4.597199e-01 -0.38539901
16 2.324931e-03 -0.74468628
17 4.366025e-06 -0.77037244
18 1.394450e-06 0.16679048
19 5.869884e-11 -0.75876486
20 1.817941e-04 0.04303387
21 7.060773e-04 0.06099372
22 8.267170e-06 -1.21687318
23 9.388680e-02 0.61135319
24 1.099290e-01 0.55715201
25 9.757236e-01 -0.33480226
26 9.544055e-01 0.09061006
27 7.322074e-07 0.09290822
28 1.014327e-06 -0.61658893
29 7.848382e-08 -0.78739456
30 1.791908e-04 -0.44073540
31 1.357918e-03 -0.22099008
32 5.192233e-06 -0.32744703
33 2.624779e-06 -0.37644068
34 6.414216e-02 -0.36947939
35 1.388143e-06 -0.00994845
36 3.010872e-05 -0.05984833
37 9.873201e-03 -0.21815268
38 3.896163e-04 -0.24009094
39 2.718760e-02 0.33383333
40 1.025650e-02 0.09779867

How can I look at a specific generated train and test sets made from for loop?

My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed function, but I am very unfamiliar with this function.
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test, train, train_ind variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed isn't really going to help here. If all you were doing was running rnorm(1) hundred times, then yes, by using set.seed you could quickly recover the n-th generated value later. In your case, however, you are not only using sample for train_ind; the model estimation functions are also very likely generating random values.

glm giving different results in R 3.2.0 and R 3.2.2. Was there any substantial change in usage?

I use glm monthly to calculate a binomial model on the payment behaviour of a credit database, using a call like:
modelx = glm(paid ~ ., data = credit_db, family = binomial())
For the last month, I use R version 3.2.2 (just recently upgraded) and the results were very different than the previous month (done with R version 3.2.0). In order to check the code, I repeated the previous month calculations with version 3.2.2 and got different results from the previous calculation done in R 3.2.0.
Coefficients are also very different, in a wild form. I use at the beginning an exploratory model, with a variable that is the average number of delinquency days during the month, which should yield low coefficients for low average. In version 3.2.0, an extract of summary(modelx) was:
## Coefficients: Estimate Std. Error z value
## delinquency_avg_days1 -0.59329 0.18581 -3.193
## delinquency_avg_days2 -1.32286 0.19830 -6.671
## delinquency_avg_days3 -1.47359 0.21986 -6.702
## delinquency_avg_days4 -1.64158 0.21653 -7.581
## delinquency_avg_days5 -2.56311 0.25234 -10.158
## delinquency_avg_days6 -2.59042 0.25886 -10.007
and for version 3.2.2
## Coefficients Estimate Std. Error z value
## delinquency_avg_days.L -1.320e+01 1.083e+03 -0.012
## delinquency_avg_days.Q -1.140e+00 1.169e+03 -0.001
## delinquency_avg_days.C 3.439e+00 1.118e+03 0.003
## delinquency_avg_days^4 8.454e+00 1.020e+03 0.008
## delinquency_avg_days^5 3.733e+00 9.362e+02 0.004
## delinquency_avg_days^6 -4.988e+00 9.348e+02 -0.005
The summary output is a little different, since the Pr(>|z|) is shown. Notice also that the coefficient names changed too.
In the dataset this delinquency_avg_days variable have the following distribution (0 is "not paid", 1 is "paid", and as you can see, coefficients might be large for average days larger than 20 or so. Number of paid was sampled to match closely the number of "not paid".
0 1
0 140 663
1 59 209
2 62 118
3 56 87
4 66 50
5 69 41
6 64 40
7 78 30
8 75 31
9 70 29
10 77 23
11 69 18
12 79 17
13 61 13
14 53 5
15 67 18
16 50 10
17 40 9
18 39 8
19 23 9
20 24 2
21 36 9
22 35 1
23 17 0
24 11 0
25 11 0
26 7 1
27 3 0
28 0 0
29 0 1
30 1 0
In previous months, I used this exploratory model to create a second binomial model using ranges af average delinquency days. But this other model gives similar results with a few levels.
Now, I'd like to know whether there are substantial changes that require specifying other parameters or there is an issue with glm in version 3.2.2.

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Barchart help in R

I am trying to set up a bar chart to compare control and experimental samples taken of specific compounds. The data set is known as 'hydrocarbon3' and contains the following information:
Exp. Contr.
c12 89 49
c17 79 30
c26 78 35
c42 63 3
pris 0.5 0.8
phy 0.5 0.9
nap 87 48
nap1 83 44
nap2 78 44
nap3 73 20
acen1 81 50
acen2 86 46
fluor 83 11
fluor1 68 13
fluor2 79 17
dibe 65 7
dibe1 67 6
dibe2 56 10
phen 82 13
phen1 70 12
phen2 65 15
phen3 53 14
fluro 62 9
pyren 48 11
pyren1 34 10
pyren2 19 8
chrys 22 3
chrys1 21 3
chrys2 21 3
When I create a bar chart with the formula:
barplot(as.matrix(hydrocarbon3),
main=c("Fig 1. Change in concentrations of different hydrocarbon compounds\nin sediments with and without the presence of bacteria after 21 days"),
beside=TRUE,
xlab="Oiled sediment samples collected at 21 days",
space=c(0,2),
ylab="% loss in concentration relative to day 0")
I receive this diagram, however I need the control and experimental samples of each chemical be next to each other allow a more accurate comparison, rather than the experimental samples bunched on the left and control samples bunched on the right: Is there a way to correct this on R?
Try transposing your matrix:
barplot(t(as.matrix(hydrocarbon3)), beside=T)
Basically, barplot will plot things in the order they show up in the matrix, which, since a matrix is just a vector wrapped colwise, means barplot will plot all the values of the first column, then all those of the second column, etc.
Check this question out: Barplot with 2 variables side by side
It uses ggplot2, so you'll have to use the following code before running it:
intall.packages("ggplot2")
library(ggplot2)
Hopefully this works for you. Plus it looks a little nicer with ggplot2!
> df
row exp con
1 a 1 2
2 b 2 3
3 c 3 4
> barplot(rbind(df$exp,df$con),
+ beside = TRUE,names.arg=df$row)
produces:

Resources