Why does the function return a value t = 13.214, but print(..., digits = 3) returns t = 10?
vals <- data.frame(a = c(4, 2, 4, 7, 3, 4, 8, 8, 3, 0, 1, 5, 4, 6, 4, 8, 7, 9, 6, 6, 3, 6, 7, 4),
b = c(5, 7, 6, 13, 12, 6, 14, 16, 4, 2, 7, 7, 4, 8, 9, 9, 11, 13, 12, 8, 3, 8, 7, 7))
stats::t.test(x = vals)
# One Sample t-test
# data: vals
# t = 13.214, df = 47, p-value < 2.2e-16
# alternative hypothesis: true mean is not equal to 0
# 95 percent confidence interval:
# 5.598761 7.609572
# sample estimates:
# mean of x
# 6.604167
print(stats::t.test(x = vals), digits = 3)
Form ?print:
digits: minimal number of significant digits, see print.default.
But that should not change 10 to 13?
package ‘stats’ version 3.5.1
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02)
nickname Feather Spray
The first step in answering these questions is always to figure out which print method we're dealing with. The generic help in ?print won't necessarily be terribly relevant. t.test objects have class htest, so we want to look at print.htest.
Note that ?print.htest sends you to a slightly more specific documentation page. The documentation for digits doesn't say anything specific, but then in the Details section we see:
Both print methods traditionally have not obeyed the digits argument
properly. They now do, the htest method mostly in expressions like
max(1, digits - 2).
(This is in R 3.5.2)
For example, in the function code we see things like:
out <- c(out, paste(names(x$statistic), "=", format(signif(x$statistic,
max(1L, digits - 2L)))))
The default value for digits will typically be 7. It uses digits for printing the sample estimates and confidence intervals, but fewer digits for other quantities.
Related
I have the following problem. I have n (typically n = 1000) data points (integers from {1,2,3}, so there are a lot of repeting numbers) and a real number d. I have to choose k<n points (k is given) which minimize the distance between the mean of those k points and point d. This can be expressed as a MILP problem (please see here).
I tried to solve that in R using lpSolve and Rglpk packages but it takes a lot of time to solve it (I tried to solve it for n = 100 points and the code has been running for 40 minutes already). I guess the issue is that there are lots of binary variables (n) and there are also repeating numbers.
library(Rglpk)
set.seed(123)
sampsize <- sample(c(1,2,3),size=100,replace = TRUE)
k <- 50
d <- 86/47
lngth <- length(sampsize)
f.obj <- c(1,rep(0,lngth))
f.con <- matrix(c(0,rep(1,lngth),-1,sampsize/k,1,sampsize/k),nrow=3, byrow = TRUE)
f.dir <- c("==","<=",">=")
f.rhs <- c(k,d,d)
f.types <- c("C",rep("B",lngth))
res <- Rglpk_solve_LP(obj=f.obj,mat=f.con,dir=f.dir,rhs=f.rhs,max=FALSE,types=f.types)
I will be satisfied with a sub-optimal solution. Is there a way to solve it quickly or re-express the problem in a certain way to speed up the algorithm?
I would appreciate any input on this.
CVXR is a much better tool for this:
#
# generate random data
#
set.seed(123)
N <- 100 # sample size
v <- c(1,2,3) # sample from these unique values
M <- length(v) # number of unique values
data <- sample(v, size=N, replace=TRUE)
tab <- table(data) # tabulate
K <- 50 # number of points to choose
target <- 86/47 # target for mean
#
# CVXR model
# see https://cvxr.rbind.io/
#
library(CVXR)
# select a number of values from each bin
select <- Variable(M,integer=T)
# obj: sum of absolute deviations
objective = Minimize(abs(sum(v*select)/K-target))
# include nonnegativity constraints
constraints = list(sum(select)==K, select >= 0, select <= vec(tab))
problem <- Problem(objective, constraints)
sol <- solve(problem,verbose=T)
cat("\n")
cat("Status:",sol$status,"\n")
cat("Objective:",sol$value,"\n")
cat("Solution:",sol$getValue(select),"\n")
Output:
GLPK Simplex Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
0: obj = 0.000000000e+00 inf = 5.183e+01 (2)
3: obj = 5.702127660e-01 inf = 0.000e+00 (0)
* 4: obj = 1.065814104e-16 inf = 0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
3 integer variables, none of which are binary
Integer optimization begins...
Long-step dual simplex will be used
+ 4: mip = not found yet >= -inf (1; 0)
+ 55: >>>>> 1.021276596e-02 >= 9.787234043e-03 4.2% (52; 0)
+ 56: >>>>> 9.787234043e-03 >= 9.787234043e-03 < 0.1% (16; 36)
+ 56: mip = 9.787234043e-03 >= tree is empty 0.0% (0; 103)
INTEGER OPTIMAL SOLUTION FOUND
Status: optimal
Objective: 0.009787234
Solution: 26 7 17
The below is written in python, but I think the concept conveys very easily and can be reformulated in r if desired.
Basically: Reformulate your problem. Instead of optimizing a long vector of binary "selection" variables, all you need is 3 variables to formulate this, specifically the (integer) number of 1's, 2's, and 3's to pick.
This solves almost instantaneously as an IP.
import pyomo.environ as pyo
from random import randint
n = 1000
k = 500
sample = [randint(1, 3) for t in range(n)]
avail = {t : len([val for val in sample if val==t]) for t in range(1, 4)}
target = 86/47
m = pyo.ConcreteModel()
m.vals = pyo.Set(initialize=[1,2,3])
m.pick = pyo.Var(m.vals, domain=pyo.NonNegativeIntegers)
m.delta = pyo.Var()
m.obj = pyo.Objective(expr=m.delta)
# constrain the delta as an absolute value of |sum(picks) - target|
m.C1 = pyo.Constraint(expr=m.delta >= sum(m.pick[v]*v for v in m.vals)-target*k)
m.C2 = pyo.Constraint(expr=m.delta >= -sum(m.pick[v]*v for v in m.vals)+target*k)
# don't use more than available for each value
def limit(m, v):
return m.pick[v] <= avail[v]
m.C3 = pyo.Constraint(m.vals, rule=limit)
soln = pyo.SolverFactory('glpk').solve(m)
print(soln)
m.pick.display()
Yields:
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 885
Number of created subproblems: 885
Error rc: 0
Time: 0.3580749034881592
Solution:
- number of solutions: 0
number of solutions displayed: 0
pick : Size=3, Index=vals
Key : Lower : Value : Upper : Fixed : Stale : Domain
1 : 0 : 3.0 : None : False : False : NonNegativeIntegers
2 : 0 : 0.0 : None : False : False : NonNegativeIntegers
3 : 0 : 304.0 : None : False : False : NonNegativeIntegers
Realize you can also attack this algorithmically quite efficiently and get a (pretty easy) near-optimal answer, or with some sweat-equity get the optimal answer as well. Below is a framework that I tinkered with. The key observation is that you can "add more 3's" to the solution up until the point where the amount to go (to get to k * target can be filled all with 1's. That's very close to as-good-as-it-gets, except for cases where you'd be better off substituting a couple of 2's near the end, I think, or backing up if you run out of 1's.
The below runs (in python) and is most of the way there for a good approximation.
### Code:
# average hitting
from random import randint
n = 1000
k = 50
sample = [randint(1, 3) for t in range(n)]
available = {t : len([val for val in sample if val==t]) for t in range(1, 4)}
target = 86/47
print(f'available at start: {available}')
sum_target = target * k
soln = []
selections_remaining = k
togo = sum_target - sum(soln)
for pick in range(k):
if togo > k - pick and available[3] > 0:
soln.append(3)
available[3] -= 1
elif togo > k - pick and available[2] > 0:
soln.append(2)
available[2] -= 1
elif available[1] > 0:
soln.append(1)
available[1] -= 1
else: # ran out of ones in home stretch... do a swap
pass
# some more logic...
togo = sum_target - sum(soln)
print(f'solution: {soln}')
print(f'generated: {sum(soln)/k} for target of {target}')
print(f'leftover: {available}')
Yields:
available at start: {1: 349, 2: 335, 3: 316}
solution: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
generated: 1.84 for target of 1.8297872340425532
leftover: {1: 291, 2: 335, 3: 274}
[Finished in 117ms]
I'm reading the book "Spatial Point Patterns: Methodology and Applications with R", Chapter 6, trying to replicate all the examples following the code at the companion website. I cannot replicate Figure 6.15 (a) since this is the output I get and it's way different from the Figure in the book.
library(spatstat)
#> Carico il pacchetto richiesto: spatstat.data
#> Carico il pacchetto richiesto: nlme
#> Carico il pacchetto richiesto: rpart
#>
#> spatstat 1.60-1 (nickname: 'Swinging Sixties')
#> For an introduction to spatstat, type 'beginner'
swp <- rescale(swedishpines)
aden <- adaptive.density(swp, f=0.1, nrep=30)
#> Computing 30 intensity estimates...
#>
#> PLEASE NOTE: The components "delsgs" and "summary" of the
#> object returned by deldir() are now DATA FRAMES rather than
#> matrices (as they were prior to release 0.0-18).
#> See help("deldir").
#>
#> PLEASE NOTE: The process that deldir() uses for determining
#> duplicated points has changed from that used in version
#> 0.0-9 of this package (and previously). See help("deldir").
#> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30.
#> Done.
rainsat <- function(n) {
grade <- sqrt(seq(0.1, 1, length=n))
rainbow(n=n, start=1/2, s=grade)
}
par(mar = c(1, 0, 0, 2))
plot(aden, main="", ribscale=1000, col=rainsat)
plot(swp, add=TRUE, pch=3)
Created on 2019-09-06 by the reprex package (v0.3.0)
What's the problem here? What am I doing wrong? Even if I run all the code in the startup.R and figurelayout.R files (which should just change the cols of the plots making the b/w) I still cannot get the same plot.
adaptive.density involves randomisation. You will not get the same result if you repeat the same command twice (unless you reset random.seed).
A larger value of nrep will reduce the random variation.
I have a distribution, for example:
d
#[1] 4 22 15 5 9 5 11 15 21 14 14 23 6 9 17 2 7 10 4
Or, the vector d in dput format.
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
And when I apply the ks.test,:
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
This gives the following warning:
Warning message:
In ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563) :
ties should not be present for the Kolmogorov-Smirnov test
I tried put unique(d), but obvious my data reduce the values and I wouldn't like this happen.
And the others manners and examples online, this example happen too, but the difference is the test show some results with the warning message, not only the message without values of ks.test.
Some help?
In gamma you can find your result, warning message is not blocking
d <- c(4, 22, 15, 5, 9, 5, 11, 15, 21, 14, 14, 23, 6, 9, 17, 2, 7, 10, 4)
gamma <- ks.test(d, "pgamma", shape = 3.178882, scale = 3.526563)
Warning message: In ks.test(d, "pgamma", shape = 3.178882, scale =
3.526563) : ties should not be present for the Kolmogorov-Smirnov test
gamma
One-sample Kolmogorov-Smirnov test
data: d
D = 0.14549, p-value = 0.816
alternative hypothesis: two-sided
You find an explanation of the warning in the help page ??ks.test
The presence of ties always generates a warning, since continuous
distributions do not generate them. If the ties arose from rounding
the tests may be approximately valid, but even modest amounts of
rounding can have a significant effect on the calculated statistic.
As you can see some rounding is applied and the test is "approximately" valid.
Due to the nature of my specification, the results of my regression coefficients provide the slope (change in yield) between two points; therefore, I would like to plot these coefficients using the slope of a line between these two points with the first point (0, -0.7620) as the intercept. Please note this is a programming question; not a statistics question.
I'm not entirely sure how to implement this in base graphics or ggplot and would appreciate any help. Here is some sample data.
Sample Data:
df <- data.frame(x = c(0, 5, 8, 10, 12, 15, 20, 25, 29), y = c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.01035) )
Output:
x y
1 0 -7.620e-01
2 5 -4.340e-04
3 8 1.580e-03
4 10 8.220e-05
5 12 -2.940e-03
6 15 2.460e-03
7 20 -5.210e-04
8 25 -9.287e-05
9 29 -1.035e-02
Example:
You can use cumsum, the cumulative sum, to calculate intermediate values
df <- data.frame(x=c(0, 5, 8, 10, 12, 15, 20, 25, 29),y=cumsum(c(-0.762,-0.000434, 0.00158, 0.0000822, -0.00294, 0.00246, -0.000521, -0.00009287, -0.0103)))
plot(df$x,df$y)
I need to calculate the product of the edges attributes of the shortest path between two vertices in my graph.
For example:
data<-as.data.frame(cbind(c(1,2,3,4,5,1),c(4,3,4,5,6,5),c(0.2,0.1,0.5,0.7,0.8,0.2)))
G<-graph.data.frame(data, directed=FALSE)
set.edge.attribute(G, "V3", index=E(G), data$V3)
If I calculate the shortest path according to the attribute I have two posibilities, the first tell me the steps:
get.shortest.paths (G, 2, 6, weights=E(G)$V3)
2 3 4 1 5 6
The second tell me the sum of the attribute along the path.
shortest.paths (G, 2, 6, weights=E(G)$V3)
1.8
Since I need to make a product, I would need to have a vector of the edge attributes between the nodes of my path. In this example I should get 0.8 0.2 0.2 0.5 0.1, whose product would be 0.0016.
Can anyone suggest me how to do it?
Use the output argument of get.shortest.paths:
library(igraph)
data <- data.frame(from =c(1, 2, 3, 4, 5, 1),
to =c(4, 3, 4, 5, 6, 5),
weight=c(0.2,0.1,0.5,0.7,0.8,0.2))
G <- graph.data.frame(data, directed=FALSE)
esp26 <- get.shortest.paths(G, 2, 6, output="epath")[[1]]
esp26
# [1] 2 3 1 6 5
prod(E(G)$weight[esp26])
# [1] 0.0016
plot(G, edge.label=paste("Id:", 1:ecount(G), "\n", "W:",
E(G)$weight, sep=""))