It is my understanding that when plotting histogram, it's not that every unique data point gets its own bin, there's an algorithm that calculates how many bins to use. How do I find out how the data were partitioned to create the number of bins? E.g. 0-5,6-10,... How do I get R to show me where the breaks are via text output?
I've found various methods to calculate number of bins but that's just theory
I think you need to use $breaks:
set.seed(10)
hist(rnorm(200,0,1),20)$breaks
[1] -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
Related
How can I make sure that by adding 0.2 at every iteration I get the correct result?
some = 0.0
for i in 1:10
some += 0.2
println(some)
end
the code above gives me
0.2
0.4
0.6000000000000001
0.8
1.0
1.2
1.4
1.5999999999999999
1.7999999999999998
1.9999999999999998
Floats are only approximatively correct and if adding up to infinity the error will become infinite, but you can still calculate with it pretty precisely. If you need to evaluate the result and look if it is correct you can use isapprox(a,b) or a ≈ b.
I.e.
some = 0.
for i in 1:1000000
some += 0.2
end
isapprox(some, 1000000 * 0.2)
# true
Otherwise, you can add integer numbers in the for loop and then divide by 10.
some = 0.
for i in 1:10
some += 2.
println(some/10.)
end
#0.2
#0.4
#0.6
#0.8
#1.0
#1.2
#1.4
#1.6
#1.8
#2.0
More info about counting with floats:
https://en.wikipedia.org/wiki/Floating-point_arithmetic
You can iterate over a range since they use some clever tricks to return more "natural" values:
julia> collect(0:0.2:2)
11-element Vector{Float64}:
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
julia> collect(range(0.0, step=0.2, length=11))
11-element Vector{Float64}:
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
I am using the Bayesian logistic regression (probit) from the rstanarm package to train a model on default events. As inputs the model accepts some financial ratios and some qualitative data. Is there a way where I can actually regularise the coefficients, for the qualitative data only, to be always positive?
For example, when I use a single prior for everything I get these results (I calibrate the model using MCMC, with set.seed(12345)):
prior <- rstanarm::normal(location = 0, scale = NULL, autoscale = TRUE)
model.formula <-
formula(paste0('default_events ~ fin_ratio_1 + ',
'fin_ratio_2 + fin_ratio_3 +',
'fin_ratio_4 + fin_ratio_5 +',
'fin_ratio_6 + fin_ratio_7 +',
'fin_ratio_8 + Qual_1 + Qual_2 +',
'Qual_3 + Qual_4'))
bayesian.model <- rstanarm::stan_glm(model.formula,
family = binomial(link = "probit"),
data = as.data.frame(ds), prior = prior,
prior_intercept = NULL,
init_r = .1, iter=600, warmup=200)
The coefficients are the following:
summary(bayesian.model)
Estimates:
mean sd 2.5% 25% 50% 75% 97.5%
(Intercept) -2.0 0.4 -2.7 -2.3 -2.0 -1.7 -1.3
fin_ratio_1 -0.7 0.1 -0.9 -0.8 -0.7 -0.6 -0.4
fin_ratio_2 -0.3 0.1 -0.5 -0.4 -0.3 -0.2 -0.1
fin_ratio_3 0.4 0.1 0.2 0.4 0.4 0.5 0.6
fin_ratio_4 0.3 0.1 0.1 0.2 0.3 0.3 0.4
fin_ratio_5 0.2 0.1 0.1 0.2 0.2 0.3 0.4
fin_ratio_6 -0.2 0.1 -0.4 -0.2 -0.2 -0.1 0.0
fin_ratio_7 -0.3 0.1 -0.5 -0.3 -0.3 -0.2 -0.1
fin_ratio_8 -0.2 0.1 -0.5 -0.3 -0.2 -0.1 0.0
Qual_1 -0.2 0.1 -0.3 -0.2 -0.2 -0.1 -0.1
Qual_2 0.0 0.1 -0.1 -0.1 0.0 0.0 0.1
Qual_3 0.2 0.0 0.1 0.1 0.2 0.2 0.3
Qual_4 0.0 0.2 -0.3 -0.1 0.0 0.1 0.3
The question is, can I use two different distributions? Like for fin_ratio_x variables to use normal and for Qual_x variables to use exponential or dirichlet?
Neither using different prior families nor inequality restrictions on coefficients are possible with the models supplied by the rstanarm package. Either or both is fairly easy to do with the brms package or by writing your own Stan program.
I want to calculation a correlation score between two sets of numbers, but these numbers are within each row
The background is that I'm compiling a recommender system, using PCA to give me scores for each user and each item to each derived feature (1,2,3 in this case)
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I've combined the outputs for each user and item into this all x all table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
The other alternative would be to the correlation before I join the users & items into an all x all dataset, but I'm not sure how to do that best either.
I don't think I can transpose the table as in reality the users and items total is very large.
Any thoughts on a good approach?
Thanks,
Andrew
This question already has answers here:
How do you create vectors with specific intervals in R?
(3 answers)
Closed 8 years ago.
I can't find how to tell R that I want to make a kind of "continuous" vector.
I want something like x<-c(-1,1) to give me a vector of n values with a specific interval (e.g, 0.1 or anything I want) so I can generate a "continuous" vector such as
x
[1] 1.0 -0.9 -0.8 -0.7....1.0
I know this should be basic but I can't find my way to the solution.
Thank you
It sounds like you're looking for seq:
seq(-1, 1, by = .1)
# [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
# [16] 0.5 0.6 0.7 0.8 0.9 1.0
I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program
It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000