Sampling Error - Basic Regression Model in Stan - stan

I am trying how to learn Stan, and doing some deliberately simple problems to get myself up to speed. I have got stuck very much at level 1, having tried to run a simple bivariate regression.
I have data of the following format
stan_data <- list("y"=y,
"year"=year,
"N_obs" = N_obs)
The full data is pasted at the bottom of this post, out of the way.
Anyway, my stan code for a bivariate regression looks like this.
lm <- "data {
int<lower=1> N_obs;
real year[N_obs];
real y[N_obs];
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
transformed parameters{
}
model {
vector[N_obs] mu_hat;
alpha ~ normal(0, 100);
beta ~ normal(0, 100);
sigma ~ uniform(0, 100);
for(i in 1:N_obs){
mu_hat[i] <- alpha + beta * year[i];
y[i] ~ normal(mu_hat[i], sigma);
}
}"
write(lm, file="lm.stan")
lm.fit0 <- stan(file="lm.stan",
data=stan_data,
chains=1,
iter=5000)
First, I have a query. Why do we have the statement in the model block vector[N_obs] mu_hat; (instead of real mu_hat[N_obs]; in the transformed parameters block)? It seems from a little Googling that this is what you need to do.
Second (and more seriously), when I try to run the code I get the following error:
TRANSLATING MODEL 'lm' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'lm' NOW.
SAMPLING FOR MODEL 'lm' NOW (CHAIN 1).
Error : Error in function stan::prob::normal_log(d): Random variable is nan, but must not be nan!
In addition: Warning message:
In storage.mode(x) <- "integer" : NAs introduced by coercion
error occurred during calling the sampler; sampling not done
As usual, any help greatly appreciated.
The data in fact looks like this:
stan_data
$y
42089728 9339536 9781184 138361088 30910448 30411792
629997056 21062368 1167006 7631744 6925444 5893008
35743680 -55904 116299776 966712 178152 19397504
101188992 1536242176 44078264 1243806 105937664 43202352
-4213172 40201728 84412544 16671128 0 19432968
44403296 89021120 33442736 5850532 68061664 0
86286272 636771072 65779408 6416524 25559184 0
0 11437649 128506560 26867136 1646992 -16684608
43974528 6812660 0 0 -906249 17730360
6571846 -14056304 -2317026 29722656 43035904 70388248
-202987 24308224 0 19598944 25241600 31093140
172198080 68365824 -15307088 345229424 0 91912288
6387084 6936104 362958976 10828080 34233728 465616896
185831488 4554222 14789792 19448168 27692960 88308096
75171552 -246307584 11228152 8361832 2265296 172424512
1182046720 22629408 1165429 348064512 77001792 11092408
84706848 -19970752 -2386432 66124424 19266104 72069984
14311872 -1680048 509040 188740112 318636288 170175680
-244937216 16264160 6017916 327072 159117760 0
8156479 320665728 36684736 17502416 29556064 47395008
12937934 168051632 0 892982 10329560 1355983
-4529648 -43117 -10704432 226641152 23704368 -3433973
-73329408 0 3594688 51327088 59915116 293390016
382384192 -12102624 -336263424 0 -24685504 -899952 10155976 218019584 48748112 30058752 1842414592 44083792
5092000 24174848 10985128 33436544 159885024 36513376
140204416 12631560 8951732 25929808 353803264 3143784448 60253136 702773 506841344 38420128 11721112 92972608
60845840 30016168 37990192 -6470864 78287520 21554528
29755168 3766984 35639136 26794784 583849280 267967488
37916960 11501600 22704880 133042624 513627 3389580
289430272 21665616 85471472 39646656 116267616 -13407846
15678080 27691000 682450 9635360 580544 16791136
793524 38486832 -79701376 -63242544 2160139 202091584
300 60001872 120758144 50716744 13548672 623414144
21202400 0 0 17696512 -5566584 -3197064
201575680 34187360 50923296 1267788800 28845072 1021406
20589376 5255816 19726800 43046336 84012320 93750016
1549232 4102708 20721248 36500736 5098330112 -20425392
781041 247644672 28292416 21682296 52508672 38884352
57993648 953560 1437008 81498304 86611584 23846608
5454052 37785760 99136512 58742016 1308937472 37354624
14447532 19370288 81054432 108383989 5834392 196654592
-37886048 199787840 -38083360 -19815904 1496112 7065456
30429000 -190947 3102040 5150997 6569152 711859
42429536 148236256 70894720 -888473 62231296 15503290
-17289808 106739712 -46661248 -185851136 602047616 15609200
940000 0
$year
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4
$N_obs
284

Please see https://groups.google.com/d/msg/stan-users/fsM8GPG4cpM/YVedcWYcmW8J
In short, this is a bug in RStan that tried to convert the storage model of integer data into integer. But here the integer is so big so some NA's are created. Will be fixed in the next release.

Related

The r function I have previously used successful no longer works

I am trying to create a summary for this data set
Morph ID black white orange green
1 O 1 2 1 0 3
2 O 2 2 1 3 0
3 O 3 2 1 1 2
4 O 4 3 0 2 1
5 O 5 3 0 2 1
6 O 6 3 0 1 2
7 O 7 3 0 1 2
8 O 8 3 0 3 0
9 O 9 0 3 2 1
10 O 10 3 0 3 0
11 O 11 3 0 1 2
12 O 12 0 3 2 1
13 O 13 3 0 2 1
14 O 14 3 0 2 1
15 O 15 2 1 1 2
I created the summary below before with a data set that has the exact same format.
n mean sd min Q1 median Q3 max percZero Choice se
sum.greenO 15 0.8666667 1.187234 0 0 0 2 3 60.00000 Orange 0.3065424
sum.greenG 15 2.1333333 1.187234 0 1 3 3 3 13.33333 Green 0.3065424
I used the function Summarize() but this function is no longer working.
I need to create the same bar graph I made for this previous data set, which I can't do without "n", "sd", or "se". (I created "se" using "n" and "sd" - it didn't come with the initial function output).
I am confused about how a function can stop working? Is there an alternative function I am not aware of?
Please let me know if this doesn't make any sense.
The following R packages on CRAN all provide a function called "Summarize" with a capital S:
> collidr::CRAN_packages_and_functions() %>% filter(function_names == "Summarize")
package_names function_names
1 alakazam Summarize
2 basket Summarize
3 bayesm Summarize
4 ChemoSpec Summarize
5 ChemoSpecUtils Summarize
6 cold Summarize
7 dataMaid Summarize
8 fastJT Summarize
9 FSA Summarize
10 GLMpack Summarize
11 LAGOSNE Summarize
12 lslx Summarize
13 MapGAM Summarize
14 MetaIntegrator Summarize
15 NetMix Summarize
16 PKNCA Summarize
17 ppclust Summarize
18 qad Summarize
19 radiant.model Summarize
20 ssmrob Summarize
Of course it is not guaranteed you made the previous summary with one of them, but hopefully this helps you find the right one.

problem in changing matrix to a data frame with same dimensions

I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?
See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1

How to do a co-occurrence matrix from multiple data frames in R

my first language isn't English so I apologize in advance for mistakes I could do. I'm newbie in R but you will notice that anyway.
I'm trying to solve the problem of having a co-occurence matrix. I have several dataframes and I am interested in 3 variables : idT, numname and numstim.
This is the unique dataframe that contains the merged data :
z=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,
df15,df16,df17,df18,df19,df20,df21,df22,df23,df24,df25,df26,df27,df28,df29,df30,df31,df32)
write.csv(z, file = ".../listz.csv")
Then I extracted the 3 variables with :
#Extract columns 3 & 6 from all the files within the list
z1 = z[,c(3,6)]
#Create a new variable 'numname' to convert name groups into numeric groups,
#then obtain levels with facNum
z1$numname <- as.numeric(z1$namegroup)
colnames(z1) <- c("namegroup", "idT", "numname")
facNum <- factor(z1$numname)
write.csv(z1, file = "...D:/z1.csv")
And data look like :
namegroup idT numname
1 GLISSEVIBREVITE 1 6
2 CINETIQUE 1 3
3 VIBRATIONS_LEGERES 1 20
4 DIFFUS 1 5
5 LIQUIDE 1 8
6 PICOTEMENTS 1 10
How to read the table : each idT is classified in a group (namegroup) and then this group is converted in a numeric variable (numname).
# Specify z1 as a data frame to make next operations
z1 = as.data.frame(z1, idT = z1$numstim, numgroup = z1$numname)
tab1 <- table(z1)
write.csv(tab1, file = ".../tab1test.csv")
out1 <- data.matrix(tab1 %*% t(tab1))
write.csv(out1, file = ".../bmtest.csv")
But the bmtest matrix doesn't look like counting pairs of idT, because only 22 users have participated and there are 32 idT, but some the numbers are much higher :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 24 10 7 7 11 7 7 8 10 8 11 8 6 11 11 12
2 10 32 27 7 5 4 7 4 4 4 5 3 2 6 6 14
3 7 27 40 0 3 1 0 2 0 0 2 2 1 2 0 15
4 7 7 0 30 7 14 15 9 15 13 13 7 5 12 13 5
5 11 5 3 7 24 7 9 20 12 13 10 19 14 20 12 7
I wanna have a matrix which shows the results of a count of idT paired together. The matrix has to look like :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 15 3 2 2 3 3 2 1 2 1 3 3 1 3 3 5
2 3 15 9 2 0 1 2 0 0 0 0 0 0 0 1 3
3 2 9 15 0 2 1 0 2 0 0 1 1 1 2 0 2
4 2 2 0 15 1 6 5 1 7 5 6 2 0 1 3 2
5 3 0 2 1 15 1 2 12 4 5 3 13 9 11 3 2
In other words, I want to see which idT have been paired together. I've looked at this topic but didn't find a way to solve my problem.
Also, I tried :
library(igraph)
library(tnet)
idT_numname <- cbind(z1$idT, z1$numname)
igraph <- graph.data.frame(idT_numname)
item_item <- projecting_tm(net = idT_numname, method="sum")
item_item <- tnet_igraph(item_item,type="weighted one-mode tnet")
itemmat <- get.adjacency(item_item,attr="weight")
itemmat #8x8 martrix of items to items
But I get error message and I don't know how to get over the "duplicated entries in the edgelist", because it seems necessary to me to have duplicated entries in order to do a co-occurrence matrix :
> idT_numname <- cbind(z1$idT, z1$numname)
> item_item <- projecting_tm(idT_numname, method="sum")
Error in as.tnet(net, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet", method="sum")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet", method = "sum") :
unused argument (method = "sum")
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
Your help is greatly appreciated.
I like to do data analysis and I want to learn more and more everyday !
Thank you

Convert year-month string to three month bins with gaps - how to assign contiguous ascending values?

I have used the code below to "bin" a year.month string into three month bins. The problem is that I want each of the bins to have a number that corresponds where the bin occurs chronologically (i.e. first bin =1, second bin=2, etc.). Right now, the first month bin is assigned to the number 4, and I am not sure why. Any help would be highly appreciated!
> head(Master.feed.parts.gn$yr.mo, n=20)
[1] "2007.10" "2007.10" "2007.10" "2007.11" "2007.11" "2007.11" "2007.11" "2007.12" "2008.01"
[10] "2008.01" "2008.01" "2008.01" "2008.01" "2008.02" "2008.03" "2008.03" "2008.03" "2008.04"
[19] "2008.04" "2008.04"
>
> yearmonth_to_integer <- function(xx) {
+ yy_mm <- as.integer(unlist(strsplit(xx, '.', fixed=T)))
+ return( (yy_mm[1] - 2006) + (yy_mm[2] %/% 3) )
+ }
>
> Cluster.GN <- sapply(Master.feed.parts.gn$yr.mo, yearmonth_to_integer)
> Cluster.GN
2007.10 2007.10 2007.10 2007.11 2007.11 2007.11 2007.11 2007.12 2008.01 2008.01 2008.01
4 4 4 4 4 4 4 5 2 2 2
2008.01 2008.01 2008.02 2008.03 2008.03 2008.03 2008.04 2008.04 2008.04 2008.04 2008.05
2 2 2 3 3 3 3 3 3 3 3
2008.05 2008.05 2008.06 2008.10 2008.11 2008.11 2008.12 <NA> 2009.05 2009.05 2009.05
3 3 4 5 5 5 6 NA 4 4 4
2009.06 2009.07 2009.07 2009.07 2009.09 2009.10 2009.11 2010.01 2010.02 2010.02 2010.02
5 5 5 5 6 6 6 4 4 4 4
UPDATE:
I was asked to provide sample input (year) and the desired output (Cluster.GN).I have a year-month string that has varying numbers of observations for each month, and some months don't have any observations. What I want to do is bin each of the three consecutive months that have data, assigning each three month "bin" a number as shown below.
yr.mo Cluster.GN
1 2007.10 1
2 2007.10 1
3 2007.10 1
4 2007.10 1
5 2007.10 1
6 2007.11 1
7 2007.11 1
8 2007.11 1
9 2007.11 1
10 2007.12 1
11 2007.12 1
12 2007.12 1
13 2007.12 1
14 2008.10 2
15 2008.10 2
16 2008.10 2
17 2008.10 2
18 2008.12 2
19 2008.12 2
20 2008.12 2
21 2008.12 2
22 2008.12 2
1) Convert the strings to zoo's "yearqtr" class and then to integers:
s <- c("2007.10", "2007.10", "2007.10", "2007.11", "2007.11", "2007.11",
"2007.11", "2007.12", "2008.01", "2008.01", "2008.01", "2008.01",
"2008.01", "2008.02", "2008.03", "2008.03", "2008.03", "2008.04",
"2008.04", "2008.04")
library(zoo)
yq <- as.yearqtr(s, "%Y.%m")
as.numeric(factor(yq))
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
The last line could alternately be: 4*(yq - yq[1])+1
Note that in the question 2007.12 is classified as in a different quarter than 2007.10 and 2007.11; however, they are all in the same quarter and we assume you did not intend this.
2) Another possibility depending on what you want is:
f <- factor(s)
nlev <- nlevels(f)
levels(f) <- gl(nlev, 3, nlev)
f
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
## Levels: 1 2 3
IF there are missing months then this will give a different answer than (1) so it all depends on what you are looking for.

R: grouped data table with proportions

I have copied my code below. I start with a list of 50 small integers, representing the number of televisions owned by 50 families. My objective is shown in the object 'tv.final' below. My effort seems very wordy and inefficient.
Question: is there a better way to start with a list of 50 integers and end with a grouped data table with proportions? (Just taking my first baby steps with R, sorry for such a stupid question, but inquiring minds want to know.)
tv.data <- read.table("Tb02-08.txt",header=TRUE)
str(tv.data)
# 'data.frame': 50 obs. of 1 variable:
# $ TVs: int 1 1 1 2 6 3 3 4 2 4 ...
tv.table <- table(tv.data)
tv.table
# tv.data
# 0 1 2 3 4 5 6
# 1 16 14 12 3 2 2
tv.prop <- prop.table(tv.table)*100
tv.prop
# tv.data
# 0 1 2 3 4 5 6
# 2 32 28 24 6 4 4
tvs <- rbind(tv.table,tv.prop)
tvs
# 0 1 2 3 4 5 6
# tv.table 1 16 14 12 3 2 2
# tv.prop 2 32 28 24 6 4 4
tv.final <- t(tvs)
tv.final
# tv.table tv.prop
# 0 1 2
# 1 16 32
# 2 14 28
# 3 12 24
# 4 3 6
# 5 2 4
# 6 2 4
You can treat the object returned by table() as any other vector/matrix:
tv.table <- table(tv.data)
round(100 * tv.table/sum(tv.table))
That will give you the proportions in rounded percentage points.

Resources