Confidence interval of episode duration frequencies - r

I have the episode duration data (in days)
dur<-c(1, 2, 1, 2, 1, 3, 11, 2, 2, 3, 2, 4, 1, 2, 2, 1, 2, 10, 1, 1, 2, 2, 18, 2, 2, 2, 1, 7, 1, 1, 11, 25, 17, 2, 2, 9, 3, 3, 2, 5, 3, 2, 3, 2, 5, 363, 1, 1, 2, 2)
Which means in one instance the episode duration was 1 days, 2 days, 1 days etc etc
table(dur) summarizes the duration data (12 instances of 1 day, 20 instances of 2 days etc)
freq.table<-(table(dur)/sum(table(dur))) gives me the frequency of the observed durations of episodes (point estimates).
How can I get confidence intervals of freq.table in R? What would be the most appropriate way for this kind of data?
Edit: I am interested in estimating the CI of the frequency of episode durations of 1, 2, ..., n days

A fast and easy way to get CIs for proportions in R is the function binom.test as in
dur <- c(1, 2, 1, 2, 1, 3, 11, 2, 2, 3, 2, 4,
1, 2, 2, 1, 2, 10, 1, 1, 2, 2, 18, 2,
2, 2, 1, 7, 1, 1, 11, 25, 17, 2, 2, 9,
3, 3, 2, 5, 3, 2, 3, 2, 5, 363, 1, 1, 2, 2)
t <- table(dur)
n <- length(dur)
ci <- sapply(t, function(x) binom.test(x, n, conf.level = .95)$conf.int)
rownames(ci) <- c("lower", "upper")
print(ci)
That is supposing, that the data forming process for each episode is anything like a binomial process.
Edit after first comment
As Roland has pointed out in an earlier comment above, you have not stated the problem in inambigous statistical terms, so I made some assumptions. I suppose Roland would suggest trying to find a distribution for all the possible durations as a whole system. Considerung a mode on 2 and the existence of an observation with value 363 this is unlikely to be a common distribution like poisson or binomial etc. Knowing nothing about the data generating process I estimated a confidence interval for each observed outcome on it's own, not regarding the distribution as a whole. For each observed outcome I stated that I assumed a binomial distribution which you should look up before you use my proposition for an answer for anything serious.

Related

How do I generate a polychoric correlation matrix in R-psych

I am trying to generate a polychoric correlation matrix in R-psych for a 227 x 6 data table which I have called nepr. Importing the data from an excel spreadsheet and entering the code:
nepr=as.data.frame(nepr)
attach(nepr)
library(psych)
out=polychoric(nepr)
neprpoly=out$rho
print(neprpoly,digits=2)
generates the following error message:
>Error in if (any(lower > upper)) stop("lower>upper integration
limits"): missing value where TRUE/FALSE needed
>In addition: warning messages:
>1. In polychoric(nepr): The items do not have an equal number
of response alternatives, global set to FALSE.
>2. In qnorm(cumsum(rsum)[-length(rsum)]): NaNs produced
I was expecting the code which I entered to produce a polychoric correlation matrix based on the dataframe nepr and don't know how to interpret/ act on the error messages which I have received.
Can anyone suggest what changes I need to make to the code to address the error messages?
A sample of the dataset is as follows:
structure(list(Balance = c(4, 4, 5, 5, 3, 4, 3, 4, 2, 2, 2, 5,
2, 2, 2, 2, 1, 2, 4, 1), Earth = c(4, 5, 5, 5, 5, 5, 5, 4, 4,
4, 4, 5, 3, 4, 4, 2, 5, 4, 5, 5), Plants = c(2, 2, 2, 3, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 4), Modify = c(2, 2, 1,
1, 2, 2, 2, 2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2), Growth =
c(2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 1, 4, 2, 2, 4, 4, 4, 1, 2),
Mankind = c(2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1,
1, 1, 2)), row.names = c(NA,20L), class = "data.frame")
The data consists of inputs of Likert scale rankings (ranked 1-5) to the items 'Balance', 'Earth', 'Plants', 'Modify', 'Growth', and 'Mankind'. There are no missing values in any cells of the 227 row x 6 item matrix; Balance, Plants, & Growth all contain the values 1-5; Earth contains the values 2-5 (no ranking of 1 recorded); Mankind contains the values 1-4 (no ranking of 5 recorded). When I ran the original data set (before reversing the valence of the last 3 columns) I was able to get a polychoric matrix with no problems even though the data contained the Earth data as it appears in the nepr data set. I assume that it is not uncommon to have similar data sets from surveys where variables do not necessarily contain the full range of response values.

How do i Interpret the coefficients of glm with binomial error distribution?

I would be happy if someone could help me understand glm with binominal error distribution.
Lets assume the following df:
year<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 3, 3)
success<-c(1, 0, 3, 1, 1, 2, 6, 0, 1, 1, 12, 2, NA, 6, 12, 0, 10,
7, 4, 10, 13, 1, 2, 1, 18, 6, 3, 8, 3, 1, 9, 15, 6, 12,
6, 15, 13, 6, 8, 6, 2, 11, 6, 1, 12, 0, 4, 15, 0, 3, 18,
5, 6, 17, 5, 3, 17, 8, 0, 7, 12, 10, 26, 12, 4, 17, 1, 8,
2, 7, 14, 8)
no_success<-c(1, 9, 5, 4, 6, 1, 4, 4, 6, 10, 16, 4, NA, 3, NA, 3,
5, 5, 6, 10, 0, 5, 3, 10, 1, 7, 11, 8, 20, 4, 3, 3,
19, 1, 11, 4, 6, 4, 9, 4, 10, 4, 2, 8, 3, 1, 13, 3,
5, 7, 5, 9, 3, 6, 3, 4, 3, 13, 6, 5, 10, 3, 1, 0,
18, 6, 13, 0, 3, 2, 2, 2)
df<-data.frame(year,success,no_success)
df$success<-as.integer(df$success)
df$no_success<-as.integer(df$no_success)
If I want to know if there is a linear increase or decrease between year in regards to the success or no_success of a thought up treatment I apply a binominal glm:
m<- glm(cbind(success, no_success)~year,
data=df, family = "quasibinomial",
na.action=na.exclude)
summary(m)
I changed to "quasibinomial" here because of overdispersion.
From the summary I see that there is a significant effect: P: 0.0219 *
As the coefficients in a binomial glm represent log odds,
I get exp(estimate) = exp(0.3099) = 1.363
So, there is an increase in Odds of succes of 1.363 per year
My Questions are:
1.) When I exp(negative estimate) it gets always positive - this can not be correct. There must be a way to express negative relationships.
2.) When I want to visualize multiple linear models, I like to display the estimates.
In a "normal" lm I would display the estimate and confidence interval like this: divide the estimate by the mean of the observation and than substract and add the mean of observation/Std. Error times 1.96.
Estimate.mean<-exp(0.3099)/mean(df$or,na.rm=TRUE)
Std.Error.mean<-exp(0.1321)/mean(df$or,na.rm=TRUE)
low<-Estimate.mean-Std.Error.mean*1.96
high<-Estimate.mean+Std.Error.mean*1.96
If this confidence level is not touching the zero line it should be significant. The effect is significantly not greater than zero.
But here the low bound is -0.3901804 and the high bound is 1.608095. This does not appear to be a significant linear relationship despite the low p-value from the glm (0.0219).
What have I mixed up here?
I am happy for any suggestions
The "zero line" in this case is x=1 and not x=0.
Question 2:
the question is. Is there a effect that is different from zero?
But odds of 1 basicaly means zero.
Question 1:
When the estimate is exp the result can not be negative.But odds below 1 express a negative effect.
Here are some sources to calculate the confidence intervall for anyone stumbling over this post.
https://fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
https://stats.stackexchange.com/questions/304833/how-to-calculate-odds-ratio-and-95-confidence-interval-for-logistic-regression

How to plot a rating scale in R

What is the best way to represent the following trait rating scale? I'd like to label the traits (8 traits) and degrees or each emotion (1 being low feelings, 5 being strong feelings), across the democratic and republican parties? Do I need to aggregate the items? I'm new to R and not sure how to tackle this.
Survey question and scale:
"Below is a list of feelings or moods that could be caused by an object. Please use the list below to describe how the U.S. FEDERAL parties (and its elected officials) make you feel. If the word definitely describes how a party makes you feel, then choose the number 5. If you decide that the word does not at all describe how the party makes you feel, then choose the number 1. Use the intermediate numbers between 1 and 5 to indicate responses between these two extremes."
Survey sample:
dput(df[Book3(1:nrow(df), 30),])
structure(list(TRAITDEM1 = c(3, 4, 3, 3, 3, 3, 3, 1, 2, 2, 2,
3, 3, 2, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 4, 4, 3, 1, 2, 4), TRAITDEM2 = c(3,
1, 1, 2, 2, 2, 3, 5, 4, 2, 2, 2, 3, 3, 3, 4, 1, 2, 3, 1, 4, 5,
2, 3, 1, 1, 1, 4, 1, 2), TRAITDEM3 = c(3, 4, 4, 2, 3, 3, 3, 1,
1, 2, 2, 3, 3, 2, 2, 1, 1, 3, 1, 5, 1, 1, 3, 1, 4, 5, 4, 1, 3,
5), TRAITDEM4 = c(3, 2, 1, 2, 2, 2, 4, 5, 4, 5, 2, 3, 2, 3, 3,
4, 3, 4, 3, 1, 5, 4, 1, 4, 3, 4, 2, 4, 2, 1), TRAITDEM5 = c(3,
4, 3, 4, 4, 3, 2, 1, 1, 2, 2, 3, 4, 2, 2, 1, 1, 3, 1, 5, 1, 1,
2, 1, 4, 4, 4, 1, 3, 4), TRAITDEM6 = c(3, 1, 1, 1, 1, 1, 1, 2,
1, 1, 1, 2, 2, 2, 2, 4, 3, 1, 1, 1, 4, 5, 1, 3, 1, 1, 1, 1, 1,
1), TRAITDEM7 = c(3, 1, 3, 3, 2, 2, 1, 1, 1, 2, 3, 4, 3, 2, 2,
1, 1, 2, 2, 5, 1, 1, 1, 3, 3, 4, 2, 1, 5, 5), TRAITDEM8 = c(3,
1, 1, 1, 2, 1, 3, 5, 2, 4, 1, 1, 2, 2, 3, 1, 3, 1, 2, 1, 5, 5,
2, 2, 1, 2, 1, 2, 1, 1), TRAITREP1 = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 4, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1,
1), TRAITREP2 = c(1, 5, 5, 5, 5, 5, 5, 2, 5, 2, 5, 5, 5, 5, 4,
5, 1, 5, 5, 5, 5, 1, 5, 4, 5, 5, 5, 3, 5, 5), TRAITREP3 = c(1,
1, 1, 1, 2, 1, 1, 2, 1, 4, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3,
1, 1, 1, 1, 1, 1, 1, 2), TRAITREP4 = c(1, 5, 5, 1, 5, 5, 5, 3,
5, 2, 5, 4, 5, 5, 5, 5, 3, 5, 5, 5, 5, 1, 5, 3, 5, 5, 5, 4, 5,
1), TRAITREP5 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 2,
1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1), TRAITREP6 = c(1,
5, 5, 5, 3, 3, 3, 1, 1, 1, 3, 3, 5, 3, 4, 5, 3, 4, 5, 4, 5, 1,
5, 3, 4, 4, 5, 1, 1, 3), TRAITREP7 = c(1, 1, 1, 1, 2, 2, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1,
2), TRAITREP8 = c(1, 5, 5, 5, 4, 5, 5, 2, 5, 2, 5, 4, 5, 5, 4,
1, 3, 5, 5, 5, 5, 3, 4, 4, 5, 5, 5, 3, 5, 5), PARTYID_Strength = c(5,
1, 2, 1, 2, 1, 8, 7, 6, 3, 1, 6, 6, 1, 7, 8, 7, 1, 1, 1, 2, 4,
1, 6, 1, 1, 1, 7, 6, 8)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
"PartyID_Strength" represents 8 measures of political parties:
1 - Strong Democrat
2 - Not very strong Democrat
3 - Strong Republican
4 - Not very strong Republican
5 - Independent
6 - Independent - Democrat
7 - Independent - Republican
8 - Other
I tried it this way (graph below) but it's still not plotting the remaining four traits:
Cleaning the data
In order to solve your problem, we have to transform your data, in order to convert it into tidy format.
Observation
There are few particular problems with your original dataset:
Data are in a wide format, i.e. most of the columns from your data frame, can be represented by 3 variables;
Names of the variables are not self-explanatory. Names are in upper case which, by itself, does not hold any useful information, they are not readable and not good for typing/writing.
There is additional information we can extract from the variable names: Party and Feelings toward the Party. First one is an abbreviation ('dem' or 'rep') second one is the numerically encoded feeling towards the political party. However the order of numbers encoding the feeling does not reflect natural order of emotions from the disgust up to joy;
Variable PARTYID_Strength is numerically encoded Political Party [self-]Identification it also does not reflect natural order from strongest democrats through independent towards strongest republicans;
Plan
Convert data from wide into long format using all variables starting with TRAIT, and leaving PARTYID_Strength variable unchanged;
Extract useful information from the TRAIT... variables (Political Party, Feelings Toward the Party);
Convert all numerically encoded variables into the factors with reasonably ordered levels;
Give all variables meaningful names;
Summarize the data;
Transformations
We need to create several lookup tables, which will simplify the workflow.
Affiliation lookup table:
aff_lookup <- c(
'Strong Democrat',
'Not very strong Democrat',
'Strong Republican',
'Not very strong Republican',
'Independent',
'Independent-Democrat',
'Independent-Republican',
'Other'
)
We can further order aff_lookup by this vector:
aff_order = c(1, 2, 6, 5, 7, 4, 3, 8)
Emotions/Feelings lookup table:
emo_lookup <- c(
'Delighted',
'Angry',
'Happy',
'Annoyed',
'Joy',
'Hateful',
'Relaxed',
'Disgusted'
)
And we can order emo_lookup by this vector:
emo_order <- emo_order <- c(8, 6, 2, 4, 7, 3, 1, 5)
Political party lookup table:
party_lookup <- c(
dem = 'National Democratic Party',
rep = 'National Republican Party'
)
Finally, with all helper variables, we can transform our data into desirable form.
library(tidyverse)
dat %<>%
rename_all(tolower) %>%
pivot_longer(
cols = starts_with('trait'),
names_to = c('party', 'emotion'),
names_pattern = 'trait(dem|rep)(\\d)',
values_to = 'score'
) %>%
mutate(
party = factor(party_lookup[party]),
affiliation = factor(
aff_lookup[partyid_strength],
levels = aff_lookup[aff_order]
),
emotion = factor(
emo_lookup[as.numeric(emotion)],
levels = emo_lookup[emo_order]
)
) %>%
group_by(party, emotion, affiliation) %>%
summarise(score = median(score)) %>%
ungroup()
head(dat)
## A tibble: 6 x 4
# party emotion affiliation score
# <fct> <fct> <fct> <dbl>
#1 National Democratic Party Disgusted Strong Democrat 1
#2 National Democratic Party Disgusted Not very strong Democrat 2
#3 National Democratic Party Disgusted Independent-Democrat 2
#4 National Democratic Party Disgusted Independent 3
#5 National Democratic Party Disgusted Independent-Republican 3
#6 National Democratic Party Disgusted Not very strong Republican 5
Plot the data
Plan
Now we can plot the data, as two separate plots for Democrats and Republicans with Affiliation (Political Party Identification) on X-axis and Emotions (Feelings) on Y-axis.
Each Emotion/Affilation point is going to be represented as a bar with the height of the bar representing the Score.
We can also add color encoding to our plot. From my point of view, encoding Emotions/Feelings with a color gradient from red (Disgust) to green (Joy) could help as to gather the internal structure of our data.
Plot
dat %>%
ggplot(
aes(
x = affiliation,
y = as.numeric(emotion) + (score / max(score) * .95) / 2,
height = (score / max(score) * .95),
width = .95,
fill = emotion,
label = score
)
) +
geom_tile(show.legend = FALSE) +
geom_text(size = 3.5, color = 'gray25', alpha = .75) +
facet_wrap(~ party, scales = 'free') +
scale_fill_brewer(palette = 'RdYlGn') +
scale_y_continuous(breaks = sort(emo_order), labels = emo_lookup[emo_order]) +
labs(x = 'Affiliations', y = 'Emotions') +
ggthemes::theme_tufte() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.ticks.x = element_blank(),
axis.text.y = element_text(hjust = 0, vjust = -0.025),
axis.ticks.y = element_blank()
)
Which gives as following figure:
Explanation
There is a trick with this plot: it looks like a series of barplots, bot it is not real barplots (by the fact, not functionally).
What I do:
The core of this solution is the use of geom_tile() for each data point. It is just a rectangle (square by default) with geometrical center of mass determined by the given coordinates (Affilation, Emotion).
Both Affilation and Emotion are factors, not numerics. And it is OK for Affiliation, because we want only to position our tile according to the Affiliation it represents.
It is more complicated with Emotion, because we want to position each tile according to the Emotion it represents, but also we want to encode Score by the height of the tile.
To define the height of the tile we use height parameter within the aes(). We want our tile height to be less or equall to one (with 0.05 offset) so the tiles between let say Angry and Annoyed do not overlap. That's why we use (score / max(score) * .95 for the height parameter.
We also need to give different y-coordinates for each tile, so the center of the tile is placed not on the imaginary line representing each emotion, but half-height up. So when tile is drawn, it's center (on y-axis) is placed half-height up from the "base line" and the tile extends half-height up and down, creating a fake barplot. That's what the following line of code does as.numeric(emotion) + (score / max(score) * .95) / 2.
We also give a tile a fixed width of .95 by width = .95, file the tile with Red-Yellow-Green gradient and lable each tile with the relevant Score.
The rest are just decorations. However, note how we relable the Y-axis. Because, as it defined in aes() it is continuous scale, but we want to make it fake discrete axis we use this row:
scale_y_continuous(breaks = sort(emo_order), labels = emo_lookup[emo_order])
Here we just use our emo_order to say that we want breaks for integers from 1 to 8, and after that we label this breaks with feelings from ordered emo_lookup table.

Is it possible to limit forecasts made by bsts to positive values only?

I am learning to use various forecasting packages available in R, and came across bsts(). The data I deal with is a time series of demands.
data=c(27, 2, 7, 7, 9, 4, 3, 3, 3, 9, 6, 2, 6, 2, 3, 8, 6, 1, 3, 8, 4, 5, 8, 5, 4, 4, 6, 1, 6, 5, 1, 3, 0, 2, 6, 7, 1, 2, 6, 2, 8, 6, 1, 1, 3, 2, 1, 3, 1, 6, 3, 4, 3, 7, 3, 4, 1, 7, 5, 6, 3, 4, 3, 9, 2, 1, 7, 2, 2, 9, 4, 5, 3, 4, 2, 4, 4, 8, 6, 3, 9, 2, 9, 4, 1, 3, 8, 1, 7, 7, 6, 0, 1, 4, 8, 9, 2, 5)
ts.main=ts(data, start=c(1910,1), frequency=12)
ss <- AddLocalLinearTrend(list(), y=ts.main)
ss <- AddSeasonal(ss, y=as.numeric(ts.temp), nseasons=12)
model <- bsts(as.numeric(ts.temp),
state.specification = ss,
niter = 1000)
pred <- predict(model, horizon = 12)
Is there way I can restrict pred$mean from becoming negative?
Since your data are a time series of counts, you need to take that into account rather than assume Gaussian errors; for some discussion on this and elaboration of some approaches, see for example Brandt et al 2000 and Brandt and Williams 2001. Luckily, the bsts package has a built-in functionality for this, the family option (see pages 24 to 26 of the documentation).
So, you can just do this
model <- bsts(as.numeric(ts.main),
state.specification = ss,
family = 'poisson',
niter = 1000)
so that the bsts() function correctly considers the data as counts, which will solve your issue, since the draws from the posterior predictive distribution will then be non-negative by definition.

How can you find the polynomial for a decimated LFSR?

I know that it if you decimate the series generated by a linear feedback shift register, you get a new series and a new polynomial. For example, if you sample every fifth element in the series generated by a LFSR with polynomial x4+x+1, you get the series generated by x2+x+1. I can find the second polynomial (x2+x+1) by brute force, which is fine for low-order polynomials. However, for higher-order polynomials, the time required to brute force it gets unreasonable.
So the question is: is it possible to find the decimated polynomial analytically?
Recently read this article and thought of it when seeing your question, hope it helps.. :oÞ
Given a primitive polynomial over GF(q), one can obtain another primitive polynomial by decimating an LFSR sequence obtained from the initial polynomial. This is demonstrated in the code below.
K := GF(7);
C := PrimitivePolynomial(K, 2);
C;
D^2 + 6*D + 3
In order to generate an LFSR sequence, we must first multiply this polynomial by a suitable constant so that the trailing coefficient becomes 1.
C := C * Coefficient(C,0)^-1;
C;
5*D^2 + 2*D + 1
We are now able to generate an LFSR sequence of length 72 - 1. The initial state can be anything other than [0, 0].
t := LFSRSequence (C, [K| 1,1], 48);
t;
[ 1, 1, 0, 2, 3, 5, 3, 4, 5, 5, 0, 3, 1, 4, 1, 6, 4, 4, 0, 1, 5, 6, 5, 2, 6, 6,
0, 5, 4, 2, 4, 3, 2, 2, 0, 4, 6, 3, 6, 1, 3, 3, 0, 6, 2, 1, 2, 5 ]
We decimate the sequence by a value d having the property gcd(d, 48)=1.
t := Decimation(t, 1, 5);
t;
[ 1, 5, 0, 6, 5, 6, 4, 4, 3, 1, 0, 4, 1, 4, 5, 5, 2, 3, 0, 5, 3, 5, 1, 1, 6, 2,
0, 1, 2, 1, 3, 3, 4, 6, 0, 3, 6, 3, 2, 2, 5, 4, 0, 2, 4, 2, 6, 6 ]
B := BerlekampMassey(t);
B;
3*D^2 + 5*D + 1
To get the corresponding primitive polynomial, we multiply by a constant to make it monic.
B := B * Coefficient(B, 2)^-1;
B;
D^2 + 4*D + 5
IsPrimitive(B);
true
from these notes: "The decimation by n>0 of a m-sequence c , denoted as c[ n],
has a period equal to N/gcd(N,n), if it is not the all-zero
sequence, its generator polynomial gˆ( x ) has roots that are nth
powers of the roots of g(x)"

Resources