I have a data (ge3_ratio) with 86 observations measuring the ratio of phonological errors produced by children. The data was collected from spontaneous speech transcripts from 3 children aged 21 to 65 months. The data look like this:
age
child
Transcripts
Process
Liq
n
ratio
21
Luana
1
posteriorização
N
1
1.0000000
23
Luana
1
apagamento de líquida
S
1
1.0000000
24
Leonardo
1
apagamento de líquida
S
1
1.0000000
24
Túlio
3
assimilação + apagamento de líquida
S
1
0.3333333
24
Túlio
3
metátese
S
1
0.3333333
24
Túlio
3
semivocalização de líquidas
S
1
0.3333333
24
Túlio
3
substituição de líquida
S
1
0.3333333
25
Túlio
2
anteriorização
N
1
0.5000000
25
Túlio
2
apagamento de líquida
S
1
0.5000000
26
Luana
1
apagamento de líquida
S
1
1.0000000
The ratio column was created dividing n by Transcripts. The column Liq describes whether the phonological error (Process) involves a liquid consonant (coded as S) or not (coded as N). I want to test the hypothesis that phonological errors involving liquid consonants last for longer than those not involving them. "Last for longer" does not mean only that the curve length will be greater for S than for N, but also that the ratio must be greater for S than for N as the child grows older. I have the following graphs showing this development path ("Liquida" just means "Liquid"):
ggplot(ge3_ratio, aes(age, ratio, color = Liq)) +
geom_smooth() + scale_color_manual(values = kb) +
labs(x = "age in months")
ggplot(ge3_ratio, aes(age, ratio, color = Liq)) +
geom_path() + scale_color_manual(values = kb) +
labs(x = "age in months")
geom_smooth:
geom_path:
I want help as how to proceed with the analysis (taking into account my hypothesis), which tests I can run and how to do that in R. I started reading about growth curve analysis, but I am unsure about how I can apply the concepts in my data.
Related
I created ordered logit models using polr in RStudio and the using the vif. It worked fine and I got my results.
But after I decided toadd more variables such as dummy for age groups, education and income groups. But when I then use vif I got the error about subscript out of bound
My used data for this:
Here is an example of my data set:
Work less
lifestatisfied
country
Work much
0
8
GB
1
1
8
SE
0
0
9
DK
1
0
9
DE
1
NA
5
NO
NA
continued:
health
education
income
age
marital status
3
3
Na
61
NA
4
2
2
30
NA
1
3
4
39
6
5
7
5
52
4
4
1
5
17
5
gender is dummy 1 or 2
age is respondents age like 35, 47 etc.
income is scaled and is 1 to 10
educ (education) is 1 to 7
health is scaled 1 to 5
work less is dummy i.e. 1 or 0
work much is dummy, i.e. 1 or 0
marital status is scaled 1 to 6
lifesatisfied is the dependent variable and is scaled 0 to 10.
My ordered regression model:
myorderedmodel = polr(factor(lifesatisfied, ordered = TRUE) ~ maritalstatus + gender + age + age20_29 + age30_39 + age40_49 + age50_59 + income + lowincome + avgincome + highincome + noeducation + loweducation + higheducation + health + child + religion + workless + workmuch, data = mydata, method = "logistic", Hess = TRUE)
vif(myorderedmodel)
Gives the following error:
Error in R[subs, subs] : subscript out of bounds
I really didn't understand the error. What does it mean? And how can it be solved?
I have a dataset (LDA output) that looks like this.
lda_tt <- tidy(ldaOut)
lda_tt <- lda_tt %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
topic term beta
1 1 council 0.044069733
2 1 report 0.020086205
3 1 budget 0.016918569
4 1 polici 0.01646605
5 1 term 0.015051927
6 1 annual 0.014938797
7 1 control 0.014316583
8 1 audit 0.013637803
9 1 rate 0.012732765
10 1 fund 0.011997421
11 2 debt 0.033760856
12 2 plan 0.030379431
13 2 term 0.02925229
14 2 fiscal 0.021836885
15 2 polici 0.017802904
16 2 mayor 0.015548621
17 2 transpar0.013175692
18 2 relat 0.012997722
19 2 capit 0.012463813
20 2 long 0.011989227
21 2 remain 0.011989227
22 3 parti 0.031795751
23 3 elect 0.029929187
24 3 govern 0.025496098
25 3 mayor 0.023046232
26 3 district0.014588364
27 3 public 0.014471704
28 3 administr0.013596752
29 3 budget 0.011730188
30 3 polit 0.011730188
31 3 seat 0.010563586
32 3 state 0.010563586
33 4 budget 0.037069484
34 4 revenu 0.025043026
35 4 account 0.018459577
36 4 oper 0.01721546
37 4 tax 0.015867667
38 4 debt 0.014416198
39 4 compani 0.013690464
40 4 expenditur0.012135318
41 4 consolid0.011305907
42 4 increas 0.010891202
43 5 invest 0.026534237
44 5 elect 0.023341538
45 5 administr0.022296654
46 5 improv 0.02189031
47 5 develop 0.019162003
48 5 project 0.017826874
49 5 transport0.016375647
50 5 local 0.016317598
51 5 infrastr0.014401978
52 5 servic 0.014111733
I want to create 5 plots by topic with terms ordered by beta. This is the code
lda_tt %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
I get this graph
As you can see, despite the sorting efforts, the terms are not order by beta, as the term "budget", for example, should be the top term in topic 4, and "invest" at the top of topic 5, etc. How can sort the terms within each topic on each graph? There are several questions on stackoverflow about ggplot sorting, but none of these helped me solve the problem.
The link suggested by Tung provides a solution to the problem. It seems that each term needs to be coded as a distinct factor to get proper sorting. We can add " _ " and the topic number to each term (done in lines 2 and 3), but display only the terms without "_" and the topic number (last line of code takes care of that). The following code generates a faceted graph with proper sorting.
lda_tt %>%
mutate(term = factor(paste(term, topic, sep = "_"),
levels = rev(paste(term, topic, sep = "_")))) %>%#convert to factor
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_discrete(labels = function(x) gsub("_.+$", "", x)) #remove "_" and topic number
I am trying to plot the predictions of a lmer model with the following code:
p1 <- ggplot(Mac_Data_Tracking, aes(x = Rspan, y = SubjEff, colour = NsCond)) +
geom_point(size=3) +
geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)),lineend="round")
print(p1)
I get weird inflections at the end of each line, is there a way to remove them? I have changed the data in newdat, but the lines always have these inflections.
Lines with Inflections at ends:
Note that you have geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)). So you've fed newdat to geom_line as the data frame to use for plotting. But then for your y-value you provide a separate vector of predictions (based on newdat), when y should actually be just a column of newdat. I'm not sure why that's causing the inflections at the ends (probably there are, somehow, two different y-values being provided for each of the endpoint x-values), but that's probably the source of your problem.
Instead, you should create a column in newdat with the predictions (if you haven't already) and feed that column name to ggplot as the y in geom_line. To add a column of predictions, do the following:
newdat$pred = predict(SubjEff.model,newdata=newdat)
You should also give geom_line the x values that correspond to the y values in newdat. So your code would be:
geom_line(data=newdat, aes(y=pred, x=Rspan), lineend="round")
(Where Rspan will (automatically) be the Rspan column in newdat.)
It was a problem with having 2 x values, actually...it was having 2 subject values.
The linear mixed model is:
Mixed.model <- lmer(Outcome ~ NsCond + Rspan + (1|Subject), data=Data))
For newdat, I was intially using:
newdat <- expand.grid(Subject=c(min(Data$Subject),max(Data$Subject)),Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which gave me:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 18 0.2916667 Pink
3 1 1.0000000 Pink
4 18 1.0000000 Pink
5 1 0.2916667 Babble
6 18 0.2916667 Babble
7 1 1.0000000 Babble
8 18 1.0000000 Babble
9 1 0.2916667 Loss
10 18 0.2916667 Loss
11 1 1.0000000 Loss
12 18 1.0000000 Loss
For each Rspan (x) there are 2 "Subjects" (1 and 18).
I changed newdat to:
newdat <- expand.grid(Subject=1,Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which results in:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 1 1.0000000 Pink
3 1 0.2916667 Babble
4 1 1.0000000 Babble
5 1 0.2916667 Loss
6 1 1.0000000 Loss
Now it looks good
I've been trying to make a percent bar chart to show mortality of two species of tadpoles on three different backgrounds (for a camouflage study)
My dataset is:
[[1]]
campanha fundo sobreviventes especie intactos percentsob
1 5 light 10 Bs 9 66.66667
2 5 mixed 8 Bs 5 53.33333
3 5 dark 8 Bs 8 53.33333
4 6 light 15 Bs 13 100.00000
5 6 mixed 15 Bs 11 100.00000
6 6 dark 14 Bs 11 93.33333
7 5 light 7 Sm 5 46.66667
8 5 mixed 10 Sm 9 66.66667
9 5 dark 12 Sm 10 80.00000
10 6 light 14 Sm 6 93.33333
11 6 mixed 14 Sm 6 93.33333
12 6 dark 15 Sm 9 100.00000
and my script is (file name=odonatapint, and 15 = total number of individuals used per trial):
odonatapint$percentsob<-odonatapint$sobreviventes*100/15
ggplot(data=odonatapint,aes(x=fundo,y=percentsob,fill=especie)) +
geom_bar(method="mean",stat="identity",position="dodge") +
scale_fill_manual(values=c("#999999", "#000000")) +
xlab("Background type (natural)") +
ylab("Tadpoles surviving (%)")
However I noticed the graph to show the highest value for each category instead of the mean (I tried to post the graph but I wasn't allowed because I just registered).
What should I do to fix it? And how can I add error bars after I get to display the mean value?
This is what I did, first calculate the mean and standard deviation (for error bars)
library(dplyr)
odonatapint <- odonatapint %>% group_by(fundo,especie) %>% mutate(mean = mean(percentsob), sd = sd(percentsob))
then plot using ggplot (first plot bars using geom_bar, then use geom_errorbar to add error bars
ggplot(data=odonatapint,aes(x=fundo,y=mean,fill=especie)) +
geom_bar(method="mean",stat="identity",position=position_dodge()) +
geom_errorbar(aes(ymax = mean + sd, ymin = mean - sd), position = position_dodge()) +
scale_fill_manual(values=c("#999999", "#000000")) +
xlab("Background type (natural)") +
ylab("Tadpoles surviving (%)")
The figure generated is shown below
I don't know that much about geom_bar(), so this takes a slightly different approach, but it works with example datasets. It computes the error bar values by bootstrapping, which may be computationally intensive with larger datasets.
ggplot(data=odonatapint,aes(x=fundo,y=percentsob,fill=especie)) +
stat_summary(geom='bar', fun.y='mean', position='dodge') +
stat_summary(geom='errorbar', fun.data='mean_cl_boot', position='dodge')
I'm trying to plot a logistic regression in R, for a continuous independent variable and a dichotomous dependent variable. I have very limited experience with R, but my professor has asked me to add this graph to a paper I'm writing, and he said R would probably be the best way to create it. Anyway, I'm sure there are tons of mistakes here, but this is the sort of this previously suggested on StackOverflow:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=FALSE)
curve(predict(ggg, data.frame(V1=x), type="response"), add=TRUE)
where vvv is the name of my csv file (31 obs. of 2 variables), V1 is the continuous variable, and V2 is the dichotomous one. Also, ggg (List of 30?) is the following:
ggg<- glm(formula = vvv$V2 ~ vvv$V1, family = "binomial", data = vvv)
The ggplot function produces a graph of my data points, but no logistic regression curve. The curve function results in the following error:
"Error in curve(predict(ggg, data.frame(V1 = x), type = "resp"), add = TRUE) : 'expr' did not evaluate to an object of length 'n'
In addition: Warning message:'newdata' had 101 rows but variables found have 31 rows"
I'm not sure what the problem is, and I'm having trouble finding resources for this specific issue. Can anybody help? It would be greatly appreciated :)
Edit: Thanks to anyone who responded! My data, vvv, is the following, where the percent was the initial probability for presence/absence of a species in a specific area, and the 1 and 0 indicate whether or not a species ended up being observed.:
V1 V2
1 95.00% 1
2 95.00% 0
3 95.00% 1
4 92.00% 1
5 92.00% 1
6 92.00% 1
7 92.00% 1
8 92.00% 1
9 92.00% 1
10 92.00% 1
11 85.00% 1
12 85.00% 1
13 85.00% 1
14 85.00% 1
15 85.00% 1
16 80.00% 1
17 80.00% 0
18 77.00% 1
19 77.00% 1
20 75.00% 0
21 70.00% 1
22 70.00% 0
23 70.00% 0
24 70.00% 1
25 70.00% 0
26 69.00% 1
27 65.00% 0
28 60.00% 1
29 50.00% 1
30 35.00% 0
31 25.00% 0
As #MrFlick commented, V1 is probably a factor. So, first you have to change it to numeric class. This just substitutes "%" for nothing and divides by 100, so you will have proportions as numeric class:
vvv$V1<-as.numeric(sub("%","",vvv$V1))/100
Doing this, you can use your own code and you will have a plot for a logistic regression:
ggplot(vvv, aes(x = vvv$V1, y=vvv$V2)) + geom_point() + stat_smooth(method="glm", family="binomial", se=F)
This should print not only the points, but also the logistic regression curve. I don't understand what is the point of using curves. From what I could understand from your question, this is enough for what you need.