Frequency distribution of a categorical variable in R - r

I am trying to prepare a frequency distribution table of a categorical variable in my data and I am using below code. But the output looks ok while I view it but not printing ok in report.
# These lines are not needed because the data below is already
# in that format
# STI<-STI_IPD1%>% select(Q18_1,Q54)
# STI$Q54<-as.factor(STI$Q54)
STI = structure(list(Q18_1 = c(101L, 120L, 29L, 101L, 94L, 16L, 47L,
141L, 154L, 47L, 141L, 154L, 154L, 29L, 58L, 154L, 101L, 154L,
47L, 141L, 75L, 1L, 120L, 16L, 154L, 141L, 141L, 154L, 154L,
154L, 29L, 141L, 38L, 47L, 101L, 16L, 154L, 154L, 101L, 192L,
58L, 154L, 16L, 120L, 101L, 1L, 38L, 1L, 154L, 1L, 16L, 58L,
75L, 154L, 47L, 58L, 120L, 141L, 1L, 141L, 16L, 141L, 58L, 29L,
101L, 58L, 154L, 75L, 75L, 141L, 29L, 101L, 101L, 154L, 16L,
101L, 101L, 47L, 47L, 181L, 16L, 154L, 47L, 154L, 47L, 120L,
75L, 47L, 192L, 1L, 154L, 154L, 120L, 141L, 58L, 47L, 154L, 101L,
75L, 141L, 75L, 16L, 47L, 1L, 58L, 141L), Q54 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Discretionary if earnings per share goals are met.",
"initial funding by targets and as year goes on begin to include financial results",
"Non-represented are targets focused and budgeted and union plans are self funded based on operating margin achievements."
), class = "factor")), class = c("data.table", "data.frame"), row.names = c(NA,
-106L), .Names = c("Q18_1",
"Q54"))
as.data.frame(table(STI$Q54))
Is there any other way to prepare such outputs?
I want output as a table of counts of each factor level. each factor level in one column and and counts in another column.
I am taking output in word file using Rmarkdown. Also in the output window the output is not printing as two columns table.

To print a data frame as a table in Markdown, one can use the kable() function in knitr.
library(knitr)
kable(aDataFrame)
For example...
data.frame() with the kable() function is really useful technique for communicating tabular information in R Markdown. For a couple of more complicated examples using this technique, please read my article Commentary on ToothGrowth Factorial ANOVA, where I compare Robert Kabacoff's analysis to the requirements of the Johns Hopkins University Statistical Inference course on Coursera.
regards,
Len
(11/22/2017) UPDATE: Responding to a comment from #sandhya-ghildiyal , here is how to exclude the blank row from the table output. If we save the result of table() into an object, we can then use the extract operator [ within the kable() function to exclude the row where the factor value is 1, the blank space.
theTable <- as.data.frame(table(STI$Q54))
kable(theTable[as.numeric(theTable$Var1) != 1,])

Related

T-tests on specific subgroups

I'm trying to calculate mean and SD and then perform t.tests on three different measurements (height, weight, speed) between multiple subgroups.
I started with a simple dataset that only contains two groups (control vs drug) and I have it all working well enough.
simple.df<-
structure(list(trial = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), levels = c("control", "drug"), class = "factor"), height = c(15,
17, 25, 21, 11, 29, 18, 20), weight = c(80, 90, 81, 79, 200,
230, 215, 210), speed = c(50, 45, 60, 51, 52, 80, 41, 19)), class = "data.frame", row.names = c(NA,
-8L))
library(rstatix)
simple.df %>% group_by(trial) %>% get_summary_stats(type = "mean_sd")
testing<- data.frame(lapply(simple.df[-1], function(x) t.test(x~simple.df$trial)$p.value))
testing
Where I'm running into trouble is with the t.testing on a larger experiment similar to the dataframe below. I still have control vs drug and height, weight & speed, but now all the measurements were done at two timepoints in both males and females. I'm only concerned with comparing control versus drug for the same sex/age. I'm still ok calculating the mean and SD for each group, but have gotten stuck with figuring out the t-testing.
Specifically, I just want the t-test on each of the three measurements for drug vs control in young males, drug vs control in old males, drug vs control in young females and drug vs control in old females, so 12 p-values total with some identification for what comparison each value represents.
Thanks for your help and expertise!
big.df<- structure(list(age = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("old", "young"
), class = "factor"), sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("f", "m"), class = "factor"),
trial = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("control", "drug"
), class = "factor"), height = c(15L, 17L, 25L, 21L, 11L,
29L, 18L, 20L, 300L, 320L, 316L, 325L, 170L, 175L, 172L,
180L, 28L, 40L, 33L, 35L, 60L, 45L, 67L, 52L, 250L, 260L,
240L, 248L, 11L, 19L, 16L, 4L), weight = c(80L, 90L, 81L,
79L, 200L, 230L, 215L, 210L, 152L, 150L, 148L, 155L, 160L,
158L, 157L, 140L, 176L, 164L, 135L, 196L, 175L, 178L, 120L,
147L, 160L, 155L, 175L, 142L, 139L, 142L, 150L, 145L), speed = c(50L,
45L, 60L, 51L, 52L, 80L, 41L, 19L, 55L, 56L, 61L, 67L, 85L,
90L, 100L, 77L, 90L, 80L, 77L, 80L, 81L, 95L, 87L, 91L, 50L,
60L, 55L, 59L, 71L, 65L, 66L, 62L)), row.names = c(NA, -32L
), class = "data.frame")
big.df %>% group_by (sex, age, trial) %>%
get_summary_stats (type = "mean_sd") %>%
arrange (variable, sex, age, trial)
RYann had a good idea by defining a function to pull out subgroups and then doing all the t-tests on each subgroup. That approach was helpful.
I ended up building on his strategy and simplifing things a bit more by vectorizing the t-tests inside the function using lapply. I then stored each of the age/sex combinations in a dataframe and used mapply to pass those combinations to the t-testing function.
group<-big.df %>% filter(age == a_age & sex == a_sex)
data.frame(lapply(group[4:6], function(x) t.test(x~group$trial)$p.value))
}
combos <- data.frame(age = c("young","young","old","old"),
sex = c("m","f","m","f"))
t.test.df <- data.frame(mapply(t.script, a_age = combos$age, a_sex = combos$sex))
colnames(t.test.df) <- paste(combos$age, combos$sex, sep = " ")
young m
young f
old m
old f
height
1
1.939896e-05
0.01175771
1.630232e-08
weight
4.435875e-05
0.6368126
0.5196617
0.1299121
speed
0.80433
0.004320253
0.1526353
0.01539331
I hope this code will work out for you
big.df<- structure(list(age = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("old", "young"
), class = "factor"), sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("f", "m"), class = "factor"),
trial = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("control", "drug"
), class = "factor"), height = c(15L, 17L, 25L, 21L, 11L,
29L, 18L, 20L, 300L, 320L, 316L, 325L, 170L, 175L, 172L,
180L, 28L, 40L, 33L, 35L, 60L, 45L, 67L, 52L, 250L, 260L,
240L, 248L, 11L, 19L, 16L, 4L), weight = c(80L, 90L, 81L,
79L, 200L, 230L, 215L, 210L, 152L, 150L, 148L, 155L, 160L,
158L, 157L, 140L, 176L, 164L, 135L, 196L, 175L, 178L, 120L,
147L, 160L, 155L, 175L, 142L, 139L, 142L, 150L, 145L), speed = c(50L,
45L, 60L, 51L, 52L, 80L, 41L, 19L, 55L, 56L, 61L, 67L, 85L,
90L, 100L, 77L, 90L, 80L, 77L, 80L, 81L, 95L, 87L, 91L, 50L,
60L, 55L, 59L, 71L, 65L, 66L, 62L)), row.names = c(NA, -32L
), class = "data.frame")
# A function to extract the 3 comparrisons
multi_t <- function(a_sex,a_age){
df_func <- big.df %>% filter(sex==a_sex,age==a_age)
h <- t.test(height~trial,df_func)$p.value
w <- t.test(weight~trial,df_func)$p.value
s <- t.test(speed~trial,df_func)$p.value
# cat(
# "sex =",a_sex,"\nage =",a_age,"\n\n"
# )
return(cbind(height=h,weight=w,speed=s))
}
# Table in a long version
ptable <- data.frame(
multi_t("m","young"),
multi_t("m","old"),
multi_t("f","young"),
multi_t("f","old")
) %>% pivot_longer(cols=everything(),
names_to = "value",
values_to = "p.values") %>%
mutate(comparison = rep(c("young males","old males",
"young females","old females"),each=3),
value=str_remove_all(value,"\\.\\d"))
ptable
# Table in a wider version
ptable %>% group_by(value) %>% mutate(id=row_number()) %>%
pivot_wider(names_from = value,values_from = p.values) %>%
select(-id)
ptable %>%
mutate(sig=p.values<0.05) %>%
ggplot(aes(x=value,y=p.values,color=sig))+
geom_point(show.legend = T)+facet_wrap(~comparison,scales="free")+
theme(legend.position = "bottom")+
labs(title="P values of 3 different measurements",
subtitle = "For 4 different populations")

Why is prediction error discrete in adabag?

I've got the table of 55 observations with 5 variables (F,H,R,T,U) and 1 classifier variable ("Group") in which I have two groups.
I'm doing data sampling by splitting the data into the training set (70%) and test set (30%). Then I run adaboosting and check how it works.
I want to get the adaboost error distribution for 100 samplings. But the distribution occurs to be discrete, outputting only five value variants: 0, 0.0588235294117647, 0.117647058823529 0.176470588235294 and 0.235294117647059.It doesn't change with mfinal argument. I guess there should be more! How it works?
I use the folowing code:
predictions<-list()
for (i in 1:100){
train.ind<-sample(nrow(df), nrow(df) * 0.7)
assign(paste0("ada",i), do.call(boosting,
c(formula=Group~F + H + R + T + U,
data=substitute(df[train.ind,]), mfinal=50, boos=FALSE,
coeflearn='Breiman'),envir = parent.frame()))
assign(paste0("pred",i), predict(ada,df[-train.ind,]))
predictions[[i]]<-get(paste0("pred",i))$error
}
hist(100*unlist(predictions),breaks=10,
main="Error probability [%] ntrees=10. 100 sampling operations", xlab="AdaBoost error")
dput(df)
structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Canines", "Sled"), class = "factor"), F = c(0.263150566678734,
0.260347316635598, 0.26437277258488, 0.265710057607949, 0.254866055219663,
0.263294264681227, 0.261901194801303, 0.257318268395066, 0.26420207103455,
0.252093225560912, 0.255473253732324, 0.259067858940115, 0.259528043446917,
0.267331491048901, 0.260246447333382, 0.26035486437815, 0.254553215708594,
0.274074579975413, 0.262896904742862, 0.260504330262876, 0.258329960879536,
0.262664861154909, 0.256148832094211, 0.258509128895957, 0.256292083925698,
0.262358651734143, 0.254578103664353, 0.255386025800537, 0.264120912009577,
0.275232714712253, 0.265375720277527, 0.267601768121804, 0.262932226832642,
0.263633189245163, 0.262826186070212, 0.261058637786334, 0.262979366135887,
0.259232168979912, 0.252933156025384, 0.263963451214447, 0.258511197058683,
0.261957295373665, 0.253412282699461, 0.260748166588172, 0.263136039863289,
0.255317062006506, 0.258822015633545, 0.252757763183064, 0.260840486010478,
0.258620689655172, 0.263738813871524, 0.26241134751773, 0.26405425581719,
0.263685152057245, 0.262062787572784), H = c(0.242711147002311,
0.243850477245014, 0.245132979060713, 0.241794831140003, 0.235370262206577,
0.241392449436832, 0.236787894677703, 0.240434935369935, 0.234076675284456,
0.236978505926275, 0.23489414817613, 0.236461115627298, 0.241377100655228,
0.240778565421122, 0.238954656595734, 0.237237027626932, 0.23562891291975,
0.228247507171151, 0.235543469567304, 0.238348073568565, 0.237639956832591,
0.237993655975811, 0.23053394888479, 0.237553985998722, 0.238716430501961,
0.241044553515742, 0.23579805839771, 0.244646715997643, 0.245211405561299,
0.248463204730402, 0.237910443860818, 0.23772859908127, 0.242517289073306,
0.230376515634971, 0.239386381312522, 0.242971498213445, 0.248246377553633,
0.245227816034538, 0.237968589560153, 0.235998092571798, 0.235639593181493,
0.240320284697509, 0.239383587641388, 0.237939850635807, 0.240409493084614,
0.239705089012767, 0.235291279312896, 0.237725562711216, 0.251017166425148,
0.244410329082034, 0.247581475626206, 0.244082639531298, 0.248022977743474,
0.246127343801762, 0.246345535241663), R = c(0.23238005068085,
0.233913128793082, 0.232906768805408, 0.234580624702711, 0.23729616240706,
0.232552468336102, 0.23566425708828, 0.233370934038501, 0.23413197660754,
0.241255572873247, 0.240609653949119, 0.233790113420818, 0.239086204963073,
0.233644719452121, 0.23849468613068, 0.236846146329206, 0.239755264655663,
0.225925420024587, 0.239355887920232, 0.237429996633718, 0.23819641170916,
0.232039177131833, 0.223832380603256, 0.235838907338977, 0.236669843303285,
0.234916072348618, 0.238304558463179, 0.235904655883701, 0.232124394623714,
0.222879222527955, 0.233232723139038, 0.233871666714818, 0.235947441217151,
0.242585880964708, 0.234693056561268, 0.233941777691605, 0.229366135886539,
0.23539800906269, 0.239803390172875, 0.236505714593364, 0.24647853698133,
0.235569395017794, 0.242526379716086, 0.236207360559779, 0.234180854122081,
0.240408036487878, 0.239601762794737, 0.245058343429191, 0.234449894103222,
0.237875925051173, 0.230698942666106, 0.233475177304965, 0.231384358432554,
0.233114688928642, 0.230655428424067), T = c(0.261758235638105,
0.261889077326307, 0.257587479549, 0.257914486549337, 0.272467520166701,
0.262760817545838, 0.265646653432713, 0.268875862196498, 0.267589277073454,
0.269672695639567, 0.269022944142428, 0.270680912011768, 0.260008650934782,
0.258245224077857, 0.262304209940204, 0.265561961665713, 0.270062606715993,
0.271752492828849, 0.262203737769602, 0.263717599534841, 0.265833670578713,
0.267302305737446, 0.289484838417743, 0.268097977766344, 0.268321642269056,
0.261680722401497, 0.271319279474757, 0.264062602318119, 0.258543287805409,
0.253424858029389, 0.263481112722616, 0.260797966082108, 0.258603042876902,
0.263404414155158, 0.263094376055998, 0.262028086308617, 0.259408120423941,
0.26014200592286, 0.269294864241588, 0.263532741620391, 0.259370672778494,
0.262153024911032, 0.264677749943065, 0.265104622216242, 0.262273612930016,
0.264569812492848, 0.266284942258822, 0.264458330676529, 0.253692453461153,
0.25909305621162, 0.257980767836164, 0.260030835646007, 0.256538408006782,
0.25707281521235, 0.260936248761486), U = c(0.276642254462421,
0.275750907536407, 0.274138521440258, 0.279385339041277, 0.283770344294126,
0.273124933319108, 0.276770665567999, 0.272796198013943, 0.273326789343435,
0.278824893979485, 0.282917535762971, 0.269035729493284, 0.276381346021371,
0.275681845488406, 0.280473043309851, 0.274957072857482, 0.279453614114969,
0.265400901516186, 0.284438401450319, 0.275270067631668, 0.277080803992985,
0.268341093323935, 0.26334299428362, 0.27494270078114, 0.277070411973316,
0.276364671746617, 0.277622940087166, 0.275489489882784, 0.275412200032649,
0.267636555236813, 0.275475938484053, 0.27914367434201, 0.281161825726141,
0.287341513046201, 0.274277898463271, 0.272041104617345, 0.268317034458041,
0.277054269097656, 0.276448903327891, 0.282483963758864, 0.288513266166897,
0.280409252669039, 0.283610415243301, 0.27874587902846, 0.274619094771137,
0.275604453090517, 0.286100299160421, 0.288513039597016, 0.270078586556683,
0.280480764184118, 0.274123602187187, 0.277940178846747, 0.273784368554907,
0.282369310276287, 0.277372857201026)), na.action = structure(c(`2` = 2L,
`4` = 4L, `19` = 18L, `24` = 20L, `28` = 24L, `29` = 25L, `30` = 26L,
`32` = 28L, `33` = 29L, `42` = 38L, `54` = 46L, `69` = 54L, `74` = 58L,
`77` = 59L, `79` = 60L, `80` = 61L, `83` = 62L), class = "omit"), row.names = c(5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 16L, 17L, 18L, 20L,
25L, 26L, 27L, 31L, 41L, 44L, 46L, 47L, 48L, 50L, 51L, 52L, 55L,
57L, 64L, 65L, 66L, 67L, 68L, 70L, 71L, 72L, 85L, 86L, 87L, 88L,
89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L,
101L, 102L, 103L), class = "data.frame")

How can I remove the legend from this boxplot in ggplot? [duplicate]

This question already has answers here:
Remove legend ggplot 2.2
(4 answers)
Closed 3 years ago.
Please find My Data below.
How can I remove the red, encircled legend from my boxplot?
I wish to keep the same colors and design. I have tried numerous different solutions, but this has unfortunately not solved the problem.
This might be kinda basic, but simply can't figure out how to solve this. I hope you can help - thanks in advance!
My script is:
df <- data.frame(x = as.factor(c(p$WHO.Grade)),
y = c(p$ki67pro),
f = rep(c("Ki67pro"), c(nrow(p))))
ggplot(df) +
geom_boxplot(aes(x, y, fill = f, colour = f), outlier.alpha = 0, position = position_dodge(width = 0.78)) +
scale_x_discrete(name = "", label=c("WHO-I\nn=108","WHO-II\nn=34","WHO-III\nn=1")) +
scale_y_continuous(name="Ki-67 proliferative index", breaks=seq(0,30,5), limits=c(0,30)) +
stat_boxplot(aes(x, y, colour = f), geom = "errorbar", width = 0.3,position = position_dodge(0.7753)) +
geom_point(aes(x, y, fill = f, colour = f), size = 3, shape = 21, position = position_jitterdodge()) +
scale_fill_manual(values = c("#52C1C76D"), name = "",
labels = c("\nTotal cohort\nn=159\n ")) +
scale_colour_manual(values = c("#51BFC4"), name = "",
labels = c("\nTotal cohort\nn=159\n "))
And My Data
p <- structure(list(WHO.Grade = c(1L, 2L, 1L, 1L, 1L, 1L, 3L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), ki67pro = c(1L, 12L, 3L, 3L, 5L,
3L, 20L, 25L, 7L, 4L, 5L, 12L, 3L, 15L, 4L, 5L, 7L, 8L, 3L, 12L,
10L, 4L, 10L, 7L, 3L, 2L, 3L, 7L, 4L, 7L, 10L, 4L, 5L, 5L, 3L,
5L, 2L, 5L, 3L, 3L, 3L, 4L, 4L, 3L, 2L, 5L, 1L, 5L, 2L, 3L, 1L,
2L, 3L, 3L, 5L, 4L, 20L, 5L, 0L, 4L, 3L, 0L, 3L, 4L, 1L, 2L,
20L, 2L, 3L, 5L, 4L, 8L, 1L, 4L, 5L, 4L, 3L, 6L, 12L, 3L, 4L,
4L, 2L, 5L, 3L, 3L, 3L, 2L, 5L, 4L, 2L, 3L, 4L, 3L, 3L, 2L, 2L,
4L, 7L, 4L, 3L, 4L, 2L, 3L, 6L, 2L, 3L, 10L, 5L, 10L, 3L, 10L,
3L, 4L, 5L, 2L, 4L, 3L, 4L, 4L, 4L, 5L, 3L, 12L, 5L, 4L, 3L,
2L, 4L, 3L, 4L, 2L, 1L, 6L, 1L, 4L, 12L, 3L, 4L, 3L, 2L, 6L,
5L, 4L, 3L, 4L, 4L, 4L, 3L, 5L, 4L, 5L, 4L, 1L, 3L, 3L, 4L, 0L,
3L)), class = "data.frame", row.names = c(1L, 2L, 3L, 9L, 10L,
11L, 12L, 13L, 14L, 15L, 16L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,
38L, 39L, 40L, 41L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L,
53L, 54L, 55L, 57L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L,
68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L,
81L, 82L, 83L, 84L, 85L, 87L, 89L, 90L, 91L, 92L, 93L, 94L, 96L,
97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L, 105L, 106L, 107L,
109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L,
120L, 121L, 123L, 124L, 125L, 126L, 127L, 128L, 130L, 131L, 132L,
133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L,
144L, 145L, 146L, 147L, 148L, 149L, 150L, 151L, 152L, 153L, 154L,
155L, 156L, 157L, 158L, 159L, 160L, 161L, 162L, 163L, 164L, 165L,
166L, 167L, 168L, 169L, 170L, 171L, 172L, 173L, 174L, 175L))
You can use theme() as follows:
... + theme(legend.position = "none")
This should eliminate the legend
reference: https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/

dplyr n_distinct() in filter takes forever where as base length(unique()) works like charm

I have a data frame such as this:
structure(list(x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L,
6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L,
13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L, 19L,
19L, 20L, 20L, 21L, 21L, 22L, 22L, 23L, 23L, 24L, 24L, 25L, 25L,
26L, 26L, 27L, 27L, 28L, 28L, 29L, 29L, 30L, 30L, 31L, 31L, 32L,
32L, 33L, 33L, 34L, 34L, 35L, 35L, 36L, 36L, 37L, 37L, 38L, 38L,
39L, 39L, 40L, 40L, 41L, 41L, 42L, 42L, 43L, 43L, 44L, 44L, 45L,
45L, 46L, 46L, 47L, 47L, 48L, 48L, 49L, 49L, 50L, 50L, 51L, 51L,
52L, 52L, 53L, 53L, 54L, 54L, 55L, 55L, 56L, 56L, 57L, 57L, 58L,
58L, 59L, 59L, 60L, 60L, 61L, 61L, 62L, 62L, 63L, 63L, 64L, 64L,
65L, 65L, 66L, 66L, 67L, 67L, 68L, 68L, 69L, 69L, 70L, 70L, 71L,
71L, 72L, 72L, 73L, 73L, 74L, 74L, 75L, 75L, 76L, 76L, 77L, 77L,
78L, 78L, 79L, 79L, 80L, 80L, 81L, 81L, 82L, 82L, 83L, 83L, 84L,
84L, 85L, 85L, 86L, 86L, 87L, 87L, 88L, 88L, 89L, 89L, 90L, 90L,
91L, 91L, 92L, 92L, 93L, 93L, 94L, 94L, 95L, 95L, 96L, 96L, 97L,
97L, 98L, 98L, 99L, 99L, 100L, 100L), y = structure(c(1L, 2L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L), .Label = c("one", "two"), class = "factor")), class = "data.frame", row.names = c(NA,
-200L), .Names = c("x", "y"))
I am trying to filter groups of x that have two distinct y values using:
library(dplyr)
df %>% group_by(x) %>% filter(n_distinct(y) > 1)
On a large data set, this almost never finishes.
Changing to this works reasonably fast for the full data set:
library(dplyr)
df %>% group_by(x) %>% filter(length(unique(y)) > 1)
Any idea why n_distinct() is super slow to never finishing?

How to plot exponential decay in geom_smooth in ggplot2 in R?

Data
> dput(new.gapdata.cc)
structure(list(gap.interval = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 34L
), .Label = c("[0.0568,10.1]", "(10.1,20.1]", "(20.1,30.1]",
"(30.1,40.1]", "(40.1,50.1]", "(50.1,60.1]", "(60.1,70.1]", "(70.1,80.1]",
"(80.1,90.1]", "(90.1,100]", "(100,110]", "(110,120]", "(120,130]",
"(130,140]", "(140,150]", "(150,160]", "(160,170]", "(170,180]",
"(180,190]", "(190,200]", "(200,210]", "(210,220]", "(220,230]",
"(230,240]", "(240,250]", "(250,260]", "(260,270]", "(270,280]",
"(280,290]", "(290,300]", "(300,310]", "(310,320]", "(320,330]",
"(330,340]", "(340,350]", "(350,360]", "(360,370]", "(370,380]",
"(380,390]", "(390,400]", "(400,410]", "(410,420]", "(420,430]",
"(430,440]", "(440,450]", "(450,460]", "(460,470]", "(470,480]",
"(480,490]", "(490,500]", "(500,510]", "(510,520]", "(520,530]",
"(530,540]", "(540,550]", "(550,560]", "(560,570]", "(570,580]",
"(580,590]", "(590,600]", "(600,610]", "(610,620]", "(620,630]",
"(630,640]", "(640,650]", "(650,660]", "(660,670]", "(670,680]",
"(680,690]", "(690,700]", "(700,710]", "(710,720]", "(720,730]",
"(730,740]", "(740,750]"), class = "factor"), Vehicle.class = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Car following",
"Heavy-Vehicle following"), class = "factor"), PrecVehClass = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Car",
"Heavy-Vehicle"), class = "factor"), sd.speed = c(8.10631184218832,
11.4437550056097, 11.8038327709683, 10.8543703246156, 9.99720748006444,
9.44865875583687, 8.96606665646703, 8.49351869704553, 7.93669264490773,
8.13551032227591, 7.84202528436342, 8.0475744381228, 7.91648183675322,
7.43125313026708, 7.35410275703108, 7.60500908370333, 7.0498555301719,
7.55232413932399, 8.06598948864824, 6.76873032867712, 9.5638441069889,
8.04863015016668, 6.3210319215341, 4.64833690603376, 6.62719482422681,
6.64056528224281, 4.73744287133819, 7.47515815690314, 7.69289983159388,
0.306328206216196, 0.686563613792699), m.speed = c(7.49142882761648,
14.9015932672865, 23.2183766318976, 29.4281833927603, 33.2698195905316,
35.8151829762138, 37.5490804914733, 38.5477371278585, 39.3540677299243,
40.6919294171912, 41.1003756008852, 41.8182626555034, 43.0467747414578,
42.8363357874289, 43.4938190765401, 43.3542212600658, 45.4415004558705,
46.0292158248193, 45.2411112123218, 45.3142872888847, 45.8483490730252,
44.9081708678314, 48.91998889291, 47.3070826500395, 47.6670737425671,
46.3952054632908, 43.9972157634013, 51.2984320152685, 60.9675201903266,
44.7204961417801, 49.3765339447783), m.gapdist = c(7.7653843749647,
16.1638754974281, 25.4776617248361, 35.2445820779774, 44.9431006950918,
54.8030747287456, 64.7488740187079, 74.7493853439047, 84.7618392182203,
94.6265821702835, 104.858371321352, 114.633780836178, 124.562176064196,
134.473095135859, 144.806940411055, 154.554692908294, 164.982952591097,
174.906212522406, 185.553895860064, 194.461299821333, 204.825162321106,
215.128853160835, 225.333436194581, 235.137188240688, 244.880475531984,
255.160919142993, 264.314402521448, 274.575498681999, 285.224335149303,
293.119840359603, 337.618758706201)), .Names = c("gap.interval",
"Vehicle.class", "PrecVehClass", "sd.speed", "m.speed", "m.gapdist"
), row.names = c(3L, 8L, 13L, 18L, 24L, 31L, 37L, 43L, 49L, 55L,
61L, 66L, 71L, 76L, 81L, 85L, 88L, 91L, 94L, 96L, 98L, 100L,
102L, 105L, 107L, 109L, 112L, 114L, 116L, 118L, 121L), class = "data.frame")
What I want to achieve
I have 'sd.speed/m.speed' as dependent variable and 'm.gapdist' as explanatory variable. When I do a scatter plot the trend seems to be exponential decay. So I want to get the summary statistics as well as the plot fitted on the data points. I used following code:
ggplot() +
geom_point(data=new.gapdata.cc,
aes(y=sd.speed/m.speed, x=m.gapdist, shape=interaction(Vehicle.class,PrecVehClass)),
size=3) +
geom_smooth(data=new.gapdata.cc,
mapping = aes(y= sd.speed/m.speed, x=m.gapdist,
linetype=interaction(Vehicle.class,PrecVehClass)), method="lm", formula = log(y) ~ x,
se=F, size=1, color="black")
Question
This does not plot the exponential decay curve on the points. How can I fit the curve on points?

Resources