Related
I have a quite "messy data". I have a model with a interaction between two factors. And I want to plot it. So:
f1 <- structure(list(tipo = c("digitables", "digitables", "digitables",
"digitables", "digitables", "digitables", "digitables", "digitables",
"payments", "payments", "payments", "payments", "payments", "payments",
"payments", "payments", "traditionals", "traditionals", "traditionals",
"traditionals", "traditionals", "traditionals", "traditionals",
"traditionals"), categoria = c("Advice", "Digital banks", "Exchange",
"FinTech", "Insurance", "Investments", "Lending", "Payments and transfers",
"Advice", "Digital banks", "Exchange", "FinTech", "Insurance",
"Investments", "Lending", "Payments and transfers", "Advice",
"Digital banks", "Exchange", "FinTech", "Insurance", "Investments",
"Lending", "Payments and transfers"), Total = c(63L, 450L, 279L,
63L, 36L, 108L, 567L, 549L, 63L, 450L, 279L, 63L, 36L, 108L,
567L, 549L, 35L, 250L, 155L, 35L, 20L, 60L, 315L, 305L), Frequencia = c(44L,
266L, 118L, 9L, 14L, 45L, 134L, 242L, 33L, 68L, 2L, 10L, 3L,
8L, 11L, 78L, 27L, 226L, 142L, 10L, 20L, 45L, 300L, 245L), Perc = c(69.84,
59.11, 42.29, 14.29, 38.89, 41.67, 23.63, 44.08, 52.38, 15.11,
0.72, 15.87, 8.33, 7.41, 1.94, 14.21, 77.14, 90.4, 91.61, 28.57,
100, 75, 95.24, 80.33), Failure = c(19L, 184L, 161L, 54L, 22L,
63L, 433L, 307L, 30L, 382L, 277L, 53L, 33L, 100L, 556L, 471L,
8L, 24L, 13L, 25L, 0L, 15L, 15L, 60L)), row.names = c(NA, -24L
), class = "data.frame")
# Packages
library(dplyr)
library(ggplot2)
library(emmeans) #version 1.4.8. or 1.5.1
# Works as expected
m1 <- glm(cbind(Frequencia, Failure) ~ tipo*categoria,
data = f1, family = binomial(link = "logit"))
l1 <- emmeans(m1, ~categoria|tipo)
plot(l1, type = "response",
comparison = T,
by = "categoria")
Using by="tipo" results:
# Doesn't work:
plot(l1, type = "response",
comparison = T,
by = "tipo")
Error: Aborted -- Some comparison arrows have negative length!
In addition: Warning message:
Comparison discrepancy in group digitables, Advice - Insurance:
Target overlap = -0.0241, overlap on graph = 0.0073
If I use comparison = F as suggested by explanation supplement vignette, it works. However, it does not show me the arrows, which are very important.
Q1 - Is there a work around for it? (Or is it impossible due to my data?)
As we can see from the last plot, there is a category with probability = 1 (categoria=Insurance and tipo=traditionals). So, I delete only this row of my data frame, and I try to redo the plotting, and results to me:
f1 <- f1 %>%
filter(!Perc ==100)
m1 <- glm(cbind(Frequencia, Failure) ~ tipo*categoria,
data = f1, family = binomial(link = "logit"))
l1 <- emmeans(m1, ~categoria|tipo)
plot(l1, type = "response",
comparison = T,
by = "categoria")
Error in if (dif[i] > 0) lmat[i, id1[i]] = rmat[i, id2[i]] = wgt * v1[i] else rmat[i, :
missing value where TRUE/FALSE needed
Q2 - How to plot my results even when I have a missing level of one variable (with respect to another variable?). I would expect that the Insurance facet would have only have the payments and digitables levels (while the others remain the same).
First, please don't ever re-use the same variable names for more than one thing; that makes things not reproducible. If you modify a dataset, or a model, or whatever, give it a new name so it can be distinguished.
Q1
As documented, comparison arrows cannot always be computed. This is such an example. I suggest displaying the results some other way, e.g. using pwpp() or pwpm()
Q2
There was a bug in handling missing cases. This has been fixed in the GitHub version:
f2 <- f1 %>%
filter(!Perc ==100)
m2 <- glm(cbind(Frequencia, Failure) ~ tipo*categoria,
data = f2, family = binomial(link = "logit"))
l2 <- emmeans(m2, ~categoria|tipo)
plot(l2, type = "response",
comparison = TRUE,
by = "categoria")
plot(l2, type = "response",
comparison = TRUE,
by = "tipo")
## Error: Aborted -- Some comparison arrows have negative length!
## (in group "payments")
i am trying to find the solution to my problem:
how many points per group lay on the straight line
I could not find any solution for this problem in R...
Below You have a sample data and as well plot just to show you how does it look like:
data <- structure(list(Group = c(22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L,
22782L, 11553L, 11553L, 11553L, 11553L, 11553L, 7059L, 7059L,
7059L, 7059L, 22782L), x = c(100L, 150L, 250L, 287L, 312L, 387L,
475L, 550L, 837L, 937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L,
1662L, 1700L, 1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L,
4762L, 5362L, 5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L,
7800L, 7937L, 7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L,
9312L, 9400L, 9600L, 4637L, 900L, 4187L, 5800L, 7075L, 1125L,
3400L, 3562L, 3462L, 5412L), y = c(493L, 482L, 479L, 476L, 481L,
479L, 474L, 480L, 480L, 491L, 489L, 490L, 485L, 485L, 485L, 479L,
482L, 482L, 482L, 482L, 484L, 489L, 491L, 489L, 496L, 498L, 500L,
0L, 498L, 500L, 502L, 506L, 497L, 0L, 495L, 506L, 497L, 494L,
498L, 500L, 496L, 499L, 496L, 495L, 495L, 498L, 825L, 284L, 850L,
360L, 790L, 861L, 883L, 882L, 881L, 502L)), row.names = c(23L,
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L,
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L,
51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L,
64L, 65L, 66L, 67L, 68L, 69L, 281L, 312L, 313L, 315L, 316L, 377L,
378L, 380L, 511L, 815L), class = "data.frame")
Data consist of group name column (3 Groups in this case), x and y coordinates:
Group x y
22782 100 493
22782 150 482
22782 250 479
22782 287 476
22782 312 481
Below we can find a plot of the group 22782:
As You can see there are many points that lay almost exactly on the same line and i would like to find out how many of them per group correspond to this condition.
Expected Output would look like this:
Group Max Points
22782 20
I would appreciate any help or tips! Thanks
Let's assume that you know only a minority of points are not on the line. You also mention that you only want to consider horizontal lines.
In that case, you can use the median as a robust estimate of the horizontal line position. You could use the mean but it may be swayed by a extreme values which are not on the line anyway.
The code is self_explanatory:
tolerance <- 10
data %>%
group_by(Group) %>%
mutate(y_line = median(y),
on_line = abs(y - y_line) <= tolerance) %>%
count(Group, on_line)
Result:
# Group on_line n
# <int> <lgl> <int>
# 1 7059 FALSE 1
# 2 7059 TRUE 3
# 3 11553 FALSE 4
# 4 11553 TRUE 1
# 5 22782 FALSE 13
# 6 22782 TRUE 34
You can of course pipe that into filter(on_line) to keep only the count of points that are on the line.
Because we do not know what values the lines in ggplot have we need to find out what breaks are set by default. This is answered here and used in my code.
The following function says how many points are on the lines per group. You can further set a tolerance value what deviations from the line you accept. Further, sometimes points my lay on different lines as in the case for ggplot(subset(data, Group == 22782), aes(x=x,y=y)) + geom_point() where point lay on two different lines (0 and 500).
For this case you can decide wether you want to know the sum of all points being on any line or if you are interested about the most points that are gathered about one line (here how many points are at 500). You can choose this with any_or_max_line.
The function
points.on.lines <- function(data, tolerance, any_or_max_line){
# runs the code below per group
sapply(unique(data$Group), function(group_i){
# chooses i-th group
data_group_i <- subset(data, Group == group_i)
# find on which y-values the lines are
line_values <-
with(data_group_i,
labeling::extended(range(y)[1], range(y)[2], m = 5))
# find out per line how many points are on or around that line
points_on_lines <- sapply(line_values, function(line_values_i){
sum(data_group_i$y >= line_values_i - tolerance &
data_group_i$y <= line_values_i + tolerance)})
# decides whether to take into account the line with most points or all points on any line
if(any_or_max_line == "max"){
points_on_lines <- max(points_on_lines)
} else {
points_on_lines <- sum(points_on_lines)
}
# names results by group
names(points_on_lines) <- paste0("Group_", group_i)
return(points_on_lines)
})}
Example
points.on.lines(data= data, tolerance= 50,
any_or_max_line= "max")
Group_22782 Group_11553 Group_7059
45 3 4
To me this seems like an interval optimisation problem (or more generally clustering of one-dimensional Data), that is unless you have fixed breaks or lines, one way I can think of to solve such a problem is the Jenks natural breaks optimization
which is already implemented in R in the package BAMMtools
You basically first fix the lines, and then see which points belong to which line (the closest line)
One parameter you have to set is the number of lines (or rather clusters), in the function getJenksBreaks.
There might be other methods to cluster those points, but here's the jenks
library(BAMMtools)
lines <- getJenksBreaks(mydata$y, 5)
lines
# [1] 0 0 360 506 883
mydata <- mydata %>%
rowwise() %>%
mutate(line_id = as.character(which.min(abs(y-unique(lines)))))
mydata %>%
group_by(Group, line_id) %>%
summarise(cnt =n()) %>%
group_by(Group) %>%
summarise(max_points = max(cnt))
#
# # A tibble: 3 x 2
# Group max_points
# <int> <dbl>
# 1 7059 4
# 2 11553 3
# 3 22782 45
mydata %>%
#filter(Group == 22782) %>%
ggplot(aes(x,y, color = line_id)) +
geom_point() +
geom_hline(yintercept = lines,
color = 'red',
#alpha = 0.5,
linetype ='dashed',
size = 0.3) +
facet_grid(.~Group)
Trying to barplot a CSV file and the plot shows all categories as the same height despite the frequencies varying from 450-800
Below is the plot I receive
!https://imgur.com/9HZuiaK
I have tried implementing a height=x, width=x
This results in completely removing the labels and does not fix the initial problem.
setwd("~/Desktop")
causes<-read.csv('causes.csv')
head(causes)
table(causes$Intentional.self.harm..suicide)
barplot(table(causes$Intentional.self.harm..suicide))
barplot(table(causes$Intentional.self.harm..suicide), ylab='Frequency',
main='Barplot of Intentional self-harm (suicide)', col='lightblue'
dput(head(causes, 20))
Intentional.self.harm..suicide. = c(535L,
579L, 480L, 541L, 499L, 537L, 466L, 453L, 459L, 494L, 520L, 553L,
525L, 588L, 578L, 631L, 676L, 656L, 757L, 673L)
I believe you are using wrongly the table function since you already calculated the numbers of modalities. So it is not a graphical nor a functional problem. All you have to do is calculate the frequency as follow :
Intentional.self.harm..suicide. = c(535L,
579L, 480L, 541L, 499L, 537L, 466L, 453L, 459L, 494L, 520L, 553L,
525L, 588L, 578L, 631L, 676L, 656L, 757L, 673L)
df <- Intentional.self.harm..suicide.
barplot(height = df) # correct barplot with counts
df_prop <- table(df) # gives a table that count modalities (all unique ie 1)
str(df_prop)
# With a data.frame if you want to include labels
df_prop <- data.frame(
"type" = paste0("t", 1:20),
"freq" = df/sum(df) # alternatively use prop.table(df)
)
# sum(df_prop$freq) # to check
barplot(height = df_prop$freq) # same 'profile' than first barplot
# --- EDIT / Follow up
# looking at documentation of barplot to set labels and ordinate limits
?barplot
barplot(height = df_prop$freq, names.arg = df_prop$type, ylim=c(0, max(df_prop$freq * 1.2)))
I am very very new to R and stats in general, and am having trouble adding multiple confidence ellipses to a PCA plot.
My interest is in highlighting potential groupings/clusters in the PCA plot with 95% confidence ellipses. I have tried using the dataEllipse function in R, however I cannot figure out how to add multiple ellipses with different centers to the PCA plot (the centers would be at various points that appear to contain a cluster, in this case lithic sources and lithic tools likely made from that source).
Thanks for any help with this!
{
lithic_final <- LITHIC.DATASHEET.FOR.R.COMPLETE.FORMAT
lithic_final
pca1 <- princomp(lithic_final); pca1
lithic_source <- c("A1", "A1", "A1", "A1", "A2","A2", "A2", "A3","A3","A3","B","B","B","B","B","B","C","C","C","C","C","C","C","D","D","D","D","D","D","D","D","E","E","E","E","E","E","E","E","F","F","G","G","G","G","H","H","H","H","H","H","H","I1","I1","I1","I2","I2","I2","I2","I2","J1","J1","J2","J2","J2","J2","J2","J2","J2","J2","J2","K","K","K","K","K","K","K","L","L","L","L","L","L","L","L","L","L","L","L","L","L","BB1","BB1","BB1","FC","FC","FC","JRPP","JRPP","JRPP","BB2","BB2","BB2","BB2","MWP","MWP","MWP","MWP","RPO","RPO","RPO")
lithic_source
summary(pca1)
plot(pca1)
#Plotting the scores with the Lithic Source Info
round(pca1$scores[,1:2], 2)
pca_scores <-round(pca1$scores[,1:2], 2)
plot(pca1$scores[,1], pca1$scores[,2], type="n")
text(pca1$scores[,1], pca1$scores[,2],labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 2 and 3 with Lithic Source Info
round(pca1$scores[,2:3], 2)
pca2_3_scores <-round(pca1$scores[,2:3], 2)
plot(pca1$scores[,2], pca1$scores[,3], type="n")
text(pca1$scores[,2], pca1$scores[,3], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 3 and 4 with Lithic Source Info
round(pca1$scores[,3:4], 2)
pca3_4_scores <-round(pca1$scores[,3:4], 2)
plot(pca1$scores[,3], pca1$scores[,4], type="n")
text(pca1$scores[,3], pca1$scores[,4], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 1 and 3 with Lithic Source Info
round(pca1$scores[,1:3], 2)
pca1_3_scores <-round(pca1$scores[,1:3], 2)
plot(pca1$scores[,1], pca1$scores[,3], type="n")
text(pca1$scores[,1], pca1$scores[,3], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 1 and 4 with Lithic Source Info
round(pca1$scores[,1:4], 2)
pca1_4_scores <-round(pca1$scores[,1:4], 2)
plot(pca1$scores[,1], pca1$scores[,4], type="n")
text(pca1$scores[,1], pca1$scores[,4], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#TRYING TO GET ELLIPSES ADDED TO PCA 1 and 4 scores
dataEllipse(pca1$scores[,1], pca1$scores[,4],centers=12,add=TRUE,levels=0.9, plot.points=FALSE)
structure(list(Ca.K12 = c(418L, 392L, 341L, 251L, 297L, 238L,
258L, 5L, 2L, 37L), Cr.K12 = c(1L, 12L, 15L, 6L, 9L, 6L, 35L,
7L, 45L, 32L), Cu.K12 = c(89L, 96L, 81L, 63L, 88L, 103L, 104L,
118L, 121L, 90L), Fe.K12 = c(18627L, 18849L, 18413L, 12893L,
17757L, 17270L, 16198L, 2750L, 4026L, 3373L), K.K12 = c(20L,
23L, 28L, 0L, 34L, 17L, 45L, 102L, 150L, 147L), Mn.K12 = c(205L,
212L, 235L, 120L, 216L, 212L, 246L, 121L, 155L, 115L), Nb.K12 = c(139L,
119L, 154L, 91L, 122L, 137L, 137L, 428L, 414L, 428L), Rb.K12 = c(99L,
42L, 79L, 49L, 210L, 243L, 168L, 689L, 767L, 705L), Sr.K12 = c(3509L,
3766L, 3481L, 2715L, 2851L, 2668L, 2695L, 202L, 220L, 217L),
Ti.K12 = c(444L, 520L, 431L, 293L, 542L, 622L, 531L, 82L,
129L, 84L), Y.K12 = c(135L, 121L, 105L, 74L, 144L, 79L, 85L,
301L, 326L, 379L), Zn.K12 = c(131L, 133L, 108L, 78L, 124L,
111L, 114L, 81L, 78L, 59L), Zr.K12 = c(1348L, 1479L, 1333L,
964L, 1506L, 1257L, 1296L, 3967L, 4697L, 4427L)), .Names = c("Ca.K12",
"Cr.K12", "Cu.K12", "Fe.K12", "K.K12", "Mn.K12", "Nb.K12", "Rb.K12",
"Sr.K12", "Ti.K12", "Y.K12", "Zn.K12", "Zr.K12"), row.names = c(NA,
10L), class = "data.frame")
I think you would have received a speedier reply if you had focused on your question instead of all the extraneous stuff. You gave us your commands for plotting a bunch of principal components that had nothing to do with your question. The question is, how do you plot ellipses by group? Your sample data at 10 lines and three groups is not helpful because 3 points is not enough to plot data ellipses. You are using the dataEllipse function in package car which has the simplest answer to your question:
First, a reproducible example:
set.seed(42) # so you can get the same numbers I get
source_a <- data.frame(X1=rnorm(25, 50, 5), X2=rnorm(25, 40, 5))
source_b <- data.frame(X1=rnorm(25, 20, 5), X2=rnorm(25, 40, 5))
source_c <- data.frame(X1=rnorm(25, 35, 5), X2=rnorm(25, 25, 5))
lithic_dat <- rbind(source_a, source_b, source_c)
lithic_source <- c(rep("a", 25), rep("b", 25), rep("c", 25))
Plot ellipses with scatterplot() and add text:
scatterplot(X2~X1 | lithic_source, data=lithic_dat, pch="", smooth=FALSE,
reg.line=FALSE, ellipse=TRUE, levels=.9)
text(lithic_dat$X1, lithic_dat$X2, lithic_source, cex=.75)
Scatterplot can be tweaked to do everything you want, but it is also
possible to plot the ellipses without using it:
sources <- unique(lithic_source) # vector of the different sources
plot(lithic_dat$X1, lithic_dat$X1, type="n")
text(lithic_dat$X1, lithic_dat$X2, lithic_source, cex=.75)
for (i in sources) with(lithic_dat, dataEllipse(X1[lithic_source==i],
X2[lithic_source==i], levels=.9, plot.points=FALSE))
This will work for your principal components and any other data.
Here is a simple solution using a package called ggbiplot (available on github) with Iris data. I hope this is what you were looking for.
library(devtools);install_github('vqv/ggbiplot')
library(ggbiplot)
pca = prcomp(iris[,1:4])
ggbiplot(pca,groups = iris$Species,ellipse = T,ellipse.prob = .95)
I want to create a barplot and my data is in a csv file in the following format
0,22
40,50
80,62
120,70
160,62
200,49
240,52
280,64
320,57
360,50
400,47
440,52
480,73
520,70
560,68
600,71
640,69
680,61
720,59
760,59
800,62
840,62
880,62
920,72
960,81
1000,89
1040,86
1080,76
1120,80
1160,95
The element before the comma should be the position in the x axis and the element after the comma the height= of the bar at that position. I can do this in Excel but the data is large.
The graph I want would look like this.
I have tried the following but I think it sums the data in each row.
data <- as.matrix(read.csv(file="data.csv",sep=",",header=FALSE))
barplot(data)
barplot(x$V2, names.arg = seq_len(nrow(x)), cex.names = .6)
two things: first, if you supply the whole matrix to the height parameter of barplot, it will sum them. instead, give it only your data.
dput(dat)
structure(c(0L, 40L, 80L, 120L, 160L, 200L, 240L, 280L, 320L,
360L, 400L, 440L, 480L, 520L, 560L, 600L, 640L, 680L, 720L, 760L,
800L, 840L, 880L, 920L, 960L, 1000L, 1040L, 1080L, 1120L, 1160L,
22L, 50L, 62L, 70L, 62L, 49L, 52L, 64L, 57L, 50L, 47L, 52L, 73L,
70L, 68L, 71L, 69L, 61L, 59L, 59L, 62L, 62L, 62L, 72L, 81L, 89L,
86L, 76L, 80L, 95L), .Dim = c(30L, 2L), .Dimnames = list(NULL,
c("V1", "V2")))
barplot(height=dat[,2])
second, you need to supply the names.arg to barplot to get the labeling:
barplot(height=dat[,2], names.arg=dat[,1])
a side note: its best to avoid naming variables with built in R functions. ?data is probably the most commonly overwritten! I use dat instead regularly.
Using your method of getting the data into R:
myData <- read.csv(file = "data.csv", sep = ",", header = FALSE)
To make sure that the order of the bars follows the order of the values in the first column (although this is not strictly what you asked for in your question)
myData2 <- myData[order(myData[, 1]), ]
barplot(myData2[, 2], names.arg = myData2[, 1])
For tweaking the graph, I recommend spending some time reading ?barplot and ?par