Adding labels to outliers in a scatterplot

Adding labels to outliers in a scatterplot - r

I currently have a matrix (table) like this that contains 6 women's height and weight:
V1 V2 V3 V4
1 Bella 161 60
2 Jessica 160 55
3 Indigo 179 72
4 Tina 165 54
5 Sofia 178 70
6 Fiona 163 51
On the scatterplot (height vs weight), I want to label the outliers using the female's name. Is there a method to do so? I've tried
text(V4, V3, labels=V2)
but it doesn't seem to work.

The text function doesn't know where to get V4 and such.
A few options, sticking with base graphics:
text(dat$V4, dat$V3, labels = $V2)
with(dat, text(V4, V3, labels = V2))
text(V3 ~ V4, data = dat, labels = V2)
Demonstration:
plot(mpg ~ disp, data = mtcars, pch = 16, col = "gray90")
text(mpg ~ disp, data = mtcars[2:4,], labels = cyl)

You can also try a ggplot2 approach. Here I include the code using your data but you would have to define what is an outlier. Here the approach:
library(ggplot2)
#Data
df <- data.frame(V1=1:6,
V2=c('Bella','Jessica','Indigo','Tina','Sofia','Fiona'),
V3=c(161,160,179,165,178,163),
V4=c(60,55,72,54,70,51),stringsAsFactors = F)
We have to define the outlier. Let's say values greater than 175 in V3 and greater than 69 in V4 are outliers:
#Create label
df$label <- ifelse(df$V3>175 & df$V4>69,df$V2,NA)
Now we plot:
#Plot
ggplot(df,aes(x=V3,y=V4))+
geom_point()+
geom_text(aes(label=label),vjust=-0.5)
Output:
Also take note on great suggestions from sage #r2evans. They are very clear!

Related

R Plot Bar graph transposed dataframe

I'm trying to plot the following dataframe as bar plot, where the values for the filteredprovince column are listed on a separate column (n)
Usually, the ggplot and all the other plots works on horizontal dataframe, and after several searches I am not able to find a way to plot this "transposed" version of dataframe.
The cluster should group each bar graph, and within each cluster I would plot each filteredprovince based on the value of the n column
Thanks you for the support
d <- read.table(text=
" cluster PROVINCIA n filteredprovince
1 1 08 765 08
2 1 28 665 28
3 1 41 440 41
4 1 11 437 11
5 1 46 276 46
6 1 18 229 18
7 1 35 181 other
8 1 29 170 other
9 1 33 165 other
10 1 38 153 other ", header=TRUE,stringsAsFactors = FALSE)
UPDATE
Thanks to the suggestion in comments I almost achived the format desired :
ggplot(tab_s, aes(x = cluster, y = n, fill = factor(filteredprovince))) + geom_col()
There is any way to put on Y labels not frequencies but the % ?

If I understand correctly, you're trying to use the geom_bar() geom which gives you problems because it wants to make sort of an histogram but you already have done this kind of summary.
(If you had provided code which you have tried so far I would not have to guess)
In that case you can use geom_col() instead.
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) + geom_col()
Alternatively, you can change the default stat of geom_bar() from "count" to "identity"
ggplot(d, aes(x = filteredprovince, y = n, fill = factor(PROVINCIA))) +
geom_bar(stat = "identity")
See this SO question for what a stat is
EDIT: Update in response to OP's update:
To display percentages, you will have to modify the data itself.
Just divide n by the sum of all n and multiply by 100.
d$percentage <- d$n / sum(d$n) * 100
ggplot(d, aes(x = cluster, y = percentage, fill = factor(filteredprovince))) + geom_col()

I'm not sure I perfectly understand, but if the problem is the orientation of your dataframe, you can transpose it with t(data) where data is your dataframe.

Accessing the values by their rowname and columnname,instead of numbers

I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!

For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()

IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output

Adding Legend in R using row names

I have data frame which I want to pass first two columns rows+variable names to the legend.
Inside of df I have group of dataset in which they grouped with letters from a to h.
The thing I want to succeed is that something like 78_256_DQ0_a and
78_256_DQ1_a and 78_256_DQ2_a to legends a and so on for other groups.
I dont know how to pass this format to the ggplot.
Any help will be appreciated.
Lets say I have a data frame like this;
df <- do.call(rbind,lapply(1,function(x){
AC <- as.character(rep(rep(c(78,110),each=10),times=3))
AR <- as.character(rep(rep(c(256,320,384),each=20),times=1))
state <- rep(rep(c("Group 1","Group 2"),each=5),times=6)
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=2)
DQ0 = sort(replicate(6, runif(10,0.001:1)))
DQ1 = sort(replicate(6, runif(10,0.001:1)))
DQ2 = sort(replicate(6, runif(10,0.001:1)))
No = c(replicate(1,rep(letters[1:6],each=10)))
data.frame(AC,AR,V,DQ0,DQ1,DQ2,No)
}))
head(df)
AC AR V DQ0 DQ1 DQ2 No
1 78 256 2.0 0.003944916 0.00902776 0.00228837 a
2 78 256 11.5 0.006629239 0.01739512 0.01649540 a
3 78 256 21.0 0.048515226 0.02034436 0.04525160 a
4 78 256 30.5 0.079483625 0.04346118 0.04778420 a
5 78 256 40.0 0.099462310 0.04430493 0.05086738 a
6 78 256 -2.0 0.103686255 0.04440260 0.09931459 a
*****************************************************
this code for plotting the df
library(reshape2)
df_new <- melt(df,id=c("V","No"),measure=c("DQ0","DQ1","DQ2"))
library(ggplot2)
ggplot(df_new,aes(y=value,x=V,group=No,colour=No))+
geom_point()+
geom_line()

Adding lty = variable to your aesthetics, like so:
ggplot(df_new, aes(y = value, x = V, lty = variable, colour = No)) +
geom_point() +
geom_line()
will give you separate lines for DQ0, DQ1, and DQ2.

Correct parameters for the geom_line layer in ggplot2

I have a dataframe as such:
1 Pos like 77
2 Neg like 58
3 Pos make 44
4 Neg make 34
5 Pos movi 154
6 Neg movi 145
...
20 Neg will 45
I would like to produce a plot using the geom_text layer in ggplot2.
I have used this code
q <- ggplot(my_data_set, aes(x=value, y=value, label=variable))
q <- q + geom_text()
q
which produced this plot:
Obviously, this is not an ideal plot.
I would like to produce a plot similar, except I would like to have the Positive class on the x-axis, and the Negative class on the y-axis.
UPDATE: Here is an example of something I am attempting to emulate:
I can't seem to figure out the correct way to give the arguments to the geom_line layer.
What is the correct way to plot the value of the Positive arguments on the X-axis, and the value of the Negative arguments on the Y-axis, given the data frame I have?
Thanks for your attention.

my_data_set <- read.table(text = "
id variable value
Pos like 77
Neg like 58
Pos make 44
Neg make 34
Pos movi 154
Neg movi 145", header = T)
library(data.table)
my_data_set <- as.data.frame(data.table(my_data_set)[, list(
Y = value[id == "Neg"],
X = value[id == "Pos"]),
by = variable])
library(ggplot2)
q <- ggplot(my_data_set, aes(x=X, y=Y, label=variable))
q <- q + geom_text()
q

This can also be easily done with reshape2 (with the same result as David Arenburg's answer):
df <- read.table(text = "id variable value
Pos like 77
Neg like 58
Pos make 44
Neg make 34
Pos movi 154
Neg movi 145", header = TRUE)
require(reshape2)
df2 <- dcast(df, variable ~ id, value.var="value")
library(ggplot2)
ggplot(df2, aes(x=Pos, y=Neg, label=variable)) +
geom_text()
which results in:

Creating Heatmaps in R

I want to create a heatmap using R.
Here is how my dataset looks like:
sortCC
Genus Location Number propn
86 Flavobacterium CC 580 0.3081827843
130 Algoriphagus CC 569 0.3023379384
88 Joostella CC 175 0.0929861849
215 Paracoccus CC 122 0.0648246546
31 Leifsonia CC 48 0.0255047821
sortN
Genus Location Number propn
119 Niastella N 316 0.08205661
206 Aminobacter N 252 0.06543755
51 Nocardioides N 222 0.05764736
121 Niabella N 205 0.05323293
257 Pseudorhodoferax??? N 193 0.05011685
149 Pedobacter N 175 0.04544274
Here is the code I have so far:
row.names(sortCC) <- sortCC$Genus
sortCC_matrix <- data.matrix(sortCC)
sortCC_heatmap <- heatmap(sortCC_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
I was going to generate 2 separate heatmap, but when I used the code above it looked wrong while using R.
Questions: 1)Is it possible to combine the two data set since they have the same genus, but the difference is the location and number & proportion. 2) If it is not possible to combine the two then how do I exclude the location column from the heatmap.
Any suggestions will be much appreciated! Thanks!

Since you have the same columns, you cand bind your data.frames and use some facets to differentiate it. Here a solution based on ggplot2:
dat <- rbind(sortCC,sortN)
library(ggplot2)
ggplot(dat, aes(y = factor(Number),x = factor(Genus))) +
geom_tile(aes(fill = propn)) +
theme_bw() +
theme(axis.text.x=element_text(angle=90)) +
facet_grid(Location~.)
To remove extra column , You can use subset:
subset(dat,select=-c(Location))
If you still want to merge data's by Genius, you can use do this for example:
sortCC <- subset(sortCC,select=-c(Location))
sortN <- subset(sortN,select=-c(Location))
merge(sortCC,sortN,by='Genus')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Adding labels to outliers in a scatterplot - r

Related

R Plot Bar graph transposed dataframe

Accessing the values by their rowname and columnname,instead of numbers

Adding Legend in R using row names

Correct parameters for the geom_line layer in ggplot2

Creating Heatmaps in R

Categories

Resources