ggplot-How to create a legend using row&column names - r

I have data frame which I want to pass first two columns rows and
variable column names to create legend.
Inside of df I have group of dataset in which they grouped with letters from a to h. In particular, I want to pass AC&AR columns rows as names in combination with DQ0:DQ2 variables and they should be shown in the legend with that format.
something like 78_256_DQ0, and 78_256_DQ1 and 78_256_DQ2 for data group a
and same for the rest of letters in the df.
my reproducible df like this;
df <- do.call(rbind,lapply(1,function(x){
AC <- as.character(rep(rep(c(78,110),each=10),times=3))
AR <- as.character(rep(rep(c(256,320,384),each=20),times=1))
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=2)
DQ0 = sort(replicate(6, runif(10,0.001:1)))
DQ1 = sort(replicate(6, runif(10,0.001:1)))
DQ2 = sort(replicate(6, runif(10,0.001:1)))
No = c(replicate(1,rep(letters[1:6],each=10)))
data.frame(AC,AR,V,DQ0,DQ1,DQ2,No)
}))
head(df)
AC AR V DQ0 DQ1 DQ2 No
1 78 256 2.0 0.003944916 0.00902776 0.00228837 a
2 78 256 11.5 0.006629239 0.01739512 0.01649540 a
3 78 256 21.0 0.048515226 0.02034436 0.04525160 a
4 78 256 30.5 0.079483625 0.04346118 0.04778420 a
5 78 256 40.0 0.099462310 0.04430493 0.05086738 a
6 78 256 -2.0 0.103686255 0.04440260 0.09931459 a
*****************************************************
library(reshape2)
df_new <- melt(df,id=c("V","No"),measure=c("DQ0","DQ1","DQ2"))
library(ggplot2)
ggplot(df_new,aes(y=value,x=V,group=No,colour=No))+
geom_point()+
geom_line()
UPDATE
after #... answer I made a little bit progress. His solution is partially ok. Because when we melt names
df$names <- interaction(df$AC,df$AR,names(df)[4:6])
df_new <- melt(df,id=c("V","No","names1"),measure=c("DQ0","DQ1","DQ2"))
this command plots 4 rows for each group a to h.
the output becomes like this;
head(df)
AC AR V DQ0 DQ1 DQ2 No names
1 78 256 2.0 0.002576547 0.04294134 0.008302918 a 78.256.DQ0
2 78 256 11.5 0.010150299 0.04570650 0.011749370 a 78.256.DQ1
3 78 256 21.0 0.012540026 0.06977744 0.013887357 a 78.256.DQ2
4 78 256 30.5 0.036532977 0.11460343 0.071172301 a 78.256.DQ0
5 78 256 40.0 0.042801967 0.11518191 0.073756228 a 78.256.DQ1
6 78 256 -2.0 0.043275144 0.13033194 0.076569977 a 78.256.DQ2
**************************************************************
and with modification of the plot command
ggplot(df_new,aes(y=value,x=V,lty=variable,colour=names))+
geom_point()+
geom_line()
the output format which I prefer is something I can refer all rows of DQ0,DQ1 and DQ2 inside of each group. Any suggestions?
last condition

u can use df$names <- interaction(v$AC,v$AR,DQ0) and then also set names in you melt command as id. Later you use color=names in your aes function.
So, this will add a column name with a combination of the defined columns. You can also set a sep='_' if you prefer over ..
If you now use this column for colouring, you will get those labels as legend names.

finally I found a way using gather from dplyr.
df_gather <- df %>% gather(DQ, value,-No, -AC, -AR, -V)
and using interaction function from #drmariod answer
df_gather$names <- interaction(df_gather$AC,df_gather$AR,df_gather$DQ)
and here is the result of this question:)

Related

Merging specific rows by summing certain columns on grouping variables

The following dataframe is a subset of a bigger df, which contains duplicated information
df<-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
In Base R, I would like to get the following:
df_result<-data.frame(Caught=226,
Retained=96,
Units=390,
Hours=72,
colour="red")
So basically the results is the sum of unique values for columns Caught, Retained, Units and leaving the same value for Hours and colour (Caught=92+134, Retained=49+47, Units= 170+220, Hours=72, colour="red)
However, I intend to do this in a much bigger data.frame with several columns. My idea was to apply a function based on column names as:
l <- lapply(df, function(x) {
if(names(x) %in% c("Caught","Discarded","Units"))
sum(unique(x))
else
unique(x)
})
as.data.frame(l)
However, this does not work, as I am not entirely sure how to extract vector names when using lapply() and other functions such as this.
I have tried withouth succes to implement by(), apply() functions.
Thanks
Asking for Base R:
l <- lapply( df, function(n) {
if( is.numeric(n) )
sum( unique(n) )
else
unique( n )
})
as.data.frame(l)
This solution takes advantage of the fact that data.frames are really just lists of vectors.
It produces this:
# Caught Discarded Units Hours Colour
# 226 96 390 72 red
A proposition:
df <-data.frame(Caught=c(92,134,92,134),
Discarded=c(49,47,49,47),
Units=c(170,170,220,220),
Hours=c(72,72,72,72),
Colour=c("red","red","red","red"))
df
#> Caught Discarded Units Hours Colour
#> 1 92 49 170 72 red
#> 2 134 47 170 72 red
#> 3 92 49 220 72 red
#> 4 134 47 220 72 red
df_results <- data.frame(Caught = sum(unique(df$Caught)),
Discarded = sum(unique(df$Discarded)),
Units = sum(unique(df$Units)),
Hours = unique(df$Hours),
Colour = unique(df$Colour))
df_results
#> Caught Discarded Units Hours Colour
#> 1 226 96 390 72 red
# Created on 2021-02-23 by the reprex package (v0.3.0.9001)
Regards,

Accessing the values by their rowname and columnname,instead of numbers

I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output

Adding Legend in R using row names

I have data frame which I want to pass first two columns rows+variable names to the legend.
Inside of df I have group of dataset in which they grouped with letters from a to h.
The thing I want to succeed is that something like 78_256_DQ0_a and
78_256_DQ1_a and 78_256_DQ2_a to legends a and so on for other groups.
I dont know how to pass this format to the ggplot.
Any help will be appreciated.
Lets say I have a data frame like this;
df <- do.call(rbind,lapply(1,function(x){
AC <- as.character(rep(rep(c(78,110),each=10),times=3))
AR <- as.character(rep(rep(c(256,320,384),each=20),times=1))
state <- rep(rep(c("Group 1","Group 2"),each=5),times=6)
V <- rep(c(seq(2,40,length.out=5),seq(-2,-40,length.out=5)),times=2)
DQ0 = sort(replicate(6, runif(10,0.001:1)))
DQ1 = sort(replicate(6, runif(10,0.001:1)))
DQ2 = sort(replicate(6, runif(10,0.001:1)))
No = c(replicate(1,rep(letters[1:6],each=10)))
data.frame(AC,AR,V,DQ0,DQ1,DQ2,No)
}))
head(df)
AC AR V DQ0 DQ1 DQ2 No
1 78 256 2.0 0.003944916 0.00902776 0.00228837 a
2 78 256 11.5 0.006629239 0.01739512 0.01649540 a
3 78 256 21.0 0.048515226 0.02034436 0.04525160 a
4 78 256 30.5 0.079483625 0.04346118 0.04778420 a
5 78 256 40.0 0.099462310 0.04430493 0.05086738 a
6 78 256 -2.0 0.103686255 0.04440260 0.09931459 a
*****************************************************
this code for plotting the df
library(reshape2)
df_new <- melt(df,id=c("V","No"),measure=c("DQ0","DQ1","DQ2"))
library(ggplot2)
ggplot(df_new,aes(y=value,x=V,group=No,colour=No))+
geom_point()+
geom_line()
Adding lty = variable to your aesthetics, like so:
ggplot(df_new, aes(y = value, x = V, lty = variable, colour = No)) +
geom_point() +
geom_line()
will give you separate lines for DQ0, DQ1, and DQ2.

Creating Heatmaps in R

I want to create a heatmap using R.
Here is how my dataset looks like:
sortCC
Genus Location Number propn
86 Flavobacterium CC 580 0.3081827843
130 Algoriphagus CC 569 0.3023379384
88 Joostella CC 175 0.0929861849
215 Paracoccus CC 122 0.0648246546
31 Leifsonia CC 48 0.0255047821
sortN
Genus Location Number propn
119 Niastella N 316 0.08205661
206 Aminobacter N 252 0.06543755
51 Nocardioides N 222 0.05764736
121 Niabella N 205 0.05323293
257 Pseudorhodoferax??? N 193 0.05011685
149 Pedobacter N 175 0.04544274
Here is the code I have so far:
row.names(sortCC) <- sortCC$Genus
sortCC_matrix <- data.matrix(sortCC)
sortCC_heatmap <- heatmap(sortCC_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
I was going to generate 2 separate heatmap, but when I used the code above it looked wrong while using R.
Questions: 1)Is it possible to combine the two data set since they have the same genus, but the difference is the location and number & proportion. 2) If it is not possible to combine the two then how do I exclude the location column from the heatmap.
Any suggestions will be much appreciated! Thanks!
Since you have the same columns, you cand bind your data.frames and use some facets to differentiate it. Here a solution based on ggplot2:
dat <- rbind(sortCC,sortN)
library(ggplot2)
ggplot(dat, aes(y = factor(Number),x = factor(Genus))) +
geom_tile(aes(fill = propn)) +
theme_bw() +
theme(axis.text.x=element_text(angle=90)) +
facet_grid(Location~.)
To remove extra column , You can use subset:
subset(dat,select=-c(Location))
If you still want to merge data's by Genius, you can use do this for example:
sortCC <- subset(sortCC,select=-c(Location))
sortN <- subset(sortN,select=-c(Location))
merge(sortCC,sortN,by='Genus')

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

Resources