Error: colours encodes as numbers must be positive - r

I am trying to recreate this plot but I am having an issue with ggplot not liking the negative numbers in the data frame by the looks of the error message? Error: colours encodes as numbers must be positive. Does anyone know what its issue is? These are very large data frames but I wouldn't have thought that would have been an issue?
## Load packages
library(tidyverse)
require(data.table)
## Read in data frames
m1<-fread("m1.csv", header = F)
m2<-fread("m2.csv", header = F)
L<-fread("l.csv", header = F)
LP<-fread("LP.csv", header = F)
## Get rate by taking m1 from m2
rate<-m1[1,]-m2[1,] ### subtract p1 rate from p2
## Transpose the data frame
t_rate <- transpose(rate)
## Create row ID's to merge data frames
L$row_num <- seq.int(nrow(L))
t_rate$row_num <- seq.int(nrow(t_rate))
all<-merge(L, t_rate, by = "row_num") ## merge the dataframes based on their ID
## Get rid of ID now we don't need it
all$row_num=NULL
## Plot the graph
ggplot(all,x=all$V1.x,y=all$V2,col=all$V1.y)+
geom_point(data=all,x=all$V1.x,y=all$V2,col=all$V1.y,size=0.1)+
geom_point(data=LP,x=LP$V1,y=LP$V2,size=1)
### Data (all)
structure(list(V1.x = c(163.75, 164.25, 164.75, 165.25, 165.75,
166.25), V2 = c(-75.25, -75.25, -75.25, -75.25, -75.25, -75.25
), V1.y = c(1.55995, 1.56093, 1.56237, 1.56545, 1.56764, 1.56827
)), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x7f9bd4811ae0>)
## Data (LP)
structure(list(V1 = c(169.7, 147.93, 150.01, 146.71, 147.31,
-63.26), V2 = c(-46.47, -42.344, -36.59, -38.64, -43.3, 44.739
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x7f9bd4811ae0>)

The issue is that you did not map on aesthetics but instead pass vectors to arguments. When doing so you have to pass color names or codes or a positive number to the color argument.
But to fix your issue you could simply map on aesthetics like so:
library(ggplot2)
ggplot(all, aes(x = V1.x, y = V2)) +
geom_point(aes(color = V1.y), size = 0.1) +
geom_point(data = LP, aes(x = V1, y = V2), size = 1)

Related

Comparing "Unlimited" value to numerical values in ggplot

I am trying to make a visual comparison between an input vector and my database.However, the input vector or the database may contain the "UL" character, which means, an infinite number. Think of it as your unlimited voice plan, with which you can make an unlimited number of calls.
Here is the code I have used to try to make a visual comparison between "UL" and other numerical values.
# d is the database data.frame, with which we want to compare the input vector
d = structure(list(Type = c("H1", "H2", "H3"),
P1 = c(2000L, 1500L, 1000L),
P2 = c(60L, 40L, 20L),
P3 = c("UL", 3000L, 2000L)),
class = "data.frame",
row.names = c(NA, -3L))
# d2 is the input vector
d2 = structure(list(Type = "New_offre", P1 = 1200L, P2 = "UL", P3 = 2000),
class = "data.frame",
row.names = c(NA, -1L))
#Check if there are some unlimited values in both d and d2
y1 <-rbind(d,d2)
y <- y1
if("UL" %in% y$P3){
max_P3_scale <- max(as.numeric(y[y$P3!="UL","P3"]))
y[y$P3=="UL","P3"]= 2*max_P3_scale
}
if("UL" %in% y$P2){
max_P2_scale <- max(as.numeric(y[y$P2!="UL","P2"]))
y[y$P2=="UL","P2"]= 2*max_P2_scale
}
y <- transform(y,P1=as.numeric(P1),
P2=as.numeric(P2),
P3=as.numeric(P3))
d <- y[1:nrow(d),]
d2<- y[nrow(d)+1,]
d %>% gather(var1, current, -Type) %>%
mutate(new = as.numeric(d2[cbind(rep(1, max(row_number())),
match(var1, names(d2)))]),
slope = factor(sign(current - new), -1:1)) %>%
gather(var2, val, -Type, -var1, -slope) %>%
ggplot(aes(x = factor(var2,levels = c("new","current")), y = val, group = 1)) +
geom_point(aes(fill = var2), shape = 2,size=4) +
geom_line(aes(colour = slope)) +
scale_colour_manual(values = c("green","green", "red")) +
facet_wrap(Type ~ var1,scales = "free")
My first attempt was to find if there is "UL" values in P2 and P3. If yes, I try to find the maximum numeric value other than "UL". Then, I replace all "UL" occurrences by this maximum value* 2, so the graphical representations will always show that "UL" is maximum.
The issue with this is that I am not able to differentiate between actual values and "UL" ones.
Here is how my plot looks like using this solution

How to highlight excel cells in R

So I have a bunch of data that I am looking through. In the past, I have used the openxlsx to highlight entire rows. I want to step it up a bit and highlight specific cells. Here is a sample of the format of the data I am working with
df <- structure(list(Name = c("ENSCAFG00000000019","ENSCAFG00000000052", "ENSCAFG00000000094","ENSCAFG00000000210"), baseMean = c(692.430970065448, 391.533849079888, 1223.74083601928, 280.477417588943), log2FoldChange = c("0.0819834415495699",
"-2.6249568393179099", "6.15181461329998", "0.23483770613468"
), lfcSE = c("0.247177913269579", "0.65059275393549898", "0.33371763683349598", "0.353449339778654"), stat = c("4.3773467751931898", "-4.0347157625707997",
"3.4514646101088902", "3.4936766522410099"), pvalue = c("1.20132758621478E-5", "5.4668435006169397E-5", "5.5755287106466398E-4", "4.7641767052765697E-4"), padj = c("9.8372077245438908E-4", "0.00004", "0.000006", "1.47480018315951E-2"), symbol = c("ZNF516", "CDH19", "LMAN1", "NA"), entrez = c("483930", "483948", "476186", "NA")), .Names = c("Names", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj", "symbol", "entrez"), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
So what I want to do is highlight cells in log2FoldChange that are either <= -1 or >= 1 and highlight cells that are <= 0.05. Is this something that can be done? I have read a lot about highlighting rows but not specific cells with a condition.
This is sort of what I am hoping I can get the data to look like. The log2Foldchange and the padj don't need to make up like the example above.
Thanks in advance
Here is one example. Note, however, that all cells in column padj have values below 0.05.
library(openxlsx)
# note that some columns of df look numeric, but are character
df <- data.frame(
Name = c("ENSCAFG00000000019","ENSCAFG00000000052", "ENSCAFG00000000094","ENSCAFG00000000210"),
baseMean = c(692.430970065448, 391.533849079888, 1223.74083601928, 280.477417588943),
log2FoldChange = c(0.0819834415495699, -2.6249568393179099, 6.15181461329998, 0.23483770613468),
lfcSE = c(0.247177913269579, 0.65059275393549898, 0.33371763683349598, 0.353449339778654),
stat = c(4.3773467751931898, -4.0347157625707997, 3.4514646101088902, 3.4936766522410099),
pvalue = c(1.20132758621478E-5, 5.4668435006169397E-5, 5.5755287106466398E-4, 4.7641767052765697E-4),
padj = c(9.8372077245438908E-4, 0.00004, 0.000006, 1.47480018315951E-2),
symbol = c("ZNF516", "CDH19", "LMAN1", "NA"), entrez = c("483930", "483948", "476186", "NA"),
stringsAsFactors=FALSE
)
# write dataset
wb <- createWorkbook()
addWorksheet(wb, sheetName="df")
writeData(wb, sheet="df", x=df)
# define style
yellow_style <- createStyle(fgFill="#FFFF00")
# log2FoldChange
y <- which(colnames(df)=="log2FoldChange")
x <- which(abs(df$log2FoldChange)>=1)
addStyle(wb, sheet="df", style=yellow_style, rows=x+1, cols=y, gridExpand=TRUE) # +1 for header line
# padj
y <- which(colnames(df)=="padj")
x <- which(abs(df$padj)<=0.05)
addStyle(wb, sheet="df", style=yellow_style, rows=x+1, cols=y, gridExpand=TRUE) # +1 for header line
# write result
saveWorkbook(wb, "yellow.xlsx", overwrite=TRUE)
You may also want to have a look at BERT.

Plot multiple rows as columns with ggplotly

I have the following data
dput(head(new_data))
structure(list(series = c("serie1", "serie2", "serie3",
"serie4"), Chr1_Coverage = c(0.99593043561, 0.995148711122,
0.996666194154, 1.00012127128), Chr2_Coverage = c(0.998909597935,
0.999350808049, 0.999696737431, 0.999091916132), Chr3_Coverage = c(1.0016871729,
1.00161108919, 0.997719609642, 0.999887319775), Chr4_Coverage = c(1.00238874787,
1.00024296426, 1.0032143002, 1.00118558895), Chr5_Coverage = c(1.00361001984,
1.00233184803, 1.00250793369, 1.00019989912), Chr6_Coverage = c(1.00145962318,
1.00085036645, 0.999767433622, 1.00018523387), Chr7_Coverage = c(1.00089620637,
1.00201715802, 1.00430458519, 1.00027257509), Chr8_Coverage = c(1.00130277775,
1.00332841536, 1.0027493578, 0.998107829176), Chr9_Coverage = c(0.998473062701,
0.999400379593, 1.00130178863, 0.9992796405), Chr10_Coverage = c(0.996508132358,
0.999973856701, 1.00180072957, 1.00172163916), Chr11_Coverage = c(1.00044015107,
0.998982489577, 1.00072330837, 0.998947935281), Chr12_Coverage = c(0.999707836898,
0.996654676531, 0.995380321719, 1.00116773966), Chr13_Coverage = c(1.00199118466,
0.99941499519, 0.999850500793, 0.999717689167), Chr14_Coverage = c(1.00133747054,
1.00232593477, 1.00059139379, 1.00233368187), Chr15_Coverage = c(0.997036875653,
1.0023727983, 1.00020943048, 1.00089130742), Chr16_Coverage = c(1.00527426537,
1.00318861724, 1.0004269482, 1.00471256502), Chr17_Coverage = c(0.995530811404,
0.995103514254, 0.995135851149, 0.99992196636), Chr18_Coverage = c(0.99893371568,
1.00452723685, 1.00006262572, 1.00418478844), Chr19_Coverage = c(1.00510422346,
1.00711968194, 1.00552123413, 1.00527171097), Chr20_Coverage = c(1.00113612137,
1.00130658886, 0.999390191542, 1.00178637085), Chr21_Coverage = c(1.00368753618,
1.00162782873, 1.00056883447, 0.999797571642), Chr22_Coverage = c(0.99677846234,
1.00168287612, 0.997645576841, 0.999297594524), ChrX_Coverage = c(1.04015901555,
0.934772492047, 0.98981339011, 0.999960536561), ChrY_Coverage = c(9.61374227868e-09,
2.50609172398e-07, 8.30448295172e-08, 1.23741398572e-08)), .Names = c("series",
"Chr1_Coverage", "Chr2_Coverage", "Chr3_Coverage", "Chr4_Coverage",
"Chr5_Coverage", "Chr6_Coverage", "Chr7_Coverage", "Chr8_Coverage",
"Chr9_Coverage", "Chr10_Coverage", "Chr11_Coverage", "Chr12_Coverage",
"Chr13_Coverage", "Chr14_Coverage", "Chr15_Coverage", "Chr16_Coverage",
"Chr17_Coverage", "Chr18_Coverage", "Chr19_Coverage", "Chr20_Coverage",
"Chr21_Coverage", "Chr22_Coverage", "ChrX_Coverage", "ChrY_Coverage"
), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
and I would like to plot it as this
I thought of transposing the data starting from the second column and name the new transposed data by the first column in the initial data with the following code:
output$Plot_1 <- renderPlotly({
Plot_1_new_data[,2:24] <- lapply(Plot_1_new_data[,2:24], as.numeric)
# first remember the names
n <- as.data.frame(Plot_1_new_data[0:nrow(Plot_1_new_data),1])
# transpose all but the first column (name)
Plot_1_new_data_T <- as.data.frame(t(Plot_1_new_data[,-1]))
colnames(Plot_1_new_data_T) <- n
#plot data
library(reshape)
melt_Transposed_Plot_1_new_data <- melt(Plot_1_new_data_T,id="series")
ggplotly(melt_Transposed_Plot_1_new_data,aes(x=series,y=value,colour=variable,group=variable)) + geom_line()
})
However, when I check the "Plot_1_new_data_T" it seems that the first column is named as c("serie1","serie2",..."serie14") and the rest is named as NA.
Any idea how to proceed because I am new to both R and shiny.
Something like this?
xm = melt(x)
ggplot(xm[xm$variable != 'ChrY_Coverage' & xm$variable != 'ChrX_Coverage', ],
aes(as.integer(variable), value, color=series)) +
geom_line() +
scale_x_continuous(breaks = as.integer(xm$variable),
labels = as.character(xm$variable)) +
theme(axis.text.x = element_text( angle=45, hjust = 1))
ggplotly()
Note that the last two columns were removed from this plot, because they are of such a different scale that including them masks any variation in the other columns. If you want to include all the columns, you could use this instead:
ggplot(xm, aes(as.integer(variable), value, color=series)) +
geom_line() +
...

Running correlation (gtools)

I am an R newbie, trying to do simple things.
I wanted to examine the running correlation between two time series (two CSV files).
Below is my code, after loading the gtools package:
v1<-read.csv("var1.csv", header = FALSE)
v2<-read.csv("var2.csv", header = FALSE)
running(v1,v2,fun=cor, width=5)
I receive the following error message:
named list()
Then I try again by assigning first a variable:
p1<-running(v1,v2,fun=cor, width=5)
plot(p1)
I receive the following error message:
Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' is a list, but
does not have components 'x' and 'y'
What am I missing?
How can I create a plot that shows the running correlation and the line that represents the 95% confidence interval?
Thanks!
v1 and v2 are as follows:
v1 = structure(list(var1 = c(-0.888829723, -0.638363898, -0.820331055, -0.711637919, -3.631666745, 0.528082315, -0.888551728, 3.670203445, -0.406498322, 1.185030346, 1.427746793, -0.393369446, 2.905055593, -0.401353407, -0.563123881, 1.140042632, 7.078661195, 2.556181809, 0.888551728, -3.670203445, 0.406498322, -1.185030346, -1.427746793, 0.393369446, -2.016225871, 1.039717305, 1.383454936, -0.428404714, -3.44699445, -3.084264124)), .Names = "var1", class = "data.frame", row.names = c(NA, -30L))
v2 = structure(list(var2 = c(0.008871463, -0.218818955, 1.055065334, 1.353131909, -1.021284981, -2.153524661, 1.825212612, 0.460388983, 1.48721711, -1.78249802, 0.46047233, -0.894777526, -0.852226438, 0.136373161, -0.248409748, -0.411183561, 0.912205699, -1.856740048, -1.825212612, -0.460388983, -1.48721711, 1.78249802, -0.46047233, 0.894777526, 0.843354976, 0.082445794, -0.806655586, -0.941948347, 0.109079282, 4.010264709)), .Names = "var2", class = "data.frame", row.names = c(NA, -30L))
Let's say you want to run 5 year rolling correlation and your given period is 1988-2017 (30 years):
v1<-read.csv("var1.csv", header = FALSE)
v2<-read.csv("var2.csv", header = FALSE)
v1 = structure(list(var1 = c(-0.888829723, -0.638363898, -0.820331055, -0.711637919, -3.631666745, 0.528082315, -0.888551728, 3.670203445, -0.406498322, 1.185030346, 1.427746793, -0.393369446, 2.905055593, -0.401353407, -0.563123881, 1.140042632, 7.078661195, 2.556181809, 0.888551728, -3.670203445, 0.406498322, -1.185030346, -1.427746793, 0.393369446, -2.016225871, 1.039717305, 1.383454936, -0.428404714, -3.44699445, -3.084264124)), .Names = "var1", class = "data.frame", row.names = c(NA, -30L))
v1 = as.vector(v1$var1)
v2 = structure(list(var2 = c(0.008871463, -0.218818955, 1.055065334, 1.353131909, -1.021284981, -2.153524661, 1.825212612, 0.460388983, 1.48721711, -1.78249802, 0.46047233, -0.894777526, -0.852226438, 0.136373161, -0.248409748, -0.411183561, 0.912205699, -1.856740048, -1.825212612, -0.460388983, -1.48721711, 1.78249802, -0.46047233, 0.894777526, 0.843354976, 0.082445794, -0.806655586, -0.941948347, 0.109079282, 4.010264709)), .Names = "var2", class = "data.frame", row.names = c(NA, -30L))
v2 = as.vector(v2$var2)
rc <- running(v1, v2, fun = cor, width = 5)
length(rc)
plot((2017-length(rc) + 1):2017, rc, type="l")
This should give you the rolling correlation plot.

'height' must be a vector or a matrix. barplot error

I am trying to create a simple bar chart, but I keep receiving the error message
'height' must be a vector or a matrix
The barplot function I have been trying is
barplot(data, xlab="Percentage", ylab="Proportion")
I have inputted my csv, and the data looks as follows:
34.88372093 0.00029997
35.07751938 0.00019998
35.27131783 0.00029997
35.46511628 0.00029997
35.65891473 0.00069993
35.85271318 0.00069993
36.04651163 0.00049995
36.24031008 0.0009999
36.43410853 0.00189981
...
Where am I going wrong here?
Thanks in advance!
EDIT:
dput(head(data)) outputs:
structure(list(V1 = c(34.88372093, 35.07751938, 35.27131783,
35.46511628, 35.65891473, 35.85271318), V2 = c(0.00029997, 0.00019998,
0.00029997, 0.00029997, 0.00069993, 0.00069993)), .Names = c("V1",
"V2"), row.names = c(NA, 6L), class = "data.frame")
and barplot(as.matrix(data)) produced a chart with all the data one bar as opposed to each piece of data on a separate bar.
You can specify the two variables you want to plot rather than passing the whole data frame, like so:
data <- structure(list(V1 = c(34.88372093, 35.07751938, 35.27131783, 35.46511628, 35.65891473, 35.85271318),
V2 = c(0.00029997, 0.00019998, 0.00029997, 0.00029997, 0.00069993, 0.00069993)),
.Names = c("V1", "V2"), row.names = c(NA, 6L), class = "data.frame")
barplot(data$V2, data$V1, xlab="Percentage", ylab="Proportion")
Alternatively, you can use ggplot to do this:
library(ggplot2)
ggplot(data, aes(x=V1, y=V2)) + geom_bar(stat="identity") +
labs(x="Percentage", y="Proportion")
Probably the entire dataframe format is wrong, The same thing happened to me since I added the columns individually and made the dataframe together.
table.values = c(value1, value2,.......)
table = matrix(table.values,nrow=number of rows ,byrow = T)
colnames(table) = c("column1","column2",........)
row.names(table) = c("row1", "row2",............)
barplot(table, beside = T, xlab= "X-axis",ylab= "Y-axis")

Resources