I have a csv that looks similar to the following.
Library Parameter1 Parameter2 Parameter3
A 3 6 4
A 4 6 3
A 7 8 9
B 2 10 7
B 4 4 5
B 3 5 4
C 4 6 4
C 6 3 12
C 5 6 8
I would like to be able to create a function to create a histogram for a specific library and parameter e.g., histogram of the frequency of Parameter 2 in Library B.
I kind of know how to use the histogram function here's what I have right now.
### x = "Parameter"
histogram <- function(x) {hist(filename[[x]], main = "Normalized",
xlab = "x", ylab = "Frequency", breaks = ceiling(sqrt(nrow(filename))))}
Edit: This is the actual data frame I am working with. It is quite large so I couldn't put the dput in here???
https://www.dropbox.com/s/2ivbhc7wyqms0fy/All-Norm.csv?dl=0
(Sorry if I've done anything incorrectly, still very new.)
One solution would be to subset your data first:
sub <- subset(yourdata, Library == "B")$Parameter2
histogram(sub)
This is a really simple ggplot I just put together
Code:
dat <- data.frame(Library = c("A","A","A","B","B","B","C","C","C"),
Parameter1=c(3,4,7,2,4,3,4,6,5),
Parameter2 = c(6,6,8,10,4,5,6,3,6),
Parameter3=c(4,3,9,7,5,4,4,12,8))
dat <- data.table::melt(dat,id.vars="Library")
library(ggplot2)
ggplot(dat,aes(x = value)) + geom_histogram() + facet_grid(Library~variable)
Output:
Obviously this could be cleaned up a lot, but this is a place to start.
Related
I am struggling with creating multiple ggplots using a loop.
I use data in the following format:
a <- c(1,2,3,4)
b <- c(5,6,7,8)
c <- c(9,10,11,12)
d <- c(13,14,15,16)
time <- c(1,2,3,4)
data <- cbind(a,b,c,d,time)
What I want to create is a list of plots that plot one of the letters against the variable time.
Which I tried in the following way:
library(ggplot2)
library(gridExtra)
plots <- list()
for (i in 1:4){
plots[[i]] <- ggplot() + geom_line(data = data, aes(x = time, y = data[,i]))
}
grid.arrange(plots[[1]], plots[[2]], plots[[3]], plots[[4]])
This results in four times the fourth plot. How do I index this correctly in a way that creates the four intended plots?
(Up front: the reason that your plots are all identical is due to ggplot's "lazy" evaluation of code. See my #2 below, where I identify that the data[,i] is evaluated when you try to plot the data, at which point i is 4, the last pass in the for loop.)
It's generally preferred/recommended to use data.frames instead of matrices or vectors (as you're doing here). It gives a bit more power and control.
data <- data.frame(a,b,c,d,time)
Also, I tend to prefer lapply to for-loops and lists, for various (some subjective) reasons. Ultimately, the issue you're having is that ggplot2 is evaluating the data lazily, so plots is a list with four plots that make reference to i ... and that is realized when you try to plot them all, at which point i is 4 (from the last pass through the loop). One benefit of using lapply is that the i referenced is a local-only (inside of the anon-func) version of i that is preserved as you would expect.
plots <- lapply(names(data)[1:4],
function(nm) ggplot(data, aes(x = time, y = .data[[nm]])) + geom_line())
gridExtra::grid.arrange(plots[[1]], plots[[2]])
I also prefer patchwork to gridExtra, mostly because it makes more-customized layouts a bit more intuitive, plus adds functionality such as axis-alignment, shared legends, shared titles, etc. (None of those other features are demonstrated here.)
library(patchwork)
plots[[1]] / plots[[2]] # same plot
plots[[1]] + plots[[2]] # side-by-side instead of top/bottom
(plots[[1]] + plots[[2]]) / (plots[[3]] + plots[[4]]) # grid
Ultimately, though, I suggest that facets can be useful and very powerful. For this, we need to melt/pivot the data into a "long format" so that the column names a-b are actually in one column.
reshape2::melt(data, id.vars = "time") |>
ggplot(aes(time, value)) +
geom_line() +
facet_grid(variable ~ ., scales = "free_y")
I assumed the preference for independent (free) y-scales, ergo the scales="free_y". Try it without if you want to see the options. (There are also scales="free_x" and scales="free" (both).)
To see what I mean by "long" format:
reshape2::melt(data, id.vars = "time")
# time variable value
# 1 1 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 1 b 5
# 6 2 b 6
# 7 3 b 7
# 8 4 b 8
# 9 1 c 9
# 10 2 c 10
# 11 3 c 11
# 12 4 c 12
# 13 1 d 13
# 14 2 d 14
# 15 3 d 15
# 16 4 d 16
This can also be done with tidyr::pivot_longer(data, -time), albeit the variable name is now name. For this use, there is no advantage to reshape2::melt or tidyr::pivot_longer; there are opportunities for significantly more complex pivoting in the latter, not relevant with this data.
Data
data <- structure(list(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8), c = c(9, 10, 11, 12), d = c(13, 14, 15, 16), time = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;
Using R package pheatmap to draw heatmaps. Is there a way to assign a color to NAs in the input matrix? It seems NA gets colored in white by default.
E.g.:
library(pheatmap)
m<- matrix(c(1:100), nrow= 10)
m[1,1]<- NA
m[10,10]<- NA
pheatmap(m, cluster_rows=FALSE, cluster_cols=FALSE)
Thanks
It is possible, but requires some hacking.
First of all let's see how pheatmap draws a heatmap. You can check that just by typing pheatmap in the console and scrolling through the output, or alternatively using edit(pheatmap).
You will find that colours are mapped using
mat = scale_colours(mat, col = color, breaks = breaks)
The scale_colours function seems to be an internal function of the pheatmap package, but we can check the source code using
getAnywhere(scale_colours)
Which gives
function (mat, col = rainbow(10), breaks = NA)
{
mat = as.matrix(mat)
return(matrix(scale_vec_colours(as.vector(mat), col = col,
breaks = breaks), nrow(mat), ncol(mat), dimnames = list(rownames(mat),
colnames(mat))))
}
Now we need to check scale_vec_colours, that turns out to be:
function (x, col = rainbow(10), breaks = NA)
{
return(col[as.numeric(cut(x, breaks = breaks, include.lowest = T))])
}
So, essentially, pheatmap is using cut to decide which colours to use.
Let's try and see what cut does if there are NAs around:
as.numeric(cut(c(1:100, NA, NA), seq(0, 100, 10)))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[29] 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
[57] 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9
[85] 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 NA NA
It returns NA! So, here's your issue!
Now, how do we get around it?
The easiest thing is to let pheatmap draw the heatmap, then overplot the NA values as we like.
Looking again at the pheatmap function you'll see it uses the grid package for plotting (see also this question: R - How do I add lines and text to pheatmap?)
So you can use grid.rect to add rectangles to the NA positions.
What I would do is find the coordinates of the heatmap border by trial and error, then work from there to plot the rectangles.
For instance:
library(pheatmap)
m<- matrix(c(1:100), nrow= 10)
m[1,1]<- NA
m[10,10]<- NA
hmap <- pheatmap(m, cluster_rows=FALSE, cluster_cols=FALSE)
# These values were found by trial and error
# They WILL be different on your system and will vary when you change
# the size of the output, you may want to take that into account.
min.x <- 0.005
min.y <- 0.01
max.x <- 0.968
max.y <- 0.990
width <- 0.095
height <- 0.095
coord.x <- seq(min.x, max.x-width, length.out=ncol(m))
coord.y <- seq(max.y-height, min.y, length.out=nrow(m))
for (x in seq_along(coord.x))
{
for (y in seq_along(coord.y))
{
if (is.na(m[x,y]))
grid.rect(coord.x[x], coord.y[y], just=c("left", "bottom"),
width, height, gp = gpar(fill = "green"))
}
}
A better solution would be to hack the code of pheatmap using the edit function and have it deal with NAs as you wish...
Actually, the question is easy now. The current pheatmap function has incorporated a parameter for assigning a color to "NA", na_col. Example:
na_col = "grey90"
You can enable assigning a colour by using the developer version of pheatmap from github. You can do this using devtools:
#this part loads the dev pheatmap package from github
if (!require("devtools")) {
install.packages("devtools", dependencies = TRUE)
library(devtools)
}
install_github("raivokolde/pheatmap")
Now you can use the parameter "na_col" in the pheatmap function:
pheatmap(..., na_col = "grey", ...)
(edit)
Don't forget to load it afterwards. Once it is installed, you can treat it as any other installed package.
If you don't mind using heatmap.2 from gplots instead, there's a convenient na.color argument. Taking the example data m from above:
library(gplots)
heatmap.2(m, Rowv = F, Colv = F, trace = "none", na.color = "Green")
If you want the NAs to be grey, you can simply force the "NA" as double.
m[is.na(m)] <- as.double("NA")
pheatmap(m, cluster_rows=F, cluster_cols=F)
There is something I don't understand.
I've this data frame :
Var1 Freq
1 2008-05 1
2 2008-07 7
3 2008-08 5
4 2008-09 3
I need to append a row on second position, for exemple it would be :
2008-06 0
I followed this (Add a new row in specific place in a dataframe). First step : add an index column ; second step : append rows with an index number for each ; then, sort it.
df$ind <- seq_len(nrow(df))
df <- rbind(df,data.frame(Var1 = "2008-06", Freq = "0",ind=1.1))
df <- df[order(df$ind),]
Ok, everything seems good. Even if I don't know why a column called "row.names" has appeared, I get :
row.names Var1 Freq ind
1 1 2008-05 1 1
2 5 2008-06 0 1.1
3 2 2008-07 7 2
4 3 2008-08 5 3
5 4 2008-09 3 4
Now, I plot it, with ggplot2.
ggplot(df, aes(y = Freq, x = Var1)) + geom_bar()
Here we are. On the X axis, "2008-06" is placed at the end, after "2008-09" (ie with the index 5). In clear, the data frame has not been sorted, in despite of it seems to be.
Where I'm wrong ? Thanks for help...
Try this:
df$Var1 <- factor(df$Var1, df$Var1[order(df$ind)])
If you want ggplot2 to order labels, you have to specify the ordering yourself.
You might also want to look into converting Var1 to some sort of date class, then dispensing with the index variable altogether. This would makes things clearer, I think. The zoo package actually has a nice class for representing months of a given year, and you could use this for Var1. For example:
library(zoo)
df$Var1 <- as.yearmon(df$Var1)
df <- rbind(df,data.frame(Var1 = as.yearmon("2008-06"), Freq = "0"))
Now you can just order your data frame by Var1 without having to worry about keeping an index:
> df[order(df$Var1), ]
Var1 Freq
1 May 2008 1
5 Jun 2008 0
2 Jul 2008 7
3 Aug 2008 5
4 Sep 2008 3
A plot in ggplot2 will turn out as expected:
ggplot(df, aes(as.Date(Var1), Freq)) + geom_bar(stat="identity")
Though you do have to convert Var1 to Date, since ggplot2 doesn't understand yearmon objects.
It is because somewhere along the way you got a factor in the mix. This produces what you're after (without the rownames column):
df <- read.table(text=" Var1 Freq
1 2008-05 1
2 2008-07 7
3 2008-08 5
4 2008-09 3", header=TRUE, stringsAsFactors = FALSE)
df$ind <- seq_len(nrow(df))
df <- rbind(df,data.frame(Var1 = "2008-06", Freq = "0",ind=1.1, stringsAsFactors = FALSE))
df <- df[order(df$ind),]
ggplot(df, aes(y = Freq, x = Var1)) + geom_bar()
Notice the stringsAsFactors = FALSE?
As far as the order goes if you already have factors (as you do) you need to reorder the factor. If you want more detailed info see this post
In previous message
Convert table into matrix by column names
I want to use the same approach for an csv table or an table in R. Could you mind to teach me how to modify the first command line?
x <- read.table(textConnection(' models cores time 4 1 0.000365 4 2 0.000259 4 3 0.000239 4 4 0.000220 8 1 0.000259 8 2 0.000249 8 3 0.000251 8 4 0.000258' ), header=TRUE)
library(reshape) cast(x, models ~ cores)
Should I use following for a data.csv file
x <- read.csv(textConnection("data.csv"), header=TRUE)
Should I use the following for a R table named as xyz
x <- xyz(textConnection(xyz), header=TRUE)
Is it a must to have the textConnection for using cast command?
Thank you.
Several years later...
read.table and its derivatives like read.csv now have a text argument, so you don't need to mess around with textConnections directly anymore.
read.table(text = "
x y z
1 1.9 'a'
2 0.6 'b'
", header = TRUE)
The main use for textConnection is when people who ask questions on SO just dump their data onscreen, rather than writing code to let answerers generate it themselves. For example,
Blah blah blah I'm stuck here is my data plz help omg
x y z
1 1.9 'a'
2 0.6 'b'
etc.
In this case you can copy the text from the screen and wrap it in a call to textConnection, like so:
the_data <- read.table(tc <- textConnection("x y z
1 1.9 'a'
2 0.6 'b'"), header = TRUE); close(tc)
It is much nicer when questioners provide code, like this:
the_data <- data.frame(x = 1:2, b = c(2.9, 0.6), c = letters[1:2])
When you are using you own data, you shouldn't ever need to use textConnection.
my_data <- read.csv("my data file.csv") should suffice.