Convert CSV data to a matrix to a heatmap in R - r

Doing some visualizations for a paper I'm writing and am stuck in trasfering data from a CSV-loaded table to a matrix (to be able to plot a heatmap from it afterwards).
I'm doing this:
dta.tesiscsv<- read.csv("dtatesis.csv", header=TRUE)
to load a data sample that looks like this:
Col,Row,Kf
1,1,100
1,2,97.14285714
2,1,100
...,...,...
but am kind of lost on the next step (creating an empty matrix and transfering data from the table to it based on a formula):
X<- matrix(nrow= 48, ncol=12)
X[dta.test[,c(1:2)]] <- dta.test$Kf

You can use acast from reshape2 package to get the data in the matrix form you desire.
require(reshape2)
acast(dta.test, Row ~ Col, value.var = "Kf")
This'll fill missing values with NA. If you want to fill them, for example, with 0 instead, then,
acast(dta.test, Row ~ Col, value.var = "Kf", fill = 0)
would accomplish that. You can wrap this around with heatmap(.) to get the heatmap.

How about (which should make sense if there is one row per Col/Row-combination):
dta.tesiscsv <- read.table(text="Col,Row,Kf
1,1,100
1,2,97.14285714
2,1,100",h=T,sep=",")
X <- tapply(dta.tesiscsv[,3],dta.tesiscsv[,2:1],head,1)
heatmap(X)

You're real close. To use matrix indexing, the indices have to be a matrix, not a data.frame.
X[as.matrix(dta.test[,c(1:2)])] <- dta.test$Kf

Related

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

Print histograms including variable name for all variables in R

I'm trying to generate a simple histogram for every variable in my dataframe, which I can do using sapply below. But, how can I include the name of the variable in either the title or the x-axis so I know which one I'm looking at? (I have about 20 variables.)
Here is my current code:
x = # initialize dataframe
sapply(x, hist)
Here's a way to modify your existing approach to include column name as the title of each histogram, using the iris dataset as an example:
# loop over column *names* instead of actual columns
sapply(names(iris), function(cname){
# (make sure we only plot the numeric columns)
if (is.numeric(iris[[cname]]))
# use the `main` param to put column name as plot title
print(hist(iris[[cname]], main=cname))
})
After you run that, you'll be able to flip through the plots with the arrows in the viewer pane (assuming you're using R Studio).
Here's an example output:
p.s. check out grid::grob(), gridExtra::grid.arrange(), and related functions if you want to arrange the histograms onto a single plot window and save it to a single file.
How about this? Assuming you have wide data you can transform it to long format with gather. Than a ggplot solution with geom_histogram and facet_wrap:
library(tidyverse)
# make wide data (20 columns)
df <- matrix(rnorm(1000), ncol = 20)
df <- as.data.frame(df)
colnames(df) <- LETTERS[1:20]
# transform to long format (2 columns)
df <- gather(df, key = "name", value = "value")
# plot histigrams per name
ggplot(df) +
geom_histogram(aes(value)) +
facet_wrap(~name, ncol = 5)

Loop through and plot columns of two identical dataframes

I have two dataframes I'd like to plot against each other:
> df1 <- data.frame(HV = c(3,3,3), NAtlantic850t = c(0.501, 1.373, 1.88), AO = c(-0.0512, 0.2892, 0.0664))
> df2 <- data.frame(HV = c(3,3,2), NAtlantic850t = c(1.2384, 1.3637, -0.0332), AO = c(-0.5915, -0.0596, -0.8842))
They're identical, I'd like to plot them column vs column (e.g. df1$HV, df2$HV) - loop through the dataframe columns and plot them against each other in a scatter graph.
I've looked through 20+ questions asking similar things and can't figure it out - would appreciate some help on where to start. Can I use lapply and plot or ggplot when they're two DFs? Should I merge them first?
As you suggest, I would indeed first rearrange into a list of plottable data frames before calling the plot command. I think that would especially be the way to go if you want to feed the data argument into ggplot. Something like:
plot_dfs <- lapply(names(df1),function(nm)data.frame(col1 = df1[,nm], col2 = df2[,nm]))
for (df in plot_dfs)plot(x = df[,"col1"], y = df[,"col2"])
or using ggplot:
for (df in plot_dfs){
print(
ggplot(data = df, aes(x=col1, y=col2)) +
geom_point())}
and if you want to add the column names as plot titles, you can do:
for (idx in seq_along(plot_dfs)){
print(
ggplot(data = plot_dfs[[idx]], aes(x=col1, y=col2)) +
ggtitle(names(df1)[idx]) +
geom_point())}
You can loop through the columns like this:
for(col in 1:ncol(df1)){
plot(df1[,col], df2[,col])
}
Make sure that both data frames have the same number of columns (and the order of the columns are the same) before running this.
Here’s one way to do it — loop over the column indices and create the plots one by one, adding them to a list and writing each one to a file:
library(ggplot2)
# create some data to plot
df1 <- iris[, sapply(iris, is.numeric)]
df2 <- iris[sample(1:nrow(iris)), sapply(iris, is.numeric)]
# a list to catch each plot object
plot_list <- vector(mode="list", length=ncol(df1))
for (idx in seq_along(df1)){
plot_list[[idx]] <- ggplot2::qplot(df1[[idx]], df2[[idx]]) +
labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=plot_list[[idx]])
}
As you suggest in the question, you can also use s/lapply() with an anonymous function, e.g. like this (though here we're not storing the plots, just writing each one to disk):
lapply(seq_along(df1), function(idx){
the_plot <- ggplot2::qplot(df1[[id]], df2[[idx]]) + labs(title=names(df1)[idx])
ggsave(filename=paste0(names(df1)[idx], ".pdf"), plot=the_plot)
})
If you want to keep the list of plots (as in the for-loop example), just assign the lapply() to a variable (e.g. plot_list) and add line like return(the_plot) before closing the function.
There's tons of ways you could modify/adapt this approach, depending on what your objectives are.
Hope this helps ~~
p.s. if it's possible the columns won't be in the same order, it is better to loop over column names instead of column indices (i.e. use for (colname in names(df1)){... instead of for (idx in seq_along(df1)){...). You can use the same [[ subsetting syntax with both names and indices.

Ploting a matrix using ggplot2 in R

I want to plot using ggplot2 the distribution of 5 variables corresponding to a matrix's column names
a <- matrix(runif(1:25),ncol=5,nrow=1)
colnames(a) <- c("a","b","c","d","e")
rownames(a) <- c("count")
I tried:
ggplot(data=melt(a),aes(x=colnames(a),y=a[,1]))+ geom_point()
However, this gives a result as if all columns had the same y value
Note: i'm using the reshape package for the melt() function
All columns look like they have the same y-value because you are only specifying 1 number in the y= statement. You are saying y=a[,1] which if you type a[,1] into your command window you will find is 0.556 (the number that everything is appearing at). I think this is what you want:
library(reshape2)
library(ggplot2)
a_melt<- melt(a)
ggplot(data=a_melt,aes(x=unique(Var2),y=value))+ geom_point()
Note that I saved a new dataset called a_melt so that things were easier to reference. Also since the data was melted,it is cleaner if we define our x-values to be the Var2 column of a_meltrather than the columns of a.

Resources