Plotting a matrix "by parts" in R? - r

I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.
Obviously, when I try to load the matrix in R by using read.table(), a memory error is encountered as the matrix is too big. Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.

Here's an approach. I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv function in the readr package. First, you will need to write your summarisation function and then apply this to each chunk. See the below for a full repex.
# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)
# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")
fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)
# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
x %>%
mutate(added_bin = cut(V1, breaks = -6:6)) %>%
count(added_bin)
}
# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise
# A
small_read <- read_csv_chunked(temp, # data
DataFrameCallback$new(summarise_for_hist),
chunk_size = 200 # number of lines to read
)
Now that we have summarised our data, we can combine and plot it.
# Generate our histogram by combining all of the results
# and plotting
small_read %>%
group_by(added_bin) %>%
summarise(total = sum(n)) %>%
ggplot(aes(added_bin, total))+
geom_col()
This will yield the following:

Related

How to visualize Markov chains for NLP using ggplot?

I am working on analyzing some text in R and have settled on (for the moment) Markov chains as part of my procedure. Here is an example of what I'm doing:
# Required libraries
library(stringi) # Input cleaning
library(tidyverse) # dplyr, ggplot, etc.
library(hunspell) # Spell checker
library(markovchain) # Markov chain calculation
# Input
shake <- c("To be, or not to be - that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles And by opposing end them.")
# Process to clean input
miniclean <- function(x = ""){
# x is character string input
words_i = x %>%
gsub(pattern = "[^[:alpha:][:space:]\']", replacement = "") %>%
#gsub(pattern = "[\n]+", replacement = "") %>% # Drop line breaks
stri_trans_tolower() %>%
strsplit(split = " ") %>%
unlist()
correct = hunspell_check(words_i)
words_o = words_i[correct]
return(words_o)
}
# Clean input
cleans <- miniclean(shake)
# Compute Markov chain using cleaned input
mark2 <- markovchainFit(cleans)
# Plot results
plot(mark2$estimate)
The base plot graphics produces this visualization:
I would really like a bit more control over the plot (e.g., increasing arrow lengths to increase the overall size to make it more readable), but I don't see how to do it.
Ideas?
(edited to make a complete example)

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

Identify spikes/peaks in density plot by group

I created a density plot with ggplot2 package for R. I would like to identify the spikes/peaks in the plot which occur between 0.01 and 0.02. There are too many legends to pick it out so I deleted all legends. I tried to filter my data out to find most number of rows that a group has between 0.01 and 0.02. Then I filtered out the selected group to see whether the spike/peak is gone but no, it is there plotted still. Can you suggest a way to identify these spikes/peaks in these plots?
Here is some code :
ggplot(NumofHitsnormalized, aes(NumofHits_norm, fill = name)) + geom_density(alpha=0.2) + theme(legend.position="none") + xlim(0.0 , 0.15)
## To filter out the data that is in the range of first spike
test <- NumofHitsnormalized[which(NumofHitsnormalized$NumofHits_norm > 0.01 & NumofHitsnormalized$NumofHits_norm <0.02),]
## To figure it out which group (name column) has the most number of rows ##thus I thought maybe I could get the data that lead to spike
testMatrix <- matrix(ncol=2, nrow= length(unique(test$name)))
for (i in 1:length(unique(test$name))){
testMatrix[i,1] <- unique(test$name)[i]
testMatrix[i,2] <- nrow(unique(test$name)[i])}
Konrad,
This is the new plot made after I filtered my data out with extremevalues package. There are new peaks and they are located at different intervals and it also says 96% of the initial groups have data in the new plot (though number of rows in filtered data reduced to 0.023% percent of the initial dataset) so I cant identify which peaks belong to which groups.
I had a similar problem to this.
How i did was to create a rolling mean and sd of the y values with a 3 window.
Calculate the average sd of your baseline data ( the data you know won't have peaks)
Set a threshold value
If above threshold, 1, else 0.
d5$roll_mean = runMean(d5$`Current (pA)`,n=3)
d5$roll_sd = runSD(x = d5$`Current (pA)`,n = 3)
d5$delta = ifelse(d5$roll_sd>1,1,0)
currents = subset(d5,d5$delta==1,na.rm=TRUE) # Finds all peaks
my threshold was a sd > 1. depending on your data you may want to use mean or sd. for slow rising peaks mean would be a better idea than sd.
Without looking at the code, I drafted this simple function to add TRUE/FALSE flags to variables indicating outliers:
GenerateOutlierFlag <- function(x) {
# Load required packages
Vectorize(require)(package = c("extremevalues"), char = TRUE)
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iRight,
TRUE,out_flg)
return(out_flg)
}
If you care to read about the extremevalues package you will see that it provides some flexibility in terms of identifying outliers but broadly speaking it's a good tool for finding various peaks or spikes in the data.
Side point
You could actually optimise it significantly by creating one object corresponding to getOutliers(x, method = "I") instead of calling the method twice.
More sensible syntax
GenerateOutlierFlag <- function(x) {
# Load required packages
require("extremevalues")
# Outliers object
outObj <- getOutliers(x, method = "I")
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% outObj$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% outObj$iRight,
TRUE,out_flg)
return(out_flg)
}
Results
x <- c(1:10, 1000000, -99099999)
table(GenerateOutlierFlag(x))
FALSE TRUE
10 2

Plot LOESS (STL) decomposition using Ggvis

I want to be able to plot the three different elements of The Seasonal Trend Decomposition using Loess (STL) with Ggvis.
However, I recive this error:
Error: data_frames can only contain 1d atomic vectors and lists
I am using the nottem data set.
# The Seasonal Trend Decomposition using Loess (STL) with Ggvis
# Load nottem data set
library(datasets)
nottem <- nottem
# Decompose using stl()
nottem.stl = stl(nottem, s.window="periodic")
# Plot decomposition
plot(nottem.stl)
Now, this is the information I am interested in. In order to make this into a plot that I can play around with I transform this into a data frame using the xts-package. So far so good.
# Transform nottem.stl to a data.frame
library(xts)
df.nottem.stl <- as.data.frame(as.xts(nottem.stl$time.series))
# Add date to data.frame
df.nottem.stl$date <- data.frame(time = seq(as.Date("1920-01-01"), by = ("months"), length =240))
# Glimpse data
glimpse(df.nottem.stl)
# Plot simple line of trend
plot(df.nottem.stl$date, df.nottem.stl$trend, type = "o")
This is pretty much the plot I want. However, I want to be able to use it with Shiny and therefore Ggvis is preferable.
# Plot ggvis
df.nottem.stl%>%
ggvis(~date, ~trend)%>%
layer_lines()
This is where I get my error.
Any hints on what might go wrong?
First of all your df.nottem.stl data.frame contains a Date data.frame, so you should be using the date$time column. Then using the layer_paths function instead of the layer_lines will make it work. I always find layer_paths working better than layer_lines:
So this will work:
library(ggvis)
df.nottem.stl%>%
ggvis(~date$time, ~trend)%>%
#for points
layer_points() %>%
#for lines
layer_paths()
Output:

Resources