R plot data.frame to get more effective overview of data - r

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?

The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

Related

Analysing vocal similarity of little owls using warbleR in R

I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.

R newbie- is there a way to separate or filter out items listed in a single cell for plotting purposes?

Problem
R and stack overflow newbie here so try and be patient with me. I am currently working on a data.frame that will act as a summary of various modeling approaches used to predict either fall events or fall rates within an in-patient setting based on a range of hospital, environmental and individual-level variables.
My data is in long format and some studies have several rows (I have created a row for each model type, with some studies having built multiple). For some columns (i.e., Model performance) I have multiple entries separated by a comma (e.g., C-statistic, Hosmer-Lemeshow test, likelihood ratio, and so forth). My question is, is there a way to separate these so I can create a barplot in ggplot2 that shows the prevalence of different methods and there is one bar per statistic/test type, with the height of the bar being a count of the number of instances in the data frame it occurs? At the moment this obviously does not work as some bars have a label that contains all of the values (i.e, C-statistic, Hosmer-Lemeshow test, likelihood ratio), which means there can be multiple bars that contain "c-statistic" for example, because the list is slightly different.
Screenshots and code
I have attached a screenshot of my data.frame below. The column I refer to is "Statistic.reported"
Screenshot of datadrame:
I have also attached an image of what happens when I create a basic barplot with the following code:
Bar <- ggplot(Modelling.Data, aes(x=Statistic.reported)) +geom_bar()+ theme_classic()
Image of plot using current basic code:~
Things I have tried
I have tried using the tidyr package function seperate_rows my code for this was as follows
separate_rows(Modelling.Data,Modelling.Data$Statistic.reported, sep = ",")
From this I got an error that said "Can't subset columns that don't exist".
Hopefully, this makes sense, but I'm really new to all of this so if you need anything else please tell me. Any tips or advice would be hugely appreciated! Apologies in advance for my complete lack of knowledge.

Package for Summarizing Data in My DataFrame R

I have a huge dataset containing information about 1774 counties in the US. The variables there are things like income quartile, voter preferences, median household income etc.
I would like to know if there exists a package which would allow me to quickly see for example the number of counties which have income over a certain number and voted Republican, or the number of counties where more than 50 % work in services, while the average education attainment is HS or lower.
I know that I can do so with dplyr functions, however, that is extremely time-consuming when I want to do it with large amounts of variables.
Thank you for any recommendations!
I recommend you try the explore package.
While you can use it manually to explore specific parts of your dataset, it has additional features to explore data interactively via shiny (explore_shiny) and to generate a report of your entire dataset via rmarkdown (report).
Exploring pairs of variables (e.g. income by party voted for) is possible by specifying one variable as the target and selecting the second variable. But it won't always give you the comparison you need. Hence I would recommend the explore package as an initial starting point for understanding your data, but for specific analysis you will probably need to write your own dplyr, ggplot, and/or plotly code (or whichever other packages you favour).
Further worked examples are found in its vignette.

Plotting a subset of data from a prcomp matrix without re-running prcomp

I am asking a question to a similar post posted up 2 years ago, with no full answer to it (subset of prcomp object in R). P.S. sorry for commenting on it for an answer..
Basically, my question is the same. I have generated a PCA table using prcomp that has 10000+ genes, and 1700+ cells, made up of 7 timepoints. Plotting all of them in a single file makes it difficult to see.
I would like to plot each timepoint separately, using the same PCA results table (ie without re-running prcomp).
Thanks Dean for giving me tips on posting. To think of a way to describe my dataset without actually loading it here, will take me a week I believe. I also tried the
dput(droplevels(head(object,2)))
option, but it was just too much info since I have such a large dataset. In short, it is a large matrix of single-cell dataset where people can commonly see on packages such as Seurat (https://satijalab.org/seurat/pbmc3k_tutorial_1_4.html). EDIT: I have posted a screenshot of a subset of my matrix here ().
Sorry I don't know how to re-create this or even export a text format.. But this is what I can provide:
My TPM matrix has 16541 rows (defining genes), and 1798 columns (defining cells).
In it, I have "re-labelled" my columns based on timepoints, using codes such as:
D0<-c(colnames(TPM[,grep("20180419-24837-1-*", colnames(TPM))])) #D0: 286 cells
D7<-c(colnames(TPM[,grep("20180419-24837-2-*", colnames(TPM))])) #D7: 237 cells
D10<-c(colnames(TPM[,grep("20180419-24947-5-*", colnames(TPM))])) #D10: 304 cells
...... and I continued to label each timepoint.
Each timepoint was also given a specific colour.
rc<-rep("white", ncol(TPM))
rc<-[,grep("20180419-24837-1-*", colnames(TPM))]= "magenta"
...... and I continued to give colour to each timepoint.
I performed a PCA using this code:
pcaRes<-prcomp(t(log(TPM+1)), center= TRUE, scale. = TRUE)
Then I proceeded to plot a PCA plot using:
plot(pcaRes$x[,1], pcaRes$x[,2], xlab="PC1", ylab="PC2",
cex=1.0, col= rc, pch=16, main="")
Then I when I wanted to plot a PCA plot only with D0, using the same PCA output (pcaRes).. This is where I am stuck.
P.S. If anyone else has an easier way of advising how to input an example data here from my large matrix, I welcome any help. Thanks so much! Sorry I am very new in bioinformatics.
Stack Exchange for
Bioinformatics is where you you will need to go to ask question(s) or learn about the package(s) and function(s) you need to deal with you area of specialty. Stack Exchange for Bioinformatics is linked with Stackoverflow so you will just need to join, you'll have the same login.
Classes S3, S4 and Base.
This Very basic over view of Classes in R. Think of a Class as the parent you inherit all of their skills or abilities from and as a result you are able to achieve certain tasks better than others and some cases, you will not be able to do the task at all.
In R and all programming, to save re-inventing the wheel, parent classes are created so that the average person does not have to repeatedly write a function to do something simple like plot() a graph. This stuff is hidden, to access it, you inherit from the parent. The child reads the traits off the parent(s), and then it either performs the task or gives you a cryptic error message.
Base and S3 classes work well together, they are like the working class people of the R world. S4 is a specialized class made for specific fields of study to be able to provide specific functionality needed in their industry. This mean you can only use certain Base and S3 functions with Class S4 functions, most are just not compatible. So it's nothing you've done wrong, plot() and ggplot() just have the wrong parent(s) to work with your dataset.
Typical Base and S3 Class dataframe: Box like structure. Along the left hand side is all the column names, nice and neatly stacked on top of each other.
Seurat S4 Class dataframe: Tree like structure, formatted to be read by a specific function(s).
Well hope that helps and I wish you well in your career. Cheers Conrad
Ps if this helps, then click the arrow up. :)
thanks #ConradThiele for your suggestion, I will check out that site.
I had a chat with other bioinformatics around the institute. My query has little to do with the object being an S4 class, since I am performing prcomp outside of the package. I have extracted my matrix out of the object and then ran prcomp on it.
Solution is simple: run prcomp with full dataset, transform the prcomp output into a dataframe, input additional columns to input additional details like "timepoint", create new dataframe(s) only with the "timepoint"/ "variable" of interest from the prcomp result, make multiple sub-dataframe and then plotting these using "plot" or whatever function you use.
This was not my solution but from a bioinformatition I went for help to in my institute. Hope this helps others! Thanks again for your time.
P.S. If I have the time, I will post a copy of the code I suggested soon.

Shiny - Efficient way to use ggplot2(boxplot) & a 'reactive' subset function

I have a dataset with > 1000K rows and 5 columns. (material & prices been the relevant columns)
I have written a 'reactive' Shiny app which uses ggplot2 to create a boxplot of the price of the various materials.
e.g the user selects 4-5 materials from a list and then Shiny creates a boxplot of the price of each material :
Price spread of: Made of Cotton, Made of Paper, Made of Wood
It also creates a material combination data plot of the pricing spread of the combination of all the materials
e.g Boxplot of
Price spread of: Made of Cotton & Paper & Wood
It is working relatively quickly for the sample dataset (~5000 rows) but I am worried about scaling it effectively.
The dataset is static so I look at the following solutions:
Calculate the quartile ranges of the various materials (data <-
summary(data)) and then use googleViz to create a candle stick,
however I run into problems when trying to calculate the material combination plot as there are over 100 materials, so calculating
all the possible combinations offline is not feasible.
Calculate the quartile ranges of the various materials (data <- summary(data)) and then create a matrix which stores the row numberof the summary data (min,median,max,1st&3rd quartile) for each material. I can then use some rough calculations to establish the summary() data for the material combination plot,
and then plot using GoogleVIZ however I have little experience with this type of calculation using Shiny.
Can anyone suggest the most robust and scalable way to calculate & boxplot reactive subsets using Shiny?
I understand this a question related to method, rather than code, but I am new to the capabilities of R and am still digesting the different class capabilities, and don't want to 'miss a trick' so to speak.
As always thanks!
Please see below for methods reviewed.
Quartile Clustering: A quartile based technique for Generating Meaningful Clusters
http://arxiv.org/ftp/arxiv/papers/1203/1203.4157.pdf
Conditionally subsetting and calculating a new variable in dataframe in shiny
If you really have a dataset that has more than 1000K, which is 1M. It is probably in a flat file or in a database. You can always do some precalculations and store the result in a database table and use shiny app to call that table instead of loading everything into R every time people open up your shiny app.
I have built several shiny apps for internal use and the lesson I have learned is that: before you build your app, you need to carefully think about, how can I minimize the calculations for R and at the same time deliver the info to app user. Some of our data is 10billion+ and use Hive query will take more than 1 hour. Then I ended up precalculate result and put it on the crontab to update the result table every midnight.
I prefer, maybe your method2? or store the precalculation in a mysql database. (Maybe a Python script update the table once a day if you need some real-time feature later).

Resources