Stacked or Grouped Barcharts from Weighted Survey Data (Class="survey.design2" "survey.design") in R - r

I am working with weighted survey data of the class survey.design2 and survey.design. With the package survey, and the function call svytable, I can create contingency tables for survey data. With these contingency tables, I can then create normal bar-charts using lattice. The standard way for doing this (e.g. barchart(cars ~ mpg | factor(cyl), data=mtcars,...)) doesn't work for this data type.
I am used to working with ggplot2, and would like to create either stacked or grouped bar-charts, if possible even with facet-wraps. Unfortunately, ggplot2 does not know how to deal with data of the type survey.design2 either. As far as I am concerned, there also does not exist some sort of add-on, which would allow ggplot2 to deal with this kind of data.
So far I have:
sub-set my data set
converted it into class survey.design2 with the function call svydesign(),
plotted multiple bar-charts in one window using grid.arrange(). This sort of provides for a work around for facetting, but still doesn't allow me to create stacked or grouped bar-charts.
I'd be grateful for any suggestions.
Thank you

Good morning MatthewR
I have a data set with 62732 observations and 691 variables.
Original Data Set
So any example based on a random number generator should work as well, I guess. I am really just interested in a work around to this issue, not necessarily the final code.
I then convert the data frame into survey.design format using:
df_Survey <- svydesign(id=~1, weights=~IXPXHJ, data=df). IXPXHJ is the variable by which the original sample data set will be weighted so as to get the entire population. head(df$IXPXHJ) looks something like this:
87.70876
78.51809
91.95209
94.38899
105.32005
56.30210
str(df_Survey) looks something like this.
Survey Data Structure

Related

Add extra explanatory layer onto PCA

I am trying to run a PCA on a dataframe which is accompanied by a metadata table. The PCA table is all normalized, scaled etc. the metadata, however, is not.
I want the PCA to not only cluster based on the dataframe but also the option to add one or multiple columns from the metadata table as explanatory variables as well. Again, these are not scaled and normalized with the main dataset. Also, I am not looking to color the plots with a certain data column, I'd want the column to be considered for the actual clustering.
I am aware that this sounds kinda vague, but I am having a hard time to find the exact words. After looking around for a little bit I found demixed PCA which seems to be very close to what I want to achieve. Sadly, there is no package in R to run it.
Any recommendations are welcome and thank you in advance.

R newbie- is there a way to separate or filter out items listed in a single cell for plotting purposes?

Problem
R and stack overflow newbie here so try and be patient with me. I am currently working on a data.frame that will act as a summary of various modeling approaches used to predict either fall events or fall rates within an in-patient setting based on a range of hospital, environmental and individual-level variables.
My data is in long format and some studies have several rows (I have created a row for each model type, with some studies having built multiple). For some columns (i.e., Model performance) I have multiple entries separated by a comma (e.g., C-statistic, Hosmer-Lemeshow test, likelihood ratio, and so forth). My question is, is there a way to separate these so I can create a barplot in ggplot2 that shows the prevalence of different methods and there is one bar per statistic/test type, with the height of the bar being a count of the number of instances in the data frame it occurs? At the moment this obviously does not work as some bars have a label that contains all of the values (i.e, C-statistic, Hosmer-Lemeshow test, likelihood ratio), which means there can be multiple bars that contain "c-statistic" for example, because the list is slightly different.
Screenshots and code
I have attached a screenshot of my data.frame below. The column I refer to is "Statistic.reported"
Screenshot of datadrame:
I have also attached an image of what happens when I create a basic barplot with the following code:
Bar <- ggplot(Modelling.Data, aes(x=Statistic.reported)) +geom_bar()+ theme_classic()
Image of plot using current basic code:~
Things I have tried
I have tried using the tidyr package function seperate_rows my code for this was as follows
separate_rows(Modelling.Data,Modelling.Data$Statistic.reported, sep = ",")
From this I got an error that said "Can't subset columns that don't exist".
Hopefully, this makes sense, but I'm really new to all of this so if you need anything else please tell me. Any tips or advice would be hugely appreciated! Apologies in advance for my complete lack of knowledge.

Data structure and package for a radial dendrogram in R

I'd like to create a radial dendrogram in R, but being new to the software, I don't know if I chose the correct data structure and package.
I've created a YAML file that looks as follows:
Data structure
I know the exact hierachy of the languages, but I need R to calculate x and y values. I'd use hclust for that, I think?
I found this instruction here for example: https://stats.stackexchange.com/questions/4062/how-to-plot-a-fan-polar-dendrogram-in-r, but it uses the mtcars dataset. I'd just like to know whether it makes sense to set up my data as above or whether I should use a different structure. When I try to import the datasets I get an error message saying I've got more columns than column headers so I must be doing something wrong.

Is shiny a good solution to display a computationally intensive fixed big dataset?

Here is my problem:
I have a big dataset that in R that represent an object of ~500MB that I plot with ggplot2.
There is 20 millions num values to plot along an int axis that are associated with a 5 level factor for color aesthetics.
I would like to set up a webapps where users could visualize this dataset, using different filter that rely on the factor to display all the data are once or for example a subset corresponding to 1 level of the factor.
The problem is that when I write the plot it takes a couple of minute (~10 minutes)
Solution 1 : The best one for the user would be to use Shiny UI. But is there a way to have the plot already somehow prewritten thanks to ggplot2 or shiny tricks so it can be quickly displayed?
Solution 2 : Without shiny, I would have done different plots of the dataset already and I will have to rebuild a UI to let user visualizes the different pictures. If I do that I will have to restrict the possible use cases of displaying the data.
Looking forward for advices and discussions
Ideally, you shouldn't need to plot anything this big really. If you're getting the data from a database then just write a sequence of queries that will aggregate the data on the DB side and drag very little data to output in shiny. Seems to be a bad design on your part.
That being said, the author of highcharter package did work on implementing boost.js module to help with plotting millions of points. https://rpubs.com/jbkunst/highcharter-boost.
Also have a look at the bigvis package, which allows 'Exploratory data analysis for large datasets (10-100 million observations)' and has been built by #Hadley Wickham https://github.com/hadley/bigvis. There is a nice presentation about the package at this meetup
Think about following procedure:
With ggplot2 you can produce an R object.
plot_2_save <- ggplot()
an object can be saved by
saveRDS(object, "file.rds")
and in the shiny server.R you can load this data
plot_from_data <- readRDS("path/.../file.rds")
I used this setup for some kind of text classification with a really (really) huge svm model implemented as an application on shiny-server.

R plot data.frame to get more effective overview of data

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

Resources