Plotting difference between datasets using ggplot - r

I'm struggling with something I'm trying to do in R.
I have two datasets with the same (categorical) columns but different values. I want to compare the count of each combination of columns (e.g. male and married, female and single etc) visually.
This is easy enough to do with ggplot's geom_bar for each dataset, and I know I can put the counts for each dataset next to each other by binding them and setting position = "dodge".
My question is whether there's an easy to way to plot the difference between the two counts for each pair of variables. And whether there's a way of changing the default 'count' method in geom_count (ironic I know) to other things (like proportion or, maybe in this case, a predefined set of values for the difference).
Thanks

Related

Multiple boxplots in one graph, R

I'm working with a dataset where I have one continous variable (V1) and want to see how that variable differs depending on demographics such as sex, age group etc.
I would like to do one graph that contains multiple boxplots - so that V1 is on the Y-axis and all my demographic variables (sex, age groups etc.) are on the x-axis with their corresponding p-values. Anyonw know how to do this in R?
I've added two photos to illustrate my dataset and the output I want.
Thanks!
Output example
Data example
It would be nice to have actual data and the code you already have so we can replicate what you have and work what you want. That being said, this link might be what you are looking for:
https://statisticsglobe.com/draw-multiple-boxplots-in-one-graph-in-r#example-2-drawing-multiple-boxplots-using-ggplot2-package
Scroll down about half way to Example 4: Drawing Multiple Boxplots for Each Group Side-by-Side

R: binning problem in multiple of consistent width

I have been searching for R cutting or binning packages but I could not quite find what I really want.
I have a dataset of 1000 variables and for some columns they might have values ranging from 0.01 to 0.2 but for some other they might have range from 0 to 2000. Some, on the other hand, might contain negative numbers.
I would like to plot the histogram for each of the variables but with a more consistent binning label, i.e. I would like the bin width in multiple of either 1,2.5 or 5 (for decimal numbers maybe of 0.01,0.02 or 0.05) and I am flexible of the bin numbers to vary between 20-40 bins(they can be fixed if it's easier) and not so care about the amount of data in each bin.
The reason for this is because I might get some new data for the same variables and I would like to have consistent binning to their distribution and perhaps model results in the same bin. And there are simply too much variables and I could not do them manually.
Any thoughts on how to write a function for this to return the bins consistent with old and new data before I get the new data.

Violin plot in R binning most groups into an "other" category?

I have a dataframe that I am currently grouping by a category that I have about a thousand items in. This creates an overly wide chart where I'm actually not interested in most of the data, since they are all alike.
What I want is to only see the plots for items that are above a threshold in their max value, and possibly combine all of the others into an "other" category.
Is there a canned way to do this?
fct_lump() from the forcats package might work.
You would need to process the values to proportions first, but it has the argument prop which "preserves values that occur at least prop of the time" (it groups the rest into an "Other" level).
There is alternatively an n argument for the number of levels to keep (also grouping the rest into an "Other" level).
Here's a bit more information about forcats.

R cluster analysis Ward auto deleting outliers

How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]

R - Assign observation into the classes (sturges rule)

I have a list of 70 observations (amounts) that I would like to assign to classes (intervals) and perform some basic calculations (relative frequency, cumulative frequency, etc).
First question is, if there is a function for Sturges rule (i.e returns the number and length of the classes)?
Second question is, if there is a function in R that is similar to Excel's function frequency (based on classes borders counts the observations per class)?
Thanks!
The Sturges rule is the default split used by the hist function and the function that does it is:
?nclass.Sturges
There are various grouping functions in R. I suspect one of cut, table or xtabs may be what you want. (I didn't understand what was meant by "based on classes borders counts the observations per class".) cut gives a vector of the same length, whereas the other two tally the counts, returning a contingency table.

Resources