R-Programming: Chart the Z distribution of a factor's frequency - r

I have reviewed a number of posts regarding histograms/barcharts from categorical data but I still can't seem to progress. I have a data set of names (single column) and each name occurs anywhere from once to 8,000 times. I can create a table with variable and frequency and I can move that table to a data frame but o matter what I try I can't even get a barplot much less a histogram with variable on x axis and frequency on the y axis.
Ultimately, I want to use the table or dataframe with name and frequency to calculate the Z score for each name and then graph the distribution. I can do this easily with a series of numbers but doing it with a categorical variable has me stumped.
thanks,
rms

Is this what you're looking for?
example_data <- data.frame(Name = sample(paste0("Name", 1:15), size = 8000, replace=TRUE, prob = (1:15)/sum(1:15)))
counts <- as.data.frame(table(example_data))
colnames(counts) <- c("Name", "Freq")
library(ggplot2)
ggplot(data = counts, aes(x = Name, y = Freq)) + geom_bar(stat="identity")
For future reference, it's a little easier to answer if you provide a reproducible example, or go into more detail about what you've tried already. Hope this helps!

Related

Question about zero values in grouped_ggbetween stats (R)

does anyone know if it's possible to plot a grouped ggbetweenstats (using grouped_ggbetweenstats) plot if some variables in my x-axis hold all zero values for some of the groupings (i.e. it cannot be plotted, but I'd like it to be left blank, or for the graph to add a boxplot/point on the zero mark for those categories)? And if so, how do I do this?
I've tried googling about it but no answers so far
This is a relatively complex question to ask without giving any sample data, and if your data is exactly as you describe it, then it is not clear what your problem is.
Suppose we simulate some data for demonstration purposes:
library(ggstatsplot)
set.seed(1)
df <- data.frame(x = rep(paste("Class", LETTERS[1:3]), each = 20),
y = rnorm(60, rep(1:3, each = 20)),
group = rep(paste("Group", 1:2)))
This gives us 10 random values for each combination of two grouping variables, x, which we plot on the x axis, and group, which we use as the grouping variable. When we plot it looks like this:
grouped_ggbetweenstats(df, x, y, grouping.var = group)
Suppose now that Class B only contains 0 values, which from your description is how your own data is structured.
df$y[df$x == "Class B"] <- 0
But we can still plot the results:
grouped_ggbetweenstats(df, x, y, grouping.var = group)
And the zero-only variable is still plotted with a value of zero, as desired.
Is there some assumption that I have made wrongly?

How do I plot an average of a column subset against another column?

I will preface this by saying that I am a complete R novice and have been asked to do some calculations that are way over my head, so please forgive me in advance if this is not the right way to ask this question!!
I have an R data frame that has 2 columns: one is age (18-80) and the other is a dependent variable that has three possible outcomes (0,1,2). I would like to plot a graph that has x = age and y = the average of the dependent variable by age. I know how to make a simple graph and I know how to calculate the average of my (0,1,2) column by age individually, but it seems really labor-intensive to do that manually for every age from 18 to 80 and then plot that against age in a new data frame that I guess I'd have to make.
How do I find the mean of my dependent variable by subset (age) and then plot it against age?
You could also do this with ggplot:
dat <- data.frame(age=sample(18:80, 250, replace=TRUE),
y = sample(0:2, 250, replace=TRUE))
ggplot(dat, aes(x=age, y=y)) +
stat_summary(fun.data = function(y)data.frame(y=mean(y)),
geom="line")

plotting two categorical vectors in ggridges

I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")

Convert absolute values to ranges for charting in R

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase

Resources