A very simple histogram with R? - r

I'm at very beginning with R programming. I'm using RStudio for an exam, and I have to represent graphically the results of some calculations on a dataset.
I have a structure like that:
and what I was thinking to do was make some histograms with the 3 values of the mean for each row, and the same for median and trimmed mean.
First question: Is this a correct way to represent this kind of data graphically? Or there are some better plot.
Second question: Could someone give me the code to draw a graph with on the x avis the 3 strings ("Lobby", "R & D","ROE") and on the y axis a scale of values ​​that includes the results, in order to have the histograms representing the differences in investment in lobbing, r & d and the roe obtained.
Hope I've been clear enough, if I haven't specified something relevant please ask me.

Its sounds like you want to do the following. With your data in a csv call bar.csv having this format:
Dept Mean Median Trimmed_Mean
Lobby 0.008 0.0018 0.0058
R & D 6.25 3.2 4.78
ROE 19.08 16.66 16.276
You can use library(ggplot2) and library(reshape) and the commands listed here
dat.m<-read.csv("bar.csv")
dat.m<-melt(dat.m,id.vars="Dept")
ggplot(dat.m, aes(x = Dept, y = value,fill=variable)) + geom_bar(stat='identity')+
facet_wrap(~ Dept, ncol = 3,scales="free_y") #facet wrapped
ggplot(dat.m, aes(x = Dept, y = value,fill=variable)) + geom_bar(stat='identity')
#stacked bar
to display the graphs below:
As zhaoy says, a historgram works with raw data (usually) - and what you have is summary data. Also, you could use library(ggplot2) to produce a boxplot summary graph like this (using spray data in the ggplot2 library):
library(ggplot2)
p<-qplot(spray,count,data=InsectSprays,geom='boxplot')
p<-p+stat_summary(fun.y=mean,shape=1,col='red',geom='point')
print(p)
Or simply using the standard boxplot command, with the same data, with added functionality to display the means:
boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
means <- tapply(InsectSprays$count,InsectSprays$spray,mean)
points(means,col="red",pch=18)

In response to question 1: The purpose of histograms is to display the density or frequency of continuous data. If you're trying to compare the mean / median / trimmed mean across the 3 categories in the row.name column, I suggest bar graphs. I'm not sure comparing mean / median / trimmed mean in a single graph is coherent to viewers, so it may be ideal to generate 3 bar graphs.
In response to question 2: If you aim to compare the 3 categories in the row.name column using multiple columns of data, I suggest a box-plot. I realize that the box-plot does not traditionally include the mean, but it is one of the best visualizations for comparing data across categories. Please see r-bloggers.com/box-plot-with-r-tutorial for an example.

Related

Creating multiple density plots using only summary statistics (no raw data) in R

I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)

Combine logistic regression with bar graph for maturity results

I am trying to present the results of a logistic regression analysis for the maturity schedule of a fish species. Below is my reproducible code.
#coded with R version R version 3.0.2 (2013-09-25)
#Frisbee Sailing
rm(list=ls())
library(ggplot2)
library(FSA)
#generate sample data 1 mature, 0 non mature
m<-rep(c(0,1),each=25)
tl<-seq(31,80, 1)
dat<-data.frame(m,tl)
# add some non mature individuals at random in the middle of df to
#prevent glm.fit: fitted probabilities numerically 0 or 1 occurred error
tl<-sample(50:65, 15)
m<-rep(c(0),each=15)
dat2<-data.frame(tl,m)
#final dataset
data3<-rbind(dat,dat2)
ggplot can produce a logistic regression graph showing each of the data points employed, with the following code:
#plot logistic model
ggplot(data3, aes(x=tl, y=m)) +
stat_smooth(method="glm", family="binomial", se=FALSE)+
geom_point()
I want to combine the probability of being mature at a given size, which is obtained, and plotted with the following code:
#plot proportion of mature
#clump data in 5 cm size classes
l50<-lencat(~tl,data=data3,startcat=30,w=5)
#table of frequency of mature individuals by size
mat<-with(l50, table(LCat, m))
#proportion of mature
mat_prop<-as.data.frame.matrix(prop.table(mat, margin=1))
colnames(mat_prop)<-c("nm", "m")
mat_prop$tl<-as.factor(seq(30,80, 5))
# Bar plot probability mature
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")
What I've been trying to do, with no success, is to make a graph that combines both, since the axis are the same it should be straightforward, but I cant seem to make t work. I have tried:
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")+
stat_smooth(method="glm", family="binomial", se=FALSE)
but does not work. Any help would be greatly appreciated. I am new so not able to add the resulting graphs to this post.
I see three problems with your code:
Using stat="bin" in your geom_bar() is inconsisten with giving values for the y-axis (y=m). If you bin, then you count the number of x-values in an interval and use that count as y-value, so there is no need to map your data to the y-axis.
The data for the glm-plot is in data3, but your combined plot only uses mat_prop.
The x-axis of the two plots are acutally not quite the same. In the bar plot, you use a factor variable on the x-axis, making the axis discrete, while in the glm-plot, you use a numeric variable, which leads to a continuous x-axis.
The following code gave a graph combining your two plots:
mat_prop$tl<-seq(30,80, 5)
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="identity") +
geom_point(data=data3) +
geom_smooth(data=data3,aes(x=tl,y=m),method="glm", family="binomial", se=FALSE)
I could run it after first sourcing your script to define all the variables. The three problems mentioned above are adressed as follows:
I used geom_bar(stat="identity") in order not to use binning in the bar plot.
I use the data-argument in geom_point and geom_smooth in order to use the correct data (data3) for these parts of the plot.
I redifine mat_prop$tl to make it numeric. It is then consistent with the column tl in data3, which is numeric as well.
(I also added the points. If you don't want them, just remove geom_point(data=data3).)
The plot looks as follows:

line graph with 2 categorical variables and 1 continuous in R

I'm quite new to R and statistics in general. I am trying to plot in a line graph 2 categorical variables (part of speech "pos", condition "trcond") and a numerical one (score "totacc") in ggplot2.
> df1<-df[, c("trcond", "subtitle", "pos", "totacc")]
> head(df1)
trcond subtitle pos totacc
7 L New Scene_16 lex 0.250
29 N New Scene_16 lex 0.500
8 L New Scene_25 lex 0.875
30 N New Scene_25 lex 0.666
9 L New Scene_29 lex 1.000
31 N New Scene_29 lex 0.833
I have used this ggplot2 command:
>ggplot(data=summdfo, aes(x=pos, y=totacc, group=trcond, colour=trcond))
+ geom_line() + geom_point()
But it is not working, the graph has coloured (blue and red) dots all over the place and more than just two lines linking them. I would like to post the graph I get as I lack words to explain but this is my first post and I don't seem to be able to upload pictures.
I would like to get a standard simple 2-line graph such as the blue and red ones in this page (where y=total bill, by x=time (lunch,dinner) grouped by gender): http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_%28ggplot2%29/
Is this possible with my data set at all? If so, what am I doing wrong with the code?
Here I tried to create a data frame based on limited sample from your data.
df1 <- data.frame(trcond=rep(c('L', 'N'), 3),
subtitle=rep('New Scene_29', 6), # Not in use, just a dummy
pos=c('lex', 'lex', 'lex', 'noLex', 'noLex', 'noLex'),
totacc=c(0.250, 0.5, 0.875, 0.666, 1.000, 0.833))
Because trcond by pos is not balanced in this data frame, the plot is going to be jumbled up like this:
ggplot(data=df1, aes(x=pos, y=totacc, group=trcond, color=trcond))+
geom_line() +
geom_point()
However, if you apply a summary function which will compute means for each condition, a correct plot will appear:
ggplot(data=df1, aes(x=pos, y=totacc, group=trcond, color=trcond))+
geom_line(stat='summary', fun.y='mean') +
geom_point(stat='summary', fun.y='mean')
Again, this is trying to figure out what's in your data. The best is that you provide here a sample of your data using dput(head(df1, 50)) to give you a better answer.

Print five point summary values(Min,Q1,Median,Q3,Max) on the boxplot

I am trying to create a simple boxplot with all the labels. I have a dataset that says about the Number of customer Visits .It has two columns; Customer ID and AvgVists
custID AvgVisits
1 10
2 4
3 12
I want a simple boxplot that is horizontally oriented and displays the five summary points on the graph, with nice color and axes. I am able to find the heading, make it horizontally oriented, unable to report the summary numbers on the graph itself.
#Henriks link seems to answer your question. This answer may also be helpful in terms of applying annotation to multiple boxplots on the same graph.
For completeness:
boxplot() will calculate the no.s (same as fivenum() ) to plot, which you can verify by storing the result:
AvgVisits <- c(10,4,12)
b1 <- boxplot(AvgVisits)
b1$stats == fivenum(AvgVisits)
Here's a solution with ggplot2 which you may find appealing. Change the values of aes(x=) to move the position up/down (as co-ordinates already flipped).
require(ggplot2)
q1 <- qplot(x=1, b1$stats, geom = "boxplot")
q1 +coord_flip() +
geom_text(aes(x=1.1,y=b1$stats,label=b1$stats)) +
opts(
axis.text.x=theme_blank(),
axis.text.y=theme_blank(),
axis.title.x=theme_blank(),
axis.title.y=theme_blank()
)
Giving:
Use the text() command, with the format text(location, "print this text", pos). pos should be one of the following: 1=below, 2=left, 3=above, 4=right. If you need further assistance please include the code you have so far. More here: http://www.statmethods.net/advgraphs/axes.html

Nested tables and calculating summary statistics with confidence intervals in R

This question is about the statistical program R.
Data
I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.
Here are some example data (thanks to Roland for this):
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
Question 1
I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.
Example
The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.
gender height_category n median_freckles LCI UCI
male tall # # # #
short # # # #
female tall # # # #
short # # # #
Question 2
Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.
What I've tried
I know that I can make a nested table using the MASS library and xtabs command:
ftable(xtabs(formula = ~ gender + height_category, data = study_data))
However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.
You should really provide a reproducible example. Anyway, you may find library(plyr) helpful. Be careful with these confidence intervals because the Central Limit Theorem doesn't apply if n < 30.
library(plyr)
ddply(df, .(gender, height_category), summarize,
n=length(freckles), median_freckles=median(freckles),
LCI=qt(.025, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles),
UCI=qt(.975, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles))
EDIT: I forgot to add the bit on the plot. Assuming we save the previous result as tab:
library(ggplot2)
library(reshape)
m.tab <- melt(tab, id.vars=c("gender", "height_category"))
dodge <- position_dodge(width=0.9)
ggplot(m.tab, aes(fill=height_category, x=gender, y=median_freckles))+
geom_bar(position=dodge) + geom_errorbar(aes(ymax=UCI, ymin=LCI), position=dodge, width=0.25)
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
library(plyr)
res <- ddply(DF,.(gender,height_category),summarise,
n=length(na.omit(freckles)),
median_freckles=quantile(freckles,0.5,na.rm=TRUE),
LCI=quantile(freckles,0.025,na.rm=TRUE),
UCI=quantile(freckles,0.975,na.rm=TRUE))
library(ggplot2)
p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
group=height_category,fill=height_category)) +
geom_bar(stat="identity",position="dodge") +
geom_errorbar(position="dodge")
print(p1)
#a better plot that doesn't require to precalculate the stats
library(hmisc)
p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) +
stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
print(p2)

Resources