have something alike. I have a dataset with 22000 values and want to show them in a proper way (with my data: a graph for every river with the fish species cought in this river on the y-axis and the number of fish caught per species on the x-axis.
dat<-file[file$RiverName=="Mississippi",]
boxplot(FishCought ~ FishName, cex.axis=0.7, horizontal=TRUE, las=2, col="green", xlab="Abundanz [Ind./ha]")
If I do so, the Graph shows all "Fishname"s on the y-Axis, only drawing a boxplot at those fish which were caught in this River.... how can I get rid of those Fish Names that aren't caught in this river (to make the graph better-looking)?!
Any suggestions?
I'm assuming that FishCought is actually FishCaught... The syntax would be
boxplot(FishCaught ~ FishName, data =
within(subset(file, RiverName=="Mississippi" & FishCaught > 0),
FishName <- factor(FishName)))
subset(file, RiverName=="Mississippi" & FishCaught > 0) selects only the samples you want.
within(...,FishName <- factor(FishName)) returns a data frame with FishName as a categorical variable where fish not caught in this river is not included as a category (or "factor level" in R parlance).
Related
I have already researched the webs on how to do this but nothing has worked - there is an element missing always. I am trying to plot histogram for mean fish condition (y axis) for males and females(colour/legend) for each age group(x axis). What complicates it more is that I have multiple sites so I am not sure how to go about it. I have attached what is kind of close to what I had like to plot(note how a factor is missing-sites). There is a lot of code I have tried but I will provide a sample below. My data is saved as a csv with different columns that have sex, site, fish condition, etc.
F8 is the csv sheet that has multiple sites from one year
K is the fish condition column that I want means calculated for
gender.mean <- t(tapply(F8$K,
list(F8$Age1, F8$Sex), mean))
^I am not sure how to factor in the site as well.
I plotted this (code below) which gives me mean condition for both males and females for each age group but its all of sites combined.
barplot(gender.mean, col=c("darkblue","red"), beside=TRUE, legend=c("F","M"))
^for some reason, the x axis line (not numbers) is missing though and the legend is on top of the bars?
In addition, I had like the bars to have standard error bars but
geom_errorbar( aes(x=name, ymin=value-sd, ymax=value+sd), width=0.4, colour="orange", alpha=0.9, size=1.3)
the above code has not worked.
I am desiring to plot two levels of data (high, low) for two days (day o, day 1) for both male and female subjects. I have been success in faceting by day and by level. However I am unsuccessful of combining and identifying the genders. I would like to show the male/female together on day 0 and day 1. Below is specific code I have been trying to create.
Thanks in advance
data <- function(ids,time_vec) {
obs.data <-
data.frame(expand.grid(ids,time_vec),DOSE=0,Conc=rnorm(13,10,2),Day=0)
names(obs.data) <- c("ID","TIME","DOSE","Conc","Day")
obs.data<-obs.data[order(obs.data$ID),]
return(obs.data)
}
test<-data(ids=1:4, time_vec= seq(0,120,10))
test$Gender<-ifelse(test$ID==1|test$ID ==3,"Male","Female")
test$Day<-ifelse(test$ID==1|test$ID==2,"Day 1","Day 2")
test$DoseLevel<-ifelse(test$ID==1|test$ID==2,"Low","High")
gf1<-ggplot(test,aes(x=TIME, y=(Conc), group=interaction(DoseLevel,Day,
Gender)))+
geom_line(size=1.25)+
facet_grid(DoseLevel~.,as.table=FALSE)
gf1
gf2<-gf1+ geom_point(aes(shape=factor(Day), fill=factor(Day),
colour=factor(Day)),size=4,show_guide=TRUE)+
scale_shape_manual(values=c(21, 21))+
scale_fill_manual(values=c("black","white"))+
scale_colour_manual(values=c("black","black"))
gf2
This link will help you:
Plotting continuous and discrete series in ggplot with facet
Just take care the using melt function.
I'm working on some fish electroshocking data and looking at fish species abundance per transects in a river. Essentially, I have an abundance of different species per transect that I'm plotting in a stacked bar chart. But, what I would like to do is label the top of the bar, or underneath the x-axis tick mark with N = Total Preds for that particular transect. The abundance being plotted is the number of that particular species divided by the total number of fish (preds) that were caught at that transect. I am having trouble figuring out a way to do this since I don't want to label the plot with the actual y-value that is being plotted.
Excuse the crude code. I am newer to R and not super familiar with generating random datasets. The following is what I came up with. Obviously in my real data the abundance % per transect always adds up to 100 %, but the idea is to be able to label the graph with TotalPreds for a transect.
#random data
Transect<-c(1:20)
Habitat<-c("Sand","Gravel")
Species<-c("Smallmouth","Darter","Rock Bass","Chub")
Abund<-runif(20,0.0,100.0)
TotalPreds<-sample(1:139,20,replace=TRUE)
data<-data.frame(Transect,Habitat,Species,Abund,TotalPreds)
#Generate plot
AbundChart<-ggplot(data=data,aes(x=Transect,y=Abund,fill=Species))
AbundChart+labs(title="Shocking Fish Abundance")+theme_bw()+
scale_y_continuous("Relative Abundance (%)",expand=c(0.02,0),
breaks=seq(0,100,by=20),labels=seq(0,100,by=20))+
scale_x_discrete("Transect",expand=c(0.03,0))+
theme(plot.title=element_text(face='bold',vjust=2,size=25))+
theme(legend.title=element_text(vjust=5,size=15))+
geom_bar(stat="identity",colour="black")+
facet_grid(~Habitat,labeller=label_both,scales="free_x")
I get this plot that I would like to label with TotalPreds as described previously.
Again my plot would have bars that reached 100% for abundance, and in my real data transects 1-10 are gravel and 11-20 are sand. Excuse my poor sample dataset.
*Update
My actual data looks like this:
Variable in this case is the fish species and value is the abundance of that species at that particular electroshocking transect. Total_Preds is repeated when the data moves to a new species, because total preds is indicative of the total preds caught at that particular transect (i.e. each transect only has 1 total preds value). Maybe the melt function wasn't the right way to analyze this, but I have like 17 fish species that were caught at different rates across these 20 transects. I guess habitat type is singular to a transect as well, with 1-10 being gravel and 11-20 being sand, and that is repeated in my dataset across fish species as well.
Edited in response to the update, you should be able to create a new dataframe containing the TotalPred data (not repeated) and use that in geom_text. Can't test this without data but maybe:
# select non-repeated half of melted data for use in geom_text
textlabels <- data[c(1:19),]
#Generate plot
AbundChart<-ggplot(data=data,aes(x=Transect,y=Abund,fill=Species))
AbundChart+labs(title="Shocking Fish Abundance")+theme_bw()+
scale_y_continuous("Relative Abundance (%)",expand=c(0.02,0),breaks=seq(0,100,by=20),labels=seq(0,100,by=20))+
scale_x_discrete("Transect",expand=c(0.03,0))+
theme(plot.title=element_text(face='bold',vjust=2,size=25))+
theme(legend.title=element_text(vjust=5,size=15))+
geom_bar(stat="identity",colour="black")+
facet_grid(~Habitat,labeller=label_both,scales="free_x") +
geom_text(data = textlabels, aes(x = Transect_ID, y = value, vjust = -0.5,label = TotalPreds))
You might have to play around with different values for vjust to get the labels where you want them.
See the geom_text help page for more info.
Hope that edit works with your data.
This question is about the statistical program R.
Data
I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.
Here are some example data (thanks to Roland for this):
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
Question 1
I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.
Example
The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.
gender height_category n median_freckles LCI UCI
male tall # # # #
short # # # #
female tall # # # #
short # # # #
Question 2
Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.
What I've tried
I know that I can make a nested table using the MASS library and xtabs command:
ftable(xtabs(formula = ~ gender + height_category, data = study_data))
However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.
You should really provide a reproducible example. Anyway, you may find library(plyr) helpful. Be careful with these confidence intervals because the Central Limit Theorem doesn't apply if n < 30.
library(plyr)
ddply(df, .(gender, height_category), summarize,
n=length(freckles), median_freckles=median(freckles),
LCI=qt(.025, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles),
UCI=qt(.975, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles))
EDIT: I forgot to add the bit on the plot. Assuming we save the previous result as tab:
library(ggplot2)
library(reshape)
m.tab <- melt(tab, id.vars=c("gender", "height_category"))
dodge <- position_dodge(width=0.9)
ggplot(m.tab, aes(fill=height_category, x=gender, y=median_freckles))+
geom_bar(position=dodge) + geom_errorbar(aes(ymax=UCI, ymin=LCI), position=dodge, width=0.25)
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
library(plyr)
res <- ddply(DF,.(gender,height_category),summarise,
n=length(na.omit(freckles)),
median_freckles=quantile(freckles,0.5,na.rm=TRUE),
LCI=quantile(freckles,0.025,na.rm=TRUE),
UCI=quantile(freckles,0.975,na.rm=TRUE))
library(ggplot2)
p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
group=height_category,fill=height_category)) +
geom_bar(stat="identity",position="dodge") +
geom_errorbar(position="dodge")
print(p1)
#a better plot that doesn't require to precalculate the stats
library(hmisc)
p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) +
stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
print(p2)
I am trying get my head around ggplot2 which creates beautiful graphs as you probably all know :)
I have a dataset with some transactions of sold houses in it (courtesy of: http://support.spatialkey.com/spatialkey-sample-csv-data/ )
I would like to have a line chart that plots the cities on the x axis and 4 lines showing the number of transactions in my datafile per city for each of the 4 home types. Doesn't sound too hard, so I found two ways to do this.
using an intermediate table doing the counts and geom_line() to plot the results
using geom_freqpoly() on my raw dataframe
the basic charts look the same, however chart nr. 2 seems to be missing plots for all the 0 values of the counts (eg. for the cities right of SACRAMENTO, there is no data for Condo, Multi-Family or Unknown (which seems to be missing completely in this graph)).
I personally like the syntax of method number 2 more than that of number 1 (it's a personal thing probably).
So my question is: Am I doing something wrong or is there a method to have the 0 counts also plotted in method 2?
# line chart example
# setup the libraries
library(RCurl) # so we can download a dataset
library(ggplot2) # so we can make nice plots
library(gridExtra) # so we can put plots on a grid
# get the data in from the web straight into a dataframe (all data is from: http://support.spatialkey.com/spatialkey-sample-csv-data/)
data <- read.csv(text=getURL('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'))
# create a data frame that counts the number of trx per city/type combination
df_city_type<-data.frame(table(data$city,data$type))
# correct the column names in the dataframe
names(df_city_type)<-c('city','type','qty')
# alternative 1: create a ggplot with a geom_line on the calculated values - to show the nr. trx per city (on the x axis) with a differenct colored line for each type
cline1<-ggplot(df_city_type,aes(x=city,y=qty,group=type,color=type)) + geom_line() + theme(axis.text.x=element_text(angle=90,hjust=0))
# alternative 2: create a ggplot with a geom_freqpoly on the source data - - to show the nr. trx per city (on the x axis) with a differenct colored line for each type
c_line <- ggplot(na.omit(data),aes(city,group=type,color=type))
cline2<- c_line + geom_freqpoly() + theme(axis.text.x=element_text(angle=90,hjust=0))
# plot the two graphs in rows to compare, see that right of SACRAMENTO we miss two lines in plot 2, while they are in plot 1 (and we want them)
myplot<-grid.arrange(cline1,cline2)
As #joran pointed out, this gives a "similar" plot, when using "continuous" values:
ggplot(data, aes(x=as.numeric(factor(city)), group=type, colour=type)) +
geom_freqpoly(binwidth=1)
However, this is not exactly the same (compare the start of the graph), as the breaks are screwed up. Instead of binning from 1 to 39 with binwidth of 1, it, for some reason starts at 0.5 and goes until 39.5.