Dealing with Factors - r

I have an object that is a factor with a number of levels:
x <- as.factor(c(rep("A",20),rep("B",10),rep("C",15)))
In the shortest manner possible, I would like to use ggplot to create a bar graph of the % frequency of each factor.
I keep finding that there are a lot of little annoyances that get in between summarizing and plotting when I have a factor. Here are a few examples of what I mean by annoyances:
as.data.frame(summary(x))
You have to rename the columns and the 1st column values are now rownames in the last example. In the next, you have to cheat to use cast and then you have to relabel because it defaults to a colname of "(all)".
as.data.frame(q1$com.preferred)
dat$value <- 1
colnames(dat) <- c("pref", "value")
cast(dat, pref ~.)
colnames(dat)[2] <- "value"
Here's another example, somewhat better, but less than ideal.
data.frame(x=names(summary(x)),y=summary(x))
If there's a quick way to do this within ggplot, I'd be more than interested to see it. So far, my biggest problem is changing counts to frequencies.

Following up on #dirk and #joran's suggestions (#joran really gets credit. I thought as.data.frame(), and not just data.frame(), was necessary, but it turns out #joran's right ...
x <- as.factor(c(rep("A",20),rep("B",10),rep("C",15)))
t1 <- table(x)
t2 <- data.frame(t1)
t3 <- data.frame(prop.table(t1))
qplot(x,Freq,data=t2,geom="bar",ylab="Count")
qplot(x,Freq,data=t3,geom="bar",ylab="Proportion")
edit: shortened slightly (incorporated #Chase's prop.table too)

You can have qplot do the summary work for you without the outside computations, try any of the following:
x <- rep(c('A','B','C'), c(20,10,15))
qplot(x, weight=1/length(x), ylab='Proportion')
qplot(x, weight=100/length(x), ylab='Percent')
qplot(x, weight=1/length(x), ylab='Percent') + scale_y_continuous(formatter='percent')
ggplot(data.frame(x=x),aes(x, weight=1/length(x))) + geom_bar() + ylab('Proportion')
There is probably a way to do this using transformations inside the ggplot functions as well, but I have not found it yet.

Did you try the ggplot-equivalent of just calling barplot(table(x)/length(x)) ? I.e.
R> x <- as.factor(c(rep("A",20),rep("B",10),rep("C",15)))
R> table(x)
x
A B C
20 10 15
which we turn into percentages easily
R> table(x)/length(x)*100
x
A B C
44.4444 22.2222 33.3333
and can then plot
R> barplot(table(x)/length(x)*100)
just fine:

Related

Ploting a matrix using ggplot2 in R

I want to plot using ggplot2 the distribution of 5 variables corresponding to a matrix's column names
a <- matrix(runif(1:25),ncol=5,nrow=1)
colnames(a) <- c("a","b","c","d","e")
rownames(a) <- c("count")
I tried:
ggplot(data=melt(a),aes(x=colnames(a),y=a[,1]))+ geom_point()
However, this gives a result as if all columns had the same y value
Note: i'm using the reshape package for the melt() function
All columns look like they have the same y-value because you are only specifying 1 number in the y= statement. You are saying y=a[,1] which if you type a[,1] into your command window you will find is 0.556 (the number that everything is appearing at). I think this is what you want:
library(reshape2)
library(ggplot2)
a_melt<- melt(a)
ggplot(data=a_melt,aes(x=unique(Var2),y=value))+ geom_point()
Note that I saved a new dataset called a_melt so that things were easier to reference. Also since the data was melted,it is cleaner if we define our x-values to be the Var2 column of a_meltrather than the columns of a.

Vectorized meta data computation based on multiple columns on R data.frame

I have a data.frame with 3 columns, each of which can be thought of as a factor. I'd like to compute some stats on the data.frame and store it in a new frame. To be more specific, I have the following fields:
obs, len, src
A 10 X
B 10 Y
I'd like to compute the breakdown of each source at each length (i.e. what percentage of observations from source X that are of length 10 are "A", "B", etc.)
An obvious approach to this is to use two for loops to iterate over the lengths and sources and then use nrow() and count() to get the values I'd need to compute, like so:
relevant_subset <- data[data$src==source & data$len==length,]
breakdown_info <- count(relevant_subset)
breakdown_info$frac <- breakdown_info$freq / nrow(relevant_subset)
Is there a way to avoid using the double for loop and use a more vectorized approach? Is there a smart way to pre-allocate the new frame that would hold the modified breakdown_info for each length and source?
aggregate is your friend for these tasks:
Example data:
set.seed(23)
test <- data.frame(
obs=sample(LETTERS[1:2],20,replace=TRUE),
len=sample(c(10,20),20,replace=TRUE),
src=sample(LETTERS[24:25],20,replace=TRUE)
)
Aggregate it:
aggregate(obs ~ src + len,data=test, function(x) prop.table(table(x)))
src len obs.A obs.B
1 X 10 0.6000000 0.4000000
2 Y 10 0.2000000 0.8000000
3 X 20 0.2500000 0.7500000
4 Y 20 0.1666667 0.8333333
This is what the plyr package was made for!
The format is <input_type><output_type>ply. For example if the input is a data.frame and you want the output to be a data.frame use ddply.
To use it, you specify the input data.frame, the columns to group by and then a function that constructs a data.frame from each group. The resulting data.frames appended with the grouping columns are assembled together into the output data.frame.
In something similar to your example, you could do
require(plyr)
a <- data.frame(
obs=factor(c('A','A','A','B','B')),
len=c(10,10,10,10,210),
src=factor(c('X','X','Y','Y','Z')))
then
z <- ddply(
a,
.(obs),
function(df){
data.frame(mean.len=mean(df$len))
})
would produce
data.frame(
obs=c('A', 'B'),
mean.length(10, 110))
while
ddply(a, .(src), function(df){
data.frame(
num.obs.A = sum(df$obs == 'A'),
num.obs.B = sum(df$obs == 'B'))})
would produce
data.frame(
src=c('X','Y', 'Z'),
num.obs.A = c(3,1,0),
num.obs.B = c(0,1,1))
The website is http://plyr.had.co.nz/ has good documentation too.
You haven't stated a reason why you want a data.frame here as output. Perhaps it's best for you, perhaps not. You also aren't really clear on what proportions are what but I think the following might solve your problem best.
prop.table( table(test) )
You could enter it slightly differently and play with the order of columns so that what you want to compare is most easily examined. But, this output is a 3-dimensional array and quite a bit different from a data.frame.
(example of alternate usage)
prop.table(with(test, table(src, obs, len) ))

How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?
You can't do that with the basic cor() function but you can
use a correlation function from one of the robust statistics packages, eg robCov() from package robust
use a winsorize() function, eg from robustHD, to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.
If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values
In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.
It makes sense to put all your data on a data frame, so it's easier to handle.
I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)
You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.
Assuming your dataframe is A and columns are V1 and V2.
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])
or you can remove several indexes. Let's say 1, 5 and 20
ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])

R: summarise data frame with repeating rows into boxplots

I am an R neophyte, with a data frame of database function runtimes with the following data:
> head(data2)
dbfunc runtime
1 fn_slot03_byperson 38.083
2 fn_slot03_byperson 32.396
3 fn_slot03_byperson 41.246
4 fn_slot03_byperson 92.904
5 fn_slot03_byperson 130.512
6 fn_slot03_byperson 113.853
The data has data for 127 discrete functions comprising of some 1940170 rows.
I would like to:
Summarise the data to only include database functions with a mean runtime of over 100 ms
Produce boxplots of the 25 slowest database functions showing the distribution of runtimes, sorted by slowest first.
I'm particularly stumped by the summary step.
Note : I've also asked this questions at stats.stackexchange.com.
Here's one approach using ggplot and plyr. The steps you outlined could be combined to be slightly more efficient, but for learning purposes I'll show you the steps as you asked them.
#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
, runtime = runif(1000, max = 300))
#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))
#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)
#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)
#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])
#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)
#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) +
geom_boxplot()
Like I said, I feel a few of those steps could be combined and made more efficient, but wanted to show you the steps as you outlined them.
The summary step is easy:
attach(data2)
func_mean = tapply(runtime, dbfunc, mean)
ad question 1:
func_mean[func_mean > 100]
ad question 2:
slowest25 = head(sort(func_mean, decreasing = TRUE), n=25)
sl25_data = merge(data.frame(dbfunc = names(slowest25), data2, sort = F)
plot(sl25_data$runtime ~ sl25_data$dbfunc)
Hope this helps. Yet the boxplots are not sorted in the plot.
I'm posting this as the 'answer' whereas Tomas and Chases' answers are in fact more complete. In Chase's case I couldn't get ggplot to operate, and time was short. In Tomas' case I got stuck at the sl25_data step.
We ended up using the following, which works with one remaining problem:
# load data frame
dbruntimes <- read.csv("db_runtimes.csv",sep=',',header=FALSE)
# calc means
meanruns <- aggregate(dbruntimes["runtime"],dbruntimes["dbfunc"],mean)
# filter
topmeanruns <- meanruns[meanruns$runtime>100,]
# order by means
meanruns <- meanruns[rev(order(meanruns$runtime)),]
# get top 25 results
drawfuncs <- meanruns[1:25,"dbfunc"]
# subset for plot
forboxplot <- subset(dbruntimes,dbfunc %in% levels(drawfuncs)[0:25])
# plot
boxplot(forboxplot$runtime~forboxplot$dbfunc)
This gives us the result we are looking for, but all the functions are still shown on the plot xaxis, rather than just the top 25.

Read rows with specific column values

I want to extract a set of rows of an existing dataset:
dataset.x <- dataset[(as.character(dataset$type))=="x",]
however when I run
summary(dataset.x$type)
It displays all types which were present in the original dataset. Basically I get a result that says
x 12354235 #the correct itemcount
y 0
z 0
a 0
...
Not only is the presence of 0 elements ugly but it also messes up any plot of dataset.x due to the presence of hundrets of entries with the value 0.
Building on Chase's answer, subsetting and dropping unused levels in factors comes up a lot, so it pays to just create your own function by combining droplevels and subset:
subsetDrop <- function(...){droplevels(subset(...))}
I'm assuming this is a factor? If so, droplevels() can be used: http://stat.ethz.ch/R-manual/R-patched/library/base/html/droplevels.html
If you add a small reproducible example, it will help others get on the same page and give better advice if this isn't right.
Others have explained what is happening and how to fix it, I just want to show why it is a desirable default.
Consider the following sample code:
mydata <- data.frame(
x = factor( rep( c(0:5,0:5), c(0,5,10,20,10,5,5,10,20,10,5,0))),
sex = rep( c('F','M'), each=50 ) )
mydata.males <- mydata[ mydata$sex=='M', ]
mydata.males.dropped <- droplevels(mydata.males)
mydata.females <- mydata[ mydata$sex=='F', ]
mydata.females.dropped <- droplevels(mydata.females)
par(mfcol=c(2,2))
barplot(table(mydata.males$x), main='Male', sub='Default')
barplot(table(mydata.females$x), main='Female', sub='Default')
barplot(table(mydata.males.dropped$x), main='Male', sub='Drop')
barplot(table(mydata.females.dropped$x), main='Female', sub='Drop')
Which produces this plot:
Now, which is the more meaningful comparison, the 2 plots on the left? or the 2 plots on the right?
Instead of dropping unused levels it may be better to rethink what you are doing. If the main goal is to get the count of the x's then you can use sum rather than subsetting and getting the summary. And how meaningful can a plot be on a variable that you have already forced to be a single value?
Try
dataset$type <-
as.character(dataset$type)
followed by your original code. It's probably just that R is still treating that column as a
factor and is keeping all of the information about that factor in the column.

Resources