starred bar chart - r

I'm trying to make a simple bar chart that first, distinguishes between two groups say based on sex, male or female, and then after stats, for each sample/ individual, there is a P-value, significant or not. I know how to color code the bars between male and female, but I want R to automatically put a star above each sample/ individual who has a P-value less than 0.05 say.
I'm currently just using the simple barplot(x) function.
I've tried to look around for answers but haven't found anything for this yet.
Below is is a link to my example data set:
[url=http://www.divshare.com/download/22797284-187]DivShare File - test.csv[/url]
I'd like to put the time on the y axis, color code the bars to distinguish between Male and Female, and then for individuals in either group who has a 1 under significance, put a star above their corresponding bar.
Thanks for any suggestions in advance.

I messed with your data a bit to make it friendlier:
## dput(read.csv("barcharttest.csv"))
x <- structure(list(ID = 1:7,
sex = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 2L), .Label = c("female", "male"),
class = "factor"),
val = c(309L, 192L, 384L, 27L, 28L, 245L, 183L),
stat = structure(c(1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("NS", "sig"),
class = "factor")),
.Names = c("ID", "sex", "val", "stat"),
class = "data.frame", row.names = c(NA, -7L))
Which looks like this:
ID sex val stat
1 1 female 309 NS
2 2 female 192 sig
3 3 female 384 sig
4 4 male 27 NS
5 5 male 28 sig
6 6 female 245 NS
7 7 male 183 NS
Now the plot:
sexcols <- c("pink","blue")
## png("barplot.png") ## for output graph
par(las=1,bty="l") ## I prefer these settings; see ?par
b <- with(x,barplot(val,col=sexcols[sex])) ## b saves x coords of bars
legend("topright",levels(x$sex),fill=sexcols,bty="n")
## use xpd=NA to make sure that star on tallest bar doesn't get clipped;
## pos=3 puts the text above the (x,y) location specified
text(b,x$val,ifelse(x$stat=="sig","*",""),pos=3,cex=2,xpd=NA)
axis(side=1,at=b,label=x$ID)
## dev.off()
I should also add "Time" and "ID" labels on the relevant axes.

Related

How to create a new dataset with aggregated values by month in r? [duplicate]

This question already has answers here:
R group by multiple columns and mean value per each group based on different column
(2 answers)
Closed 2 years ago.
My data set "data1" somewhat looks like this
Price class
243 1
32 2
45 3
245 1
67 2
343 3
567 1
.
.
and so on, in class column 1,2,3 repeats itself continuously till the end of data (298 observations).
I want to aggregate it, such that I get the mean of each class. The data should look like. The data should be on a new dataset "classdata"
class column_name
1 mean of all class 1 prices
2 mean of all class 2 prices
3 mean of all class 3 prices
I tried this code
classdata = aggregate(x=data1$Price, by=list(data1$class), FUN="mean")
But I am not getting the desired result. Please help.
You probably want proper column names. To get them also put x= into a list, and name the lists in both arguments.
aggregate(x=list(column_name=data1$Price), by=list(class=data1$class), FUN="mean")
# class column_name
# 1 1 351.6667
# 2 2 49.5000
# 3 3 194.0000
Data:
data1 <- structure(list(Price = c(243L, 32L, 45L, 245L, 67L, 343L, 567L
), class = c(1L, 2L, 3L, 1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-7L))
Welcome to Stack Overflow. Another option is to use the tidyverse data processing model:
# use the data jay.sf made
data1 <- structure(list(Price = c(243L, 32L, 45L, 245L, 67L, 343L, 567L),
class = c(1L, 2L, 3L, 1L, 2L, 3L, 1L)),
class = "data.frame", row.names = c(NA, -7L))
library(tidyverse)
data1 %>% # start with sample data and pipe it to the next line
group_by(class) %>% # group the data by class and pipe it to the next line
summarise(`The Mean Price` = mean(Price)) # Make a variable called "The
# Mean Price" holding the mean of
# the price variable.

How to order contingency table based on data order?

Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]

Correlation for multiple categorical variables tableau

Question updated!!
I have 15 columns of categorical variables and I want the correlation among them. The data set is 20,000+ long and the data set looks like this:
state | job | hair_color | car_color | marital_status
NY | cs | brown | blue | s
FL | mt | black | blue | d
NY | md | blond | white | m
NY | cs | brown | red | s
Notice that 1st row and last row NY, cs, and s repeats. I want to find out that kind of patterns. NY and cs is highly correlated. I need to rank the combination of values in the columns. Hope now the question make sense. Please notice that is NOT counting NY or cs. Is about finding out how many times NY and blond appears together in the same row. I need to do that for all values by row. Hope now this make sense.
I tried to utilize cor() with R but since these are categorical variables the function doesn't work. How can I work with this data set to find the correlation among them?
You may wish to refer to Ways to calculate similarity. Suppose your data is
d <- structure(list(state = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("FL",
"NY"), class = "factor"), job = structure(c(2L, 1L, 4L, 3L, 2L
), .Label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3L,
3L, 1L, 2L, 3L), .Label = c("black", "blond", "brown"), class = "factor"),
car_color = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("blue",
"red", "white"), class = "factor"), marital_status = structure(c(3L,
1L, 1L, 2L, 3L), .Label = c("d", "m", "s"), class = "factor")), .Names = c("state",
"job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(NA,
-5L))
Data:
> d
state job hair_color car_color marital_status
1 NY cs brown blue s
2 FL bs brown red d
3 FL mt black blue d
4 NY md blond white m
5 NY cs brown red s
We can calculate the "dissimilarities" between observations:
library(cluster)
daisy(d, metric = "euclidean")
Output:
> daisy(d, metric = "euclidean")
Dissimilarities :
1 2 3 4
2 0.8
3 0.8 0.6
4 0.8 1.0 1.0
5 0.2 0.6 1.0 0.8
Metric : mixed ; Types = N, N, N, N, N
Number of objects : 5
which tells us that observations 1 and 5 are least dissimilar. With many observations, it is obviously impossible to visually inspect the dissimilarity matrix, but we can filter out the pairs that fall below a certain threshold, e.g.
out <- daisy(d, metric = "euclidean")
pairs <- expand.grid(2:5, 1:4)
pairs <- pairs[pairs[,1]!=pairs[,2],]
similars <- pairs[which(out<.8),]
Given a threshold of 0.8,
> similars
Var1 Var2
4 5 1
6 3 2
8 5 2

Multiple aggregations (categorical and numeric) with dplyr in one chain

I had a problem today figuring out a way to do an aggregation in dplyr in R but for some reason was unable to come up with a solution (although I think this should be quite easy).
I have a data set like this:
structure(list(date = structure(c(16431, 16431, 16431, 16432,
16432, 16432, 16433, 16433, 16433), class = "Date"), colour = structure(c(3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L), .Label = c("blue", "green",
"red"), class = "factor"), shape = structure(c(2L, 2L, 3L, 3L,
3L, 2L, 1L, 1L, 1L), .Label = c("circle", "square", "triangle"
), class = "factor"), value = c(100, 130, 100, 180, 125, 190,
120, 100, 140)), .Names = c("date", "colour", "shape", "value"
), row.names = c(NA, -9L), class = "data.frame")
which shows like this:
date colour shape value
1 2014-12-27 red square 100
2 2014-12-27 blue square 130
3 2014-12-27 blue triangle 100
4 2014-12-28 green triangle 180
5 2014-12-28 green triangle 125
6 2014-12-28 red square 190
7 2014-12-29 red circle 120
8 2014-12-29 blue circle 100
9 2014-12-29 blue circle 140
My goal is to calculate the most frequent colour, shape and the mean value per day. My expected output is the following:
date colour shape value
1 27/12/2014 blue square 110
2 28/12/2014 green triangle 165
3 29/12/2014 blue circle 120
I ended up doing it using split and writing my own function to calculate the above for a data.frame, then used snow::clusterApply to run it in parallel. It was efficient enough (my original dataset is about 10M rows long) but I am wondering whether this can happen in one chain using dplyr. Efficiency is really important for this so being able to run it in one chain is quite important.
You could do
dat %>% group_by(date) %>%
summarize(colour = names(which.max(table(colour))),
shape = names(which.max(table(shape))),
value = mean(value))

Plot Bar Chart in R

I have some data. for example there are two columns. First column data is continuous. second column value is binary value(t|f). I want to plot this in a bar chart in R language. In the first column, I want group the numbers into category like 0-100, 101-200,..... then i want to plot number of t's in y axis. I have used ggplot2 in R. But i am not clear with how to group these x axis data.
1 123 t
2 145 t
3 222 t
4 345 f
5 455 t
6 567 t
7 245 t
8 300 t
9 150 t
10 600 t
11 333 t
First, here's your sample data in a data.frame
dd<-structure(list(V1 = 1:11, V2 = c(123L, 145L, 222L, 345L, 455L,
567L, 245L, 300L, 150L, 600L, 333L), V3 = structure(c(2L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("f", "t"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -11L))
Here's a strategy for plotting
ggplot(dd, aes(x=cut(V2, breaks=c(0,1:9*100)), weight=as.numeric(V3=="t"))) +
geom_bar(stat="bin") + xlab("value")
We define x and weights in the aes(). We use cut() to break up with numbers into ranges. Then we use weights to turn each value to a zero/one value that will be added together in the bins.

Resources