How to plot profiles in R with ggplot2 - r

I have a large data set with protein IDs and corresponding abundance profiles across a number of gel fractions. I want to plot these profiles of abundances across the fractions.
The data looks like this
IDs<- c("prot1", "prot2", "prot3", "prot4")
fraction1 <- c(3,4,2,4)
fraction2<- c(1,2,4,1)
fraction3<- c(6,4,6,2)
plotdata<-data.frame(IDs, fraction1, fraction2, fraction3)
> plotdata
IDs fraction1 fraction2 fraction3
1 prot1 3 1 6
2 prot2 4 2 4
3 prot3 2 4 6
4 prot4 4 1 2
I want it to look like this:
Every protein has a profile. Every fraction has a corresponding abundance value per protein. I want to have multiple proteins per plot.
I tried figuring out ggplot2 using the cheat sheet and failed. I don't know what the input df should look like and what method I should use to get these profiles.
I would use excel, but a bug draws the wrong profile of my data depending on order of data, so I can't trust it to do what I want.

First, you'll have to reorganize your data.frame for ggplot2. You can do it one step with reshape2::melt. Here you can change the 'variable' and 'value' names.
library(reshape2)
library(dplyr)
library(ggplot2)
data2 <- melt(plotdata, id.vars = "IDs")
Then, we'll group the data by protein:
data2 <- group_by(data2, IDs)
Finally, you can plot it quite simply:
ggplot(data2) +
geom_line(aes(variable, value, group = IDs,
color = IDs))

Related

How to plot the numbers in each of 10 columns for one row in a line chart

This seems so simple. I can easily do it in Excel but I want to automate the process through R. I have installed ggplot2. Using RStudio I have read in my CSV file.
The resulting data frame has over 200 rows, each a town in New Hampshire. The first column is titled "Town" and each row below that has the text name of the town, (e.g., "Concord" or "Lancaster"). Column 2 contains a number for each town (spending per elementary school pupil) and the title of that column in the CSV file is "01/02 Elem PPE" - but it shows as "X01.02.Elem.PPE" when using View(). Column 3 has similar numbers for each town and its title in View() is "X02.03.Elem.PPE". Columns 4 through 11 are similar.
I just want to plot a line graph of the numbers in columns 2-11 for one row (one town). It will show how the spending per pupil has changed in that town over time. There must be a simple way to do this, but I can't find it.
Please help. I am a 77 year old with some programming experience 3-5 decades ago but new to R and Rstudio only yesterday.
First, I'll make some new data that mimics yours. It should have more or less the same properties.
library(glue)
library(tidyverse)
set.seed(4314)
mat <- matrix(rpois(40, 5000), ncol=10)
colnames(mat) <- glue("X{sprintf('%2.0f', 1:10)}.{sprintf('%2.0f', 2:11)}.Elem.PPE", sep="") %>%
gsub(". ", ".0", ., fixed=TRUE) %>%
gsub("X ", "X0", ., fixed=TRUE)
df <- tibble(town = c("Concord", "Lancaster", "Manchester", "Nashua"))
df <- bind_cols(df, as_tibble(mat))
Now, this is where you would start. I'm going to assume that you read your csv into an object called df. The first thing you should do to make plotting easier is to pivot the data from wide-form (one-row and 10 columns per observation) to long-form with 1 column and 10 rows per observation. I'm going to save this in an object called df2. The pivot_longer function is in the tidyr package. The first argument is the columns that you want to change from wide- to long-form, in this case, it's everything except town. Then you tell it a variable name for the column names and a variable name for the values. Then, I'm just using a couple of regular expressions to go from X01.02.Elem.PPE to 01/02 for plotting purposes.
df2 <- df %>%
pivot_longer(-town, names_to="time", values_to="val") %>%
mutate(time = gsub("X(.*)\\.Elem\\.PPE", "\\1", time),
time = gsub("\\.", "/", time))
The resulting data frame looks like this:
# # A tibble: 40 x 3
# town time val
# <chr> <chr> <int>
# 1 Concord 01/02 4965
# 2 Concord 02/03 4953
# 3 Concord 03/04 5066
# 4 Concord 04/05 5100
# 5 Concord 05/06 4979
# 6 Concord 06/07 5090
# 7 Concord 07/08 5136
# 8 Concord 08/09 5076
# 9 Concord 09/10 5079
# 10 Concord 10/11 4945
Next, we can make a plot for a single place (before we think about automation). Let's try Concord. First, we'll save the values that we want to put on the x-axis:
xlabs <- unique(df2$time)
Next, we can use ggplot() to make the plot. In the code below, we're first piping the data frame to a filter that will pull out the values for a single town. The filtered data frame is piped into the ggplot() function. Since time in the data frame is a character vector, we need to turn it into a factor and then into a numeric to make the line plot. We add the line geometry to plot the line. Then we change the x-axis labels with scale_x_continuous(). The labs() function changes the axis labels for the x- and y-axes. Finally, ggtitle() puts the title at the top of the plot. I also like theme_bw() rather than the gray background, but that's entirely a matter of personal preference. The resulting plot looks like this:
df2 %>% filter(town == "Concord") %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle("Concord") +
theme_bw()
Now, the next part you mentioned was automation - you want to do this for every row of the original data frame. We could do that as follows. First, untown grabs the unique values of town from the data. The for() loop loops from 1 to the number of values in untown. Then you can see where "Concord" was in the previous plot, we now have untown[i]. We also use ggsave() at the end and we paste together the town name and .png. This will make a different plot for each town in R's working directory.
untown <- unique(df2$town)
for(i in 1:length(untown)){
df2 %>% filter(town == untown[i]) %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle(untown[i]) +
theme_bw()
ggsave(glue("{untown[i]}.png"), width=9, height=6)
}

How to create correlation using dplyr in R studio

I have a data set with 3 attributes (organization hierarchy region-area-territory, territory is the lowest grain) plus two numeric fields (sales qty and headcount).
How do I generate correlation between sales qty and territory headcount, and display the correlation by region, area and territory?
I used dplyr package, g=group_by (mydataset, region, area, territory), and then summarize(g, cor(sales_qty, headcount). The display looks right, but all correlation is 'NA'. If I omit territory, then the result looks right (group by region and area). Even though territory is the lowest level, can I still use 'group_by' feature? Why is it showing NA?
Thank you for helping!
Without looking at your code it is hard to tell what you are trying. I can't comment what you are doing wrong. Here is what I have tried to get correlation with groups. It works well.
set.seed(1234)
df <- data.frame(group = rep(1:5, 100), x = rnorm(500) , y = rnorm(500) )
library(dplyr)
df %>%
group_by(group) %>%
do(data.frame(x=cor(.$x,.$y)))
Output:
group x
<int> <dbl>
1 1 0.1293551648
2 2 0.0006703073
3 3 0.2021294935
4 4 -0.0162522307
5 5 0.0995898089

Creating stacked barplots in R using different variables

I am a novice R user, hence the question. I refer to the solution on creating stacked barplots from R programming: creating a stacked bar graph, with variable colors for each stacked bar.
My issue is slightly different. I have 4 column data. The last column is the summed total of the first 3 column. I want to plot bar charts with the following information 1) the summed total value (ie 4th column), 2) each bar is split by the relative contributions of each of the three column.
I was hoping someone could help.
Regards,
Bernard
If I understood it rightly, this may do the trick
the following code works well for the example df dataframe
df <- a b c sum
1 9 8 18
3 6 2 11
1 5 4 10
23 4 5 32
5 12 3 20
2 24 1 27
1 2 4 7
As you don't want to plot a counter of variables, but the actual value in your dataframe, you need to use the goem_bar(stat="identity") method on ggplot2. Some data manipulation is necessary too. And you don't need a sum column, ggplot does the sum for you.
df <- df[,-ncol(df)] #drop the last column (assumed to be the sum one)
df$event <- seq.int(nrow(df)) #create a column to indicate which values happaned on the same column for each variable
df <- melt(df, id='event') #reshape dataframe to make it readable to gpglot
px = ggplot(df, aes(x = event, y = value, fill = variable)) + geom_bar(stat = "identity")
print (px)
this code generates the plot bellow

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

plotting the top 5 values from a table in R

I'm very new to R so this may be a simple question. I have a table of data that contains frequency counts of species like this:
Acidobacteria 47
Actinobacteria 497
Apicomplexa 7
Aquificae 16
Arthropoda 26
Ascomycota 101
Bacillariophyta 1
Bacteroidetes 50279
...
There are about 50 species in the table. As you can see some of the values are a lot larger than the others. I would like to have a stacked barplot with the top 5 species by percentage and one category of 'other' that has the sum of all the other percentages. So my barplot would have 6 categories total (top 5 and other).
I have 3 additional datasets (sample sites) that I would like to do the same thing to only highlighting the first dataset's top 5 in each of these datasets and put them all on the same graph. The final graph would have 4 stacked bars showing how the top species in the first dataset change in each additional dataset.
I made a sample plot by hand (tabulated the data outside of R and just fed in the final table of percentages) to give you an idea of what I'm looking for: http://dl.dropbox.com/u/1938620/phylumSum2.jpg
I would like to put these steps into an R script so I can create these plots for many datasets.
Thanks!
Say your data is in the data.frame DF
DF <- read.table(textConnection(
"Acidobacteria 47
Actinobacteria 497
Apicomplexa 7
Aquificae 16
Arthropoda 26
Ascomycota 101
Bacillariophyta 1
Bacteroidetes 50279"), stringsAsFactors=FALSE)
names(DF) <- c("Species","Count")
Then you can determine which species are in the top 5 by
top5Species <- DF[rev(order(DF$Count)),"Species"][1:5]
Each of the data sets can then be converted to these 5 and "Other" by
DF$Group <- ifelse(DF$Species %in% top5Species, DF$Species, "Other")
DF$Group <- factor(DF$Group, levels=c(top5Species, "Other"))
DF.summary <- ddply(DF, .(Group), summarise, total=sum(Count))
DF.summary$prop <- DF.summary$total / sum(DF.summary$total)
Making Group a factor keeps them all in the same order in DF.summary (largest to smallest per the first data set).
Then you just put them together and plot them as you did in your example.
We should make it a habit to use data.table wherever possible:
library(data.table)
DT<-data.table(DF,key="Count")
DT[order(-rank(Count), Species)[6:nrow(DT)],Species:="Other"]
DT<-DT[, list(Count=sum(Count),Pcnt=sum(Count)/DT[,sum(Count)]),by="Species"]

Resources