How to plot data from two files? - r

I am trying to plot the 5 first values and the 5 last of my data.
I have two files A and B.
A is order that is why I need the 5 and the five last.
I want to plot the intensity of each replicate which are in another file B.
If my names in A are found in B to plot all the values...
File B as the same rownames that file A (and more) that is why I want those in common between A and B, then if they are equal I want to plot all the values for each replicate (column) from my file B.
So at the end I'll have 5 plots for 5first and 5 plots for 5last.
I already try something like this but like I said I am not very good with R programming.
Just to begin, I want to see if I can make the link between the two files. This doesn't work.
5fisrt <- A[1:5,]
5last <- A[(nrow(RT)-5+1):nrow(A), ]
i <- 0
for (i in 5fisrt)
{
if row.names(5fisrt[i]) == row.names(B[i])
plot <- boxplot(B,aes(B[i]))
print (plot)
}
I'll function do another for 5last.
How do I solve the problem?

If I understand, you want to make a barplot for each row in B, if the name of that row is the same as the name of one of the first 5 rows or last 5 rows of A.
If you want all 10 plots in one graphic, you can use facet_grid:
library(tidyverse)
n <- nrow(A)
C <- B[row.names(A[c(1:5,(n-4):n),]),]
C %>%
mutate(rowName = row_number()) %>%
gather(key = 'replicates', value = 'intensity', -rowName) %>%
ggplot(
aes(
x = replicates,
y = intensity
)
) +
geom_col() +
facet_grid(rowName~.)
Note that my answer assumes that there are at least 10 rows in A. The second line of code subsets B as you want, and assigns it to a new object C, just for clarity in the code. I could have just piped that subset of B right into the mutate without defining C.
UPDATE:
Since you don't want to use facet, you could try one of these:
library(tidyverse)
n <- nrow(A)
C <- B[row.names(A[c(1:5,(n-4):n),]),]
D <- C %>%
mutate(rowName = row_number()) %>%
gather(key = 'replicates', value = 'intensity', -rowName)
# Plot to the RStudio viewer
for(i in 1:10){
p <-
D %>%
filter(rowName == i) %>%
ggplot(
aes(
x = replicates,
y = intensity
)
) +
geom_col()
print(p)
}
# Save the plots to files
for(i in 1:10){
p <-
D %>%
filter(rowName == i) %>%
ggplot(
aes(
x = replicates,
y = intensity
)
) +
geom_col()
ggsave(paste0("myPlot",i,".png"), plot = p)
}

Related

Data cleaning in R: grouping by number and then by name

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.
Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result
library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)
According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

ggplot facet grid within a factor

Consider data that looks like this
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
head(df)
fsubs fitem fcounts ftime
1 sub1 1 8 a
2 sub2 1 10 a
3 sub3 1 4 a
4 sub1 2 4 a
5 sub2 2 1 a
6 sub3 2 6 a
I would like to plot a facet grid that shows the counts for the two time points ('a','b'), subject-wise. I can't seem to figure out how to plot this in ggplot
here is my ugly attempt to do it
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
s1<-ggplot(fdf_counts,aes(x=fcounts,y=fcounts.1))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub1')
s2<-ggplot(fdf_counts,aes(x=fcounts.2,y=fcounts.3))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub2')
s3<-ggplot(fdf_counts,aes(x=fcounts.4,y=fcounts.5))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub3')
plot_grid(s1,s2,s3)#from 'cowplot' package
How can I do this with using the original fdf data.frame? Especially as the # of subs increase
Or for example if I wanted to plot one scatter plot across all of the subs with fcounts against eachother with ftime(a) as x axis and ftime(b) as y axis?
Consider a merge solution with data frame by itself on fsubs and fitem (being sequential number of items per fsubs and ftime grouping). This approach allows you to keep your long, tidy data format which is ideal format for ggplot since you can then facet_grid using fsubs without iteration.
mdf <- merge(subset(fdf, ftime=="a"),
subset(fdf, ftime=="b"),
by=c("fsubs", "fitem"),
suffixes=c("", "_"))
ggplot(mdf, aes(x=fcounts, y=fcounts_)) +
geom_point() +
geom_smooth(method='lm') +
labs(x='a', y='b') +
facet_grid(~fsubs)
This should get you close:
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-tibble(fsubs,fitem,fcounts,ftime)
fdf <- fdf %>%
group_by(ftime) %>%
mutate(row_id = row_number()) %>%
pivot_wider(values_from = fcounts,
names_from = ftime)
ggplot(data = fdf, aes(x = a, y = b)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(fsubs ~ ., ncol = 1)
The tidyr function pivot_wider allows us to create the shape of the data we need without explicit loops: create new columns a and b with values from fcounts. We do need to create a unique row id to make this work.
By the way, when I run your code the plots look different from what you posted in the question.
With this output:
Just tried to create a visualization that would analyze all 4 variables. Got a geom_histogram
```{r}
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
ggplot(data = fdf, mapping = aes(x = fdf$fsubs, y = fdf$fcounts, fill = fdf$fitem)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(cols = vars(ftime))
```

R stacked bar charts including "other" (using ggplot2)

I want to make a stacked barchart that describes abundances of taxa at two locations in three different seasons. I'm using ggplot2. Making the plot is ok, but I have 48 taxa so I end up with a lot of different colours in the bar. There are only eight taxa that occur frequently and abundantly, so I'd like to group the others into "Other" for the plot.
My data looks like this:
SampleID TransectID SampleYear Season Location Taxa1 Taxa2 Taxa3 .... Taxa48
BW15001 1 2015 fall SiteA 25 0 0 0
BW15001 2 2015 fall SiteA 32 0 0 2
BW15001 2 2015 fall SiteA 6 0 45 0
BW15001 3 2015 fall SiteA 78 1 2 0
This is what I have tried (modified from here):
y <- rowSums(invert[6:54])
x<-invert[6:54]/y
x<-invert[,order(-colSums(x))]
#Extract list of top N Taxa
N<-8
taxa_list<-colnames(x)[1:N]
#remove "__Unknown__" and add it to others
taxa_list<-taxa_list[!grepl("Unknown",taxa_list)]
N<-length(taxa_list)
#Generate a new table with everything added to Others
new_x<-data.frame(x[,colnames(x) %in% taxa_list],
Others=rowSums(x[,!colnames(x) %in% taxa_list]))
df<-NULL
for (i in 1:dim(new_x)[2]){
tmp<-data.frame(row.names=NULL,Sample=rownames(new_x),
Taxa=rep(colnames(new_x)[i],dim(new_x) [1]),Value=new_x[,i],Type=grouping_info[,1])
if(i==1){df<-tmp} else {df<-rbind(df,tmp)}
}
To plot the graph:
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");
library(ggplot2)
p<-ggplot(df,aes(Sample,Value,fill=Taxa))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")
p<-p+scale_fill_manual(values=colours[1:(N+1)])
p<-p+theme_bw()+ylab("Proportions")
p<-p+ scale_y_continuous(expand = c(0,0))+
theme(strip.background = element_rect(fill="gray85"))+
theme(panel.spacing = unit(0.3, "lines"))
p<-p+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
p
The main problem that I would like help with today is pulling out the main taxa and lumping the rest as "Other". I think I can figure out how to group the graph by Season and Location using facet_grid() later...
Thanks!
Expanding on my comment. Take a look at the forcats package. Without a full example, it's hard to say, but the following should work:
library(tidyverse)
library(forcats)
temp <- df %>%
gather(taxa, amount, -c(1:5))
# Reshape the data so that that there is one record per each amount
tidy_df <- temp[rep(rownames(temp), times = temp$amount), ]
tidy_df %>%
select(-amount) %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>% # Check out this line
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
You can change fct_lump(taxa, n = 2) to fct_lump(taxa, n = 8) to group the top 8 categories. Alternatively, you can use fct_lump(taxa, prop = 0.9) to lump things up by proportions.
If you are simply going after the "presence" of the taxa in a sample (and not the value or amount), things are a bit simpler and can likely be handled in one pipe:
df %>%
gather(taxa, amount, -c(1:5)) %>%
mutate(amount = na_if(amount, 0)) %>%
na.omit() %>%
mutate(taxa = fct_lump(taxa, n = 2)) %>%
ggplot(., aes(x = SampleID, fill = taxa)) +
geom_bar()
One way of doing it:
library(plyr)
d=data.frame(SampleID=rep('BW15001',4),
TransectID=c(1,2,2,3),
SampleYear=rep(2015,4),
Taxa1=c(25,32,6,78),
Taxa2=c(0,0,0,1),
Taxa3=c(0,0,45,3))
#Reshape the df so that all taxa columns are melted into two
d=melt(d,id=colnames(d[,1:3]))
d$variable=as.character(d$variable)
# rename all uninteresting taxa as 'other'
`%ni%` <- Negate(`%in%`) # Here I decided to select the ones to keep, but the other way around is fine as well of course
d[d$variable %ni% c('Taxa1','Taxa2'),'variable']='Other' #here you could add a function to automatically determine which taxta you want to keep, as you already did
# aggregate all data for 'other'
d=ddply(d,colnames(d[,1:4]),summarise,value=sum(value))
#make your plot, this one is just a bad example
ggplot(d,aes(SampleID,value,fill=variable))+
geom_bar(stat="identity")+
facet_grid(. ~ Type, drop=TRUE,scale="free",space="free_x")

Using functionals instead of for loops to identify sequential changes in a vector

My data look like this:
I want to identify which "downward trend" each observation is part of, so I can group them and do things like make this graph:
My logic for distinguishing "downward trends" is that they end when the next observation has a higher measurement.
I've written a loop to do this, but I'm wondering if there's a better way to do it with one of the apply functions or something like them.
##Create sample data
df <- data.frame(timestamp = seq(1:20),
measurement = seq(10, 1, by = -1))
## This is the for loop I'm hoping to improve
df$downward.trend.seq <- 0
seq <- 1
for(i in 1:nrow(df)){
df$downward.trend.seq[i] <- seq
if (i < nrow(df) & df$measurement[i] < df$measurement[i+1]) {
seq <- seq + 1
}
}
## Code for plots
library(ggplot2)
library(dplyr)
ggplot(df, aes(x = timestamp, y = measurement)) + geom_point()
ggplot(df, aes(x = timestamp, y = measurement, group = downward.trend.seq)) + geom_line(aes(color=downward.trend.seq %>% factor))
You can use which and diff to help identify the where downward trend changes occur, and use cumsum to fill out the group membership.
# set up new column with all 0s
df$downward.trend.seq <- 0
# use diff to identify indices to change to 1
df$downward.trend.seq[which(c(NA, diff(df$measurement)) > 0)] <- 1
# use cumsum to fill in proper group membership
df$downward.trend.seq <- cumsum(df$downward.trend.seq)
Here is a dplyr solution
df %>% mutate(data_group = cumsum( c(0, diff(measurement)) > 0 ))
This performs the cumulative sum over a logical vector and assigns the results to data_group

Substituting dates with number of days in time series

I have following data on student scores on several pretests before their true exam.
a<-(c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b<-c(300,230,400,NA,NA,NA,"2013-04-30")
c<-c(NA,260,410,420,NA,NA,"2013-05-30")
d<-c(300,230,400,NA,370,390,"2013-08-30")
df<-as.data.frame(rbind(b,c,d))
colnames(df)<-a
rownames(df)<-(c("student 1","student 2","student 3"))
The actual datasheet is much larger. Since the dates vary so much, and the timing between the pretests and to the exam are relatively similar, I would rather convert the true dates into the number of days before the exam, so that they are the new column names, not dates. I understand that this will merge some of the columns which is OK. How would I be able to do that?
This is another good use case for reshape2, because you want to go to long form for plotting. For example:
# you are going to need the student id as a field
df$student_id <- row.names(df)
library('reshape2')
df2 <- melt(df, id.vars = c('student_id','actual_exam_date'),
variable.name = 'pretest_date',
value.name = 'pretest_score')
# drop empty observations
df2 <- df2[!is.na(df2$pretest_score),]
# these need to be dates
df2$actual_exam_date <- as.Date(df2$actual_exam_date)
df2$pretest_date <- as.Date(df2$pretest_date)
# date difference
df2$days_before_exam <- as.integer(df2$actual_exam_date - df2$pretest_date)
# scores need to be numeric
df2$pretest_score <- as.numeric(df2$pretest_score)
# now you can make some plots
library('ggplot2')
ggplot(df2, aes(x = days_before_exam, y = pretest_score, col=student_id) ) +
geom_line(lwd=1) + scale_x_reverse() +
geom_vline(xintercept = 0, linetype = 'dashed', lwd = 1) +
ggtitle('Pretest Performance') + xlab('Days Before Exam') + ylab('Pretest Score')
Here is one way to approach this one. I am sure there are many others. I commented the code to explain what is going on at each step:
# Load two libraries you need
library(tidyr)
library(dplyr)
# Construct data frame you provided
a <- (c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b <- c(300,230,400,NA,NA,NA,"2013-04-30")
c <- c(NA,260,410,420,NA,NA,"2013-05-30")
d <- c(300,230,400,NA,370,390,"2013-08-30")
df <- as.data.frame(rbind(b,c,d))
colnames(df) <- a
# Add student IDs as a column instead of row names and move them to first position
df$StudentID <- row.names(df)
row.names(df) <- NULL
df <- select(df, StudentID, everything())
# Gather date columns as 'categories' with score as the new column value
newdf <- df %>% gather(Date, Score, -actual_exam_date, -StudentID) %>% arrange(StudentID)
# Convert dates coded as factor variables into actual dates so we can do days to exam computation
newdf$actual_exam_date <- as.Date(as.character(newdf$actual_exam_date))
newdf$Date <- as.Date(as.character(newdf$Date))
# Create a new column of days before exam per student ID (group) and filter
# out dates with missing scores for each student
newdf <- newdf %>% group_by(StudentID) %>% mutate(daysBeforeExam = as.integer(difftime(actual_exam_date, Date, units = 'days'))) %>% filter(!is.na(Score))
# Plot the trends using ggplot
ggplot(newdf, aes(x = daysBeforeExam, y = Score, col = StudentID, group = StudentID)) + geom_line(size = 1) + geom_point(size = 2)

Resources