Use R to compare metagenomic data - r

Good evening everyone,
I am not exactly new to R, I have done a course on coursera, but I haven't really done anything serious with R yet.
Now I have some metagenomic data, split into tibbles such as domains of metagenome 1 in a tibble, metagenome 2 in a tibble etc, similarly for phyla, class, order, genus, family etc. I need to make comparisons of the data. Compare the genera present in a metagenome with four or five other metagenomes. Can you point me towards libraries and functions with which I can compare data like this.
Example data,
The tibbles with genus, and family data are even longer with hundreds of columns.
Archaea
Bacteria
Eukaryota
Viruses
other.sequences
unclassified.sequences
649
423655
4901
64
7
317
Now I understand that I should clean the data to make the column names into a column(ex. taxon) using pivot.longer()
But what are some good ways to visualize data similar to this

Related

Is there an R function for finding shared traits among variables?

I have a data set of plants and plant traits. It is a large data set with over 150 plants and over 300 different traits. However I do not have data for all 300 traits for all of the 150 plants. Some plants have data for 100 traits, other plants have data for only 2 or 3 traits.
I have figured out how to isolate which plants have the most trait data, but I can’t figure out how to isolate which traits these plants have in common
For example. I have 10 plants, numbered 1-10, and each of these 10 plants has data for 75 traits, with trait numbers varying from 1-3000. So each plant has 75 different traits, but with some overlap. I want to find which traits overlap. I want to analyze all of the traits that they share/have in common, so I need to isolate the shared traits.
Is there an easy way to do this in R? It seems like there should be a relatively easy way, but I can’t quite figure it out.
My data set looks something like this, just much larger.
In this example I would want to highlight Traits #1 and #4, because those are the two which have data for all three plants.
I hope this all makes sense. Thanks everyone in advance for your help!

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

Criteria for deciding which character columns should be converted to factors

I have been working through the book "Analyzing Baseball Data with R" by Marchi and Albert and am wondering about an issue which they don't address.
Many of the datasets I need to import are fairly large (though not really "Big" in the sense of "Big Data"). For example, the Retrosheet Game Logs have 1 csv file per year dating back to 1871 where each file has a row for each game played that year, and 161 columns. When I read it into a dataframe using read.csv() using the default setting on stringsAsFactors fully 75 of the 161 columns become factors. Some of these columns conceptually are factors (such as one containing "D" or "N" for day or night games) but others are probably better left as strings (many of the columns contain names of starting pitchers, closers, etc.) I know how to convert columns from factors to strings or vice versa, but I don't want to have to scan through 161 columns, making an explicit decision for 75 of them.
The reason I think it important is that I've noticed that conceptually small dataframes obtained by subsetting these game logs are surprisingly large given the need to retain the full factor information. For example, given the dataframe GL2016 obtained from downloading, unzipping and the reading in the file, object.size(GL2016) is about 2.8 MB, and when I use:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
to extract the home day games played by the Cleveland Indians in 2016, I get a df with 26 rows. 26/2428 (where 2428 is the number of rows in the whole dataframe) is slightly more than 1%, but object.size(df) is around 1.3 MB, which is far more than 1% of the size of GL2016.
I came up with an ad-hoc solution. I first defined a function:
big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k}
And then used mutate_if from dplyr like thus:
GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016
30 is the number of teams in the MLB and I somewhat arbitrarily decided that any factor with more than 30 levels should probably be treated as a string.
After this code has been run, the number of factor variables has been reduced from 75 to 12. It works in the sense that even though now GL2016 is around 3.2 MB (slightly larger than before), if I now subset the dataframe to pull out the Cleveland day games, the resulting dataframe is just 0.1 MB.
Questions:
1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large data set?
2) I am aware of the cost in terms of memory footprint of converting all character data to factors, but am I incurring any hidden costs (say in processing time) when I convert most of these factors back into strings?
Essentially, I think what you need to do is:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
df <- droplevels(df)
droplevelsfunction will remove all the unused factor levels, and thus reduce the size of df immensely.

R ttest on multiple levels of a factor

I'm trying to perform multiple t-test on my dataset in r and got totally confused from the capabilities of apply functions, aggregate and for loops.
My data is as following: I have observations which are different products. for each product I have multiple numeric variables, which I'd like to compare. In addition, I have 13 different categories of products. AND, I have another factor variable which differentiate between new, used, and old products. So a sample of my data may look as the following:
ProdID Category Cond No. of instances Sales Time since launch
aaaaa Sports New 100 40000 30
bbbb Crafts New 0 0 20
ccccc Music Used 20 1000 10
My goal is to perform the following, I want to output separately, for each Category (Sports, Crafts, Music etc.) the results of a t-test. This t-test should compare means of each numeric var, with the comparison of "New" mean to "Used" mean (I'm not interested in "old" values at all). So at the end I want to see the comparison of "Time since launch"m "Sales" and "Num Instances" between new and old in Sports, then the same in crafts, the same in music etc....
I've tried it in so many ways, but in each of them (aggreagte, tapply, for loop) I had a different problem... It seems that I'm missing here something (I'm kind of new in R. I used to do it in spss and used split file...)

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources