Is there an R function for finding shared traits among variables? - r

I have a data set of plants and plant traits. It is a large data set with over 150 plants and over 300 different traits. However I do not have data for all 300 traits for all of the 150 plants. Some plants have data for 100 traits, other plants have data for only 2 or 3 traits.
I have figured out how to isolate which plants have the most trait data, but I can’t figure out how to isolate which traits these plants have in common
For example. I have 10 plants, numbered 1-10, and each of these 10 plants has data for 75 traits, with trait numbers varying from 1-3000. So each plant has 75 different traits, but with some overlap. I want to find which traits overlap. I want to analyze all of the traits that they share/have in common, so I need to isolate the shared traits.
Is there an easy way to do this in R? It seems like there should be a relatively easy way, but I can’t quite figure it out.
My data set looks something like this, just much larger.
In this example I would want to highlight Traits #1 and #4, because those are the two which have data for all three plants.
I hope this all makes sense. Thanks everyone in advance for your help!

Related

Use R to compare metagenomic data

Good evening everyone,
I am not exactly new to R, I have done a course on coursera, but I haven't really done anything serious with R yet.
Now I have some metagenomic data, split into tibbles such as domains of metagenome 1 in a tibble, metagenome 2 in a tibble etc, similarly for phyla, class, order, genus, family etc. I need to make comparisons of the data. Compare the genera present in a metagenome with four or five other metagenomes. Can you point me towards libraries and functions with which I can compare data like this.
Example data,
The tibbles with genus, and family data are even longer with hundreds of columns.
Archaea
Bacteria
Eukaryota
Viruses
other.sequences
unclassified.sequences
649
423655
4901
64
7
317
Now I understand that I should clean the data to make the column names into a column(ex. taxon) using pivot.longer()
But what are some good ways to visualize data similar to this

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

How to use image() function to plot the data in R

I have a clinical dataset and I would like to plot it using image() function to see if I can spot out the different groups within my data.
The structure of this data is a List of 2: 56 samples and 5000 gene expressions.
When I use image(lung), all I see a just a plot of orange color and I do not see pattern or any group standing out to me.
Basically, there are four types of clinical conditions in the dataset: Colon cancer (13 samples), smallcell (6 samples), etc.
I wanted to see, for instance, ```smallcell" with 6 samples has its own pattern compared to the rest of the groups/conditions within this dataset.
load(url("https://github.com/hughng92/dataset/raw/master/lung.RData"))
rownames(lung)
image(lung)
This is all I see:
I am wondering if I can combine the four different plots of these 4 conditions from the data set, it will look different.
Any tip would be great!
I'd suggest looking at the image output after rearranging the like types together. I think I now see some group differences in those gene expression profiles. Specifically the "Normal" category has generally fewer red bands although there are a couple where "normal" is red and the others are not. I think it is interesting, and not particularly surprising, that the appears to be less variability within the Normal columns (in the image) than there is within each the tumor types. I have a friend who's a molecular biologist who characterizes tumors as "genetic train wrecks":
table( rownames( lung[order(rownames(lung)), ]))
Carcinoid Colon Normal SmallCell
20 13 17 6
------------------
image( lung[order(rownames(lung)), ])
This would give a better indication of the boundaries of the type grouping:
image( lung[order(rownames(lung)), ], xaxt="n")
axis(1, at=(cumsum( table( rownames( lung[order(rownames(lung)), ])))-1)/56 ,
labels=names(table( rownames( lung[order(rownames(lung)), ]))),las=2)

How do I put more than 1 condition on a set of data?

I've got a set of data, olympic_height.txt, where each row corresponds to a person. There are 3 columns that tell you their height, their gender and what sport they play respectively. How do I obtain a subset of the data that only contains people that are male and play basketball for example?
I tried this
MBP=read.table(file="olympic_height.txt", header=T)
MBP$sex=="M"
MBP$sport=="Basketball"
t=MBP
boxplot(t)
My goal is to have one boxplot of heights of male basketball players and one of heights of male football players. When I try this, I end up with 2 identical boxplots and I'm certain they should be very different. What am I doing wrong?

R ttest on multiple levels of a factor

I'm trying to perform multiple t-test on my dataset in r and got totally confused from the capabilities of apply functions, aggregate and for loops.
My data is as following: I have observations which are different products. for each product I have multiple numeric variables, which I'd like to compare. In addition, I have 13 different categories of products. AND, I have another factor variable which differentiate between new, used, and old products. So a sample of my data may look as the following:
ProdID Category Cond No. of instances Sales Time since launch
aaaaa Sports New 100 40000 30
bbbb Crafts New 0 0 20
ccccc Music Used 20 1000 10
My goal is to perform the following, I want to output separately, for each Category (Sports, Crafts, Music etc.) the results of a t-test. This t-test should compare means of each numeric var, with the comparison of "New" mean to "Used" mean (I'm not interested in "old" values at all). So at the end I want to see the comparison of "Time since launch"m "Sales" and "Num Instances" between new and old in Sports, then the same in crafts, the same in music etc....
I've tried it in so many ways, but in each of them (aggreagte, tapply, for loop) I had a different problem... It seems that I'm missing here something (I'm kind of new in R. I used to do it in spss and used split file...)

Resources