R ttest on multiple levels of a factor - r

I'm trying to perform multiple t-test on my dataset in r and got totally confused from the capabilities of apply functions, aggregate and for loops.
My data is as following: I have observations which are different products. for each product I have multiple numeric variables, which I'd like to compare. In addition, I have 13 different categories of products. AND, I have another factor variable which differentiate between new, used, and old products. So a sample of my data may look as the following:
ProdID Category Cond No. of instances Sales Time since launch
aaaaa Sports New 100 40000 30
bbbb Crafts New 0 0 20
ccccc Music Used 20 1000 10
My goal is to perform the following, I want to output separately, for each Category (Sports, Crafts, Music etc.) the results of a t-test. This t-test should compare means of each numeric var, with the comparison of "New" mean to "Used" mean (I'm not interested in "old" values at all). So at the end I want to see the comparison of "Time since launch"m "Sales" and "Num Instances" between new and old in Sports, then the same in crafts, the same in music etc....
I've tried it in so many ways, but in each of them (aggreagte, tapply, for loop) I had a different problem... It seems that I'm missing here something (I'm kind of new in R. I used to do it in spss and used split file...)

Related

Use R to compare metagenomic data

Good evening everyone,
I am not exactly new to R, I have done a course on coursera, but I haven't really done anything serious with R yet.
Now I have some metagenomic data, split into tibbles such as domains of metagenome 1 in a tibble, metagenome 2 in a tibble etc, similarly for phyla, class, order, genus, family etc. I need to make comparisons of the data. Compare the genera present in a metagenome with four or five other metagenomes. Can you point me towards libraries and functions with which I can compare data like this.
Example data,
The tibbles with genus, and family data are even longer with hundreds of columns.
Archaea
Bacteria
Eukaryota
Viruses
other.sequences
unclassified.sequences
649
423655
4901
64
7
317
Now I understand that I should clean the data to make the column names into a column(ex. taxon) using pivot.longer()
But what are some good ways to visualize data similar to this

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

How to add only one observation at a time amongst several observations in R?

Say I have observations for several periods for financial data, how can I create a function in R that only adds one observation at a time throughout my dataset so that I can compare how a single observation impacts my original data?
Say for instance that I have something like this:
Apple Microsoft Tesla Amazon
2010 0.8533719 0.8078440 0.2620114 0.1869552
2011 0.7462573 0.5127501 0.5452448 0.1369686
2012 0.7580671 0.5062639 0.7847919 0.8362821
2013 0.3154078 0.6960258 0.7303597 0.6057027
2014 0.4741735 0.3906580 0.4515726 0.1396147
2015 0.4230036 0.4728911 0.1262413 0.7495193
2016 0.2396552 0.5001825 0.6732861 0.8535837
2017 0.2007575 0.8875209 0.5086837 0.2211072
#And I define my original covariance matrix as follows:
cov.m <- cov(x[1:5,])
#I would like to add only one new observation at a time, so the results should be:
cov(x[1:5,]), cov(x[1:6,]), cov(x[1:7,]), cov(x[1:8,])
I have tried using rbind and a repeat loop, but it seems like I still have to define every row to include in rbind, which is quite tedious if I want to test on say 100+ different observations as I then manually need to specify all the observations, and I would have no use for the repeat loop in that case either.
Does this get you closer to your expected output?
lapply(5:nrow(x), function(y) cov(x[1:y, ]))

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources