Sum of subvectors - r

My vector contains the frequency per day of a certain event in a certain month.
I want to see what run of 16 days contains the highest frequency, and I would like to extract the dates which start en and it.
vector=table(date[year(date)==2001&month(date)==05])
I know how to do this, but my method is (obviously) too primitive.
max(c(sum(vector[1:16]),sum(vector[2:17]),sum(vector[3:18]),sum(vector[4:19]),sum(vector[5:20]),sum(vector[6:21]))/sum(vector))
Edit: For reproducibility the data in vector is provided in .csv form below:
"","Var1","Freq"
"1","2001-05-06",1
"2","2001-05-08",1
"3","2001-05-09",7
"4","2001-05-10",2
"5","2001-05-11",10
"6","2001-05-12",10
"7","2001-05-13",7
"8","2001-05-14",20
"9","2001-05-15",24
"10","2001-05-16",15
"11","2001-05-17",27
"12","2001-05-18",17
"13","2001-05-19",13
"14","2001-05-20",15
"15","2001-05-21",13
"16","2001-05-22",26
"17","2001-05-23",17
"18","2001-05-24",19
"19","2001-05-25",7
"20","2001-05-26",5
"21","2001-05-27",6
"22","2001-05-28",2
"23","2001-05-29",1
"24","2001-05-31",1

Assuming the data in vector is as shown in your data example, something like
max_start <- which.max(rollmean(vector$Freq, 16, align="left"))
date_max_start <- vector$Var1[max_start]
date_max_end <- vector$Var1[max_start + 16]

Related

How alter R codes in efficient way

I have a sample created as follows:
survival1a= data.frame(matrix(vector(), 50, 2,dimnames=list(c(), c("Id", "district"))),stringsAsFactors=F)
survival1a$Id <- 1:nrow(survival1a)
survival1a$district<- sample(1:4, size=50, replace=TRUE)
this sample has 50 individuals from 4 different districts.
I have probabilities (a matrix) that shows the likelihood of migration from one district to another(Migdata) as follows:
district***** prob1****** prob2******** prob3******* prob4**
0.83790 0.08674 0.05524 0.02014
0.02184 0.88260 0.03368 0.06191
0.01093 0.03565 0.91000 0.04344
0.03338 0.06933 0.03644 0.86090
I merge these probabilities with my data with this code:
survival1a<-merge( Migdata,survival1a, by.x=c("district"), by.y=c("district"))
I would like to know by the end of the year each person resides in which districts based on probabilities of migration that I have(Migdata).
I have already written a code that perfectly works but with big data it is so time-consuming since it is based on a Loop:
for (k in 1:nrow(survival1a)){
survival1a$migration[k]<-sample(1:4, size=1,replace = TRUE,prob=survival1a[k,2:5])}
Now, I want to write the code in a way that it would not be based on a loop and shows every person district by the end of the year.

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

How to loop over rows sequentially while picking a specific column in each row

I am a relatively novice R user and am trying to recreate a 'Dogs of the DOW' strategy with the 6 biggest Canadian banks whereby you buy the poorest performing stock/bank from the previous year in the current year. I would like to go through each row and select the column with the preceding year's worst performer. Any suggestions or tips are greatly appreciated!
I have tried writing several versions of for loops but continue to get odd outputs. In the below code I get a list where each item is the same number?
In the code, I have a data frame (BtBB) which is the annual returns of the bank stocks from 2012 through to 2018 as the rows, and the 6 banks as the columns. BtBB_min is a vector which has 6 entries denoting which column the previous year's minimum return is in (so first value points to column 4 which is year 2012's worst performer, second value is column 2 which is 2013's worst performer etc.) BtBB_ret is meant to be the output showing the returns.
#Entering data
BtBB <- data.frame(
Date = as.Date(c("2012-12-31", "2013-12-31", "2014-12-31", "2015-12-31", "2016-12-31", "2017-12-31", "2018-12-31"), format = "%Y-%m-%d"),
CIBC = c(0.08375119, 0.13442541, 0.10052910, -0.08663862, 0.20144753, 0.11847390, -0.17023013),
RBC = c(0.151981531, 0.192551770, 0.123652150, -0.075897308, 0.225488874, 0.129635743, -0.089722358),
National = c(0.07069587, 0.14422579, 0.11880516, -0.18466828, 0.35276606, 0.15019255, -0.10634566),
BMO = c(0.08911954, 0.16348998, 0.16057054, -0.04989048, 0.23680840, 0.04162783, -0.11333135),
TD = c(0.097771953, 0.195319962, 0.108869357, -0.022878761, 0.220870206, 0.112201752, -0.078615071),
BNS = c(0.130434783, 0.156108597, -0.001806413, -0.155934248, 0.335715562, 0.085072231, -0.161119329))
BtBB_min <- apply(BtBB[-1], 1, which.min) # Finding Minimums
#Adding scalar to min vector so column numbers match properly with BtBB dataframe
BtBB_min <- BtBB_min + 1
#Removing last entry since only minimums from prior years matter, not current years
BtBB_min <- BtBB_min[-length(BtBB_min)]
#Removing first row from data frame since we want to reference current years
BtBB <- BtBB[-1,]
#Creating output vector for for loop
BtBB_ret <- vector("double", length = length(BtBB_min))
#Nested For loop where I'm having issue generating a proper output
for (h in seq_along(BtBB_ret)) {
for (i in nrow(BtBB)) {
for (j in seq_along(BtBB_min)) {
BtBB_ret[h] <- BtBB[i,BtBB_min[j]]
}
}
}
Expect to get a vector of returns as:
.1442258, .10052910, -0.155934248, 0.3527661, 0.11847390, -0.11333135
Actually get BMO's return 6 times (-0.11333135). Can't figure out why. Have worked on this problem for like a week and can't seem to crack it :(
You are doing unnecessary loops and you are overwriting the BtBB_ret values over and over again. One loop should suffice:
#Nested For loop where I'm having issue generating a proper output
for (i in 1:nrow(BtBB)) {
BtBB_ret[i] <- BtBB[i,BtBB_min[i]]
}
BtBB_ret

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources