How to improve this code for getting pairwise? - r

It is a question build upon the previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).
It is different from the previous one, as more details is provided, and libraries and example file is provided according to comments from DWin. So, I submitted it as a new question. Could you mind to teach me how to modify this code further?
To load the necessary libraries:
source("http://bioconductor.org/biocLite.R")
biocLite()
My protseq.fasta file has the following contents:
>drugbank_target|1 Peptidoglycan synthetase ftsI (DB00303)
MVKFNSSRKSGKSKKTIRKLTAPETVKQNKPQKVFEKCFMRGRYMLSTVLILLGLCALVARAAYVQSINADTLSNEADKR
SLRKDEVLSVRGSILDRNGQLLSVSVPMSAIVADPKTMLKENSLADKERIAALAEELGMTENDLVKKIEKNSKSGYLYLA
RQVELSKANYIRRLKIKGIILETEHRRFYPRVEEAAHVVGYTDIDGNGIEGIEKSFNSLLVGKDGSRTVRKDKRGNIVAH
ISDEKKYDAQDVTLSIDEKLQSMVYREIKKAVSENNAESGTAVLVDVRTGEVLAMATAPSYNPNNRVGVKSELMRNRAIT
DTFEPGSTVKPFVVLTALQRGVVKRDEIIDTTSFKLSGKEIVDVAPRAQQTLDEILMNSSNRGVSRLALRMPPSALMETY
QNAGLSKPTDLGLIGEQVGILNANRKRWADIERATVAYGYGITATPLQIARAYATLGSFGVYRPLSITKVDPPVIGKRVF
SEKITKDIVGILEKVAIKNKRAMVEGYRVGVKTGTARKIENGHYVNKYVAFTAGIAPISDPRYALVVLINDPKAGEYYGG
AVSAPVFSNIMGYALRANAIPQDAEAAENTTTKSAKRIVYIGEHKNQKVN
>drugbank_target|3 Histidine decarboxylase (DB00114; DB00117)
MMEPEEYRERGREMVDYICQYLSTVRERRVTPDVQPGYLRAQLPESAPEDPDSWDSIFGDIERIIMPGVVHWQSPHMHAY
YPALTSWPSLLGDMLADAINCLGFTWASSPACTELEMNVMDWLAKMLGLPEHFLHHHPSSQGGGVLQSTVSESTLIALLA
ARKNKILEMKTSEPDADESCLNARLVAYASDQAHSSVEKAGLISLVKMKFLPVDDNFSLRGEALQKAIEEDKQRGLVPVF
VCATLGTTGVCAFDCLSELGPICAREGLWLHIDAAYAGTAFLCPEFRGFLKGIEYADSFTFNPSKWMMVHFDCTGFWVKD
KYKLQQTFSVNPIYLRHANSGVATDFMHWQIPLSRRFRSVKLWFVIRSFGVKNLQAHVRHGTEMAKYFESLVRNDPSFEI
PAKRHLGLVVFRLKGPNCLTENVLKEIAKAGRLFLIPATIQDKLIIRFTVTSQFTTRDDILRDWNLIRDAATLILSQHCT
SQPSPRVGNLISQIRGARAWACGTSLQSVSGAGDDPVQARKIIKQPQRVGAGPMKRENGLHLETLLDPVDDCFSEEAPDA
TKHKLSSFLFSYLSVQTKKKTVRSLSCNSVPVSAQKPLPTEASVKNGGSSRVRIFSRFPEDMMMLKKSAFKKLIKFYSVP
SFPECSSQCGLQLPCCPLQAMV
>drugbank_target|5 Glutaminase liver isoform, mitochondrial (DB00130; DB00142)
MRSMKALQKALSRAGSHCGRGGWGHPSRSPLLGGGVRHHLSEAAAQGRETPHSHQPQHQDHDSSESGMLSRLGDLLFYTI
AEGQERTPIHKFTTALKATGLQTSDPRLRDCMSEMHRVVQESSSGGLLDRDLFRKCVSSSIVLLTQAFRKKFVIPDFEEF
TGHVDRIFEDVKELTGGKVAAYIPQLAKSNPDLWGVSLCTVDGQRHSVGHTKIPFCLQSCVKPLTYAISISTLGTDYVHK
FVGKEPSGLRYNKLSLDEEGIPHNPMVNAGAIVVSSLIKMDCNKAEKFDFVLQYLNKMAGNEYMGFSNATFQSEKETGDR
NYAIGYYHEEKKCFPKGVDMMAALDLYFQLCSVEVTCESGSVMAATLANGGICPITGESVLSAEAVRNTLSLMHSCGMYD
FSGQFAFHVGLPAKSAVSGAILLVVPNVMGMMCLSPPLDKLGNSHRGTSFCQKLVSLFNFHNYDNLRHCARKLDPRREGA
EIRNKTVVNLLFAAYSGDVSALRRFALSAMDMEQKDYDSRTALHVAAAEGHIEVVKFLIEACKVNPFAKDRWGNIPLDDA
VQFNHLEVVKLLQDYQDSYTLSETQAEAAAEALSKENLESMV
>drugbank_target|6 Coagulation factor XIII A chain (DB00130; DB01839; DB02340)
SETSRTAFGGRRAVPPNNSNAAEDDLPTVELQGVVPRGVNLQEFLNVTSVHLFKERWDTNKVDHHTDKYENNKLIVRRGQ
SFYVQIDFSRPYDPRRDLFRVEYVIGRYPQENKGTYIPVPIVSELQSGKWGAKIVMREDRSVRLSIQSSPKCIVGKFRMY
VAVWTPYGVLRTSRNPETDTYILFNPWCEDDAVYLDNEKEREEYVLNDIGVIFYGEVNDIKTRSWSYGQFEDGILDTCLY
VMDRAQMDLSGRGNPIKVSRVGSAMVNAKDDEGVLVGSWDNIYAYGVPPSAWTGSVDILLEYRSSENPVRYGQCWVFAGV
FNTFLRCLGIPARIVTNYFSAHDNDANLQMDIFLEEDGNVNSKLTKDSVWNYHCWNEAWMTRPDLPVGFGGWQAVDSTPQ
ENSDGMYRCGPASVQAIKHGHVCFQFDAPFVFAEVNSDLIYITAKKDGTHVVENVDATHIGKLIVTKQIGGDGMMDITDT
YKFQEGQEEERLALETALMYGAKKPLNTEGVMKSRSNVDMDFEVENAVLGKDFKLSITFRNNSHNRYTITAYLSANITFY
TGVPKAEFKKETFDVTLEPLSFKKEAVLIQAGEYMGQLLEQASLHFFVTARINETRDVLAKQKSTVLTIPEIIIKVRGTQ
VVGSDMTVTVQFTNPLKETLRNVWVHLDGPGVTRPMKKMFREIRPNSTVQWEEVCRPWVSGHRKLIASMSSDSLRHVYGE
LDVQIQRRPSM
To load the data to R for the analysis, I have done:
require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("./protseq.fasta", strip.descs=TRUE)
To get the the pairwise numbers, as there are a total of 4 sequences, I have done:
number <-c(1:4); dat <- expand.grid(number,number, stringsAsFactors=FALSE)
datr <- dat[dat[,1] > dat[,2] , ]
In order to calculate the score one by one, I can do this:
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))
However, I have problem to add a new column as "score" to include all the score for each pairs of the proteins. I tried to do this, but did not work.
datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))})
Could you mind to comments how to further improve it? Thanks DWin and diliop for wonderful solutions to my previous question.

Try:
datr$score <- sapply(1:nrow(datr), function(i) {
x <- datr[i,1]
y <- datr[i,2]
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100,gapOpening=0, gapExtension=-5))
})
To be able to reference your sequences better using their names, you might want to tidy up datr by doing the following:
colnames(datr) <- c("seq1id", "seq2id", "score")
datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc)
datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc)
Or if you just want to extract the accession IDs i.e. the contents of your parentheses, you could use stringr as such:
library(stringr)
datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "DB[0-9\\ ;DB]+"))
Hope this helps!

Related

How to create a loop that changes part of a column name in a data frame

I am trying to find Cronbach's Alpha for survey data containing a series of multi-item measures. Rather than have to manually write out every single multi-item measure, it looks like something a loop should be able to manage far more effectively, but it needs to change only part of the column name, according to the question number.
The basic idea as it currently sits in my head would be...
for (N in 4:22) {
ytqN <- data.frame(YT_Data$QNa, YT_Data$QNb, YT_Data$QNc)
alpha(ytqN)
}
The loop would then create new data frames for each multi item measure and run Cronbach's Alpha as it goes.
This doesn't work though. :(
ytq4 <- data.frame(YT_Data$Q4a, YT_Data$Q4b, YT_Data$Q4c)
alpha(ytq4)
ytq5 <- data.frame(YT_Data$Q5a, YT_Data$Q5b, YT_Data$Q5c)
alpha(ytq5)
ytq6 <- data.frame(YT_Data$Q6a, YT_Data$Q6b, YT_Data$Q6c)
alpha(ytq6)
ytq7 <- data.frame(YT_Data$Q7a, YT_Data$Q7b, YT_Data$Q7c)
alpha(ytq7)
ytq8 <- data.frame(YT_Data$Q8a, YT_Data$Q8b, YT_Data$Q8c)
alpha(ytq8)
ytq9 <- data.frame(YT_Data$Q9a, YT_Data$Q9b, YT_Data$Q9c)
alpha(ytq9)
ytq10 <- data.frame(YT_Data$Q10a, YT_Data$Q10b, YT_Data$Q10c)
alpha(ytq10)
ytq11 <- data.frame(YT_Data$Q11a, YT_Data$Q11b, YT_Data$Q11c)
alpha(ytq11)
ytq12 <- data.frame(YT_Data$Q12a, YT_Data$Q12b, YT_Data$Q12c)
alpha(ytq12)
ytq13 <- data.frame(YT_Data$Q13a, YT_Data$Q13b, YT_Data$Q13c)
alpha(ytq13)
ytq14 <- data.frame(YT_Data$Q14a, YT_Data$Q14b, YT_Data$Q14c)
alpha(ytq14)
ytq15 <- data.frame(YT_Data$Q15a, YT_Data$Q15b, YT_Data$Q15c)
alpha(ytq15)
ytq16 <- data.frame(YT_Data$Q16a, YT_Data$Q16b, YT_Data$Q16c)
alpha(ytq16)
ytq17 <- data.frame(YT_Data$Q17a, YT_Data$Q17b, YT_Data$Q17c)
alpha(ytq17)
ytq18 <- data.frame(YT_Data$Q18a, YT_Data$Q18b, YT_Data$Q18c)
alpha(ytq18)
ytq19 <- data.frame(8 - YT_Data$Q19a, YT_Data$Q19b, YT_Data$Q19c)
# Reverse code Q19a
alpha(ytq19)
ytq20 <- data.frame(YT_Data$Q20a, YT_Data$Q20b, YT_Data$Q20c)
alpha(ytq20)
ytq21 <- data.frame(YT_Data$Q21a, YT_Data$Q21b, YT_Data$Q21c)
alpha(ytq21)
ytq22 <- data.frame(YT_Data$Q22a, YT_Data$Q22b, YT_Data$Q22c)
alpha(ytq22)
The desired results would be a single output containing all the Cronbach's Alphas for the multi item measures for questions 4-22 in the data set I am currently working on executed via a single piece of code, rather than have to go question by question.
It's easier to help if you include your data, but I guess this should work:
alpha_list = list()
for(N in 4:22){
ytq = data.frame(YT_Data[paste0("Q",N,"a")],
YT_Data[paste0("Q",N,"b")],
YT_Data[paste0("Q",N,"c")])
alpha_list[[N]] = alpha(ytq)
}
We are using paste0() to create the column names while looping on N. alpha_list will be a list with the results given by alpha()

Subset an xts by year into a list. Subset an xts by year and month into a list

im new to R and the stack platforms.
sti <- getSymbols("^STI", src = "yahoo", auto.assign = F, from = "2007-01-01", to = "2017-12-31")
sti_adjusted <- sti[, 6]
I done this in order to subset the data into a list of years.
ls_sti_adjusted <- list(sti_adjusted["2007"], sti_adjusted["2008"], sti_adjusted["2009"], sti_adjusted["2010"], sti_adjusted["2011"], sti_adjusted["2012"], sti_adjusted["2013"], sti_adjusted["2014"], sti_adjusted["2015"], sti_adjusted["2016"], sti_adjusted["2017"])
I'm looking for a more elegant solution, like a for-loop maybe?
ls_sti_adjusted <- list()
for (i in 2007:2017){
ls_sti_adjusted[[]] <- ???
}
The second issue is how can I further subset the elements into months in the year?
so for example: ls_sti_adjusted[[1]][[2]][[3]] returns the 3rd data point of February in 2007. Is this possible?
I hope that I am clear about the problem that I am facing. Thanks folks, plus any tips/tricks to understand loops and lists better would be greatly appreciated.
Combining .indexyear and split(x,f = “months” will give you the desired list.
lapply(unique(.indexyear(STI)),function(x) split.xts(STI[.indexyear(STI) == x ,],f='months’))
If you only need yearly lists leave out the split part, like so:
lapply(unique(.indexyear(STI)),function(x) STI[.indexyear(STI) == x ,])
UPDATE: OP’s follow-up question regarding naming of lists
Assuming you named the list of lists object STIlist you can do the following to name the list by years.( keep in mind that the names are converted to strings! )
names(STIlist) <- 2007:2018
To get the list of the year 2007:
> both(STIlist[['2007']])
STI.Open STI.High STI.Low STI.Close STI.Volume STI.Adjusted
2007-01-03 3015.74 3037.74 3010.22 3037.74 192739200 3037.74
2007-01-04 3035.08 3045.18 3008.23 3023.80 198216700 3023.80
2007-01-05 3031.09 3038.27 3000.50 3029.04 233321400 3029.04
STI.Open STI.High STI.Low STI.Close STI.Volume STI.Adjusted
2007-12-27 3469.11 3491.65 3459.97 3477.20 91474200 3477.20
2007-12-28 3452.18 3463.38 3441.96 3445.82 109442100 3445.82
2007-12-31 3424.48 3482.30 3424.48 3482.30 205741900 3482.30
If you need need more information about naming lists "Google is your best friend” or post another question :-)
for the first question something like this?
library(lubridate)
index(sti_adjusted)=floor_date(index(sti_adjusted),unit="years")
ls_sti_adjusted <- lapply(unique(index(sti_adjusted)),function(x) sti_adjusted[index(sti_adjusted)==x,1])
We could use the indexing directly from xts, check ?index.xts:
split(sti_adjusted, .indexyear(sti_adjusted))
In order to keep the correct naming 2012, 2013, ..., we can try:
split(sti_adjusted, as.integer(format(index(sti_adjusted), '%Y')))
Of course this can be nested in a list as much as you want:
nestedList <- lapply(
split(sti_adjusted, .indexyear(sti_adjusted))
, function(x) split(x, .indexmon(x))
)
nestedList[[3]][[2]][3] #3.year, 2.month, 3. obs.
Example using build-in data from xts:
data(sample_matrix, package = "xts")
sample_matrix <- as.xts(sample_matrix)
nestedList <- lapply(
split(sample_matrix, .indexyear(sample_matrix))
, function(x) split(x, .indexmon(x))
)
nestedList[[1]][[3]][5]
Open High Low Close
2007-03-05 50.26501 50.3405 50.26501 50.29567

R subset data by multiple conditions on every row

I got a very large dataset which contains many columns and rows. Not every co-worker is allowed to see all data. Based on the dataframe Data_locatie I want to subset my original dataframe DF. The column acces tells me if the co-worker may see this combination yes(=1) or no(=0). I made a reproducible example which you can use.
CityChargeSessions <-c("Amsterdam","Amsterdam","Amsterdam","Amsterdam","Beverwaard","De meern","De Meern","De Meern","Den Haag","Den Haag")
RegionAbbreviation <- c("G4", "G4","G4","G4","G4","G4","G4","G4","G4","G4")
Provider<- c("ALLEGO","Essent","EVBOX","Nuon","EVBOX","EVnet","Ballast Nedam", "Nuon","Alfen","EVnet")
acces<- c(0,1,1,0,1,1,0,0,1,0)
Data_locatie<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,acces)
CityChargeSessions <-c("Amsterdam" ,"Amsterdam" ,"Den Haag" , "Den Haag" ,"Rotterdam", "Rotterdam", "Rotterdam", "Utrecht" , "Utrecht" )
RegionAbbreviation <- c("G4", "G4","G4","G4","G4","G4","G4","G4","G4")
Provider <- c("Essent","Nuon","Alfen","EVnet","Alfen","EVBOX", "EVnet","Ballast Nedam", "EVnet")
kWh<- c(3366231.03, 7547896.10, 2535700.80, 245951.82, 62004.86, 3074192.86, 221362.13, 1272956.51, 281451.94)
DF<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,kWh)
My expected output is:
CityChargeSessions <-c("Amsterdam" ,"Den Haag")
RegionAbbreviation <- c("G4", "G4")
Provider <- c("Essent","Alfen ")
kWh<- c(3366231.03, 2535700.80)
expected_output<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,kWh)
Could you help me out?
Thanks for your help!
Martijn
You could use the data table and do the following:
require(data.table)
setDT(Data_locatie)
setkey(Data_locatie, "CityChargeSessions", "RegionAbbreviation", "Provider")
setDT(DF)
setkey(DF, "CityChargeSessions", "RegionAbbreviation", "Provider")
allowed_combinations <- DF[Data_locatie[acces==1], nomatch=0][, acces:=NULL]
not_allowed_combinations <- DF[Data_locatie[acces==0], nomatch=0][, acces:=NULL]

Is it possible to repeat code in R - but replace certain characters? (needs to be done 350 times)

I am writing a script to data scrape analyst share ratings and current share prices from the internet in R (using RStudio);
library(rvest)
BKGURL <- 'http://www.marketbeat.com/stocks/LON/BKG/' #analysts
BKGwebpage <- read_html(BKGURL)
BKGhtml <- html_nodes(BKGwebpage, "td:nth-child(5) , td:nth-child(4) , td:nth- child(3) , td:nth-child(2) , td:nth-child(1)")
BKG <- html_text(BKGhtml) #imports analyst text
BKGprice <- 'http://markets.investorschronicle.co.uk/research/Markets/Companies/Summary?s=BKG:LSE'
BKGpricewebpage <- read_html(BKGprice)
BKGpriceHTML <- html_nodes(BKGpricewebpage, "#wsod td.first")
BKGgbpp <- html_text(BKGpriceHTML) #imports current share price in text
before compiling them in a data frame; (code for INN not shown to save space)
Code <- c('BKG', 'INN')
Analysts_Opinion <- c(BKG [2], INN [2])
Consensus <- c(BKG [4], INN [4])
Price_target <- c(BKG [6], INN [6])
Last_rating <- c(BKG [7], INN [7])
Current_price <- c(BKGgbpp [1], INNgbpp [1])
Scrapev1 <- data.frame(Code, Analysts_Opinion, Consensus, Price_target, Last_rating, Current_price)
Scrapev1 then gives
Code Analysts_Opinion Consensus Price_target Last_rating Current_price
1 BKG 2 Sell Rating(s), 6 Hold Rating(s), 8 Buy Rating(s) Hold (Score: 2.38) GBX 3,434.29 7/26/2016 2,650
2 INN 1 Buy Rating(s) Buy (Score: 3.00) GBX 190 2/2/2016 198.00
So the code works fine for importing the data, but I need to repeat/replicate the code in the top panel 350 times, changing "BKG" for the 349 other codes in each URL and name. Currently I am stumped on what to do as copy and pasting each would take quite some time, surely there is a quicker way of doing it in R?
Any help or suggestions as to how to tackle this problem would be much appreciated. Apologies if the code is sloppy, I have taught myself (poorly) R by using this very website and come from a Pharmacology background - with an interest in technology!
You could do it with strings, then parsing and evaluating them. However I wouldn't advise you to do so. The best way in my opinion would be to use lists and names. Something like:
library(rvest)
auxlist<- c('BKG', 'ASD', 'QWE')
URLS <- c() # Or list()
webpages <- list()
# etc...
for(comp in auxlist){
URLS[[comp]] <- paste0('http://www.marketbeat.com/stocks/LON/', comp, '/')
webpages[[comp]] <- read_html(URLS[[comp]])
# etc...
}

linking crsp and compustat in R via WRDS

I am using R to connect to WRDS. Now, I would like to link compustat and crsp tables. In SAS, this would be achieved using macros and the CCM link table. What would be the best way to approach this topic in R?
PROGRESS UPDATE:
I downloaded crsp, compustat and ccm_link tables from wrds.
sql <- "select * from CRSP.CCMXPF_LINKTABLE"
res <- dbSendQuery(wrds, sql)
ccmxpf_linktable <- fetch(res, n = -1)
ccm.dt <- data.table(ccmxpf_linktable)
rm(ccmxpf_linktable)
I am then converting the suggested matching routine from the wrds event study sas file into R:
ccm.dt[,typeflag:=linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey, typeflag)
for (i in 1:nrow(compu.dt)) {
gvkey.comp = compu.dt[i, gvkey]
endfyr.comp = compu.dt[i,endfyr]
PERMNO.val <- ccm.dt[.(gvkey.comp, TRUE),][linkdt<=endfyr.comp & endfyr.comp<=linkenddt,lpermno]
if (length(PERMNO.val)==0) PERMNO.val <- NA
suppressWarnings(compu.dt[i, "PERMNO"] <- PERMNO.val)
}
However, this code is fantastically inefficient. I started out with data.table, but do not really understand how to apply the logic in the for-loop. I am hoping that some could point me to a way how to improve the for-loop.
Matching fields in stages works better. maybe someone finds this useful. Any suggestions for further improvement are of course very welcome!!!
# filter on ccm.dt
ccm.dt <- ccm.dt[linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey)
setkey(compu.dt, gvkey)
compu.merged <- merge(compu.dt, ccm.dt, all.x = TRUE, allow.cartesian = TRUE)
# deal with NAs in linkenddt - set NAs to todays date, assuming they still exist.
today <- as.character(Sys.Date())
compu.merged[is.na(linkenddt), "linkenddt":=today]
# filter out date mismatches
compu <- compu.merged[linkdt <= endfyr & endfyr<=linkenddt]

Resources