using hash to determine whether 2 dataframes are identical (PART 01) - r

I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison.
The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)
And I have written a little script as below:
library(digest)
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)
ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]
And something really interesting happens when I do the checking:
> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01
TRUE
506
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02
TRUE
506
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE
line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa.
line03 shows that the hash column for both dataframe are the same
line04 shows that the two dataframe are different
What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)
Or do I have any misunderstanding about digest? Thanks.

Read in data as before.
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])
Like Marek said, start by checking for differences with all.equal.
all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches"
[2] "Component 8: 24 string mismatches"
So we just need to look at columns 6 and 8.
different.old <- ddd.old[, c(6, 8)]
different.new <- ddd.new[, c(6, 8)]
Hash these columns.
hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)
And find the rows where they don't match.
different_rows <- which(hash.old != hash.new) #which is optional
Finally, combine the datasets.
cbind(different.old[different_rows, ], different.new[different_rows, ])

Related

Why a substraction has a different result if computed from a dataframe? in R

I have been checking my data and I realized that a substraction was wrong, and I computed it mannually and I get the correct result.
My question is why when I computed the substraction from the data.frame it gives a number .5 and when I computed it with just the numbers it gives me the correct results.
I tried to reproduce the problem in other computer and here is a toy example of this problem
a <- data.frame(matrix(nrow = 1, ncol = 4))
a[1,] <- c(101292229, 101298224, 101190466, 101195700)
colnames(a) <- c("start_I","end_I","start_II","end_II")
a$MI <- (a$start_I + a$end_I)/2
a$MII <- (a$start_II + a$end_II)/2
a
start_I end_I start_II end_II MI MII
1 101292229 101298224 101190466 101195700 101295226 101193083
a$MII-a$MI
[1] -102143.5
101193083-101295226
[1] -102143
As Lyngbakr pointed out, a$MI is actually 101295226.5, but only appears to be 101295226 due to the default printing settings cutting off the decimal place.
If you override those settings with format(), you'll see the extra .5 appear:
format(a, digits = 10)
start_I end_I start_II end_II MI MII
1 101292229 101298224 101190466 101195700 101295226.5 101193083
Try
as.integer(a$MII-a$MI)
to get the results that you are expecting

linking crsp and compustat in R via WRDS

I am using R to connect to WRDS. Now, I would like to link compustat and crsp tables. In SAS, this would be achieved using macros and the CCM link table. What would be the best way to approach this topic in R?
PROGRESS UPDATE:
I downloaded crsp, compustat and ccm_link tables from wrds.
sql <- "select * from CRSP.CCMXPF_LINKTABLE"
res <- dbSendQuery(wrds, sql)
ccmxpf_linktable <- fetch(res, n = -1)
ccm.dt <- data.table(ccmxpf_linktable)
rm(ccmxpf_linktable)
I am then converting the suggested matching routine from the wrds event study sas file into R:
ccm.dt[,typeflag:=linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey, typeflag)
for (i in 1:nrow(compu.dt)) {
gvkey.comp = compu.dt[i, gvkey]
endfyr.comp = compu.dt[i,endfyr]
PERMNO.val <- ccm.dt[.(gvkey.comp, TRUE),][linkdt<=endfyr.comp & endfyr.comp<=linkenddt,lpermno]
if (length(PERMNO.val)==0) PERMNO.val <- NA
suppressWarnings(compu.dt[i, "PERMNO"] <- PERMNO.val)
}
However, this code is fantastically inefficient. I started out with data.table, but do not really understand how to apply the logic in the for-loop. I am hoping that some could point me to a way how to improve the for-loop.
Matching fields in stages works better. maybe someone finds this useful. Any suggestions for further improvement are of course very welcome!!!
# filter on ccm.dt
ccm.dt <- ccm.dt[linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey)
setkey(compu.dt, gvkey)
compu.merged <- merge(compu.dt, ccm.dt, all.x = TRUE, allow.cartesian = TRUE)
# deal with NAs in linkenddt - set NAs to todays date, assuming they still exist.
today <- as.character(Sys.Date())
compu.merged[is.na(linkenddt), "linkenddt":=today]
# filter out date mismatches
compu <- compu.merged[linkdt <= endfyr & endfyr<=linkenddt]

Filter xts objects by common dates

I am stuck with the following code.
For reference the code it is taken from the following website (http://gekkoquant.com/2013/01/21/statistical-arbitrage-trading-a-cointegrated-pair/), I am also compiling the code through R Studio.
library("quantmod")
startDate = as.Date("2013-01-01")
symbolLst<-c("WPL.AX","BHP.AX")
symbolData <- new.env()
getSymbols(symbolLst, env = symbolData, src = "yahoo", from = startDate)
stockPair <- list(
a =coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep="")))))
,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep="")))))
,hedgeRatio = 0.70 ,name=title)
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
I am getting the following error.
Error in stockPair$a - stockPair$hedgeRatio * stockPair$b :
non-conformable arrays
The reason these particular series don't match is because "WPL.AX" has an extra value (date:19-05-2014 - the matrix lengths are different) compared to "BHP". How can I solve this issue when loading data?
I have also tested other stock pairs such as "ANZ","WBC" with the source = "google" which produces two of the same length arrays.
> length(stockPair$a)
[1] 360
> length(stockPair$b)
[1] 359
Add code such as this prior to the stockPair computation, to trim each xts set to the intersection of dates:
common_dates <- as.Date(Reduce(intersect, eapply(symbolData, index)))
symbolData <- eapply(symbolData, `[`, i=common_dates)
Your code works fine if you don't convert your xts object to matrix via coredata. Then Ops.xts will ensure that only the rows with the same index will be subtracted. And fortune(106) applies.
fortunes::fortune(106)
# If the answer is parse() you should usually rethink the question.
# -- Thomas Lumley
# R-help (February 2005)
stockPair <- list(
a = Cl(symbolData[[symbolLst[1]]])
,b = Cl(symbolData[[symbolLst[2]]])
,hedgeRatio = 0.70
,name = "title")
spread <- stockPair$a - stockPair$hedgeRatio*stockPair$b
Here's an alternative approach:
# merge stocks into a single xts object
stockPair <- do.call(merge, eapply(symbolData, Cl))
# ensure stockPair columns are in the same order as symbolLst, since
# eapply may loop over the environment in an order you don't expect
stockPair <- stockPair[,pmatch(symbolLst, colnames(stockPair))]
colnames(stockPair) <- c("a","b")
# add hedgeRatio and name as xts attributes
xtsAttributes(stockPair) <- list(hedgeRatio=0.7, name="title")
spread <- stockPair$a - attr(stockPair,'hedgeRatio')*stockPair$b

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

How to improve this code for getting pairwise?

It is a question build upon the previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).
It is different from the previous one, as more details is provided, and libraries and example file is provided according to comments from DWin. So, I submitted it as a new question. Could you mind to teach me how to modify this code further?
To load the necessary libraries:
source("http://bioconductor.org/biocLite.R")
biocLite()
My protseq.fasta file has the following contents:
>drugbank_target|1 Peptidoglycan synthetase ftsI (DB00303)
MVKFNSSRKSGKSKKTIRKLTAPETVKQNKPQKVFEKCFMRGRYMLSTVLILLGLCALVARAAYVQSINADTLSNEADKR
SLRKDEVLSVRGSILDRNGQLLSVSVPMSAIVADPKTMLKENSLADKERIAALAEELGMTENDLVKKIEKNSKSGYLYLA
RQVELSKANYIRRLKIKGIILETEHRRFYPRVEEAAHVVGYTDIDGNGIEGIEKSFNSLLVGKDGSRTVRKDKRGNIVAH
ISDEKKYDAQDVTLSIDEKLQSMVYREIKKAVSENNAESGTAVLVDVRTGEVLAMATAPSYNPNNRVGVKSELMRNRAIT
DTFEPGSTVKPFVVLTALQRGVVKRDEIIDTTSFKLSGKEIVDVAPRAQQTLDEILMNSSNRGVSRLALRMPPSALMETY
QNAGLSKPTDLGLIGEQVGILNANRKRWADIERATVAYGYGITATPLQIARAYATLGSFGVYRPLSITKVDPPVIGKRVF
SEKITKDIVGILEKVAIKNKRAMVEGYRVGVKTGTARKIENGHYVNKYVAFTAGIAPISDPRYALVVLINDPKAGEYYGG
AVSAPVFSNIMGYALRANAIPQDAEAAENTTTKSAKRIVYIGEHKNQKVN
>drugbank_target|3 Histidine decarboxylase (DB00114; DB00117)
MMEPEEYRERGREMVDYICQYLSTVRERRVTPDVQPGYLRAQLPESAPEDPDSWDSIFGDIERIIMPGVVHWQSPHMHAY
YPALTSWPSLLGDMLADAINCLGFTWASSPACTELEMNVMDWLAKMLGLPEHFLHHHPSSQGGGVLQSTVSESTLIALLA
ARKNKILEMKTSEPDADESCLNARLVAYASDQAHSSVEKAGLISLVKMKFLPVDDNFSLRGEALQKAIEEDKQRGLVPVF
VCATLGTTGVCAFDCLSELGPICAREGLWLHIDAAYAGTAFLCPEFRGFLKGIEYADSFTFNPSKWMMVHFDCTGFWVKD
KYKLQQTFSVNPIYLRHANSGVATDFMHWQIPLSRRFRSVKLWFVIRSFGVKNLQAHVRHGTEMAKYFESLVRNDPSFEI
PAKRHLGLVVFRLKGPNCLTENVLKEIAKAGRLFLIPATIQDKLIIRFTVTSQFTTRDDILRDWNLIRDAATLILSQHCT
SQPSPRVGNLISQIRGARAWACGTSLQSVSGAGDDPVQARKIIKQPQRVGAGPMKRENGLHLETLLDPVDDCFSEEAPDA
TKHKLSSFLFSYLSVQTKKKTVRSLSCNSVPVSAQKPLPTEASVKNGGSSRVRIFSRFPEDMMMLKKSAFKKLIKFYSVP
SFPECSSQCGLQLPCCPLQAMV
>drugbank_target|5 Glutaminase liver isoform, mitochondrial (DB00130; DB00142)
MRSMKALQKALSRAGSHCGRGGWGHPSRSPLLGGGVRHHLSEAAAQGRETPHSHQPQHQDHDSSESGMLSRLGDLLFYTI
AEGQERTPIHKFTTALKATGLQTSDPRLRDCMSEMHRVVQESSSGGLLDRDLFRKCVSSSIVLLTQAFRKKFVIPDFEEF
TGHVDRIFEDVKELTGGKVAAYIPQLAKSNPDLWGVSLCTVDGQRHSVGHTKIPFCLQSCVKPLTYAISISTLGTDYVHK
FVGKEPSGLRYNKLSLDEEGIPHNPMVNAGAIVVSSLIKMDCNKAEKFDFVLQYLNKMAGNEYMGFSNATFQSEKETGDR
NYAIGYYHEEKKCFPKGVDMMAALDLYFQLCSVEVTCESGSVMAATLANGGICPITGESVLSAEAVRNTLSLMHSCGMYD
FSGQFAFHVGLPAKSAVSGAILLVVPNVMGMMCLSPPLDKLGNSHRGTSFCQKLVSLFNFHNYDNLRHCARKLDPRREGA
EIRNKTVVNLLFAAYSGDVSALRRFALSAMDMEQKDYDSRTALHVAAAEGHIEVVKFLIEACKVNPFAKDRWGNIPLDDA
VQFNHLEVVKLLQDYQDSYTLSETQAEAAAEALSKENLESMV
>drugbank_target|6 Coagulation factor XIII A chain (DB00130; DB01839; DB02340)
SETSRTAFGGRRAVPPNNSNAAEDDLPTVELQGVVPRGVNLQEFLNVTSVHLFKERWDTNKVDHHTDKYENNKLIVRRGQ
SFYVQIDFSRPYDPRRDLFRVEYVIGRYPQENKGTYIPVPIVSELQSGKWGAKIVMREDRSVRLSIQSSPKCIVGKFRMY
VAVWTPYGVLRTSRNPETDTYILFNPWCEDDAVYLDNEKEREEYVLNDIGVIFYGEVNDIKTRSWSYGQFEDGILDTCLY
VMDRAQMDLSGRGNPIKVSRVGSAMVNAKDDEGVLVGSWDNIYAYGVPPSAWTGSVDILLEYRSSENPVRYGQCWVFAGV
FNTFLRCLGIPARIVTNYFSAHDNDANLQMDIFLEEDGNVNSKLTKDSVWNYHCWNEAWMTRPDLPVGFGGWQAVDSTPQ
ENSDGMYRCGPASVQAIKHGHVCFQFDAPFVFAEVNSDLIYITAKKDGTHVVENVDATHIGKLIVTKQIGGDGMMDITDT
YKFQEGQEEERLALETALMYGAKKPLNTEGVMKSRSNVDMDFEVENAVLGKDFKLSITFRNNSHNRYTITAYLSANITFY
TGVPKAEFKKETFDVTLEPLSFKKEAVLIQAGEYMGQLLEQASLHFFVTARINETRDVLAKQKSTVLTIPEIIIKVRGTQ
VVGSDMTVTVQFTNPLKETLRNVWVHLDGPGVTRPMKKMFREIRPNSTVQWEEVCRPWVSGHRKLIASMSSDSLRHVYGE
LDVQIQRRPSM
To load the data to R for the analysis, I have done:
require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("./protseq.fasta", strip.descs=TRUE)
To get the the pairwise numbers, as there are a total of 4 sequences, I have done:
number <-c(1:4); dat <- expand.grid(number,number, stringsAsFactors=FALSE)
datr <- dat[dat[,1] > dat[,2] , ]
In order to calculate the score one by one, I can do this:
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))
However, I have problem to add a new column as "score" to include all the score for each pairs of the proteins. I tried to do this, but did not work.
datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))})
Could you mind to comments how to further improve it? Thanks DWin and diliop for wonderful solutions to my previous question.
Try:
datr$score <- sapply(1:nrow(datr), function(i) {
x <- datr[i,1]
y <- datr[i,2]
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100,gapOpening=0, gapExtension=-5))
})
To be able to reference your sequences better using their names, you might want to tidy up datr by doing the following:
colnames(datr) <- c("seq1id", "seq2id", "score")
datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc)
datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc)
Or if you just want to extract the accession IDs i.e. the contents of your parentheses, you could use stringr as such:
library(stringr)
datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "DB[0-9\\ ;DB]+"))
Hope this helps!

Resources