Accessing actual content of XML file in R? - r

I am working with a well-structured XML file with the following initial content:
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
<cas-number>138068-37-8</cas-number>
<unii>Y43GF64R34</unii>
<state>liquid</state>
<groups>
<group>approved</group>
</groups>
...
This file consists of many nodes, each representing one drug. I am aiming to extract two specific fields from each node of this file: name and drugbank-id primary="true"
... and save these to a neatly formatted table (with one column for name and the second column for drugbank-id).
I have reviewed a number of tutorials and had success with accessing the higher levels of this XML table structure, but where the examples provide syntax to access the actual values (e.g. specific drug names), this code is not working for me.
This is my current code:
library(XML)
# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse("DrugBank_TruncatedDatabase_v3_Small.xml"))
# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)
# Create array structure to hold DrugBank ID values
DB_ID <- array(1:NumNodes, dim=c(1,NumNodes,1))
# Create array structure to hold Drug Name values
DrugName <- array(1:NumNodes, dim=c(1,NumNodes,1))
# for each node (i.e. each drug) in the database
for (i in 1:NumNodes){
# Assign the Drug Names to easy-to-comprehend DrugName array
DrugName[i] <- xmldata[[i]][["name"]]
# Assign the DrugBank ID numbers to easy-to-comprehend DB_ID array
DB_ID[i] <- xmldata[[i]][["drugbank-id"]]
}
EdgeListTable = data.frame(DrugName, DB_ID)
write.table(EdgeListTable, file="Output1.txt", quote=F)
The output file contains the following text, which is a level higher than I want:
X.name. X.name..1 X.name..2 X.name..3 X.drugbank.id. X.drugbank.id..1 X.drugbank.id..2 X.drugbank.id..3
1 name name name name drugbank-id drugbank-id drugbank-id drugbank-id
If I try:
xmlSApply(xmldata, function(x) xmlSApply(x, xmlValue))
...my output looks like:
$drug
$drug$drugbank-id
[1] "DB00001"
$drug$drugbank-id
[1] "BTD00024"
$drug$drugbank-id
[1] "BIOD00024"
$drug$name
[1] "Lepirudin"
...
...but after experimentation, I'm not sure how to actually access the values needed.
I appreciate any advice regarding the best way to store the values in the two fields of interest as a table.
============================================================
Update: I am able to extract the desired values using the following code:
DrugBankData <- xmlSApply(xmldata, function(x) xmlSApply(x, xmlValue))
for (i in 1:NumNodes){
DB_ID[i] <- DrugBankData[[i]][[1]]
DrugName[i] <- DrugBankData[[i]][[4]]
}
EdgeListTable = data.frame(DrugName, DB_ID)
write.table(EdgeListTable, file="Output1.txt", quote=F)
The output file looks like this:
X1 X2 X3 X4 X1.1 X2.1 X3.1 X4.1
1 Lepirudin Cetuximab Dornase alfa Denileukin diftitox DB00001 DB00002 DB00003 DB00004
So I am just working on getting this correctly formatted into columns and removing the first line of text from this file, and the "1" at the beginning of the second line...

Thanks for your response, herbaman. I ended up resolving the formatting issues (mostly, except that the columns still aren't aligned...) using the following code:
DrugName_Matrix = matrix(DrugName,nrow=NumNodes,ncol=1)
DrugID_Matrix = matrix(DB_ID,nrow=NumNodes,ncol=1)
Composite_Matrix = cbind(DrugName_Matrix,DrugID_Matrix,Target)
write.table(Composite_Matrix, file="Output1.txt", sep='\t', row.names=F, quote=F)
There remain mysterious column header names ("V1" and "V2") that don't appear in the contents of these two matrices; my attempts to rename them have been unsuccessful using standard methods, e.g.
colnames(Composite_Matrix)[colnames(Composite_Matrix)=="V1"] <- "Drug Name"
colnames(Composite_Matrix)[colnames(Composite_Matrix)=="V2"] <- "Drug ID"
or
setnames(Composite_Matrix, old=c("V1","V2"), new=c("DrugName", "DrugID"))
I'm not sure where these V column headers are originating...
As requested, the contents of the two matrices of interest are:
> DrugName_Matrix
[,1]
[1,] "Lepirudin"
[2,] "Cetuximab"
[3,] "Dornase alfa"
[4,] "Denileukin diftitox"
> DrugID_Matrix
[,1]
[1,] "DB00001"
[2,] "DB00002"
[3,] "DB00003"
[4,] "DB00004"
...and the output table is:
V1 V2
Lepirudin DB00001
Cetuximab DB00002
Dornase alfa DB00003
Denileukin diftitox DB00004

To read drug bank nodes I created the following method:
drug_sub_df <- function(rec, main_node, seconadary_node = NULL, id = "drugbank-id", byValue = FALSE) {
parent_key <- NULL
if (!is.null(id)) {
parent_key <- xmlValue(rec[id][[1]])
}
if (byValue) {
df <- map_df(rec[main_node], xmlValue)
} else {
if (is.null(seconadary_node) && !is.null(rec[[main_node]])) {
df <- xmlToDataFrame(rec[[main_node]], stringsAsFactors = FALSE)
} else {
df <- xmlToDataFrame(rec[[main_node]][[seconadary_node]], stringsAsFactors = FALSE)
}
}
if (nrow(df) > 0 && !is.null(parent_key)) {
df$parent_key <- parent_key
}
return(df)
}
Then I call the method like the following:
# Extract drug enzymes actions df
get_enzymes_actions_df <- function(rec) {
return(map_df(xmlChildren(rec[["enzymes"]]),
~ drug_sub_df(.x, "actions", id = "id")))
}
# Extract drug articles df
get_enzymes_articles_df <- function(rec) {
return(map_df(
xmlChildren(rec[["enzymes"]]),
~ drug_sub_df(.x, "references", seconadary_node = "articles", id = "id")
))
}
Of course. there are different situations that require different solutions like the following:
get_enzyme_rec <- function(r, drug_key) {
tibble(
id = xmlValue(r[["id"]]),
name = xmlValue(r[["name"]]),
organism = xmlValue(r[["organism"]]),
known_action = xmlValue(r[["known-action"]]),
inhibition_strength = xmlValue(r[["inhibition-strength"]]),
induction_strength = xmlValue(r[["induction-strength"]]),
position = ifelse(is.null(xmlGetAttr(r, name = "position")),
NA, xmlGetAttr(r, name = "position")),
parent_key = drug_key
)
}
get_enzymes_df <- function(rec) {
return(map_df(xmlChildren(rec[["enzymes"]]),
~ get_enzyme_rec(.x, xmlValue(rec["drugbank-id"][[1]]))))
}
or that one
get_atc_codes_rec <- function(r, drug_key) {
tibble(
atc_code = xmlGetAttr(r, name = "code"),
level_1 = xmlValue(r[[1]]),
code_1 = xmlGetAttr(r[[1]], name = "code"),
level_2 = xmlValue(r[[2]]),
code_2 = xmlGetAttr(r[[2]], name = "code"),
level_3 = xmlValue(r[[3]]),
code_3 = xmlGetAttr(r[[3]], name = "code"),
level_4 = xmlValue(r[[4]]),
code_4 = xmlGetAttr(r[[4]], name = "code"),
parent_key = drug_key
)
}
get_atc_codes_df <- function(rec) {
return (map_df(xmlChildren(rec[["atc-codes"]]),
~ get_atc_codes_rec(.x,
xmlValue(rec["drugbank-id"][[1]]))))
}
You can find more examples to extract contents of an drug bank XML database in R in different structures in this package
https://github.com/Dainanahan/dbparser

Related

See unmatched countries for joinCountryData2Map in rworldmap?

I'm using the joinCountryData2Map function in rworldmap to match my data to the countries in the world map.
I get this result:
230 codes from your data successfully matched countries in the map
11 codes from your data failed to match with a country code in the map
11 codes from the map weren't represented in your data
I cannot figure out how to view those two lists of 11 countries. I am guessing that those 11 countries have issues with their ISO2 codes that I need to correct, but am not sure which ones to check without being able to view those two lists.
I'm guessing there's a solution along the lines of just View(SomeObject$Countries) but I haven't been able to find anything that works.
Set joinCountryData2Map(...,verbose=TRUE) to print the names of the countries that failed to match in the console.
From the FAQ: "You can see that a summary of how many countries are successfully joined is output to the console. You can specify verbose=TRUE to get a full list of countries"
library(rworldmap)
data(countryExData)
# Set Angola to fail
countryExData[countryExData$ISO3V10 == "AGO", "ISO3V10"] <- "AGO_FAIL"
# Attempt to join
# With verbose=TRUE, failed joins (ie Angola) are printed in the console
sPDF <- joinCountryData2Map(
countryExData[,c("ISO3V10", "Country")],
joinCode = "ISO3",
nameJoinColumn = "ISO3V10",
verbose = TRUE)
# > 148 codes from your data successfully matched countries in the map
# > 1 codes from your data failed to match with a country code in the map
# > failedCodes failedCountries
# > [1,] "AGO_FAIL" "Angola"
# > 95 codes from the map weren't represented in your data
But what if you want to get the information on failed joins programmatically? I may have missed something, but I don't see an option for that (i.e., str(sPDF) or function arguments). However, looking at the internals of joinCountryData2Map(), the object failedCountries contains the info you want, so it should be easy enough to include it in the returned object.
Here's how you could modify joinCountryData2Map() to return a list with two elements: the first element is the default object, and the second element is failedCountries.
# Modify the function to return the failed joins in the environment
joinCountryData2Map_wfails <- function(
dF, joinCode = "ISO3", nameJoinColumn = "ISO3V10",
nameCountryColumn = "Country", suggestForFailedCodes = FALSE,
mapResolution = "coarse", projection = NA, verbose = FALSE) {
# Retain successful join as first element and failed join as second element
ll <- list() # MODIFIED
mapWithData <- getMap(resolution = mapResolution)
if (!is.na(projection))
warning("the projection argument has been deprecated, returning Lat Lon, use spTransform from package rgdal as shown in help details or the FAQ")
listJoinCodesNew <- c("ISO_A2", "ISO_A3", "FIPS_10_",
"ADMIN", "ISO_N3")
listJoinCodesOld <- c("ISO2", "ISO3", "FIPS",
"NAME", "UN")
listJoinCodes <- c(listJoinCodesOld, listJoinCodesNew)
if (joinCode %in% listJoinCodes == FALSE) {
stop("your joinCode (", joinCode, ") in joinCountryData2Map() is not one of those supported. Options are :",
paste(listJoinCodes, ""), "\n")
return(FALSE)
}
joinCodeOld <- joinCode
if (joinCode %in% listJoinCodesOld) {
joinCode <- listJoinCodesNew[match(joinCode, listJoinCodesOld)]
}
if (is.na(match(nameJoinColumn, names(dF)))) {
stop("your chosen nameJoinColumn :'", nameJoinColumn,
"' seems not to exist in your data, columns = ",
paste(names(dF), ""))
return(FALSE)
}
dF[[joinCode]] <- as.character(dF[[nameJoinColumn]])
dF[[joinCode]] <- gsub("[[:space:]]*$", "", dF[[joinCode]])
if (joinCode == "ADMIN") {
dF$ISO3 <- NA
for (i in 1:nrow(dF)) dF$ISO3[i] = rwmGetISO3(dF[[joinCode]][i])
joinCode = "ISO3"
nameCountryColumn = nameJoinColumn
}
matchPosnsInLookup <- match(as.character(dF[[joinCode]]),
as.character(mapWithData#data[[joinCode]]))
failedCodes <- dF[[joinCode]][is.na(matchPosnsInLookup)]
numFailedCodes <- length(failedCodes)
numMatchedCountries <- nrow(dF) - numFailedCodes
cat(numMatchedCountries, "codes from your data successfully matched countries in the map\n")
failedCountries <- dF[[nameCountryColumn]][is.na(matchPosnsInLookup)]
failedCountries <- cbind(failedCodes, failedCountries = as.character(failedCountries))
cat(numFailedCodes, "codes from your data failed to match with a country code in the map\n")
if (verbose)
print(failedCountries)
matchPosnsInUserData <- match(as.character(mapWithData#data[[joinCode]]),
as.character(dF[[joinCode]]))
codesMissingFromUserData <- as.character(mapWithData#data[[joinCode]][is.na(matchPosnsInUserData)])
countriesMissingFromUserData <- as.character(mapWithData#data[["NAME"]][is.na(matchPosnsInUserData)])
numMissingCodes <- length(codesMissingFromUserData)
cat(numMissingCodes, "codes from the map weren't represented in your data\n")
mapWithData#data <- cbind(mapWithData#data, dF[matchPosnsInUserData,
])
invisible(mapWithData)
ll[[1]] <- mapWithData # MODIFIED
ll[[2]] <- failedCountries # MODIFIED
return(ll) # MODIFIED
}
Usage:
sPDF_wfails <- joinCountryData2Map_wfails(
countryExData[,c("ISO3V10", "Country")],
joinCode = "ISO3",
nameJoinColumn = "ISO3V10",
verbose = TRUE)
# This is the result of the original function
# sPDF_wfails[[1]]
# This is info on the failed joins
sPDF_wfails[[2]]
# > failedCodes failedCountries
# > [1,] "AGO_FAIL" "Angola"

R & xml2: Locate elements by specific text value, store all children values in data.frame

I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00

How to access data saved in an assign construct?

I made a list, read the list into a for loop, do some calculations with it and export a modified dataframe to [1] "IAEA_C2_NoStdConditionResiduals1" [2] "IAEA_C2_EAstdResiduals2" ect. When I do View(IAEA_C2_NoStdConditionResiduals1) after the for loop then I get the following error message in the console: Error in print(IAEA_C2_NoStdConditionResiduals1) : object 'IAEA_C2_NoStdConditionResiduals1' not found, but I know it is there because RStudio tells me in its Environment view. So the question is: How can I access the saved data (in this assign construct) for further usage?
ResidualList = list(IAEA_C2_NoStdCondition = IAEA_C2_NoStdCondition,
IAEA_C2_EAstd = IAEA_C2_EAstd,
IAEA_C2_STstd = IAEA_C2_STstd,
IAEA_C2_Bothstd = IAEA_C2_Bothstd,
TIRI_I_NoStdCondition = TIRI_I_NoStdCondition,
TIRI_I_EAstd = TIRI_I_EAstd,
TIRI_I_STstd = TIRI_I_STstd,
TIRI_I_Bothstd = TIRI_I_Bothstd
)
C = 8
for(j in 1:C) {
#convert list Variable to string for later usage as Variable Name as unique identifier!!
SubNameString = names(ResidualList)[j]
SubNameString = paste0(SubNameString, "Residuals")
#print(SubNameString)
LoopVar = ResidualList[[j]]
LoopVar[ ,"F_corrected_normed"] = round(LoopVar[ ,"F_corrected_normed"] / mean(LoopVar[ ,"F_corrected_normed"]),
digit = 5
)
LoopVar[ ,"F_corrected_normed_error"] = round(LoopVar[ ,"F_corrected_normed_error"] / mean(LoopVar[ ,"F_corrected_normed_error"]),
digit = 5
)
assign(paste(SubNameString, j), LoopVar)
}
View(IAEA_C2_NoStdConditionResiduals1)
Not really a problem with assign and more with behavior of the paste function. This will build a variable name with a space in it:
assign(paste(SubNameString, j), LoopVar)
#simple example
> assign(paste("v", 1), "test")
> `v 1`
[1] "test"
,,,, so you need to get its value by putting backticks around its name so the space is not misinterpreted as a parse-able delimiter. See what happens when you type:
`IAEA_C2_NoStdCondition 1`
... and from here forward, use paste0 to avoid this problem.

Huge data file and running multiple parameters and memory issue, Fisher's test

I have a R code that I am trying to run in a server. But it is stopping in the middle/get frozen probably because of memory limitation. The data files are huge/massive (one has 20 million lines) and if you look at the double for loop in the code, length(ratSplit) = 281 and length(humanSplit) = 36. The data has specific data of human and rats' genes and human has 36 replicates, while rat has 281. So, the loop is basically 281*36 steps. What I want to do is to process data using the function getGeneType and see how different/independent are the expression of different replicate combinations. Using Fisher's test. The data rat_processed_7_25_FDR_05.out looks like this :
2 Sptbn1 114201107 114200202 chr14|Sptbn1:114201107|Sptbn1:114200202|reg|- 2 Thymus_M_GSM1328751 reg
2 Ndufb7 35680273 35683909 chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+ 2 Thymus_M_GSM1328751 rev
2 Ndufb10 13906408 13906289 chr10|Ndufb10:13906408|Ndufb10:13906289|reg|- 2 Thymus_M_GSM1328751 reg
3 Cdc14b 1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|- 3 Thymus_M_GSM1328751 reg
and the data fetal_output_7_2.out has the form
SPTLC2 78018438 77987924 chr14|SPTLC2:78018438|SPTLC2:77987924|reg|- 11 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
EXOSC1 99202993 99201016 chr10|EXOSC1:99202993|EXOSC1:99201016|rev|- 5 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
SHMT2 57627893 57628016 chr12|SHMT2:57627893|SHMT2:57628016|reg|+ 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
ZNF510 99538281 99537128 chr9|ZNF510:99538281|ZNF510:99537128|reg|- 8 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
PPFIBP1 27820253 27824363 chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+ 10 Fetal_Brain_408_AGTCAA_L006_R1_report.txt reg
Now I have few questions on how to make this more efficient. I think when I run this code, R takes up lots of memory that ultimately causes problems. I am wondering if there is any way of doing this more efficiently
Another possibility is the usage of double for-loop'. Will sapply help? In that case, how should I apply sapply?
At the end I want to convert result into a csv file. I know this is a bit overwhelming to put code like this. But any optimization/efficient coding/programming will be A LOT! I really need to run the whole thing at least one to get the data soon.
#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]
#function to capitalize
capitalize <- function(x){
capital <- toupper(x) ## capitalize
paste0(capital)
}
#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))
#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)
print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG
#function to be used to aggregate
getGeneType <- function(types) {
types <- as.character(types)
if ('rev' %in% types) {
return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
}
else {
return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
}
}
#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis",
Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
for(j in 1:length(humanSplit)) {
ratReplicateName <- names(ratSplit[i])
humanReplicateName <- names(humanSplit[j])
#merging above two based on the one-to-one gene mapping as in geneList defined above.
mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")
mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
mergedRatData <- mergedRatData[,c(2,1,4,5)] #rearrange column
mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"
agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.
#now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
#is.integer0 function is used
HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])
contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
# contingencyTable:
# HnRn --|--HyRn
# |------|-----|
# HnRy --|-- HyRy
#
fisherTest <- fisher.test(contingencyTable)
#make new line out of the result of fisherTest
newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2],
oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine) #append newline to result
if(j%%10 = 0) print(c(i,j))
}
}
write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Referring to the accepted answer to Monitor memory usage in R, the amount of memory used by R can be tracked with gc().
If the script is, indeed, running short of memory (which would not surprise me), the easiest way to resolve the problem would be to move the write.table() from the outside to the inside of the loop, to replace the rbind(). It would just be necessary to create a new file name for the CSV file that is written from each output, e.g. by:
csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)
If the CSV files are written without headers, they could then be concatenated separately outside R (e.g. using cat in Unix) and the header added later.
While this approach might succeed in creating the CSV file that is sought, it is possible that file might be too big to process subsequently. If so, it may be preferable to process the CSV files individually, rather than concatenating them at all.

Ordering Merged data frames

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much
Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Resources