In R - Substring based on repeating character - r

I have two tables. In one table(IPTable), there is a column in one table containing IP addresses (which look like this: "10.100.20.13"). I am trying to match each of those to the data in a column in the other table (SubnetTable) which holds subnet addresses (which look like this: "10.100.20", essentially a shortened version of the IP address - everything before the 3rd period). Both variables appear to be chr vectors.
Essentially the raw IP data looks like this:
IPTable$IPAddress
10.100.20.13
10.100.20.256
10.100.200.23
101.10.13.43
101.100.200.1
and the raw Subnet data I am comparing it against looks like this:
SubnetTable$Subnet
Varies
10.100.20
Remote Subnet
10.100.200, 101.10.13
Unknown Subnet
Notes:
sometimes the subnet entries contain two subnets within a field separated by a comma
the IPAddress field doesn't have a consistent placement between the groups (e.g. - there could exist "10.110.20.13" as well as "101.10.20.13")
In a different scripting application I am able to simply compare these as strings in a foreach loop. In this logic, it loops through each entry in the Subnet data(SubnetTable), splits it against the comma (to account for the entries with multiple subnet addresses) and then checks to see if it finds a match in the IP Address field (e.g. - is "10.100.20" found anywhere in "10.100.20.13"). I use that field for a join/merge. In using R I understand that foreach looping isn't the most efficient way I should this and in the other application it takes a long time which is part of the reason I am moving to R.
I didn't see a method of doing the same thing against this type of data (I have done merges and joins but I don't see a way of doing that without getting to two variables alike enough to use to link the two tables).
In the past I have been able to use R methods like sqldf, charindex and leftstr to look for a particular character "." and pull everything before it but the difficulty here is that to do it that way, I need to look for the 3rd occurance of the period "." instead of the first. I didn't see a way of doing that but if there is a way, that might be best.
My next attempt was to use strsplit and sapply on the IP address with the idea of reassembling only the first three portions to create a subnet to match against (in a new column/variable). That looked like this:
IPClassC <- sapply(strsplit(Encrypt_Remaining5$IPAddress, "[.]"), `[`)
This gives a "Large List" which makes the data look like this:
chr [1:4] "10" "100" "20 "13"
But when attempting to put it back together I am also losing the period between the octets. Sample code:
paste(c(IPClassC[[1]][1:3]), sep ="[.]", collapse = "")
This produces something like this:
"1010020"
In the end I have two questions:
1) Is there a method for doing the easy comparison I did earlier (essentially doing a merge from the subnet variable of Table1 to "most" of the IP Address of Table2 basing it off of everything before the third period (".") without having to split it out and reassemble the IPAddress field?
2) If not, am I on the right track with trying to split and then reassemble? If so, what am I doing wrong with the reassembly or is there an easier/better way of doing this?
Thanks and let me know what else you need.

I think what you’re essentially asking is how to join these two tables, right? If this is the case, I would do it like this:
library(tidyr)
suppressPackageStartupMessages(library(dplyr))
IPTable <-
data.frame(
IPAddress =
c(
"10.100.20.13",
"10.100.20.256",
"10.100.200.23",
"101.10.13.43",
"101.100.200.1"
),
stringsAsFactors = FALSE
)
I am not sure, whether your SubnetTable really looks like this, i.e. mixing subnet addresses with other text? Anyway, this solution essentially ignores the other text.
SubnetTable <-
data.frame(
subnet_id = 1:5,
Subnet =
c(
"Varies",
"10.100.20",
"Remote Subnet",
"10.100.200, 101.10.13",
"Unknown Subnet"
),
stringsAsFactors = FALSE
)
First we separate multiple subnets into multiple rows. Note that this assumes that the SubnetTable$Subnet vector only contains a ", " to separate two subnets. I.e. there are no strings like this "Unknown, Subnet", or else these will be separated into two rows as well.
SubnetTable_tidy <- tidyr::separate_rows(SubnetTable, Subnet, sep = ", ")
SubnetTable_tidy
#> subnet_id Subnet
#> 1 1 Varies
#> 2 2 10.100.20
#> 3 3 Remote Subnet
#> 4 4 10.100.200
#> 5 4 101.10.13
#> 6 5 Unknown Subnet
Next we extract the Subnet by replacing/deleting a dot (\\.) followed by one to three numbers (\\d{1,3}) followed by the end of the string ($) from IPTable$IPAddress.
IPTable$Subnet <- gsub("\\.\\d{1,3}$", "", IPTable$IPAddress)
IPTable
#> IPAddress Subnet
#> 1 10.100.20.13 10.100.20
#> 2 10.100.20.256 10.100.20
#> 3 10.100.200.23 10.100.200
#> 4 101.10.13.43 101.10.13
#> 5 101.100.200.1 101.100.200
Now we can join both tables.
IPTable_subnet <-
dplyr::left_join(
x = IPTable,
y = SubnetTable_tidy,
by = "Subnet"
)
IPTable_subnet
#> IPAddress Subnet subnet_id
#> 1 10.100.20.13 10.100.20 2
#> 2 10.100.20.256 10.100.20 2
#> 3 10.100.200.23 10.100.200 4
#> 4 101.10.13.43 101.10.13 4
#> 5 101.100.200.1 101.100.200 NA

unlist(strsplit(SubnetTable$Subnet,split=",")) %in%
gsub("^(\\d{2,3}.\\d{2,3}.\\d{2,3}).*$","\\1",IPTable$IPAddress)
This will give you a vector of class logical that matches a TRUE/FALSE to each item in Subnet (giving multiple responses for items with commas in them). Alternatively, you can flip the two sides to get a list of logicals for each of the IPAddress, telling you if it exists in the subnet list.
Is this what you were looking for?
You can also achieve a similar result with charmatch:
sapply(strsplit(SubnetTable$Subnet, split=","), charmatch, IPTable$IPAddress)
This gives the following result with your sample data:
[[1]]
[1] NA
[[2]]
[1] 0
[[3]]
[1] NA
[[4]]
[1] 3 NA
[[5]]
[1] NA
Note that when there is a single match, you get the index for it, but where there are multiple matches, the value is 0.
Finally, the flip of this one will give you a list of the indexes in subnet where the IPaddresses match to:
sapply(gsub("^(\\d{2,3}.\\d{2,3}.\\d{2,3}).*$","\\1",IPTable$IPAddress), charmatch, SubnetTable$Subnet)
results in:
10.100.20 10.100.20 10.100.200 101.10.13 101.100.200
2 2 4 NA NA

Related

How to save R DataFrame to a file in MSSQL backup format?

I need to feed external MSSQL server with a large amount of data calculated in R.
No direct access to DB is possible, so it must be an interim export file.
Excel format cannot be utilised due to number of data rows exceeding Excel capacity.
CSV would be fine, but there are many obstacles in the data itself like semicolons used in names, special characters, not closed quotations (odd number of ") and so on.
I am looking for the versatile method of transporting data from R to MSSQL database, independent of data content. If I were able to save DataFrame as a database containing single table to a MSSQL backup format file, that would satisfy the needs.
Any idea on how to achieve this? Any package available? Any suggestion would be appreciated.
I'm inferring you're hoping to bulk-insert the data using bcp or sqlcmd. While neither one deals well with commas, embedded commas, and embedded quotes, you can work around this by using a different field separator (that is not contained within the data).
Setup:
evil_data <- data.frame(
id = 1:2,
chr = c('evil"string ,;\n\'', '",";:|"'),
stringsAsFactors = FALSE
)
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2test (id INT, chr nvarchar(64))")
# [1] 0
DBI::dbWriteTable(con, "r2test", evil_data, create = FALSE, append = TRUE)
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
First, I'll use \U001 as the field separator and \U002 as the row separator. Those two should be "good enough", but if you have non-printable characters in your data, then you might either change your separators to other values or look for encoding options for the data (e.g., base64, though it might need to be stored that way).
write.table(evil_data, "evil_data.tab", sep = "\U001", eol = "\U002", quote = FALSE)
# or data.table::fwrite or readr::write_delim
Since I'm using bcp, it can use a "format file" to indicate separators and which columns on the source correspond with columns in the database. See references for how to create this file, but for this example I'll use:
fmt <- c("12.0", "2",
"1 SQLCHAR 0 0 \"\001\" 1 id \"\" ",
"2 SQLCHAR 0 0 \"\002\" 2 chr SQL_Latin1_General_CP1_CI_AS")
writeLines(fmt, "evil_data.fmt")
From here, assuming bcp is in your PATH (you'll need an absolute path for bcp otherwise), run this in a terminal (I'm using git-bash on windows, but this should be the same in others). The second line is all specific to my database connection, you'll need to omit or change all of this for your own connection. The first line is your stuff
$ bcp [db_owner].[r2test] in evil_data.tab -F2 -f evil_data.fmt -r '\002' \
-S '127.0.0.1,21433' -U 'myusername' -d 'mydatabase' -P ***MYPASS***
Starting copy...
2 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 235 Average : (8.51 rows per sec.)
Proof that it worked:
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
# 3 1 1\001evil"string ,;\r\n'
# 4 2 2\001",";:|"
References:
Microsoft pages for bcp: windows and linux
non-XML format files
bcp and quoted-CSV

Getting range_boundaries_to_cidr to run through a vector of data giving the error "Expecting a single value" error

I am creating an R script to manipulate a csv of IP addresses in the format I need it in. An example of the starting csv is as follows:
data2 <- separate(Data,IP,into=c("IP","End"),sep="-")
The code above will get it into this format:
Note: Data is name of CSV and IP is name of the column
After I get to that point I want to use the code below to run through that list and convert the IPs that are not in CIDR format into CIDR format, but it returns the "expecting a single value" error and I am not sure where to go from here. Also, I would like it to skip over the ones that are not in CIDR format and I am not sure if it will do this automatically or I need to code that in as well. I also tried a for loop and got the same error
range_boundaries_to_cidr(ip_to_numeric(data2$IP), ip_to_numeric(data2$End))
Any help is appreciated!!! I am new to R so I am very open to any suggestions
You could try using my ipaddress R package. :-)
Starting from your second dataframe, I think the main difficulty is that the IP column is sometimes storing an IP address and sometimes storing an IP network. In other words, the data is dirty.
Here's an example with branching logic to handle the two cases.
library(tidyverse)
library(withr)
library(ipaddress)
tibble(
IP = c("192.168.0.0", "172.16.0.0/12", "10.100.56.0"),
End = c("192.168.55.255", NA, "10.100.56.255")
) %>%
mutate(network = with_options(list(warn = -1), if_else(
is.na(ip_network(IP)),
common_network(ip_address(IP), ip_address(End)),
ip_network(IP)
)))
#> # A tibble: 3 x 3
#> IP End network
#> <chr> <chr> <ip_netwk>
#> 1 192.168.0.0 192.168.55.255 192.168.0.0/18
#> 2 172.16.0.0/12 <NA> 172.16.0.0/12
#> 3 10.100.56.0 10.100.56.255 10.100.56.0/24
Created on 2020-08-06 by the reprex package (v0.3.0)

Apply a filter on a column to delete rows

I am new to R I have a simple question on deleting if a condition is not met.
I have a CSV file, one column is IP Address, issue not all are IP Address do want to delete rows that are not IP Address.
Sample table My.Data
ID. SIP DIP
1. 123.243.0.1 56
2. 123.143.0.1 89
3. 0.16783633 44
4. 123.143.0.1 89
So want to delete any thing in My.Data$SIP that no [O-9]{3}\.[O-9]{3}\.[O-9]{3}\.
I am lost and I am using dplyr
Thank you,
Paul
Use filter with grepl. I'm assuming here that the patter for an IP address is "digits plus dot 3 times, then digits".
library(dplyr)
My.Data %>%>
filter(grepl("^(\\d+\\.){3}\\d+$", SIP))
It looks like you're already familiar with regular expressions so you can do something like:
hits = grepl(pattern = "[0-9]{3}\\.[0-9]{3}\\.[0-9]{3}", x = My.Data$SIP)
My.Data2 <- My.Data[hits,]
Basically this just makes a logical vector of whether or not that column of your data frame matches the regular expression. We then subset the data based on that vector.

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

Unix programming to subset every 1Mb and name the subset

I am needing a way to subset a large data set in Unix. I have > 50K SNP, each with the genetic variance they explain and a location (chromosome and position). I need to subset the SNP every 1 million base pairs (position) for each chromosome to create what we call 1Mb windows. I also need to name these windows, for instance CHR:WINDOW.
My data is structured as:
SNP CHR POS GenVar
BTB-00074935 1 157284336 2.306141e-06
BTB-01512420 8 72495155 1.958865e-06
Hapmap35555-SCAFFOLD20017_21254 18 29600313 1.876211e-06
BTB-01098205 3 68702409 1.222881e-06
ARS-BFGL-NGS-115531 11 74038177 9.597669e-07
ARS-BFGL-NGS-25658 2 119059379 7.953552e-07
BTB-00411452 20 47919708 6.827312e-07
ARS-BFGL-NGS-100532 18 63878550 6.115242e-07
Hapmap60823-rs29019235 1 10717144 5.400144e-07
ARS-BFGL-NGS-42256 10 50282066 4.864838e-07
.
.
.
A basic first try, assuming no spaces in any of the first fields (SNP), and that the "key" is (col2, first (length-6) digits of col3):
awk '{w=0+substr($3,1,length($3)-6); print >>sprintf("CHR%02d:WINDOW%03d",$2,w)}'
This prints to files named like CHR03:WINDOW456. If you only wanted something like 03:456 for filenames, edit out the CHR and WINDOW above.
Also note, subsequent runs will just keep expanding existing files, so you may need a rm *:* between runs.

Resources