Alternative to making a dictionary from text document - dictionary

I took word pairs from a text file and made a dictionary:
x = open('sustantivos.txt', 'r') ## opens file and assigns it to a variable
y = x.read() ## reads open file object and assigns it to variable y
y = str(y).lower().replace(":", "") ## turns open file object into a string, then makes it lower case and replaces ":" with whitespace
z = y.splitlines() # make a list with each element being a word pair string, then assign to variable z
bank = {}
for pair in z: #go through every word pair string
(key, value) = pair.split() #split the word pair string making a list with two elements, assign these to variable key and value
bank[key] = value #add key value pair
x.close()
For reference this is an excerpt from the text file:
Amour: amor
Anglais: inglés
Argent: dinero
Bateau: barco
My question is: Is there are more efficient or different approach that you would do differently? Also I was curious if my understanding that I include in the comments is correct. Thanks in advance.

Your inline notes are accurate except that line number 2 is where the opened file is read and its contents are turned into a string. Your use of str(y) in the third line is unnecessary and could simply be written as y.lower()...
Your parsing strategy is sound as long as you know that the file will always contain lines of key:value pairs on each and every line. However, There are a couple of recommendation I would make.
Use a with statement when opening files. this avoids errors that can occur if the file isn't closed properly
Don't read the whole file in at once.
dict.update will take an iterable of length 2 as an argument
Using those tips your code can be rewritten as:
bank = {}
with open('sustantivos.txt', 'r') as x:
for line in x:
key, value = line.strip().split(':')
bank[key] = value
# bank.update([line.strip().split(':')]) <- or this

Related

Finding and substituting set of codes/words in a file based on a list of old and corrected ones

I have a FASTA_16S.txt file containing paragraphs of different lengths with a unique code (e.g. 16S317) at the top. After transfer into R, I have a list with 413 members that looks like this:
[1]">16S317_V._rotiferianus_A\n
AAATTGAAGAGTTTGATCATGGCTCAG..."
[2]">16S318_Salmonella_bongori\n
AAATTGAAGAGTTTGATCATGGCTCAGATT..."
[3]">16S319_Escherichia_coli\n
TTGAAGAGTTTGATCATGGCTCAGATTG...
I need to substitute the existing codes with the new ones from a table Code_16S:
Old New
1. 16S317 16S001
2. 16S318 16S307
3. 16S319 16S211
4. ... ...
Can anybody suggest a code that would identify an old code and substitute it with a new one?
Consider that we have the same codes in columns New and Old, so direct application of gsub or replace for the whole list did not work (after a substitution we have two paragraphs with the same code, so one of the next steps will change both of them).
Below there is my solution for the problem, but I don´t consider it as an optimal.
Instead of using lapply, it may be easier with str_replace_all
library(stringr)
library(tibble)
FASTA_16S <- str_replace_all(FASTA_16S, deframe(Code_16S))
-output
FASTA_16S
[1] ">16S001_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG..."
[2] ">16S307_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
data
FASTA_16S <- c(">16S317_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG...",
">16S318_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
)
Code_16S <- structure(list(Old = c("16S317", "16S318", "16S319"), New = c("16S001",
"16S307", "16S211")), class = "data.frame", row.names = c("1.",
"2.", "3."))
As long as the new codes are sorted according to the old ones, which corresponds to the order of the paragraphs in the file, we can perform substitution paragraph by paragraph.
(Initially the table was sorted by the column New)
Num = seq.int(1:413) # total number of paragraphs
Code_16S = codes$New
F_16S = function(x) {
row = code_16S[x]
gsub("^.{7}", paste(">", row, sep = ""), FASTA_16S[[1]][x])
}
N_16S = lapply(Num, F_16S)
With gsub("^[>].{7}", I tried to substitute first 6 characters (the code) except the first one (>) in each string, but it did not work, thus added paste function.

Swapping each letter in a string sequentially using R

For those fellow DnD fans, I recently found the Ring of the Grammarian. Thus I am trying to make a quick script for generating a list of sensible words based on swapping letters from an input string. For example, I want to input "mage hand" and have the program return a list or dataframe which reads;
cage hand
...yada yada ...
mage band
mage land
...yada yada ...
mage bang
so far, I've only gotten as far as this:
dictionary<-data.frame(DICTIONARY)
spell.suggester<-function(x){
for (i in 1:nchar(x)) {
for (k in 1:length(letters)) {
res1<-gsub(pattern = x[i] ,replace = letters[k], x)
res2<-grep("\\bres1\\b",dictionary[,1],value = F)
if_else(res2>1,print(grep("\\bres1\\b",dictionary[,1],value = T)),"nonsense")
return()
}
}
}
spell.suggester(x = "mage hand")
but I end up with an error message which reads
character(0)
NULL
I haven't found any answers on stack using R. Could someone please help me with some suggestions and guidance?
Your major problem here is that you're trying to index each letter of a string, and R doesn't like letting you do that - it treats a string as a whole value, so attempting to index the letters fails.
To fix that, you can use strsplit to turn a string into a vector of individual characters that you can index as normal.
Your second issue the dictionary search seems a bit over-complicated; you can use %in% to check if a value is present in a vector.
The code below shows a minimal example of how to do this; it only works with single words, and relies on you having a decent dictionary to check valid words against.
# minimal example of valid word list
dictionary <- c("vane", "sane", "pane", "cane",
"bone", "bans", "bate", "bale")
spell.suggester<-function(spell){
#Split spell into a vector of single characters
spell_letters <- strsplit(spell,"")[[1]]
# Once for each letter in spell
for (i in 1:nchar(spell)) {
# Once for each letter in letters
for (k in 1:length(letters)) {
#If the letter isn't a space
if (spell_letters[i] != " "){
# Create a new word by changing one letter
word <-gsub(pattern = spell_letters[i] ,replace = letters[k], spell)
# If the word is in the list of valid words
if (word %in% dictionary){
# print the possibility
print(word)
}
}
}
}
}
spell.suggester(spell="bane")

Specify the number of columns read_csv is applied to

Is it possible to pass column indices to read_csv?
I am passing many CSV files to read_csv with different header names so rather than specifying names I wish to use column indices.
Is this possible?
df.list <- lapply(myExcelCSV, read_csv, skip = headers2skip[i]-1)
Alternatively, you can use a compact string representation
where each character represents one column: c = character, i
= integer, n = number, d = double, l = logical, f = factor, D
= date, T = date time, t = time, ? = guess, or ‘_’/‘-’ to
skip the column.
If you know the total number of columns in the file you could do it like this:
my_read <- function(..., tot_cols, skip_cols=numeric(0)) {
csr <- rep("?",tot_cols)
csr[skip_cols] <- "_"
csr <- paste(csr,collapse="")
read_csv(...,col_types=csr)
}
If you don't know the total number of columns in advance you could add code to this function to read just the first line of the file and count the number of columns returned ...
FWIW the skip argument might not do what you think it does (it skips rows rather than selecting/deselecting columns): as I read ?readr::read_csv() there doesn't seem to be any convenient way to skip and/or include particular columns (by name or by index) except by some ad hoc mechanism such as suggested above; this might be worth a feature request/discussion on the readr issues list? (e.g. add cols_include and/or cols_exclude arguments that could be specified by name or position?)

Parsing a NASDAQ .tip file

Problem: I have a .tip file from NASDAQ that I need to parse. Official name: GENIUM CONSOLIDATED FEED
The file i a csv like file with semicolons and newline for new entries of different structure, and so no constant header. But it has a corresponding xsd schemafile which should describe the contents and structure, but I can see no clear way to go from the file to a structure result. Have tried with a list setup where messageType becomes a name in a list
x <- scan("cran_tasks/NOMX_Comm_Close2.tip", what="", sep="\n")
y <- strsplit(x, ';')
names(y) <- sapply(y, `[[`, 1)
y <- sapply(y, `[`, -1, simplify = FALSE)
y <- sapply(y, as.list)
The file is structured like this:
messageType;key1Value;key2Value;...;..;/n
messageType;key1Value;key2Value;.....;/n
BDSr;i2;NAmGITS;
BDx;i106;Si18;s2;SYmNC;NAmNASDAQ OMX Commodities;CNyNO;MIcNORX;
BDm;i672;Si018171;s2;Ex106;NAmFuel Oil;SYmNCFO;TOTa+0200;LDa20141011;
BDIs;i10142;SiNP;s2;ISsNP;NAmNord Pool ASA;
m;i122745;t191500.001;Dt20170509;ISOcY;ISOtY;
m;i122745;t192808.721;Dt20170509;ISOcN;ISOtY;SEp275.45;
Oi;i122745;t054425.600;OPi2840;
I have had a working sql code set to parse the file but it have shown to be to case specific to be robust against even minor changes in structure, like the order of the different keyValue pairs. So I'm looking for at way to exploit the the structure of the information to be able to make a robust and maintainable solution, preferably in R. I have tried with some regular expressions matching but still I end up with a lot of context specific code, so I hope the some structuring with a table or dataframe containing the Key information can make for a sustainable solution.
Any hints or suggestions are more than welcome.
link to the XML/XSD file and the html sheet specifying keys, and a .tip file
TIP Message Format The TIP protocol is a tagged text protocol. A
TIP message is a sequence of tag and value pairs separated with
semicolon. A tag is zero or more UPPERCASE characters followed by a
lowercase character. The tag is followed immediately by the value.
Examples of tags are "FLd", "STa". The first tag in a message is
always the message type. The message type tag has no value. An example
of a message type tag is "BDSh". IP messages are encoded with UTF-8
unless stated otherwise. The maximum length of a TIP message is
indicated with the constant MAX_MESSAGE_LENGTH (2048 bytes). Any
max field length excludes any escape characters '\'. No empty values
will be sent; exceptions are message type tags and Boolean tags (the
presence of the tag itself corresponds to a 'true' value). For a
decimal field (i.e. the Float data type) the length is given as X,Y
where X is the max number of digits in the integer part of the field
(left of the separator). Y is the number of decimals (right of the
separator). The order of the disseminated tags is not fixed, i.e.
the client may not make any assumptions of the order of tags. The only
fixed component of a message is the message type, which is always
placed first in the message data. Note that new messages and fields
may be added in future versions of the protocol. To ensure forward
compatibility, clients should ignore unrecognized message types and
field tags.
The data.table solution below parses the given .tip file and returns a data.table with tag and value pairs. So, this is probably a good starting point for further extracting the relevant data.
library(data.table)
# read downloaded file from local disk
tip_wide <- fread(
"NOMX_Comm_Close2.tip"
, sep = "\n"
, header = FALSE
)
# split tip messages into tag and value pairs
# thereby rehaping from wide to long format
# and adding a row number
tip_long <- tip_wide[, unlist(strsplit(V1, ";")),
by = .(rn = seq_len(nrow(tip_wide)))]
# get message type tag as the first entry of each message
msg_type <- tip_long[, .(msg.type = first(V1)), by = rn]
# make message type a separate column for each tag-value-pair using join
# remove unnecessary rows
tip_result <- msg_type[long, on = "rn"][msg.type != V1]
# split tag and value pairs
tip_result[, c("tag", "value") :=
data.table(stringr::str_split_fixed(V1, "(?<=^[A-Z]{0,9}[a-z])", 2))]
tip_result
# rn msg.type V1 tag value
# 1: 1 BDSr i2 i 2
# 2: 1 BDSr NAmGITS NAm GITS
# 3: 2 BDx i106 i 106
# 4: 2 BDx Si18 Si 18
# 5: 2 BDx s2 s 2
# ---
#905132: 95622 BDCl s2 s 2
#905133: 95622 BDCl i2368992 i 2368992
#905134: 95622 BDCl Il2368596 Il 2368596
#905135: 95622 BDCl Op1 Op 1
#905136: 95622 BDCl Ra1 Ra 1
Note that the value column is of type character.
The regular expression "(?<=^[A-Z]{0,9}[a-z])" uses a look-behind assertion (see ?"stringi-search-regex") to define the split pattern. Note that {0,9} is used here instead of * as the look-behind pattern must not be unbounded (no * or + operators.)

Read selected files from the directory based on selection criteria in R

I would like to read only selected .txt files in a folder to construct a giant table... I have over 9K files, and would like to import the files with the selected distance and building type, which is indicated in part of the file name.
For example, I want to first select files with name containing "_U0" and "_0_Final.txt":
Type = c(0,1)
D3Test = 1
Distance = c(0,50,150,300,650,800)
D2Test = 1;
files <- list.files(path=data.folder, pattern=paste("*U", Type[D3Test],"*_",Distance[D2Test],"_Final.txt",sep=""))
But the result returned empty...
Any problem with my construction?
filename <- scan(what="")
"M10_F1_T1_D1_U0_H1_0_Final.txt" "M10_F1_T1_D1_U0_H1_150_Final.txt" "M10_F1_T1_D1_U0_H1_300_Final.txt"
"M10_F1_T1_D1_U0_H1_50_Final.txt" "M10_F1_T1_D1_U0_H1_650_Final.txt" "M10_F1_T1_D1_U0_H1_800_Final.txt"
"M10_F1_T1_D1_U0_H2_0_Final.txt" "M10_F1_T1_D1_U0_H2_150_Final.txt" "M10_F1_T1_D1_U0_H2_300_Final.txt"
"M10_F1_T1_D1_U0_H2_50_Final.txt" "M10_F1_T1_D1_U0_H2_650_Final.txt" "M10_F1_T1_D1_U0_H2_800_Final.txt"
"M10_F1_T1_D1_U0_H3_0_Final.txt" "M10_F1_T1_D1_U0_H3_150_Final.txt" "M10_F1_T1_D1_U0_H3_300_Final.txt"
"M10_F1_T1_D1_U0_H3_50_Final.txt" "M10_F1_T1_D1_U0_H3_650_Final.txt" "M10_F1_T1_D1_U0_H3_800_Final.txt"
"M10_F1_T1_D1_U1_H1_0_Final.txt" "M10_F1_T1_D1_U1_H1_150_Final.txt" "M10_F1_T1_D1_U1_H1_300_Final.txt"
"M10_F1_T1_D1_U1_H1_50_Final.txt" "M10_F1_T1_D1_U1_H1_650_Final.txt" "M10_F1_T1_D1_U1_H1_800_Final.txt"
Another way would be to use sprintf and grepl.
x <- c("M10_F1_T1_D1_U0_H1_150_Final.txt", "M10_F1_T1_D1_U0_H2_650_Final.txt", "M10_F1_T1_D1_U1_H1_650_Final.txt")
x[grepl(sprintf("U%i_H%i_%i", 1, 1, 650), x)]
[1] "M10_F1_T1_D1_U1_H1_650_Final.txt"
You should look at the result that you are passing to pattern:
"*U0*_0_Final.txt"
It is not going to pick up any of those filenames. The asterisk is saying zero or more instances of "0" between "U" and the underscore. If Type and Distance are not represented by T and D in the file names, then this delivers the correct pattern:
grep( pattern=paste0("_U", Type[D3Test],".*_", Distance[D2Test],"_Final\\.txt"), filename)
#-----------
#[1] 1 7 13 So matches 3 filenames
Notice that you need to escape (with two backslashes) the periods that you want to be only periods because periods are special characters. You also need to use ".*" to allow a gap in the pattern.
files <- list.files(path=data.folder, pattern=paste("*U", Type[D3Test], "....",Distance[D2Test], sep=""))
I revised my code and this one works! Basically the idea is to use dot to present each character between Type[D3Test] and Distance[D2Test], since the characters between these two are fixed at 4.
Thanks to:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Resources