Problem: I have a .tip file from NASDAQ that I need to parse. Official name: GENIUM CONSOLIDATED FEED
The file i a csv like file with semicolons and newline for new entries of different structure, and so no constant header. But it has a corresponding xsd schemafile which should describe the contents and structure, but I can see no clear way to go from the file to a structure result. Have tried with a list setup where messageType becomes a name in a list
x <- scan("cran_tasks/NOMX_Comm_Close2.tip", what="", sep="\n")
y <- strsplit(x, ';')
names(y) <- sapply(y, `[[`, 1)
y <- sapply(y, `[`, -1, simplify = FALSE)
y <- sapply(y, as.list)
The file is structured like this:
messageType;key1Value;key2Value;...;..;/n
messageType;key1Value;key2Value;.....;/n
BDSr;i2;NAmGITS;
BDx;i106;Si18;s2;SYmNC;NAmNASDAQ OMX Commodities;CNyNO;MIcNORX;
BDm;i672;Si018171;s2;Ex106;NAmFuel Oil;SYmNCFO;TOTa+0200;LDa20141011;
BDIs;i10142;SiNP;s2;ISsNP;NAmNord Pool ASA;
m;i122745;t191500.001;Dt20170509;ISOcY;ISOtY;
m;i122745;t192808.721;Dt20170509;ISOcN;ISOtY;SEp275.45;
Oi;i122745;t054425.600;OPi2840;
I have had a working sql code set to parse the file but it have shown to be to case specific to be robust against even minor changes in structure, like the order of the different keyValue pairs. So I'm looking for at way to exploit the the structure of the information to be able to make a robust and maintainable solution, preferably in R. I have tried with some regular expressions matching but still I end up with a lot of context specific code, so I hope the some structuring with a table or dataframe containing the Key information can make for a sustainable solution.
Any hints or suggestions are more than welcome.
link to the XML/XSD file and the html sheet specifying keys, and a .tip file
TIP Message Format The TIP protocol is a tagged text protocol. A
TIP message is a sequence of tag and value pairs separated with
semicolon. A tag is zero or more UPPERCASE characters followed by a
lowercase character. The tag is followed immediately by the value.
Examples of tags are "FLd", "STa". The first tag in a message is
always the message type. The message type tag has no value. An example
of a message type tag is "BDSh". IP messages are encoded with UTF-8
unless stated otherwise. The maximum length of a TIP message is
indicated with the constant MAX_MESSAGE_LENGTH (2048 bytes). Any
max field length excludes any escape characters '\'. No empty values
will be sent; exceptions are message type tags and Boolean tags (the
presence of the tag itself corresponds to a 'true' value). For a
decimal field (i.e. the Float data type) the length is given as X,Y
where X is the max number of digits in the integer part of the field
(left of the separator). Y is the number of decimals (right of the
separator). The order of the disseminated tags is not fixed, i.e.
the client may not make any assumptions of the order of tags. The only
fixed component of a message is the message type, which is always
placed first in the message data. Note that new messages and fields
may be added in future versions of the protocol. To ensure forward
compatibility, clients should ignore unrecognized message types and
field tags.
The data.table solution below parses the given .tip file and returns a data.table with tag and value pairs. So, this is probably a good starting point for further extracting the relevant data.
library(data.table)
# read downloaded file from local disk
tip_wide <- fread(
"NOMX_Comm_Close2.tip"
, sep = "\n"
, header = FALSE
)
# split tip messages into tag and value pairs
# thereby rehaping from wide to long format
# and adding a row number
tip_long <- tip_wide[, unlist(strsplit(V1, ";")),
by = .(rn = seq_len(nrow(tip_wide)))]
# get message type tag as the first entry of each message
msg_type <- tip_long[, .(msg.type = first(V1)), by = rn]
# make message type a separate column for each tag-value-pair using join
# remove unnecessary rows
tip_result <- msg_type[long, on = "rn"][msg.type != V1]
# split tag and value pairs
tip_result[, c("tag", "value") :=
data.table(stringr::str_split_fixed(V1, "(?<=^[A-Z]{0,9}[a-z])", 2))]
tip_result
# rn msg.type V1 tag value
# 1: 1 BDSr i2 i 2
# 2: 1 BDSr NAmGITS NAm GITS
# 3: 2 BDx i106 i 106
# 4: 2 BDx Si18 Si 18
# 5: 2 BDx s2 s 2
# ---
#905132: 95622 BDCl s2 s 2
#905133: 95622 BDCl i2368992 i 2368992
#905134: 95622 BDCl Il2368596 Il 2368596
#905135: 95622 BDCl Op1 Op 1
#905136: 95622 BDCl Ra1 Ra 1
Note that the value column is of type character.
The regular expression "(?<=^[A-Z]{0,9}[a-z])" uses a look-behind assertion (see ?"stringi-search-regex") to define the split pattern. Note that {0,9} is used here instead of * as the look-behind pattern must not be unbounded (no * or + operators.)
Related
I took word pairs from a text file and made a dictionary:
x = open('sustantivos.txt', 'r') ## opens file and assigns it to a variable
y = x.read() ## reads open file object and assigns it to variable y
y = str(y).lower().replace(":", "") ## turns open file object into a string, then makes it lower case and replaces ":" with whitespace
z = y.splitlines() # make a list with each element being a word pair string, then assign to variable z
bank = {}
for pair in z: #go through every word pair string
(key, value) = pair.split() #split the word pair string making a list with two elements, assign these to variable key and value
bank[key] = value #add key value pair
x.close()
For reference this is an excerpt from the text file:
Amour: amor
Anglais: inglés
Argent: dinero
Bateau: barco
My question is: Is there are more efficient or different approach that you would do differently? Also I was curious if my understanding that I include in the comments is correct. Thanks in advance.
Your inline notes are accurate except that line number 2 is where the opened file is read and its contents are turned into a string. Your use of str(y) in the third line is unnecessary and could simply be written as y.lower()...
Your parsing strategy is sound as long as you know that the file will always contain lines of key:value pairs on each and every line. However, There are a couple of recommendation I would make.
Use a with statement when opening files. this avoids errors that can occur if the file isn't closed properly
Don't read the whole file in at once.
dict.update will take an iterable of length 2 as an argument
Using those tips your code can be rewritten as:
bank = {}
with open('sustantivos.txt', 'r') as x:
for line in x:
key, value = line.strip().split(':')
bank[key] = value
# bank.update([line.strip().split(':')]) <- or this
please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here
?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.
I have the following backtick on my list's names. Prior lists did not have this backtick.
$`1KG_1_14106394`
[1] "PRDM2"
$`1KG_20_16729654`
[1] "OTOR"
I found out that this is a 'ASCII grave accent' and read the R page on encoding types. However what to do about it ? I am not clear if this will effect some functions (such as matching on list names) or is it OK leave it as is ?
Encoding help page: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
Thanks!
My understanding (and I could be wrong) is that the backticks are just a means of escaping a list name which otherwise could not be used if left unescaped. One example of using backticks to refer to a list name is the case of a name containing spaces:
lst <- list(1, 2, 3)
names(lst) <- c("one", "after one", "two")
If you wanted to refer to the list element containing the number two, you could do this using:
lst[["after one"]]
But if you want to use the dollar sign notation you will need to use backticks:
lst$`after one`
Update:
I just poked around on SO and found this post which discusses a similar question as yours. Backticks in variable names are necessary whenever a variable name would be forbidden otherwise. Spaces is one example, but so is using a reserved keyword as a variable name.
if <- 3 # forbidden because if is a keyword
`if` <- 3 # allowed, because we use backticks
In your case:
Your list has an element whose name begins with a number. The rules for variable names in R is pretty lax, but they cannot begin with a number, hence:
1KG_1_14106394 <- 3 # fails, variable name starts with a number
KG_1_14106394 <- 3 # allowed, starts with a letter
`1KG_1_14106394` <- 3 # also allowed, since escaped in backticks
I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.
Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])
You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.
You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.