What is the main difference between (base64_encode) hashing and (sha1, md5, ...) ways?
base64_encode is decode-able way, but it seems the others not. is it their main difference?
Yes, the main difference is that. Base64 is decodable, SHA1 and MD5 are not.
irb(main):001:0> source = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
=> "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
irb(main):002:0> require "base64"
=> true
irb(main):003:0> encoded = Base64.encode64(source)
=> "TG9yZW0gaXBzdW0gZG9sb3Igc2l0IGFtZXQsIGNvbnNlY3RldHVyIGFkaXBp\nc2NpbmcgZWxpdC4=\n"
irb(main):004:0> Base64.decode64(encoded)
=> "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
The other difference is the length of the hash. The length of a Base64 encoded string varies, because it contains the original data. However the length of SHA1 and MD5 hashes are fixed (20 byte for SHA1 and 16 byte for MD5).
irb(main):001:0> source = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
=> "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
irb(main):002:0> require "digest"
=> true
irb(main):003:0> Digest::SHA1.hexdigest(source)
=> "e7505beb754bed863e3885f73e3bb6866bdd7f8c"
irb(main):004:0> Digest::MD5.hexdigest(source)
=> "35899082e51edf667f14477ac000cbba"
Base64 encoding and hashing (sha1 etc.) are different concepts.
They will both transform data into another format.
Encoding is reversible , hashing is not.
Endoding transforms data using a public algorithm so it can be easily reversed.
Hashing preserves the integrity of the data.
... and then there is Encryption : )
Hope that helps
Related
The abstract is often the longest section in the YAML in a papaja Rmd, and I was wondering if it was possible to move this off into a separate document (e.g. another Rmd file) and include it via reference instead (just as other chapters can be).
Here are two options: Text-references are built into papaja, but are a little more limited than using an external Lua-filter.
Text-references
You can use bookdown text-references for this. This way you can move the abstract into the body of the document.
---
title : "Title"
abstract : "(ref:abstract)"
output : papaja::apa6_pdf
---
(ref:abstract) Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Similarly, you could use an inline code chunk for the text reference to put the abstract into a separate document.
(ref:abstract) `r readlines()`
A limitation of this approach is that
The paragraph must not be wrapped into multiple lines, and should not end with a white space.
Lua-filters
A more flexible alternative is to use this Lua-filter that uses an abstract section from the document body.
---
title: "Title"
output:
papaja::apa6_pdf:
pandoc_args: ["--lua-filter", "path/to/abstract-to-meta.lua"]
---
# Abstract
The abstract text includes this.
* * * *
This text is the beginning of the document.
Here, the horizontal rule * * * * marks the end of the abstract. Again, here you could use a code chunk to include an external file.
# Abstract
```{r}
#| child: "path/to/abstract.md"
```
* * * *
I have a text file (my.txt) with the following contents that I wish to process in R.
Lorem ipsum tag:[value_0], dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua tag:[value_01, value_02, value_03].
Ut enim ad minim veniam, tag:[value_04, value_05, value_06, value_07] quis nostrud exercitation, tag:[value_08, value_09, value_10].
I wish to process strings inside tags (tag:[ * ]).
Values inside the tags are
comma separated made up of
alphanumeric characters and punctuations (except commas and brackets).
The number of values inside a tag is variable (1 or more).
I wish to change the commas with ]+[.
The outcome I wish to have is as follows:
Lorem ipsum tag:[value_0], dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua tag:[value_01]+[value_02]+[value_03].
Ut enim ad minim veniam, tag:[value_04]+[value_05]+[value_06]+[value_07] quis nostrud exercitation, tag:[value_08]+[value_09]+[value_10].
All I have been able to figure out is to capture the contents of the tags.
gsub(
pattern = paste0(
"tag:\\[([^]]*)\\]"
),
replacement = "\\1",
x = readLines("my.txt")
)
I cannot simply find and replace commas since there are commas outside the tags.
Is there a way to process \\1 further to replace commas with ]+[?
Is there a way to achieve my goal using base R?
Thanks very much.
You can do this with the stringr package using nested replaces. First find the tags, then for each of the tags, replace the commas. str_replace_all allows you to pass a function for transformation rather than a string.
input <- c(
"orem ipsum tag:[value_0], dolor sit amet",
"consectetur adipiscing elit",
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua tag:[value_01, value_02, value_03].",
"Ut enim ad minim veniam, tag:[value_04, value_05, value_06, value_07] quis nostrud exercitation, tag:[value_08, value_09, value_10]."
)
stringr::str_replace_all(input, "tag:\\[[^\\]]*\\]", function(x) {
stringr::str_replace_all(x, ", ", "]+[")
})
which returns
[1] "orem ipsum tag:[value_0], dolor sit amet"
[2] "consectetur adipiscing elit"
[3] "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua tag:[value_01]+[value_02]+[value_03]."
[4] "Ut enim ad minim veniam, tag:[value_04]+[value_05]+[value_06]+[value_07] quis nostrud exercitation, tag:[value_08]+[value_09]+[value_10]."
Here are some solutions.
In the question a comma within square brackets is always followed by a space and I assumed that is the general case but if a comma within square brackets can be followed by a non-space then remove the space after the comma in the pattern in each solution.
1) gsubfn This one-liner uses gsubfn which finds the matches to the pattern given in the first argument, passes it to the function (which may be specified as a formula) in the second argument and replaces each match with the output of the function.
Here it matches tag:[ followed by a string until the next nearest ] and uses gsub to perform the required replacement within that.
library(gsubfn)
gsubfn("tag:\\[.*?\\]", ~ gsub(", ", "]+[", x), Lines)
2) gsub It can be done in a single gsub although note the caveat below. It looks for comma followed by space followed by any number of non-square brackets followed by a right square bracket. If a left square bracket comes first or no right square bracket is encountered it won't match. Everything except the comma space is within a zero width lookahead -- the lookahead won't be regarded as part of the pattern so only the comma space is replaced and the lookahead part continues to be processed for more comma and space character sequences.
(Unfortunately lookbehind does not support repetition characters so we can't use the same idea to check for a preceding tag:[ . Thus this is not completely safe although the checks it does do seem sufficient for the example input in the question and maybe for your actual input as well.)
This only uses base R.
gsub(", (?=[^][]*\\])", "]+[", Lines, perl = TRUE)
2a) This variation of (2) is longer than (2) but it does check for tag:[ and still uses only base R. It assumes that there are no brace brackets in the input. If there are brace brackets use some other characters that are not in the input, e.g. < and > . First it replaces the tag:[...] with {...}. Then it performs the substitution as in (2) but using brace brackets and finally it converts back.
Lines2 <- gsub("tag:\\[(.*?)\\]", "{\\1}", Lines)
Lines3 <- gsub(", (?=[^][{}]*})", "]+[", Lines, perl = TRUE)
gsub("\\{(.*?)\\}", "tag:[\\1]", Lines2)
This may be a difficult/overly complex question to answer, however I pose it in hopes of speeding up a query that passes criteria to two sets of documents that are linked in a hierarchy. This is in Xquery 3.1 (under eXist-db 4.7).
The database contains TEI XML documents which are divided into two categories: header and exemplum, where each header is the "master" document to a subset of exemplum. For example, the header document contains the author of all the exemplum under it. There are about 100 headers, each header has anywhere between 50-300 exemplum under it.
The query receives criteria from an HTML form that allows users to search exemplum but using criteria in both header and exemplum. I indicate the structure of the documents with applicable fields below.
Example of header:
<bibl xml:id="TC0001" type="header" subtype="published">
<title type="long">some long title</title>
<title type="short">some short title</title>
<author nymRef="stephanus_de_borbone"/>
<affiliation corresp="dominican"/>
<date notBefore="1267" notAfter="1290">
...
</bibl>
Example of exemplum (linked to header through #corresp):
<TEI xml:id="TE003679" corresp="TC0001" type ="exemplum" subtype="published">
<text>
<front>
<div type="source-text">
<p>Nulla a mauris urna. Suspendisse urna felis, suscipit consectetur aliquam ac, fermentum sit amet eros. Morbi semper, nisl ac tincidunt laoreet, nulla dui interdum magna, quis rhoncus arcu lectus sit amet ex. Nulla ut malesuada augue, vel hendrerit quam. </p>
</div>
<div type="allegory" n="y"/>
<div type="keywords">
<list>
<item type="keyword" corresp="KW0003"/>
<item type="keyword" corresp="KW0078"/>
<item type="keyword" corresp="KW0537"/>
<item type="keyword" corresp="KW1972"/>
</list>
</div>
</front>
<body>
<p xml:lang="fr">As main soit tu elle. Fenetres jet feu quarante galopent but. Souvenirs corbeille chambrees vif demeurons gaillards oui. Son les noircir eau murmure entiere abattit puisque lettres. Cime la soir ai arcs sons. Remarquent petitement ah on diplomates cathedrale. </p>
<p xml:lang="it">Nervi cigli di farai oblio buone le ti veste. Fanciullo lavorando ha ho melagrani osservava rivederci si strappato da. Punge tardi verra al in passa ed te. Comprendi ch po distrutta statuario. Col ascoltami rammarico oltremare ama. Forse sta bel campo andro sapro. Salvata su seconda divieto ritrovi ai. </p>
<p xml:lang="en">Can curiosity may end shameless explained. True high on said mr on come. An do mr design at little myself wholly entire though. Attended of on stronger or mr pleasure. Rich four like real yet west get. Felicity in dwelling to drawings. His pleasure new steepest for reserved formerly disposed jennings. </p>
</body
</text>
</TEI>
The request comes in the form of parameters that are scrubbed and set into sequences of strings to apply in the query. The user will not likely submit all of these, but I put them all here to illustrate the possibilities of combinations of parameters acting on the headerand the exemplumin different stages of the query process:
let $paramHeader := ("TC0003", "TC0019")
let $paramAuthor := ("stephanus_de_borbone", "johannes_gobi")
let $paramAffil := "dominican"
let $paramBegDate := "1245"
let $paramEndDate := "1300"
let $paramAlleg := "y"
let $paramKeyword := ("KW0002", "KW0034")
let $paramTerms := "sta*"
All of the elements/attributes affected by parameters have been indexed in eXist. The first step is I apply relevant parameters to the header:
let $headers :=
for $h in $mydb//bibl[$paramCollect = ("", #xml:id) and #subtype="published"]
where $h/author[$paramAuthor = ("", #nymRef)] and
$h/affiliation[$paramAffil = ("",#corresp)]
$h/date[(#notBefore lt $paramBegDate and #notAfter gt $paramBegDate) or (#notBefore lt $paramEndDate and #notAfter gt $paramEndDate)]
return $h
From this result I can extract the header/#xml:id to apply as criteria to exemplum:
let $headids := distinct-values($headers/#xml:id)
These are then submitted to a query which uses eXist's ft:query to perform a Lucene-based full-text search (I apply it only to p elements). :
let $query := <query>{for $t in $paramTerms
return <wildcard>{normalize-space(lower-case($t))}</wildcard>}</query>
(: apply full text query :)
let $luchits := $mydb//TEI//p[ft:query(.,$query)]
(: then filter those hits with criteria from exemplum parameters, using /ancestor :)
return
for $luchit in $luchits
where $luchit/ancestor::TEI[#corresp=($headids) and #subtype="published"] and
$luchit/ancestor::text/front/div[#type="allegory" and $paramAlleg = ("", #n)] and
$luchit/ancestor::text//item[#type="keyword" and $paramKeyword = ("", #corresp)]
return $luchit
The Lucene query in eXist is super fast, and for that reason I apply ft:query first and then apply where statements to the results. Doing this proved much faster than applying Lucene at the very end.
Depending on the criteria, the query can take 1-12 seconds to run. I'd like to see if I can shave down the upper end of that range with optimizing basic query technique.
Many thanks in advance.
Background:
I have an XML document with the following structure:
<records>
<record id="512" size="1">
<user id="8412" origin="ab"/>
<category id="105">Certificates</category>
<rating>80</rating>
<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</text>
</record>
<record id="452" size="2">
<user id="7623" origin="bb"/>
<category id="105">Certificates</category>
<rating>70</rating>
<text>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</text>
</record>
</records>
What I'm trying to do:
Using R, I'm trying to convert this XML information into a dataframe, where each row represents a single record, and each column represents either attribute or text data for that record (with the goal of including all the data that exists in the XML document).
This is what the final output should look like:
Record ID | Size | User ID | ... | Text |
452| 2| 7623| ... | Lorem ipsum... |
Also, since there are around 1,000,000 records, and the file containing them is ~500MB, I'm trying to find a relatively efficient way to do this.
What I've tried so far:
I've looked at a number of related questions on the topic, but none of them offered a solution that applied in this case.
First, I tried using the 'xmlToDataFrame' function in the XML package using the following code, but it's only extracting the text data, and not the attributes:
library(XML)
doc = xmlParse("My_document.xml")
xmldf = xmlToDataFrame(doc, nodes = "//record")
xmldf = xmlToDataFrame(nodes = getNodeSet(doc, "//record"))
The same happens when I try to use the flatxml package, despite the fact that during the initial import of the XML document it does extract the relevant attribute data:
library(flatxml)
doc = fxml_importXMLFlat("My_document.xml")
xmldf = fxml_toDataFrame(xml_original, siblings.of = 2)
I also tried a slightly different approach using the xml2 package:
library(xml2)
doc <- read_xml('My_document.xml')
rows <- xml_children(doc)
data.frame(
Record_ID = as.numeric(xml_attr(rows,"id")),
Size = as.numeric(xml_attr(rows,"size")),
User_ID = as.numeric(xml_attr(rows,"id")),
Origin = as.character(xml_attr(rows,"origin")),
Category = as.character(xml_text(rows,"category")),
Category_ID = as.numeric(xml_attr(rows,"id")),
Rating = as.numeric(xml_text(rows,"rating")),
Text = as.character(xml_text(rows,"text"))
) -> xmldf
Here I had a different set of issues: I'm able to extract attribute data, but only from the 'record' node. This means that it copies the 'id' data from the record for the 'User_ID', and is unable to access the relevant data for things such as the 'origin' attribute. In addition, this process also pulls all the text information from all nodes simultaneously each time I attempt to extract it.
Consider binding attributes with the internal method, xmlAttrsToDataFrame, and elements with xmlToDataFrame, assuming only one set of user and sibling tags per record.
library(XML)
...
# BIND ATTRIBUTES AND ELEMENTS
record_df <- cbind(XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//record')),
XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//user')),
xmlToDataFrame(doc, nodes = getNodeSet(doc, "//record"))
)
# RENAME COLUMNS
record_df <- setNames(record_df, c("record_id", "record_size", "user_id", "user_origin",
"record_user", "record_category", "record_rating", "record_text"))
record_df
# record_id record_size user_id user_origin record_user record_category record_rating record_text
# 1 512 1 8412 ab Certificates 80 \nLorem ipsum dolor ...
# 2 452 2 7623 bb Certificates 70 \nUt enim ad minim ...
I want to replace unwanted strings using unix sed reqular expression
Input string
echo ',"wanted1":"value1","unwanted";"unwanted";"wanted2":"value2",'
Required string
"wanted1":"value1","wanted2":"value2"
Try this one
Script
echo ',"wanted1":"value1","unwanted";"unwanted";"wanted2":"value2",' | sed 's/,//g; s/"unwanted"//g; s/;//g; s/""/","/g'
Output
"wanted1":"value1","wanted2":"value2"