Received MethodError when cleaning String - julia

I have data in a .txt file that looks like this:
04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]
25:31 Heatbeat & Quilla – Secret (Original Mix) [ARMADA CAPTIVATING]
All of them have this pattern:
00:00 artist - title [studio]
I want to remove the time stamp and the studio, so the output looks like this:
1. Yuri Kane feat Jeza – Love Comes (Original Mix)
Here is what I tried:
function remove_time_from(str::String)
return last(split(str,"0 "))
end
function remove_url(str::String)
return first(rsplit(str,"["))
end
function main()
tracks = String[]
local number = 0
for line in eachline("track-list.txt")
number += 1
removed_time = remove_time_from(line)
cleaned = remove_url(removed_time)
push!(tracks,"$number.$cleaned")
end
open("track-list-cleaned.txt", "w") do io
for line in tracks
write(io, "$line\n")
end
end
end
main()
but it returns:
MethodError: no method matching remove_url(::SubString{String})

When you use the function remove_time_from() it uses first() which returns a SubString{String}:
track = "04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]"
println(typeof(remove_time_from(track))) # Output: SubString{String}
You have 2 ways to fix it:
Have both remove_time_from() and remove_url() convert the SubString to String before returning it. This way, no matter which function you use first, you'll get a String:
return convert(String,last(split(str,"0 ")))
Use AbstractString instead of String as the function parameter, because SubString is a subtype of AbstractString:
println(SubString <: AbstractString) # Output: true
This way, no matter which function you use first, it would accept a String (the variable type of line) or SubString (the type you end up with after using one of the functions).
Suggestions:
Using split(str,"0 ") won't remove the time stamp:
last(split("04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]", "0 "))
Output: 04:31 Yuri Kane feat Jeza – Love Comes (Original Mix) [PREMIER]
What you need is chop() and you can specify how many characters to ignore from the head, so in this case 5 (includes the leading whitespace).
chop(str, head = 5)
You don't need to read in the lines, clean it, and then store it in a Vector to write later. You can clean it (do it in one line), and write it out to the file:
open("track-list-cleaned.txt", "w") do io
for line in eachline("track-list.txt")
number += 1
cleaned = (remove_url(remove_time_from(line)))
write(io, "$number.$cleaned\n")
end
end
Use enumerate() to number the lines as you're reading them in:
for (number,line) in enumerate(eachline("track-list.txt"))
Code:
# Using the assignment form because each function has only one line.
remove_time_from(str::AbstractString) = chop(str, head = 5)
remove_url(str::AbstractString) = first(rsplit(str," https"))
function main()
open("track-list-cleaned.txt", "w") do io
for (number,line) in enumerate(eachline("track-list.txt"))
cleaned = strip(remove_url(remove_time_from(line)))
write(io,"$number.$cleaned\n")
end
end
end
main()

Related

Matching dict key to text file and returning Test Pass/fail

I'm a novice at Python, and am currently working on a small test case assignment where I am to find and match the dictionary keys to a small text file, and see if the keys are present in the text file.
As follows, the dictionary goes:
dict = {"description, translation": "test_translation(serial,",
"unit": "test_unit(",}
The text in text file, henceforth called "requirement.txt" as follows:
The description shall display the translation of XXX.
The unit shall be hidden.
The value is read from the file "version.txt".
To the key, I am to find and match if they are present or absent - a match should return a "test pass", no match would return a skip.
Keys from dictionary are to be sorted to a list, then iterated and matched to text. (Values from dictionary are to be sorted to a seperate list and iterated over a seperate file, to which I shall not delve into it here.)
This is the code that I currently have (and stuck):
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print("Test Pass")
print(word)
break
else:
print("Test Fail")
print(line + "\n")
Result currently obtained:
Test Fail
Test Pass
display
The description shall display the translation of XXX.
Test Fail
Test Fail
Test Fail
Test Pass
unit
The unit shall be hidden.
Test Fail
Test Fail
Test Fail
Test Fail
The value is read from the file "version.txt".
Using the current code which I have, (and I am stuck), running the code returned multiple times of "Test pass" and "Test fail", suggesting that the keys are iterated multiple times over each line and the results returned for each multiple iteration.
I am stuck at two fronts:
After seperating the key into a list, how to order them in the sequence of "description, translation", "unit)?
How to modify the code so as to ensure that result is returned once as "Test pass" or "test fail"
Results should ideally return in the following format:
Ideal outcome:
('Text:', "The description shall display the translation of XXX.
('Key:', 'description, translation')
Test Pass
('Text:', 'The unit shall be hidden.')
('Key:', 'unit')
Test Pass
('Text:', 'The value is read from the file "version.txt".')
('Key:', (none))
Test Fail
For your kind enlightenment please, thank you!
Try with this:
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
# if the word match then add it to the list words_found
if word in line:
words_found.append(word)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")
the idea is to create a list of the words founds and then print them all
I'm using the format Operation and you can find a guide on how to use it here. And the line if(words_found): check if the list is empty.
Additional Notes
In this case, you won't need it but if you wanted to solve only the second point you can use the for else statement as explained in the docs
4.4 break and continue Statements, and else Clauses on Loops
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement.
Reducing by one tab the indentation the else of your if statement it became the else of the for statement so it will be executed only if the for never had a break the problem is solved.
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
line_strings = line.split(' ')
for word in list:
if word in line:
print(word)
print("Test Pass")
break
else:
print("Test Fail")
print(line + "\n")
Edit
To split the key into description and translation we just have to split the two word at the comma with the builtin function split
list = sorted(key_words.keys(), key=lambda d: d[0])
with open('C:/Users-------/requirement.txt', 'r') as outfile:
lines = outfile.readlines()
for line in lines:
line = line.strip()
if line == '':
continue
# Create an empty list which will contain all the word that match
words_found = []
for word in list:
description, translation = word.split(",")
# if the word match then add it to the list words_found
if description in line:
words_found.append(description)
print("(\'Text:\',\"{}\"")' ".format(line))
print("(\'Keys:\',\"{}\"")' ".format(words_found))
# if the list of words found it's not empty then the test passed
if(words_found):
print("Test Passed")
else:
print("Test Failed")

Can R read html-encoded emoji characters?

Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like 🤗?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: 🤗) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: 🤗" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: 🤗"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling 🤗 actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.
tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: 😀😃"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not 🤗. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is 🤗. Or, in decimal, &#129303
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = '😊😘😀😆😂😁'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// 😊😘😀😆😂😁
// is turned into
// 😊😘😀😆😂😁
// which is rendered by the browser as the emojis
Original:<br>😊😘😀😆😂😁<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.
I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test 🤗 test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)
Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "😊😘😀😆😂😁"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}

Concat 2 strings erlang and send with http

I'm trying to concat 2 variables Address and Payload. After that I want to send them with http to a server but I have 2 problems. When i try to concat the 2 variables with a delimiter ';' it doesn't work. Also sending the data of Payload or Address doesn't work. This is my code:
handle_rx(Gateway, #link{devaddr=DevAddr}=Link, #rxdata{port=Port, data= RxData }, RxQ)->
Data = base64:encode(RxData),
Devaddr = base64:encode(DevAddr),
TextAddr="Device address: ",
TextPayload="Payload: ",
Address = string:concat(TextAddr, Devaddr),
Payload = string:concat(TextPayload, Data),
Json=string:join([Address,Payload], "; "),
file:write_file("/tmp/foo.txt", io_lib:fwrite("~s.\n", [Json] )),
inets:start(),
ssl:start(),
httpc:request(post, {"http://192.168.0.121/apiv1/lorapacket/rx", [], "application/x-www-form-urlencoded", Address },[],[]),
ok;
handle_rx(_Gateway, _Link, RxData, _RxQ) ->
{error, {unexpected_data, RxData}}.
I have no errors that I can show you. When I write Address or Payload individually to the file it works but sending doesn't work...
Thank you for your help!
When i try to concat the 2 variables with a delimiter ';' it doesn't work.
5> string:join(["hello", <<"world">>], ";").
[104,101,108,108,111,59|<<"world">>]
6> string:join(["hello", "world"], ";").
"hello;world"
base64:encode() returns a binary, yet string:join() requires string arguments. You can do this:
7> string:join(["hello", binary_to_list(<<"world">>)], ";").
"hello;world"
Response to comment:
In erlang the string "abc" is equivalent to the list [97,98,99]. However, the binary syntax <<"abc">> is not equivalent to <<[97,98,99]>>, rather the binary syntax <<"abc">> is special short hand notation for the binary <<97, 98, 99>>.
Therefore, if you write:
Address = [97,98,99].
then the code:
Bin = <<Address>>.
after variable substitution becomes:
Bin = <<[97,98,99]>>.
and that isn't legal binary syntax.
If you need to convert a string/list contained in a variable, like Address, to a binary, you use list_to_binary(Address)--not <<Address>>.
In your code here:
Json = string:join([binary_to_list(<<Address>>),
binary_to_list(<<Pa‌​yload>>)],
";").
Address and Payload were previously assigned the return value of string:concat(), which returns a string, so there is no reason to (attempt) to convert Address to a binary with <<Address>>, then immediately convert the binary back to a string with binary_to_list(). Instead, you would just write:
Json = string:join(Address, Payload, ";")
The problem with your original code is that you called string:concat() with a string as the first argument and a binary as the second argument--yet string:concat() takes two string arguments. You can use binary_to_list() to convert a binary to the string that you need for the second argument.
Sorry I'm new to Erlang
As with any language, you have to study the basics and write numerous toy examples before you can start writing code that actually does something.
You don't have to concatenate strings. It is called iolist and is one of best things in Erlang:
1> RxData = "Hello World!", DevAddr = "Earth",
1> Data = base64:encode(RxData), Devaddr = base64:encode(DevAddr),
1> TextAddr="Device address", TextPayload="Payload",
1> Json=["{'", TextAddr, "': '", Devaddr, "', '", TextPayload, "': '", Data, "'}"].
["{'","Device address","': '",<<"RWFydGg=">>,"', '",
"Payload","': '",<<"SGVsbG8gV29ybGQh">>,"'}"]
2> file:write_file("/tmp/foo.txt", Json).
ok
3> file:read_file("/tmp/foo.txt").
{ok,<<"{'Device address': 'RWFydGg=', 'Payload': 'SGVsbG8gV29ybGQh'}">>}

Test for exact string in testthat

I'd like to test that one of my functions gives a particular message (or warning, or error).
good <- function() message("Hello")
bad <- function() message("Hello!!!!!")
I'd like the first expectation to succeed and the second to fail.
library(testthat)
expect_message(good(), "Hello", fixed=TRUE)
expect_message(bad(), "Hello", fixed=TRUE)
Unfortunately, both of them pass at the moment.
For clarification: this is meant to be a minimal example, rather than the exact messages I'm testing against. If possible I'd like to avoid adding complexity (and probably errors) to my test scripts by needing to come up with an appropriate regex for every new message I want to test.
You can use ^ and $ anchors to indicate that that the string must begin and end with your pattern.
expect_message(good(), "^Hello\\n$")
expect_message(bad(), "^Hello\\n$")
#Error: bad() does not match '^Hello\n$'. Actual value: "Hello!!!!!\n"
The \\n is needed to match the new line that message adds.
For warnings it's a little simpler, since there's no newline:
expect_warning(warning("Hello"), "^Hello$")
For errors it's a little harder:
good_stop <- function() stop("Hello")
expect_error(good_stop(), "^Error in good_stop\\(\\) : Hello\n$")
Note that any regex metacharacters, i.e. . \ | ( ) [ { ^ $ * + ?, will need to be escaped.
Alternatively, borrowing from Mr. Flick's answer here, you could convert the message into a string and then use expect_true, expect_identical, etc.
messageToText <- function(expr) {
con <- textConnection("messages", "w")
sink(con, type="message")
eval(expr)
sink(NULL, type="message")
close(con)
messages
}
expect_identical(messageToText(good()), "Hello")
expect_identical(messageToText(bad()), "Hello")
#Error: messageToText(bad()) is not identical to "Hello". Differences: 1 string mismatch
Your rexeg matches "Hello" in both cases, thus it doesn't return an error. You''ll need to set up word boundaries \\b from both sides. It would suffice if you wouldn't use punctuations/spaces in here. In order to ditch them too, you'll need to add [^\\s ^\\w]
library(testthat)
expect_message(good(), "\\b^Hello[^\\s ^\\w]\\b")
expect_message(bad(), "\\b^Hello[^\\s ^\\w]\\b")
## Error: bad() does not match '\b^Hello[^\s ^\w]\b'. Actual value: "Hello!!!!!\n"

scanString end location: why it is end_index+1?

python/pyparsing
When I use scanString method, it is giving the start and end location of the matched token, in the text.
e.g.
line = "cat bat"
pat = Word(alphas)
for i in pat.scanString(line):
print i
I get the following:
((['cat'], {}), 0, 3)
((['bat'], {}), 4, 7)
But cat end location should be "2" right? Why it is reporting the next location as the end location?
This is consistent with Python's [begin:end] slicing conventions, where the "end" is the index of the next character. By putting the end as the next location, it is very straightforward to extract the matching substring using the returned values:
for t,start,end in pat.scanString(line):
print line[start:end]
You can see how this is used if you look in the pyparsing source code for the implementation of transformString.

Resources