Email Id and text parsing in R

Email Id and text parsing in R - r

I am getting some 55k mails each month & I have taken up an assignment to analyse the mails. While the .eml has lot of content, I am typically interested in email text content as follows:
From: "SavReader" <info#savreader.com>
To: <pgmagesh#gmail.com>
Subject: Export file SavReader.com
Date: Mon, 2 Nov 2015 08:37:52 +0100
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0000_01D11549.C39BD260"
This is a multi-part message in MIME format.
------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Hello from SavReader!
The file that you submitted for export is now available for download
from SavReader - all files will be stored for 1 day from receipt of
this mail.
Download file <http://www.savreader.com/export/qlysDuv1xQ.xls>
Thanks,
Team SavReader
------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/html
Content-Transfer-Encoding: 7bit
Hello from SavReader!<br><br>The file that you submitted for export is now available for download from SavReader - all files will be stored for 1 day from receipt of this mail.<br><br><a href=http://www.savreader.com/export/qlysDuv1xQ.xls>Download file</a><br><br>Thanks, <br><br>Team SavReader
------=_NextPart_000_0000_01D11549.C39BD260--
I am interested in extracting Subject:, From:, and the content of the mail. While the body of the mail is extracted in both Content-Type: text/plain; charset=UTF-8 as well as Content-Type: text/html; charset=UTF-8 I figured I get matching pair of delimit strings --001a113d7c1e5de339051fdaaf69 before and after --001a113d7c1e5de339051fdaaf69-- (the closing delimiter ends with additional two "--") the message or the body of the mail is sandwiched between these uids. I was trying to parse the email id and content of the mail. I have used the following code, where a is the name of my .eml file:
pat = '([From]): ([a-zA-Z]) (([a-z0-9_\\.-]+)#([\\da-z\\.-]+)\\.([a-z\\.]{2,6}))'
d <- str_match(pattern = pat, a)
d
another option:
strsplit(gsub("(?s)^_+\\s+", "", a, perl=T) , "_+\\s*(?=From:)", perl=T)[[1]]
another option:
d <- str_extract(string=a,pattern="From:\\b[-A-Za-z0-9_.%]+\\#[-A-Za-z0-9_.%]+\\.[A-Za-z]+")
and many other options given in SO. What I want to extract is:
From: DEFHIJ <abc#xyz.in>
and the html content of the mail between the matching delimiter strings. Can someone help pls?

You're almost there. You need to include angular brackets and also match the space which exist between From: and the first letter using \\s*.
str_extract(string=a,pattern="^From:\\s*[-A-Za-z0-9_.%]+\\s*<[-A-Za-z0-9_.%]+#[-A-Za-z0-9_.%]+\\.[A-Za-z]+>")
DEMO
or
str_extract(a, "^From:.*")

Related

What is this text: =B0=A1=C1=CB ... and how to convert it to normal text?

I have found some text in this form:
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1
containing mostly sequences consisting of an equal sign followed by two hexadecimal digits.
I am told it could be converted into this Chinese sentence:
啊了你也没联系我最近是不是很忙啊
What is the =B0=A1=C1 and how to decode/convert it?

The Chinese sentence has been encoded into an 8-bit Guobiao encoding (GB2312, GBK or GB18030; most likely the latter, though it apparently decodes correctly as the former too), and then further encoded into the 7-bit MIME quoted-printable encoding.
To decode it into a Unicode string, first undo the quoted-printable encoding, then decode the Guobiao encoding. Here’s an example using Python:
import quopri
print(quopri.decodestring("""\
=B0=A1=C1=CB,=C4=E3=D2=B2=C3=BB=C1=AA=CF=B5=CE=D2,=D7=EE=BD=FC=CA=C7=B2=BB=CA=C7=
=BA=DC=C3=A6=B0=A1\
""").decode('gb18030'))
This outputs 啊了,你也没联系我,最近是不是很忙啊 on my terminal.
The quoted-printable encoding is usually found in e-mail messages; whether it is actually in use should be determined from message headers. A message encoded in this manner should carry the header Content-Transfer-Encoding: quoted-printable. The text encoding (gb18030 in this case) should be specified in the charset parameter of the Content-Type header, but sometimes can be determined by other means.

Odd text characters after parsing from JSON file? Encoding issue?

I have a JSON file with Tweet data, containing fields such as text, published date, author, ID, etc.
I used the parseTweets function from streamR, but when I view the completed df, the text has not been encoded/parsed in correctly.
tweets <- parseTweets("C:/Users/...file.json",simplify = FALSE, verbose = TRUE, legacy = FALSE)
View(tweets)
This is what is shown in the "text" column of the parsed object
think youâ€™re continuing the conversation
It should say: think you're continuing the conversation
I did some searching and this seems to be an encoding issue, but I can't seem to figure it out.
Would I need to parseTweets first, then edit the text column afterwards? Or is there a wrapper method that I can parse correctly the first time I read in the JSON?
Any help is appreciated, thank you!
Here is an example JSON snippet pulled from my larger file
{"created_at":"Sun Jun 10 00:01:12 +0000 2018","id":100565760896,"id_str":"1005600896","text":"think you’re continuing the conversation","source":"Twitter for iPhone","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":403340,"id_str":"40311840","name":"Dvo","screen_name":"ImBorau","location":"Florida, USA","url":"http://Instagram.com/ ","description":"ucf | I your sarcastic quips","translator_type":"none","protected":false,"verified":false,"followers_count":43,"friends_count":166,"listed_count":0,"favourites_count":839,"statuses_count":1460,"created_at":"Wed Nov 02 01:41:45 +0000 2011","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"9AE4E8","profile_background_image_url":"http://abs.twimg.com/images/themes/theme16/bg.gif","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme16/bg.gif","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"BDDCAD","profile_sidebar_fill_color":"DDFFCC","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http://pbs.twimg.com/profile_images/10014987138688/RYbZNdVR_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/100149871633688/RYbNdVR_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/40318340/107757914","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null,"updated":["description","name"]},"geo":null,"coordinates":null,"place":{"id":"4ec0163497","url":"https://api.twitter.com/1.1/geo/id/4ec1c9db497.json","place_type":"admin","name":"Florida","full_name":"Florida, USA","country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-87.634643,24.396308],[-87.634643,31.001056],[-79.974307,31.001056],[-79.974307,24.396308]]]},"attributes":{}},"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1528588108","matching_rules":[{"tag":null,"id":484862573421,"id_str":"48486970421"}]}

Restore information from multipart/form-data

I'm using a MITM technic to study some apps apis but im not able to restore the original data from the multipart gzip request
does anyone know how can i recover the content of this package?
POST /logging_client_events HTTP/1.1
Accept-Language: pt-BR, en-US
Content-Type: multipart/form-data; boundary=3TtLStKljJgtMAosyN-hY6JtpuUqhC
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 1129
--3TtLStKljJgtMAosyN-hY6JtpuUqhC
Content-Disposition: form-data; name="access_token"
567067343352427|f249176f09e26ce54212b472dbab8fa8
--3TtLStKljJgtMAosyN-hY6JtpuUqhC
Content-Disposition: form-data; name="format"
json
--3TtLStKljJgtMAosyN-hY6JtpuUqhC
Content-Disposition: form-data; name="cmsg"; filename="ae3ada0b-866d-4b0c-b0af-e0c66df71808_5_regular.batch.gz"
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
eRÛ®0üòG6¾GÊUhm/9Ö!#0Ð¥ù÷Ú¤Q¢VH\fvf×³Üª×ê(÷cCu¬¤ÒTi.8µ¨uõ V2Ç(=é«m¦Ü»ÐôË¥ m¸FCç88A¥8ÊÖÄñÄ+¡Zë°6³¤Kì¾w¥ôSJ#DíqÜK"æ¡uTfeÂâÐ?4PGò$G=qZÔg ÕÌP5ËVLóÿ¾Ç.Mx^:2Ö
çfþ1¾ØÏ
®ùþ7ÖPf5²b2ôm<Ê$]ëê?Ñ¥-£kúíOye8BÀê:HDQsgPÑúZÝNL*¥eÚî®ëie»t³ÜRç©â¨u
['Ì¹{QÎ`êøq«z¸ássðs\sýÓ
].ãÆSEùAð²³±ý¹`Îl_á¯yÊ~·j;ý3§UfJ&Û³yØ¾\÷ÕøõoLv Wæã4B#Ã³ÁÏØFÒ}ù+rí°Ûv¥fïP*Xîh´BÉwêÿÞï?î
======================UPDATE===============
I uploaded 3 sample packages in this format so if anyone knows how to solve the problem can try
https://gofile.io/?c=fNakzX

The content you uploaded contains a lot of question marks as ASCII '\x3f' (all three versions of it). I am pretty sure these represent the original data at all bytes which were unprintable characters. In changing the original bytes to question marks, the information was lost completely.
The description of your question contains at least a version which is not peppered with question marks, but since this is a real text representation of the binary data, I am also pretty sure that there are some (relevant) characters missing and/or that some of the characters aren't correctly transformable back to the original binary.
If you do not have any other version of your input, I'm afraid your task cannot be accomplished, sorry.

Header values with commas

According to the HTTP specs a header can look like this:
Header-Name=value1,value2,value3,...
I try to parse the header values and store them as an array:
array('value1', 'value2', 'value3')
so far so good. I can just tokenize the string if a comma appears.
BUT how should I handle headers like this one:
Expires=Thu, 01 Dec 1994 16:00:00 GMT
there's a comma but in the one value the header has. Oh that's easy I thought and figuered out the rule: Only separate by commas when there's no space before and after the comma. This way both examples get parsed correct.
BUT then I came across a header like this:
Accept-Encding=gzip, deflate
and now? Is this one value array('gzip, deflate') or two values array('gzip', 'deflate')? For me they are two separate values but then my rule from the above isn't true anymore.
Is there a list which headers are allowed more than once? So I can check against a blacklist to determine if the comma means a value delimiter or not?

Comma concatenation can occur for any header field, even those that aren't designed for it; it's how libraries and intermediaries happen to work.
It is designed to be used for header fields that use list syntax (RFC 7230 has all the details).
Finally, you can't use generic code to tokenize, because the way the comma can occur inside values varies from field to field.

What does this mean: 3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f

I found this string several times on the Internet, and I wonder what it means, and where it comes from:
3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f
It's often used after a boundery-definition in the HTTP-Content-Type-Header:
Content-Type: multipart/form-data; boundary=--3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f

http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
rfc1341 7.2 The Multipart Content-Type
The body must then contain one or more "body parts," each preceded by an encapsulation boundary, and the last one followed by a closing boundary.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Email Id and text parsing in R - r

You're almost there. You need to include angular brackets and also match the space which exist between From: and the first letter using \\s. str_extract(string=a,pattern="^From:\\s[-A-Za-z0-9_.%]+\\s<[-A-Za-z0-9_.%]+#[-A-Za-z0-9_.%]+\\.[A-Za-z]+>") DEMO or str_extract(a, "^From:.")

Related

What is this text: =B0=A1=C1=CB ... and how to convert it to normal text?

Odd text characters after parsing from JSON file? Encoding issue?

Restore information from multipart/form-data

Header values with commas

What does this mean: 3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Email Id and text parsing in R - r

You're almost there. You need to include angular brackets and also match the space which exist between From: and the first letter using \\s*. str_extract(string=a,pattern="^From:\\s*[-A-Za-z0-9_.%]+\\s*<[-A-Za-z0-9_.%]+#[-A-Za-z0-9_.%]+\\.[A-Za-z]+>") DEMO or str_extract(a, "^From:.*")

Related

What is this text: =B0=A1=C1=CB ... and how to convert it to normal text?

Odd text characters after parsing from JSON file? Encoding issue?

Restore information from multipart/form-data

Header values with commas

What does this mean: 3i2ndDfv2rTHiSisAbouNdArYfORhtTPEefj3q2f

Categories

Resources

You're almost there. You need to include angular brackets and also match the space which exist between From: and the first letter using \\s. str_extract(string=a,pattern="^From:\\s[-A-Za-z0-9_.%]+\\s<[-A-Za-z0-9_.%]+#[-A-Za-z0-9_.%]+\\.[A-Za-z]+>") DEMO or str_extract(a, "^From:.")