Add \\ to escape non-UTF 8 characters within a string using regex - r
I am working with a dataset that has free text containing special characters. I need to clean the text before use strsplit for a subsequent function, but would prefer to add escapes (\\) before the special characters rather than delete them altogether.
For example, the string that looks like this:
Do you love great hair? Wind it up! Your curls are your gift- set
them free and help preserve your natural curl with bounce and
definition. Cleanse hair without weighing it down while reducing
frizz. Infused with pineapple, argan oil and quinoa. Let your natural
beauty shine bright!
Should look like this:
Do you love great hair\\? Wind it up\\! Your curls are your gift\\- set
them free and help preserve your natural curl with bounce and
definition. Cleanse hair without weighing it down while reducing
frizz. Infused with pineapple, argan oil and quinoa. Let your natural
beauty shine bright\\!
I've figured out how to remove a list of several special characters (~!##$%^&*(){}|<>/), but can't find the right tutorial for adding \\ before them.
Note: I am not looking to remove ALL punctuation because some characters are used for subsequent delimiting logic. Instead, I want to address a specific subset of special characters.
Sample data:
>dput(tar$clean.text[1:10])
list(c("Dove go fresh Cucumber and Green Tea Beauty Bar combines the refreshing scent of cucumber and green tea with Dove's gentle cleansers and _ moisturizing cream. Dove Beauty Bar is proven to be more gentle and mild on skin than ordinary soap. It can be used on your hands and as a mild facial cleanser, so if you're also after a fresh face and refreshed hands throughout the day, why not try adding Dove Beauty Bar go fresh Cucumber and Green Tea to your skin care routine? Light, hydrating feel and refreshing formula that effectively nourishes skin. A refreshing shower can be just what you need to start the day off right. Dove's go fresh range blends nourishing ingredients and light, fresh scents in a formula that's gentle on your skin. Dove go fresh beauty bars give you a feeling of hydrating freshness that leaves you and your skin feeling blissfully revived. For best results: Your hands are one of the driest parts of your body so give them a boost and lather your Dove beauty bar between wet hands. Once you've covered your body with the rich lather, making sure to avoid contact with your eyes, rinse away thoroughly. At Dove, our vision is of a world where beauty is a source of confidence, and not anxiety. So, we are on a mission to help the next generation of women develop a positive relationship with the way they look - helping them raise their self-esteem and realize their full potential.",
"Scent: Cucumber", "Health Facts: Sulfate-free", "Suggested Age: 5 Years and Up",
"Wellness Standard: Aluminum-free, paraben-free", "Recommended Skin Type: Normal",
"Beauty Purpose: Moisturizing, basic cleansing", "Package Quantity: 1",
"TCIN: 10819409", "UPC: 011111611023", "Item Number (DPCI): 049-00-0604"
), c("Me! Bath Bath Bomb Papaya Nectar 6 ct is a great idea to add to a spa gift basket. These bath bombs are like scoops for your bath to make mini bath ice cream that gives you super soft skin.",
"Scent: Fruit", "Health Facts: Vegan, paraben-free, aluminum-free",
"Product Form: Bath bomb", "Suggested Age: Adult Use Only", "Wellness Standard: Aluminum-free, cruelty-free, paraben-free, vegan",
"Recommended Skin Type: Normal", "Sustainability Claims: Cruelty-free",
"TCIN: 18828570", "UPC: 858858000358", "Item Number (DPCI): 037-08-1164"
), NA_character_, NA_character_, c("Aura Cacia pure essential oils in 4 fl oz Body Oil has a lavender and cocoa butter scent. This natural skin care oil shows skin tone improvement that you can feel.",
"Scent: Lavender, Cocoa Butter", "Health Facts: Contains lavender, butylparaben-free, phthalate-free, formaldehyde donor-free, formaldehyde-free, nonylphenol ethoxylate free, propylparaben-free, Sulfate-free, paraben-free, dye-free, aluminum-free",
"Product Form: Lotion", "Suggested Age: All Ages", "Recommended Skin Type: Normal",
"Beauty Purpose: Skin tone improvement", "Sustainability Claims: Not tested on animals, cruelty-free",
"TCIN: 50030689", "UPC: 051381911720", "Item Number (DPCI): 037-05-1378"
), c("Deep clean pores with the Facial Cleansing Brush from Eco",
"Tools. This compact brush features soft bristles for moderate exfoliation, leaving you with soft, supple skin. Your serums and moisturizers can more effectively penetrate your skin once all the dead skin cells are out of the way. The compact size is ideal for packing in your weekend tote or suitcase for cleansing on the go.",
"Material: Nylon", "Suggested Age: All Ages", "Beauty Purpose: Basic cleansing, exfoliating",
"TCIN: 52537254", "UPC: 079625074864", "Item Number (DPCI): 037-08-2254"
), c("Deep Steep Rosemary Mint Sugar Scrub gently exfoliates dead skin cells while moisturizing, leaving smooth, radiant, polished skin. This formula is made up of a smooth blend of shea butter, cocoa butter and carefully sourced sugar to give you light, blissful fragrance with just the right amount of exfoliation and no harsh scratching. Apply desired amount of Deep Steep Rosemary Mint Sugar Scrub to wet skin from shoulders to ankles. Massage in a circular motion. Rinse.",
"Scent: Rosemary", "Health Facts: Contains argan oil, contains coconut oil, contains shea butter, formaldehyde donor-free, gluten-free, dye-free, ethyl alcohol-free, paraben-free, phthalate-free, vegan",
"Product Form: Scrub", "Suggested Age: All Ages", "Recommended Skin Type: Dry, normal",
"Beauty Purpose: Exfoliating", "TCIN: 53242409", "UPC: 674749101153",
"Item Number (DPCI): 037-08-2123"), NA_character_, c("Want to feel gorgeously soft skin every day? Transform your daily shower into an irresistible treat with the exquisitely fragranced Caress Evenly Gorgeous body wash. Indulge your skin with a rich exfoliating lather delicately scented with burnt brown sugar and karite butter that makes this body wash smell good enough to eat. Subtle notes of soft crisp apple and berry open up to a bold floral heart, while rich scents of warm tonka bean, vanilla and balsam together round out the lush lather to leave you with perfectly buffed and glowing skin. Caress Evenly Gorgeous is a revitalizing body wash that blends rich, luxurious lather with expertly crafted fine fragrance It is a body wash that gently cleanses your skin to leave it delicately fragrant, beautifully soft.",
"Lather up and indulge in a deeply cleansing and reviving shower experience. With fine floral fragrance and gentle exfoliates, Caress Evenly Gorgeous will leave you feeling delicately perfumed and silky-smooth, making this the perfect body wash for every day? and every night. Caress body wash and beauty bar fragrances are crafted by the world's best perfumers to transform your daily shower into an indulging experience that will make you feel special every day?Scent: Fresh",
"Health Facts: Aluminum-free, paraben-free, fluoride-free", "Product Form: Liquid",
"Suggested Age: 5 Years and Up", "Wellness Standard: Aluminum-free, paraben-free",
"Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing",
"Package Quantity: 1", "TCIN: 13446229", "UPC: 011111014909",
"Item Number (DPCI): 049-00-0806"), c("Maintain a sanitary and healthy atmosphere with the MEDLINE n/a READYBATH, PREMIUM,FRAG FREE, 8/PK - 24pks. These sterile swab sticks are pre-treated with povidone-iodine for preparing skin for incision and other medical issues. Comes in disposable packages of 3.",
"Scent: Unscented", "Health Facts: No fragrance added", "Suggested Age: Adult Use Only",
"Recommended Skin Type: Normal", "Beauty Purpose: Basic cleansing",
"Package Quantity: 1", "TCIN: 14339945", "UPC: 080196731445",
"Item Number (DPCI): 037-13-0198"))`
Code that removes a list of symbols:
tar$clean.text<-str_replace_all(tar$clean.text, "~|!|#|#|$|%|^|&|\\*|\\(|\\)|\\{|\\}|_|\\\\|<|>|\\?|\\[|\\]|-", "") # Removes a ton of non-UTF characters
I'm sure there is a simple modification to my regexp, but can't seem to figure it out. All previous answers I've found are more specific to fixing a specific text pattern, rather than generally replacing across a lot of different variations.
You may use
str_replace_all(x, "[~!##$%^&*(){}_\\\\<>?\\[\\]|-]", "\\\\\\0")
A base R approach:
gsub("([]\\~!##$%^&*(){}_<>?[|-])", "\\\\\\1", "~!##$%^&*(){}_\\<>?[]|-")
See the regex demo.
Details
[ - start of a character class matching any of the following chars:
~ - ~
! - !
# - #
# - #
$ - $
% - %
^ - ^ (if you put it at the start, escape with \\)
& - &
* - * (no need to escape inside a character class)
( - (
) - )
{ - {
} - }
_ - _ (note it is a word char, and \W would not match it)
\\\\ - a \ char (a literal \ escaped with another literal \)
< - a <
> - >
? - ?
\\[ - a [ char (in ICU regex, must be escaped inside a character class
\\] - a ] char (ibid.)
| - a | char (it is not an OR operator inside a character class)
- - a - char
] - end of the character class.
The "\\\\\\0" string replacement pattern is parsed as two literal backslashes that defines a singular literal backslash and a \0 literal string that is a backreference to the whole match in the ICU regex in R.
Note that gsub TRE regex is a bit trickier: ] must be the first char in the character class, [ should not be escaped, literal \ should only be single (no regex escape sequences are supported inside TRE patterns), and - must be at the end. Also, there is no support for the whole match backreference, hence, you need to wrap the whole pattern with a capturing group and replace with \1 backreference.
let dat = tar$clean.text[1:10] then you can do:
Map(gsub,"([[:punct:]])","\\\\\\1",dat)
Related
How to avoid word limit in ChatGPT in R?
I registered at this link to get an API for ChatGPT. And I used the chatgpt package in R. library(chatgpt) And set environment: Sys.setenv(OPENAI_API_KEY = "sk-YOURAPI") I used this function: chatgpt::ask_chatgpt("How to make a strawberry pie to donate to my wife? Ingredients, please.") Sometimes the output is missing words. For example: *** ChatGPT input: How to make a strawberry pie to donate to my wife? Ingredients, please. [1] ... and your wife is" It does not complete the text after is. Is there any solution for this on R? Similar question on reddit. EDIT I tried increasing the processing time of the function (so that the chat would finish typing all the content, without stopping midway). Like this: for (i in 35) { print(chatgpt::ask_chatgpt("How to make a strawberry pie to donate to my wife? Ingredients, please.")) Sys.sleep(i) }
What you are running into is prompt engineering. GPT is a sophisticated autocomplete engine. If you are not getting the right response you need to recraft your prompt. You can always test your prompts in the OpenAI playground: https://platform.openai.com/playground I was able to get a list of ingredients and cooking steps by using the following prompt: How can I make a strawberry pie to donate to my wife? Please provide first a numbered list of ingredients, and secondly a numbered lists of steps. Here is the output I got in the playground: Ingredients: 1. 2 ½ cups of fresh or frozen strawberries 2. 1 9-inch pre-made pie crust 3. ¾ cup of granulated sugar 4. 2 tablespoons of cornstarch 5. ¼ teaspoon of salt 6. 1 tablespoon of fresh lemon juice Steps: 1. Preheat oven to 425 degrees F. 2. Place the pre-made pie crust in a 9-inch pie dish and set aside. 3. In a medium bowl, combine the strawberries, sugar, cornstarch, salt, and lemon juice. Stir until the mixture is combined. 4. Pour the strawberry mixture into the pre-made pie crust. 5. Place the pie dish on a baking sheet and bake for 15 minutes. 6. Reduce the oven temperature to 375 degrees F and bake for an additional 25 minutes. 7. Allow the pie to cool completely before serving. Another thing to note, per the Github repo for the chatgpt R library it says "The {chatgpt} R package provides a set of features to assist in R coding." Ref: https://github.com/jcrodriguez1989/chatgpt I would use the OpenAI APIs directly, this way you will have a lot more control over your response. I am not an R specialist, but this is how the OpenAI Playground showed me how to do it. library(httr) response <- GET("https://api.openai.com/v1/completions", query = list( prompt = "How can I make a strawberry pie to donate to my wife? Please provide first a numbered list of ingredients, and secondly a numbered lists of steps.", max_tokens = 200, model = 'text-davinci-003' ), add_headers(Authorization = "bearer YOUR_OPENAI_API_KEY") ) content(response) Ref: OpenAI playground
Increase max_tokens to get longer answer.
Cracking an XOR crypt with a know key length
I'm trying to crack a crypt with a known key length. I deduced that the operation made was a hex XOR. Here is the crypt: 330a1448010816101c1e470b0248104711050903040a0844511317130d030817024812150d014817150d1c48050f0751181415071f0618060e510618000a051b190606144822080e1006040a42051d1302101e1b040a423d4651330a14480608101548010816101c1e470f1011511507170d0347161e48050f0751181d060c0548181311140417470b1f48100306181c18080c511c1e4716190d510206180a1d0242051d1302105f4838094205001447231f0c14144e511f1902101448050f07511b010201180d02470b0248180906180f14090d041b5d4716190d030242101a1447111e0514470d050014154212041e14071d115115071d09050206510b040b16181e1013071548010816101c1e4711010d120e07024651370d0509050807024806021014481809160307151201140c510817051b180307511c1902423006150211511a14000b1e0651010d041a5104071f1c04150b141b5106051e4451060c154819061414481302011e051447031f48180916140f03060e5118101516510717470f040b19470d1748050f07511f1e150e154f0247041e071547110418010b1b5f48381342181b51130a14480608101d0c561442170704151619451d0610160d02134217071e0342121a1e174e510e1e0b0e1e1f1809055105100e18144451100a14090547031f0c51150b120d5f I have tried to use this tool to decrypt it. The tool outputted multiple possible keys and for each key, an attempt do decipher the crypt. I know the result should be plain text English. The closest I got was this: T-e po1ato ,s a 6tarc-y, t0bero0s cr*p fr*m th per nnia) nig-tsha!e So)anumetube7osumeL. T-e wo7d po1ato (ay r fer 1o th pla+t it6elf ,n ad!itio+ to 1he e!ibleetube7. Inethe ndesi whe7e th spe&ies ,s in!igen*us, 1hereeare 6ome *thereclos ly r late! cul1ivat d po1ato 6peci s.P*tato s we7e in1rodu&ed o0tsid theeAnde6 reg,on f*ur c ntur,es a"o,a+d ha3e be&ome $n in1egra) par1 of (uch *f th wor)d's #ood 6uppl<. Iteis t-e wo7ld'sefour1h-la7gestefoodecropi fol)owin" mai?e, w-eat $nd r,ce. After some digging and manually tweaking the text, I got this: The potato is a starchy, tuberous crop from the perennial nightshade Solanum tuberosum L. The word "potato" may refer to the plant, itself, in addition to the edible tuber. In the Andes, where the species is indigenous, there are some other closely related cultivated potato species. Potatoes were introduced outside the Andes region four centuries ago, and have become an integral part of much of the world's food supply. It is the world's fourth-largest food crop, following maize, wheat and rice. It unfortunately did not work. I am now trying to find a clue as to what I should be doing to find the answer.
Converting a dataframe which contains list into a csv with r
I am new to R and I am facing difficulties to convert my dataframe (named dffinal) which contains list into a csv. I tried the following code which gave a csv that is not usable: dput(dffinal, file="out.txt") new <- source("out.txt") write.csv2(dffinal,"C:/Users\\final.csv", row.names = FALSE) I tried all the option but I found nothing! Here is a sample of my dataframe: dput(head(dffinal[1:2])) structure(list(V1 = list("I heard about your products and I would like to give it a try but I'm not sure which product is better for my dry skin, Almond products or Shea Butter products? Thank you", "Hi,\n\nCan you please tell me the difference between the shea shower oil limited edition and the other shower gels? I got a sample of one in a kit that had a purple label on it. (Please see attached photo.) I love it!\nBut, what makes it limited edition, the smell or what? It is out of stock and I was wondering if it is going to be restocked or not?\n\nAlso, what makes it different from the almond one?\n\nThank you for your help.", "Hello, Have you discontinued Eau de toilette", "I both an eGift card for my sister and she hasn't received anything via her email\n\nPlease advise \n\nThank you \n\n cann", "I do not get Coco Pillow Mist. yet. When are you going to deliver it? I need it before January 3rd.", "Hello,\nI wish to follow up on an email I just received from Lol, notifying\nme that I've \"successfully canceled my subscription of bun Complete.\"\nHowever, I didn't request a cancelation and was expecting my next scheduled\nfulfillment later this month. Could you please advise and help? I'd\nappreciate it if you could reinstate my subscription.\n"), V2 = list("How long can I keep a product before opening it? shea butter original hand cream large size 5oz, i like to buy a lot during sales promotions, is this alright or should i only buy what i'll use immediately, are these natural organic products that will still have a long stable shelf life? thank you", "Hi,\nI recently checked to see if my order had been delivered, and I only received my gift box and free sample. Can you please send the advent calendar? Does not seem to have been included in the shipping. Thank you", "Is the gade fragrance still available?", "I previously contacted you because I purchased your raspberry lip scrub. When I opened the scrub, 25% of the product was missing. Your customer service department agreed to send me a replacement, but I never received the replacement rasberry lip scrub. Could you please tell me when I will receive the replacement product? Thanks, me", "To whom it may concern:\n\nI have 3 items in my order: 1 Shea Butter Intensive Hand Balm and 2 S‚r‚nit‚ Relaxing Pillow Mist. I have just received the hand balm this morning. I was wondering when I would receive the two bottles of pillow mist.\n\nThanks and regards,\n\nMe", "I have not received 2X Body Scalp Essence or any shipment information regarding these items. Please let me know if and when you will be shipping these items, otherwise please credit my card. Thanks")), row.names = c(NA, 6L), class = "data.frame")
We can do this in tidyverse library(dplyr) library(readr) dffinal %>% mutate(across(everything(), unlist)) %>% write_csv('result.csv')
If you have list of only length 1 for all the rows as shared in the example using unlist will work - dffinal[] <- lapply(dffinal, unlist) If the length of list is greater than 1 use - dffinal[] <- lapply(dffinal, sapply, toString) Write the data with write.csv - write.csv(dffinal, 'result.csv', row.names = FALSE)
r lang and extract meta description from facebook page
con=file("https://www.facebook.com/groups/368965769950169/","r",blocking=FALSE) page=readLines(con) d=grep('<title id="pageTitle">',page,perl=TRUE,value=TRUE) res2=str_match(d,'"og:description\" content(.*?)/>') Run "=\"Super Healthy Kids. ٣٫٣ مليون تسجيل إعجاب. We love making healthy food fun, simple, and delicious! With recipes, strategies, tips, and more!\" " But I need only "We love making healthy food fun, simple, and delicious! With recipes, strategies, tips, and more!" How I can convert this latin1 strings to Arabic language and english language to save them in a text file "=\"Super Healthy Kids. ٣٫٣ مليون تسجيل إعجاب. "
Extract string after first occurrence of a string pattern
I'm having trouble extracting all the text that occurs after the first occurrence of the word 'PRODUCTS'. The text I'm working with is below and is stored in test$description (There is more text but R truncates the last part) [1] "Hey guys! Been wanting to film a Get Ready With Me for a while, just to sit back and chill and chit chat with you all! It has been a MINUTE since I have done one of these so I hope you enjoy this first impressions get ready with me :D Love you guys! \n\nDONT FORGET TO HIT SUBSCRIBE! :D \n---------------------------------------------------------------------------------------------------------------\nFACE PRODUCTS : \n\nH2O Green Tea Matcha Facial Essence - \nMILK Makeup Blur Stick - \nLoreal Total Coverage Foundation - \nGallany Concealer - \n\nBecca Soft Light Powder - \nPixie X Maryam NYC Glow and Bronze Palete - \nClinique Honey Cheek Pop Blush -\n---------------------------------------------------------------------------------------------------------------\nEYE PRODUCTS! \n\nColourpop Pressed Eyeshadows - <truncated> When I use: sub(".*PRODUCTS",'',test$description) I get: [1] "! \n\nColourpop Pressed Eyeshadows - \n\nTarte Cosmetics Fake Away Pencil - \n\nKat Vond D Trooper Eyeliner - \n\nNubounsom Dragon Li Lashes - Use code MANNYMUA to save 20% - \n---------------------------------------------------------------------------------------------------------------\nLIPS \n\nMorphe Brushes Liquid Lipstick in the shade Mood - USE CODE MANNYMUA to save money -\n--------------------------------------------------------------------------------------------\nBRUSHES AND TOOLS - \n\nMorphe Brushes - use code \"MANNYMUA\" all caps for 10% off everything! - \n- \nMorphe E2 Bronzer Brush - \nMorphe E4 Blush Brush - \nMorphe MB13 Nose Contour - \nMorphe M510 Highlight Brush - \n\nEYES:\nE2... <truncated> So only everything after the second occurrence of 'PRODUCTS' When I use: sub(".*PRODUCTS ",'',test$description) I get: [1] ": \n\nH2O Green Tea Matcha Facial Essence - \nMILK Makeup Blur Stick - \n\nLoreal Total Coverage Foundation - \nGallany Concealer - \n\nBecca Soft Light Powder - \n\nPixie X Maryam NYC Glow and Bronze Palete - \n\nClinique Honey Cheek Pop Blush - \n\n---------------------------------------------------------------------------------------------------------------\nEYE PRODUCTS! \n\nColourpop Pressed Eyeshadows - \n\nTarte Cosmetics Fake Away Pencil - \n\nKat Vond D Trooper Eyeliner - \n\nNubounsom Dragon Li Lashes - Use code MANNYMUA to save 20% - \n\n---------------------------------------------------------------------------------------------------------------\nLIPS \n\nMorphe Brushes Liquid Lipstick in the shade Mood - USE CODE MANNYMUA to save money... <truncated> I think the issue is the space between 'PRODUCTS' and the colon in the first occurrence and the lack of space between 'PRODUCTS' and the exclamation point in the second occurrence. But I'm trying to tell R just to look for the string 'PRODUCTS'. How can I get it to ignore the spacing?
You almost had it. Instead use sub(".*?PRODUCTS",'',test$description) Note added ?, no space after PRODUCTS. By default, the matching is "greedy"; it matches as much as it can, so .*PRODUCTS goes until the last copy of PRODUCTS. Adding the ? turns off greedy matching so it only goes to the first instance.