Extract text from CSV in R - r

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.
One example sentence:
so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day
Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.
Any ideas on how to do this? Or which function/package to use?
Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?
I don't want to copy and paste every cell as a "strng" -- I was hoping I could someone just upload the entire CSV

To load the CSV into R, you could use readr::read_CSV(YOUR_FILE.CSV). There are more options, some of which are available to you if you use the "File -- Import Dataset -- From Text (readr)" menu option in RStudio.
Supposing you have the data loaded, you will likely need to rely on some form of "regex" to parse the text into sections based on the brackets. There are some base R functions for this, but I find the functions in stringr (part of the tidyverse meta-package) to be useful for this. And tidyr::separate_rows is a nice way to split the text into more lines.
In the regex below, there are a few ingredients:
(?=...) means to split before the [ but to keep it.
\\[ is how we refer to [ because brackets have special meaning in regex so we need to "escape" them to treat them as a literal character.
(?<=...) means to split after the ] but keep it.
| in the last row means "or"
(Granted, I'm still a regex beginner, so I expect there are more concise ways to do this.)
So we could do something like:
df1 <- data.frame(text = "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day")
library(tidyverse)
df1 %>%
mutate(orig_row = row_number()) %>%
separate_rows(text, sep = "(?=\\[)") %>%
separate_rows(text, sep = "(?<=\\] )") %>%
mutate(language = if_else(str_detect(text, "\\[|\\]"), "Espanol", "English"),
text = str_remove_all(text, "\\[|\\]"))
Result
# A tibble: 5 × 3
text orig_row language
<chr> <int> <chr>
1 "so " 1 English
2 "usualmente " 1 Espanol
3 "maybe " 1 English
4 "me levanto como a las nueve y media " 1 Espanol
5 "like I exercise and the I like either go to class online or in person like it depends on the day" 1 English

Related

Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)
R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Extracting text between html tags and labelling it with the tag in R

I am trying to learn how to classify sentences in R.
I have a text file containing sentences in the following format:
<happy>
This did the trick : the boys now have a more distant friendship and David is much happier .
<\happy>
<happy>
When Anna left Inspector Aziz , she was much happier .
<\happy>
I do intent to tag the sentences in the following way:
dataset$text = When Anna left Inspector Aziz , she was much happier
dataset$label = happy
I want to extract the sentence and label them with the emotion. How should I approach this? I know that I should use grouping in regex but I don't know how to do this in R. I am new to it and learning.
rl <- readLines('sentences.txt')
Presently that's badly-formatted XML, as
XML uses forward slashes in closing tags instead of backslashes. In fact, you can't even read that into R as-is, as it will try to parse \h as an escaped character unless you add extra backslashes to escape the backslashes themselves.
XML needs to be enclosed in a single tag. The problem is much easier to remedy (paste on some tags), though.
If, as is not unlikely, your actual data is properly formatted XML, you can use the xml2 or XML packages to parse. I like purrr::map_df to iterate over nodes and coerce the results to a data.frame, but you can do the same thing in base R, if you prefer.
library(xml2)
library(purrr)
'<happy>
This did the trick : the boys now have a more distant friendship and David is much happier .
</happy>
<happy>
When Anna left Inspector Aziz , she was much happier .
</happy>' %>%
paste('<sent>', ., '</sent>') %>% # add enclosing tags
read_xml() %>%
xml_find_all('//text()/parent::*') %>% # select nodes that are parents of text
map_df(~list(text = xml_text(.x, trim = TRUE),
emotion = xml_name(.x)))
## # A tibble: 2 × 2
## text emotion
## <chr> <chr>
## 1 This did the trick : the boys now have a more distant friendship and David is much happier . happy
## 2 When Anna left Inspector Aziz , she was much happier . happy

Dealing with spaces and "weird" characters in column names with dplyr::rename()

I have table with difficult headers like this:
Subject Cat Nbr Title Instruction..Mode!
1 XYZ 101 Intro I ONLINE
2 XYZ 102 Intro II CAMPUS
3 XYZ 135 Advanced CAMPUS
I would like to rename the columns with dplyr::rename()
df %>%
rename(subject = Subject,
code = Cat Nbr,
title = title,
mode = Instruction..Mode!)
But I am getting an Error: unexpected symbol in:
How might I reconcile this?
To refer to variables that contain non-standard characters or start with a number, wrap the name in back ticks, e.g., `Instruction..Mode!`

Resources