Delimiting Text from Democratic Debate - r

I am trying to delimit the following data by first name, time stamp, and then the text. Currently, the entire data is listed in 1 column as a data frame this column is called Text 1. Here is how it looks
text
First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text
This is what I did so far:
text$specificname = str_split_fixed(text$text, ":", 2)
and it created the following
text specific name
First Name: 00:03 Welcome Back text text text First Name
First Name 2: 00:54 Text Text Text First Name2
First Name 3: 01:24 Text Text Text First Name 3
How do I do the same for the timestamp and text? Is this the best way of doing it?
EDIT 1: This is how I brought in my data
#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'
#Reading the HTML code from the website
wp = read_html(url)
#assignging the class to an object
alltext = html_nodes(wp, 'p')
#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)

Assuming that text is in the form shown in the Note at the end, i.e. a character vector with one component per line, we can use read.table
read.table(text = gsub(" +", ",", text), sep = ",", as.is = TRUE)
giving this data.frame:
V1 V2 V3
1 First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54 Text Text Text
3 First Name 3: 01:24 Text Text Text
Note
Lines <- "First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text"
text <- readLines(textConnection(Lines))
Update
Regarding the EDIT that was added to the question define a regular expression pat which matches possible whitespace, 2 digits, colon, 2 digits and possibly more whitespace. Then grep out all lines that match it giving tt and in each line left replace the match with #, the pattern (except for the whitespace) and # giving g. Finally read it in using # as the field separator giving DF.
pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "#\\1#", tt)
DF <- read.table(text = g, sep = "#", quote = "", as.is = TRUE)

Related

Extract disallowed characters

I have transcriptions with erroneous encodings, that is, characters that occur but should not occur.
In this toy data, the only allowed characters are this class:
"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"
df <- data.frame(
Utterance = c("~°maybe you (.) >should ¥just¥<",
"SOME text |<-- pipe¿ and€", # <--: | and €
"blah%", # <--: %
"text ^more text", # <--: ^
"£norm(hh)a::l£mal, (1.22)"))
What I need to do is:
detect Utterances that contain any wrong encodings
extract the wrong characters
I'm doing OK as far as detection is concerned but the extraction fails miserably:
library(stringr)
library(dplyr)
df %>%
filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME, t, ex, |<, --, p, ip, e¿, a, nd
2 blah% bl, ah
3 text ^more text te, xt, ^m, or, t, ex
How can the extraction be improved to obtain this expected result:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
You need to
Ensure the [ and ] are escaped inside a character class
Add whitespace pattern to both regexp checks as its absence is messing your results.
So you need to use
df %>%
filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Output:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
Note that I used positive logic in filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")), so we get all items that contain at least one char other than an allowed one.

Using the same regex for multiple specific columns in R

I have the data as below
Data
df <- structure(list(obs = 1:4, text0 = c("nothing to do with this column even it contains keywords",
"FIFA text", "AFC text", "UEFA text"), text1 = c("here is some FIFA text",
"this row dont have", "some UEFA text", "nothing"), text2 = c("nothing here",
"I see AFC text", "Some samples", "End of text")), class = "data.frame", row.names = c(NA,
-4L))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
4 4 UEFA text nothing End of text
Expected Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Question: I have several columns contains some keywords (FIFA, UEFA, AFC) I am looking for. I want to filter these keywords on specific columns (in this case: text1, and text2 only). Any those keywords founded in text1 and text2 should be filtered as the expected output. We have nothing to do with text0. I am wondering if there is any regex to get this result.
Using filter_at
library(dplyr)
library(stringr)
patvec <- c("FIFA", "UEFA", "AFC")
# // create a single pattern string by collapsing the vector with `|`
# // specify the word boundary (\\b) so as not to have any mismatches
pat <- str_c("\\b(", str_c(patvec, collapse="|"), ")\\b")
df %>%
filter_at(vars(c('text1', 'text2')),
any_vars(str_detect(., pat)))
With across, currently does the all_vars matching instead of any_vars. An option is rowwise with c_across
df %>%
rowwise %>%
filter(any(str_detect(c_across(c(text1, text2)), pat))) %>%
ungroup
Also you can try (base R):
#Keys
keys <- c('FIFA', 'UEFA', 'AFC')
keys <- paste0(keys,collapse = '|')
#Filter
df[grepl(pattern = keys,x = df$text1) | grepl(pattern = keys,x = df$text2),]
Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Another base R option:
pat <- sprintf("\\b(%s)\\b",paste(patvec, collapse = "|"))
subset(df, grepl(pat, do.call(paste, df[c("text1","text2")])))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples

Add quatation marks to every row of specific column

Having a dataframe in this format:
data.frame(id = c(4,2), text = c("my text here", "another text here"))
How is it possible to add triple quatation marks at the start and end of every value/row in text column.
Expected printed output:
id text
4 """my text here"""
2 """another text here"""
With no paste nor cat/paste you can simply run:
data.frame(id = c(4,2), text = c('"""my text here"""', '"""another text here"""'))
id text
1 4 """my text here"""
2 2 """another text here"""

How to read a file with more than one tab as separator and where the space is part of column value

I have to read a CSV file with tab ("\t") separator and it can occur multiple times. The read.table function has the special white space separator (sep="") that considers multiple occurrences of any whitespace (tab or space). The problem is that I have the space character as part of the column value, so I cannot use the white space separator. When I use "\t" it only consider one occurrence.
Here is a toy example of my problem:
text1 <- "
a b c
11 12 13
21 22 23
"
ds <- read.csv(sep = "", text = text1)
before the element [1,3], i.e. "13" there are two tabs as separator. Then I get:
a b c
1 11 12 13
2 21 22 23
This is the expected result.
Let's say we add an space in the third column values between the first and second number, so now it would be: "1 3" and "2 3". Now we cannot use a white space delimiter because the space is not a delimiter in this case, it is part of the column value. Now when I use "\t" I get this unexpected result:
text3 <- "
a b c
11 12 1 3
21 22 2 3
"
ds <- read.csv(sep = "\t", text = text3)
The string representation of the input text is:
"a\tb\tc\n11\t12\t\t1 3\n21\t22\t2 3\n"
And now the result is:
a b c
11 12 1 3
21 22 23
It seems to be simple, but I cannot find a way to do it using the read.table interface, because the input argument sep does not accept a regular expression as delimiter.
I think I found a workaround for this, 1) replacing all extra tabs with one first, 2) read the file/text. For example:
read.csv(text = gsub("[\t]+", "\t", readLines(text3), perl = TRUE), sep = "\t")
and also using a file instead:
temp <- tempfile()
writeLines(text3, temp)
read.csv(text = gsub("[\t]+", "\t", readLines(temp), perl = TRUE), sep = "\t")
The text input argument will result:
> text
[1] "a\tb\tc" "11\t12\t1 3" "21\t22\t2 3" ""
and the result of read.csv will be:
a b c
1 11 12 1 3
2 21 22 2 3
This is similar to #Badger suggestion, just in one step.
Okay I think I've got something for you:
write.table( gsub("\\r","", gsub("\t","", readChar( "C:/_Localdata/tab_sep.txt", file.info( "C:/_Localdata/tab_sep.txt" )$size) ) ), "C:/_Localdata/test.txt", sep=" ", quote = F, col.names = T, row.names=F)
## In the event there is a possibility that it is 1 or 2 tabs in series, you can use gsub("\t|\t\t", in place of gsub("\t", just add a | and more \t's if needed!
read.table("C:/_Localdata/test.txt",sep=" ",skip=1,header=T)
Okay what just happened? First we read in the file as a massive character string using readChar(), we need to tell R how big the file is, using file.info(), from this we need to get rid of any tabs using gsub and the \t call, then we have a character string with \r's and \n's, the \r and \n are both carriage returns however R sees both within the file, so it reports both. As such we get rid of one of the carriage returns. Then we write the table out (ideally back to where it came from). Now you can read it in with an easy separating value of a single space, and skip the first line. The first line will be an X, an artifact of writing out a gsub. Also declare a header and you should be good to go!
Let's say you have 500 files.
Place all your files in a folder, and set the pattern to the file type they are, or just allow R to view them all by removing the pattern call.
for( filename in 1:list.files("C:_/Localdata/",patten=".txt") ) {
write.table( gsub("\\r","", gsub("\t","", readChar( filename , file.info( filename )$size) ) ), filename , sep=" ", quote = F, col.names = T, row.names=F)
}
Now your files are ready to be read in however you would like.

Exceptions to sep = " " when reading table into R? Dealing with whitespace within fields

I need to import a table into R that is separated by spaces. Unfortunately, within some of the fields, there are spaces which cause R to separate into a new row. Is there any way of making those fields 'stick together'?
For example, the table looks like this:
V1 V2 V3 V4
Text More 0.11 (a)kdfs hdfa ag$
Text More 1.12 a
Text More 0.21 v
Text More 1222 (a)sdfs sdfa->g
Text More 1232 (a)sdfs sdfa->g
But gets turned into this when R reads it (using read.delim)
V1 V2 V3 V4
Text More 0.11 (a)kdfs
hdfa ag$
Text More 1.12 a
Text More 0.21 v
Text More 1222 (a)sdfs
sdfa->g
Text More 1232 (a)sdfs
sdfa->g
Those fields all have weird characters that aren't all shared with the other columns/rows. However, as seen, the spaces aren't flanked by the same characters.
In the original file, the rows are separated properly. Is there a way to do any of the following?
Stop separating by spaces after the fourth column is created
Have fields starting/ending with certain characters be stuck together as a string/add a non-space character where the spaces are
Generically, allow exceptions to sep
Quite new to R so sorry if this is very naive. Here is what my script looks like up to then:
strs <- readLines("file")
dat <- read.delim(text = strs,
skip = 17,
col.names = c("V1", "V2", "V3", "V4"),
sep = " ", header = F)
Is there anything I can add to either read.delim or readLines or in between those to fix this problem? As there is fluff that needs to be cut out (hence the skip) I can't use read.table (correct me if I'm wrong).
Some of the characters around the spaces are shared, so I would be willing to use a more tedious method to put other characters in place of the spaces in between e.g. 's' and 's'. Would that be possible with gsub if there isn't an easier method?
Thanks so much!
EDIT: Flash of insight, would it be possible to make the fourth column a new table (that's of course not separated by spaces), then replace all spaces in that table with something else? How would I go about 'breaking off' the fourth column/columns after the third column?
1) Try this:
for(i in 1:3) strs <- sub(" +", ",", strs)
read.csv(text = strs)
The result of the last line is:
V1 V2 V3 V4
1 Text More 0.11 (a)kdfs hdfa ag$
2 Text More 1.12 a
3 Text More 0.21 v
4 Text More 1222.00 (a)sdfs sdfa->g
5 Text More 1232.00 (a)sdfs sdfa->g
2) Here is a second solution:
strs.comma <- sub("^(\\S+) +(\\S+) +(\\S+) +", "\\1,\\2,\\3,", strs)
read.csv(text = strs.comma)

Resources