Add quatation marks to every row of specific column - r

Having a dataframe in this format:
data.frame(id = c(4,2), text = c("my text here", "another text here"))
How is it possible to add triple quatation marks at the start and end of every value/row in text column.
Expected printed output:
id text
4 """my text here"""
2 """another text here"""

With no paste nor cat/paste you can simply run:
data.frame(id = c(4,2), text = c('"""my text here"""', '"""another text here"""'))
id text
1 4 """my text here"""
2 2 """another text here"""

Related

How do we insert \n every n-character or/and after n-th space in a string in R?

On SO I found a solution which helps to insert a value/character every n-th character in a string:
(?=(?:.{n})+$)
But it will be more reasonable to insert a value (for example, tabulation sign or \n) every n-th space, so the word won't be splitted. What are the possible ways to editing this regex?
I did my cluster analysis and now I want to attach labels to a dendrogram. Consider that the labels are very long strings like:
tibble(
id = d2022_1,
label = "A very long label for the dendro that should be splitted so it will look nicely in the picture"
)
And I would like to have it tabulated/splited by rows, so I want to insert \n:
A very long label for the dendro\nthat should be splitted so\nit will look nicely in the picture
You're reinventing the wheel here. R includes the strwrap function that can split a long string at appropriate word boundaries. This gives a more consistent line length than creating a break after n spaces.
For example, suppose I wanted a line break at most every 12 characters. I can do:
string <- "The big fat cat sat flat upon the mat"
strwrap(string, width = 12)
#> [1] "The big fat" "cat sat" "flat upon" "the mat"
If you want newlines instead of split strings, just paste the result using collapse:
paste(strwrap(string, width = 12), collapse = "\n")
[1] "The big fat\ncat sat\nflat upon\nthe mat"
EDIT
Using the newly added example:
df <- tibble(
id = "d2022_1",
label = rep("A very long label for the dendro that should be splitted so it will look nicely in the picture", 2)
)
df
#> # A tibble: 2 x 2
#> id label
#> <chr> <chr>
#> 1 d2022_1 A very long label for the dendro that should be splitted so it will look nic~
#> 2 d2022_1 A very long label for the dendro that should be splitted so it will look nic~
df %>% mutate(label = sapply(label, function(x) paste(strwrap(x, 20), collapse = "\n")))
#> # A tibble: 2 x 2
#> id label
#> <chr> <chr>
#> 1 d2022_1 "A very long label\nfor the dendro that\nshould be splitted\nso it will look~
#> 2 d2022_1 "A very long label\nfor the dendro that\nshould be splitted\nso it will look~

Using the same regex for multiple specific columns in R

I have the data as below
Data
df <- structure(list(obs = 1:4, text0 = c("nothing to do with this column even it contains keywords",
"FIFA text", "AFC text", "UEFA text"), text1 = c("here is some FIFA text",
"this row dont have", "some UEFA text", "nothing"), text2 = c("nothing here",
"I see AFC text", "Some samples", "End of text")), class = "data.frame", row.names = c(NA,
-4L))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
4 4 UEFA text nothing End of text
Expected Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Question: I have several columns contains some keywords (FIFA, UEFA, AFC) I am looking for. I want to filter these keywords on specific columns (in this case: text1, and text2 only). Any those keywords founded in text1 and text2 should be filtered as the expected output. We have nothing to do with text0. I am wondering if there is any regex to get this result.
Using filter_at
library(dplyr)
library(stringr)
patvec <- c("FIFA", "UEFA", "AFC")
# // create a single pattern string by collapsing the vector with `|`
# // specify the word boundary (\\b) so as not to have any mismatches
pat <- str_c("\\b(", str_c(patvec, collapse="|"), ")\\b")
df %>%
filter_at(vars(c('text1', 'text2')),
any_vars(str_detect(., pat)))
With across, currently does the all_vars matching instead of any_vars. An option is rowwise with c_across
df %>%
rowwise %>%
filter(any(str_detect(c_across(c(text1, text2)), pat))) %>%
ungroup
Also you can try (base R):
#Keys
keys <- c('FIFA', 'UEFA', 'AFC')
keys <- paste0(keys,collapse = '|')
#Filter
df[grepl(pattern = keys,x = df$text1) | grepl(pattern = keys,x = df$text2),]
Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Another base R option:
pat <- sprintf("\\b(%s)\\b",paste(patvec, collapse = "|"))
subset(df, grepl(pat, do.call(paste, df[c("text1","text2")])))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples

Delimiting Text from Democratic Debate

I am trying to delimit the following data by first name, time stamp, and then the text. Currently, the entire data is listed in 1 column as a data frame this column is called Text 1. Here is how it looks
text
First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text
This is what I did so far:
text$specificname = str_split_fixed(text$text, ":", 2)
and it created the following
text specific name
First Name: 00:03 Welcome Back text text text First Name
First Name 2: 00:54 Text Text Text First Name2
First Name 3: 01:24 Text Text Text First Name 3
How do I do the same for the timestamp and text? Is this the best way of doing it?
EDIT 1: This is how I brought in my data
#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'
#Reading the HTML code from the website
wp = read_html(url)
#assignging the class to an object
alltext = html_nodes(wp, 'p')
#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)
Assuming that text is in the form shown in the Note at the end, i.e. a character vector with one component per line, we can use read.table
read.table(text = gsub(" +", ",", text), sep = ",", as.is = TRUE)
giving this data.frame:
V1 V2 V3
1 First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54 Text Text Text
3 First Name 3: 01:24 Text Text Text
Note
Lines <- "First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text"
text <- readLines(textConnection(Lines))
Update
Regarding the EDIT that was added to the question define a regular expression pat which matches possible whitespace, 2 digits, colon, 2 digits and possibly more whitespace. Then grep out all lines that match it giving tt and in each line left replace the match with #, the pattern (except for the whitespace) and # giving g. Finally read it in using # as the field separator giving DF.
pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "#\\1#", tt)
DF <- read.table(text = g, sep = "#", quote = "", as.is = TRUE)

Classification based on list of words R

I have a data set with article titles and abstracts that I want to classify based on matching words.
"This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text"
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
Given that that text above matches words in Topic 2, I want to assign a new column with this label. Preferred if this could be done with "tidy-verse" packages.
Given the sentence as a string and the topics in a data frame you can do something like this
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
Given I am unsure of the data frame you want to add this too I have made a vector newcol.
If you had a data frame of long sentences then you can use a similar approach.
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))

Separating one file in two using timestamp

Having a txt file in the following format:
2011-01-01 00:00:00 text text text text
2011-01-01 00:01:00 text text text text
text
2011-01-01 00:02:00 text text text text
....
....
....
....
2011-01-02 00:00:00 text text text text
2011-01-02 00:01:00 text text text text
All file contains data of two calendar days.
Is it possible to separate the file into two different files, one for everyday?
read data in with read.table()
we should have a data.frame similar to :
df <- data.frame(d = c("2011-01-01 00:00:00", "2011-01-01 00:01:00"), x = 0:1)
apply split()
dfl <- split(df, df$d)
Map write.table to split
Map(write.table, dfl, file = paste(names(dfl), "txt", sep = "."), row.names = FALSE, sep = ";")
dat <- readLines(textConnection(" 2011-01-01 00:00:00 text text text text
2011-01-01 00:01:00 text text text text text
2011-01-01 00:02:00 text text text text
2011-01-02 00:00:00 text text text text
2011-01-02 00:01:00 text text text text"))
grouped.lines <- split(dat, substr(dat, 1,11) )
grouped.lines
$` 2011-01-01`
[1] " 2011-01-01 00:00:00 text text text text"
[2] " 2011-01-01 00:01:00 text text text text text"
[3] " 2011-01-01 00:02:00 text text text text"
$` 2011-01-02`
[1] " 2011-01-02 00:00:00 text text text text"
[2] " 2011-01-02 00:01:00 text text text text"
It's more efficient to process these as separate items in a single list. It will create problems if you split them into separate objects. They can be accessed by text names or by numeric reference. (But do note that the leading space would need to be in the name if a leading space was in your text file.)
You will have to read all the lines of the file.
you can try to do so using
library(package=reshape)
then the function read.table might help
then you will have to compare all lines and write them back in two new files

Resources