R - delete length-one strings and stopwords (using tidytext) in character

R - delete length-one strings and stopwords (using tidytext) in character - r

If I have a df:
Class sentence
1 Yes there is p beaker on the table
2 Yes they t the frown
3 Yes so Z it was asleep
How do I remove the length-one strings within "sentence" column to remove things like "t" "p" and "Z", and then do a final clean using the stop_words list in tidytext to get the below?
Class sentence
1 Yes beaker table
2 Yes frown
3 Yes asleep

If we want to use tidytext, then create a sequence column (row_number()), then apply unnest_tokens on the sentence column, do an anti_join with the default data from get_stopwords(), filter out the words that have characters only 1, and then do a group by paste on the 'word' column to create the 'sentence'
library(dplyr)
library(tidytext)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, sentence) %>%
anti_join(get_stopwords()) %>%
filter(nchar(word) > 1) %>%
group_by(rn, Class) %>%
summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>%
select(-rn)
-Output
# A tibble: 3 x 2
Class sentence
<chr> <chr>
1 Yes beaker table
2 Yes frown
3 Yes asleep
Data
df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table",
"they t the frown", "so Z it was asleep")),
class = "data.frame", row.names = c("1",
"2", "3"))

Related

Filter based on different conditions at different positions in a string in R

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01
ID-1-ABYEMJ-08B-01
Then I would want:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01
I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!

Using extract and some dplyr wrangling:
library(tidyr)
library(dplyr)
df %>%
extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
into = c("ID", "number", "letter"),
remove = FALSE, convert = TRUE) %>%
group_by(ID) %>%
filter(number != 2) %>%
slice_min(n = 1, order(number, letter)) %>%
ungroup() %>%
select(col1)
# col1
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01

An option with str_detect
library(stringr)
library(dplyr)
df1 %>%
group_by(ID = str_extract(col1, "ID-\\d+-\\w+")) %>%
filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
ungroup %>%
select(-ID)
-output
# A tibble: 4 × 1
col1
<chr>
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01
data
df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01",
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01",
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame",
row.names = c(NA,
-7L))

How to join multiple columns together on blanks of one column in R

This is my dataframe:
df <- data.frame(option_1 = c("Box 1", "", ""), option_2 = c("", 4, ""), Width = c("","",3))
I want to get this data frame:
option_1
1 Box 1
2 4
3 3
I'm doing this on a much bigger dataframe with 5+ columns I'm merging on blanks with respect to the option_1 column. I have tried using coalesce, but some of the columns won't "merge" on the blanks. For example:
df %>%
mutate(option_value_1 = coalesce(option_value_1, option_value_2, option_value_3, option_value_4, option_value_5, option_value_6, option_value_7))
option_value_5 wouldn't come together with option_value_1 on the blanks, but the other option values did. Should I put the vectors in a list then use coalesce?

We convert the blank ("") to NA and coalesce with the bang-bang (!!!) operator. According to ?"!!!"
The big-bang operator !!! forces-splice a list of objects. The elements of the list are spliced in place, meaning that they each become one single argument.
library(dplyr)
df %>%
na_if("") %>%
transmute(option_1 = coalesce(!!! .))
-output
option_1
1 Box 1
2 4
3 3
If we are interested only in the 'option' columns, subset the columns (also can use invoke with coalesce
library(purrr)
df %>%
na_if("") %>%
mutate(option_1 = invoke(coalesce,
across(starts_with("option"))), .keep = "unused")

With a base R approach:
df <- data.frame(option_1 = apply(df, 1, \(x) paste(x, collapse = "")))
df
#> option_1
#> 1 Box 1
#> 2 4
#> 3 3
Or using tidyverse:
df %>%
rowwise %>%
transmute(option_1 = str_c(c_across(everything()), collapse = "")) %>%
ungroup

how to build a new variable by extract a string from another variable

I have df that looks like this, and I would like to build a new variableMain if Math|ELA in Subject. The sample data and my codes are:
df<- structure(list(Subject = c("Math", "Math,ELA", "Math,ELA, PE",
"PE, Math", "ART,ELA", "PE,ART")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
df<-df %>%
+ mutate(Main=case_when (grepl("Math|ELA", Subject)~ paste0(str_extract_all(df$Subject, "Math|ELA"))))
However my outcome looks like following, not the one I like. What did I do wrong? I feel that my codes complicated the simple step. Any better solution?

str_extract_all returns a list. We need to loop over the list and paste/str_c
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(Main = case_when(grepl("Math|ELA", Subject)~
map_chr(str_extract_all(Subject, "Math|ELA"), toString)))
-output
# A tibble: 6 x 2
# Subject Main
# <chr> <chr>
#1 Math Math
#2 Math,ELA Math, ELA
#3 Math,ELA, PE Math, ELA
#4 PE, Math Math
#5 ART,ELA ELA
#6 PE,ART <NA>
Or another option is separate_rows from tidyr
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(Subject) %>%
group_by(rn) %>%
summarise(Main = toString(intersect(Subject, c("Math", "ELA"))),
.groups = 'drop') %>%
select(Main) %>%
bind_cols(df, .)
NOTE: paste by itself doesn't do anything and in a list, we need to loop over the list
Or another option is to use
trimws(gsub("(Math|ELA)(*SKIP)(*FAIL)|\\w+", "", df$Subject, perl = TRUE), whitespace = ",\\s*")
#[1] "Math" "Math,ELA" "Math,ELA" "Math" "ELA" ""

Here is a base R option using regmatches
transform(
df,
Main = sapply(
regmatches(Subject, gregexpr("Math|ELA", Subject)),
function(x) replace(toString(x), !length(x), NA)
)
)
which gives
Subject Main
1 Math Math
2 Math,ELA Math, ELA
3 Math,ELA, PE Math, ELA
4 PE, Math Math
5 ART,ELA ELA
6 PE,ART <NA>

Simplifying tables (squashing them!) in R- basic q

I have a basic q I would like a quick R solution in...
I have a tab delimited table with multiple rows, but I want to "squash" all rows into one... for example:
name day red blue orange black
bill 1 yes
bill 2 yes
bill 3 yes
bill 4 no
But I want the output to be independent of day:
name red blue orange black
bill yes yes no yes
So essentially I am squashing the table down to include all answers regardless of the day. NB: There are never any overlaps i.e. Bill will select only one colour per day.
I could do this in excel, but I'd prefer to find an R solution... happy for guidance even wrt which libraries would be useful :).
Go easy on me, I'm a clinician not a bioinformatician!

Here is an option with dplyr. If the missing values are "", after grouping over 'name', summarise by looping across the columns and get the elements that are not a blank (.[. != ""])
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[.!= '']))
Or if the missing values are NA
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[!is.na(.)]))
If there are more than one non-missing element, the above output will be a list column. Instead, we can also paste it together
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(.[!is.na(.)])))
If there are both NA and "", an option is to convert the "" to NA and then use is.na or complete.cases or with na.omit
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(na.omit(na_if(., "")))))

In base R, you could use aggregate and select non-blank values for each name.
aggregate(cbind(red,blue,orange,black)~name, df, function(x) toString(x[x!='']))
# name red blue orange black
#1 bill yes yes no yes
data
df <- structure(list(name = c("bill", "bill", "bill", "bill"), day = 1:4,
red = c("yes", "", "", ""), blue = c("", "yes", "", ""),
orange = c("", "", "", "no"), black = c("", "", "yes", ""
)), class = "data.frame", row.names = c(NA, -4L))

Iterate over column names and separate fields recursively with dplyr

I want to iterate over column names of the data frame, then using dplyr, separate fields using a delimiter(->) found among the row fields. This is how the dataset looks like :
dput(df)
structure(list(v1 = c("Silva->Mark", "Brandon->Livo", "Mango->Apple"),
v2 = c("Austin", "NA ", "Orange"),
v3 = c("James -> Jacy","NA->Jane", "apple -> Orange")),
class = "data.frame", row.names = c(NA, -3L))
Now I wrote a code that filters out column names with delimiter(->) on rows which are column v1 and column v3. Here is the code:
rows_true <- apply(df,2,function(x) any(sapply(x,function(y)grepl("->",y))))
ss<-df[,rows_true]
Then I tried to loop through those column names so that I can separate using the delimiter using this code but it ain't working
cols<- names(df)
if (names %in% df){
splitcols <- ss %>%
tidyr::separate(cols, into = c(paste0(names,+ "old"), "paste0(names,+ "New")"), sep = "->")
}
The reason I am using paste0 is because I do want the columns split into two using the delimiter then the newly formed columns should be named using the original name plus suffix Old for the first one and New for second split column
End result after looping through column names and recursively separating them should look like this
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Mark", "Livo", "Apple"),
v3_Old = c("James","NA", "apple"),
v3_New = c("Jacy","Jane", "Orange")),
class = "data.frame", row.names = c(NA, -3L))

For the sake of completeness, here is also a solution which uses data.table().
There are some differences to the other answers posted so far:
It is not required to identify the columns to be split beforehand. Instead, columns without "->" are dropped from the result on the fly.
The regular expression which is used for splitting includes surrounding white space (if any)
" *-> *". This avoids to call trimws() on the resulting pieces afterwards or to remove white space beforehand.
.
library(data.table)
library(magrittr) # piping used to improve readability
setDT(df)
lapply(names(df), function(x) {
mDT <- df[, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_Old", "_New")))
}) %>% as.data.table()
v1_Old v1_New v3_Old v3_New
1: Silva Mark James Jacy
2: Brandon Livo NA Jane
3: Mango Apple apple Orange

One possibility involving dplyr and tidyr could be:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid) %>%
mutate(val = trimws(val),
var = make.unique(var)) %>%
ungroup() %>%
spread(var, val) %>%
select(-rowid)
v1 v1.1 v3 v3.1
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange
Or to further match the expected output:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid, var) %>%
mutate(val = trimws(val),
var2 = if_else(row_number() == 2, paste0(var, "_old"), paste0(var, "_new"))) %>%
ungroup() %>%
select(-var) %>%
spread(var2, val) %>%
select(-rowid)
v1_new v1_old v3_new v3_old
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange

A different approach with dplyr, purr, and stringr is the following.
library(dplyr)
library(purrr)
library(stringr)
# Detect the columns with at least on "->"
my_df_cols <- map_lgl(my_df, ~ any(str_detect(., "->")))
my_df %>%
# Select only the columns with at least "->"
select(which(my_df_cols)) %>%
# Mutate these columns and only keep the mutated columns with new names
transmute_all(list(old = ~ str_split(., "->", simplify = TRUE)[, 1],
new = ~ str_split(., "->", simplify = TRUE)[, 2]))
# v1_old v3_old v1_new v3_new
# 1 Silva James Mark Jacy
# 2 Brandon NA Livo Jane
# 3 Mango apple Apple Orange

We can also use cSplit from splitstackshape
#Detect columns with "->"
cols <- names(df)[colSums(sapply(df, grepl, pattern = "->")) > 1]
#Remove unwanted whitespaces before and after "->"
df[cols] <- lapply(df[cols], function(x) gsub("\\s+", "", x))
#Split into new columns specifying sep as "->"
splitstackshape::cSplit(df[cols], cols, sep = "->")
# v1_1 v1_2 v3_1 v3_2
#1: Silva Mark James Jacy
#2: Brandon Livo <NA> Jane
#3: Mango Apple apple Orange

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - delete length-one strings and stopwords (using tidytext) in character - r

Related

Filter based on different conditions at different positions in a string in R

How to join multiple columns together on blanks of one column in R

how to build a new variable by extract a string from another variable

Simplifying tables (squashing them!) in R- basic q

Iterate over column names and separate fields recursively with dplyr

Categories

Resources