Regular expression to remove ellipses - r

I am trying to remove the dots at the end of the stname column, but nothing I am trying is working.
This is what the dataset looks like.
df = structure(list(stname = c("Alabama……………………………………",
"Alaska………………………………………", "American Samoa……………………………",
"Arizona………………………………………", "Arkansas……………………………………",
"California………………………………"), value = c(34305795,
20236292, 103657, 267021650, 15045025, 3976908430)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I tried the following but the dots are still there.
library(tidyverse)
#remove non alpha-numeric characters
df %>%
mutate(stname = str_replace_all(stname, "[^[:alnum:][:space:]]", ""))
#remove dots
df %>%
mutate(stname = str_replace(stname, "\\.+", ""))
Neither of those approaches worked.

The problem you are facing is that it's not an actual dot in your stname column. You have a horizontal ellipsis there (P.S. check ansi chars table, this will be number 133: http://www.alanwood.net/demos/ansi.html)
That's why in your regex you also need to search for horizontal ellipsis. Try to run this, it should help:
mutate(stname = str_replace(stname, "…+", ""))

Related

How to replace multiple string responses in multiple columns

I have imported a csv dataset with many columns and many responses. I want to look at specific columns and replace a set of responses.
In my dataset, I have: hairtypeDad, hairtypeMom, hairtypeBro1, hairtypeSis1, which are all located in different areas of my file. Within these are many responses that I want to change, including but not limited to:
Straight= straightened,
Curly= curled
Wavy = waved
wavyy=waved
cruley= curled
and so on.
Below is the code that I have tried so far:
hairdata <- read.csv ('alldata.csv', header = TRUE, stringAsFactors = FALSE)
hair_vars<- c ("hairtypeDad", "hairtypeMom", "hairtypeBro1", "hairtypeSis1")
hairdata[hair_vars]<-str_replace_all(hairdata[hair_vars],
c("Straight"= "straightened",
"Curly"= "curled",
"Wavy" = "waved",
"wavyy"= "waved"))
#I also tried:
hairdata %>% mutate(across(c("hairtypeDad", "hairtypeMom", "hairtypeBro1", "hairtypeSis1"),
fns= ~ str_replace_all(.,
c("Straight"= "straightened",
"Curly"= "curled",
"Wavy" = "waved",
"wavyy"= "waved"))
Ultimately, I want it to go from:
id
hairtypeMom
hairtypeDad
hairtypeBro1
1
Straight
Curly
wavyy
2
Wavy
Curly
Curly
to
id
hairtypeMom
hairtypeDad
hairtypeBro1
1
straightened
curled
waved
2
waved
curled
curled
and am not getting what i need. Please help!!
You were very close; you were just missing the period in .fns =, as you had fns =. You were also missing a couple of closing parentheses as well.
library(tidyverse)
df %>%
mutate(across(
c("hairtypeDad", "hairtypeMom", "hairtypeBro1"),
.fns = ~ str_replace_all(.,
c("Straight" = "straightened",
"Curly" = "curled",
"Wavy" = "waved",
"wavyy" = "waved")
)
))
Output
id hairtypeMom hairtypeDad hairtypeBro1
1 1 straightened curled waved
2 2 waved curled curled
Data
df <- structure(list(id = 1:2, hairtypeMom = c("Straight", "Wavy"),
hairtypeDad = c("Curly", "Curly"), hairtypeBro1 = c("wavyy",
"Curly")), class = "data.frame", row.names = c(NA, -2L))

How to build a new variable from a col with a lot of words

I have a data that looks like this:
And i would like to build a new variable to only show music ones. I tried to use gsub to build it but it did not work. Any suggestion on how to do this. Not limit to gsub.
My codes are: df$music<-gsub("Sawing"|"Cooking", "", df$Hobby)
The outcome should be sth that looks like this:
Sample data can be build using codes:
df<- structure(list(Hobby = c("cooking, sawing, piano, violin", "cooking, violin",
"piano, sawing", "sawing, cooking")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
The double quotes opening and closing should be a single pair "Sawing|Cooking" and not "Sawing"|"Cooking" in the pattern
df$music<- trimws(gsub("Sawing|Cooking", "", df$Hobby, ignore.case = TRUE),
whitespace ="(,\\s*){1,}")
trimws will remove the leading/lagging , with spaces (if any)
The opposite would be to extract the words of interest and paste them
library(stringr)
sapply(str_extract_all(df$Hobby, 'piano|violin'), toString)
Another way to do this would be :
library(dplyr)
library(tidyr)
df %>%
mutate(index = row_number()) %>%
separate_rows(Hobby, sep = ',\\s*') %>%
group_by(index) %>%
summarise(Music = toString(setdiff(Hobby, c('sawing', 'cooking'))),
Hobby = toString(Hobby)) %>%
select(Hobby,Music)
# Hobby Music
# <chr> <chr>
#1 cooking, sawing, piano, violin "piano, violin"
#2 cooking, violin "violin"
#3 piano, sawing "piano"
#4 sawing, cooking ""

Removing Stop words from a list of strings in R

Sample data
Dput code of my data
x <- structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..",
"I want to vist my teacher today only!!"), class = "factor"),
Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA,
-2L))
I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output. Along with this how can I remove punctuation in tidytext package?
Note: I don't want to change my dataset into corpus
You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries. However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.
For example taking only first 10 words from tidytext::stop_words$word, you can do :
gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b',
collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
#[1] "I want to vist my teacher today only"
# "I have lot of homework to be completed"
clean_tweet = removeWords(clean_tweet, stopwords("english"))

Add space between a character in R

Background
I have a dataset, df. Whenever I try and rename the 'TanishaIsCool' column, I get an error: unexpected string constant. I wish to add spaces within my column name
TanishaIsCool Hello
hi hi
This is what I am doing:
df1 <- df %>% rename(Tanisha Is Cool = `TanishaIsCool` )
Desired output
Tanisha Is Cool Hello
hi hi
dput
structure(list(TanishaIsCool = structure(1L, .Label = "hi", class = "factor"),
Hello = structure(1L, .Label = "hi", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
Your attempt was nearly there, except missing the backquotes/backticks:
df1 %>% rename(`Tanisha Is Cool` = TanishaIsCool)
However, I believe you will find that most recommendations (and I agree completely after my own personal experience of struggling with one particular dataset...), state not to use spaces in your variable names, since you might find that when you have to reference these variables, you will have to always include the `` , which can get pretty cumbersome.
Just realised #thelatemail has answered exactly this in the comment!
We can use gsub to capture the lower case letter (([a-z])), then the upper case (([A-Z])) and in the replacement, use the backreference of the captured groups (\\1,\2`) and create space between them
colnames(df1) <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(df1))
df1
# Tanisha Is Cool Hello
#1 hi hi
With tidyverse and option is
library(dplyr)
library(stringr)
library(magrittr)
df1 %<>%
rename_all(~ str_replace_all(., "([a-z])([A-Z])", "\\1 \\2"))
For selected columns, use rename_at
df1 %<>%
rename_at(1, ~ str_replace_all(., "([a-z])([A-Z])", "\\1 \\2"))
Another option is regex lookaround
gsub("(?<=[a-z])(?=[A-Z])", " ", names(df1), perl = TRUE)
#[1] "Tanisha Is Cool" "Hello"
If we need to update only selected column names, then use an index, hre it is the first column
names(df1)[1] <- gsub("(?<=[a-z])(?=[A-Z])", " ", names(df1)[1], perl = TRUE)

Spread dataframe

I have the following dataframe/tibble sample:
structure(list(name = c("Contents.Key", "Contents.LastModified",
"Contents.ETag", "Contents.Size", "Contents.Owner", "Contents.StorageClass",
"Contents.Bucket", "Contents.Key", "Contents.LastModified", "Contents.ETag"
), value = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_0e94e664-4d5e-4646-b2b9-1937398cfaed_2019-01-01-07-54-46-064",
"2019-01-01T07:54:47.000Z", "\"378d04496cb27d93e1c37e1511a79ec7\"",
"24187", "e7c0d260939d15d18866126da3376642e2d4497f18ed762b608ed2307778bdf1",
"STANDARD", "vfevvv-edrfvevevev-streamed-data", "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_33a8ba28-245c-490b-99b2-254507431d47_2019-01-01-07-54-56-755",
"2019-01-01T07:54:57.000Z", "\"df8cc7082e0cc991aa24542e2576277b\""
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
I want to spread the names column using tidyr::spread() function but I don't get the desired result
df %>% tidyr::spread(key = name, value = value)
I get an error:
Error: Duplicate identifiers for rows:...
Also tried with melt function same result.
I have connected to S3 using aws.s3::get_bucket() function and trying to convert it to dataframe. I am aware there is a aws.s3::get_bucket_df() function which should do this but it doesn't work (you may look at my relevant question.
After I've got the bucket list, I've unlisted it and run enframe command.
Please advise.
You can introduce a new column first(introduces NAs, will have to deal with them).
df %>%
mutate(RN=row_number()) %>%
group_by(RN) %>%
spread(name,value)

Resources