Add space between a character in R - r

Background
I have a dataset, df. Whenever I try and rename the 'TanishaIsCool' column, I get an error: unexpected string constant. I wish to add spaces within my column name
TanishaIsCool Hello
hi hi
This is what I am doing:
df1 <- df %>% rename(Tanisha Is Cool = `TanishaIsCool` )
Desired output
Tanisha Is Cool Hello
hi hi
dput
structure(list(TanishaIsCool = structure(1L, .Label = "hi", class = "factor"),
Hello = structure(1L, .Label = "hi", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))

Your attempt was nearly there, except missing the backquotes/backticks:
df1 %>% rename(`Tanisha Is Cool` = TanishaIsCool)
However, I believe you will find that most recommendations (and I agree completely after my own personal experience of struggling with one particular dataset...), state not to use spaces in your variable names, since you might find that when you have to reference these variables, you will have to always include the `` , which can get pretty cumbersome.
Just realised #thelatemail has answered exactly this in the comment!

We can use gsub to capture the lower case letter (([a-z])), then the upper case (([A-Z])) and in the replacement, use the backreference of the captured groups (\\1,\2`) and create space between them
colnames(df1) <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(df1))
df1
# Tanisha Is Cool Hello
#1 hi hi
With tidyverse and option is
library(dplyr)
library(stringr)
library(magrittr)
df1 %<>%
rename_all(~ str_replace_all(., "([a-z])([A-Z])", "\\1 \\2"))
For selected columns, use rename_at
df1 %<>%
rename_at(1, ~ str_replace_all(., "([a-z])([A-Z])", "\\1 \\2"))
Another option is regex lookaround
gsub("(?<=[a-z])(?=[A-Z])", " ", names(df1), perl = TRUE)
#[1] "Tanisha Is Cool" "Hello"
If we need to update only selected column names, then use an index, hre it is the first column
names(df1)[1] <- gsub("(?<=[a-z])(?=[A-Z])", " ", names(df1)[1], perl = TRUE)

Related

using tidyr separate function to split by \ backslash

I would like to split text in a column by '' using the separate function in tidyr. Given this example data...
library(tidyr)
df1 <- structure(list(Parent.objectId = 1:2, Attachment.path = c("photos_attachments\\photos_image-20220602-192146.jpg",
"photos_attachments\\photos_image-20220602-191635.jpg")), row.names = 1:2, class = "data.frame")
And I've tried multiple variations of this...
df2 <- df1 %>%
separate(Attachment.path,c("a","b","c"),sep="\\",remove=FALSE,extra="drop",fill="right")
Which doesn't result in an error, but it doesn't split the string into two columns, likely because I'm not using the correct regular expression for the single backslash.
We may need to escape
library(tidyr)
separate(df1, Attachment.path,c("a","b","c"),
sep= "\\\\", remove=FALSE, extra="drop", fill="right")
According to ?separate
sep - ... The default value is a regular expression that matches any sequence of non-alphanumeric values.
By splitting on \, assuming you are trying to get folder and filenames, try these 2 functions:
#get filenames
basename(df1$Attachment.path)
# [1] "photos_image-20220602-192146.jpg" "photos_image-20220602-191635.jpg"
#get foldernames
basename(dirname(df1$Attachment.path))
# [1] "photos_attachments" "photos_attachments"

How to combine two specific cells using a comma separation in R

I have the following data set
I'd like it such that cell [2,6] retains its current content with the addition of the cell below it, separated by a "," I have found paste() functions for concatenating columns into new columns but can't find an answer for specific cell combinations. Any help appreciated.
For example, I'd like the highlighted cell in the above to read John, Mary
Here is a way on how you can paste values with the values in below row.
# Loading required libraries
library(dplyr)
# Creating sample data
example <- data.frame(Type = c("Director", NA_character_),
Name = c("John", "Mary"))
example %>%
# Get value of the below row
mutate(new_col = lead(Name, 1),
# If Type is director then paste names else null
new_col = ifelse(Type == "Director", paste(Name, new_col, sep = ", "), NA_character_))

How to build a new variable from a col with a lot of words

I have a data that looks like this:
And i would like to build a new variable to only show music ones. I tried to use gsub to build it but it did not work. Any suggestion on how to do this. Not limit to gsub.
My codes are: df$music<-gsub("Sawing"|"Cooking", "", df$Hobby)
The outcome should be sth that looks like this:
Sample data can be build using codes:
df<- structure(list(Hobby = c("cooking, sawing, piano, violin", "cooking, violin",
"piano, sawing", "sawing, cooking")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
The double quotes opening and closing should be a single pair "Sawing|Cooking" and not "Sawing"|"Cooking" in the pattern
df$music<- trimws(gsub("Sawing|Cooking", "", df$Hobby, ignore.case = TRUE),
whitespace ="(,\\s*){1,}")
trimws will remove the leading/lagging , with spaces (if any)
The opposite would be to extract the words of interest and paste them
library(stringr)
sapply(str_extract_all(df$Hobby, 'piano|violin'), toString)
Another way to do this would be :
library(dplyr)
library(tidyr)
df %>%
mutate(index = row_number()) %>%
separate_rows(Hobby, sep = ',\\s*') %>%
group_by(index) %>%
summarise(Music = toString(setdiff(Hobby, c('sawing', 'cooking'))),
Hobby = toString(Hobby)) %>%
select(Hobby,Music)
# Hobby Music
# <chr> <chr>
#1 cooking, sawing, piano, violin "piano, violin"
#2 cooking, violin "violin"
#3 piano, sawing "piano"
#4 sawing, cooking ""

conditional str_replace based on matching regex within mutate?

For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.
Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01
You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

Resources