Rename suffix part of column name but keep the rest the same - r

For now I am redoing a merge because I poorly named the columns, however, I would like to know how to match on a suffix of a column name and rename that part of the column, keeping the rest the same.
For example, if I have a data.frame (could be a data.table too, doesn't matter - I could convert it):
d <- data.frame("ID" = c(1, 2, 3),
"Attribute1.prev" = c("A", "B", "C"),
"Attribute1.cur" = c("D", "E", "F"))
Now imagine that there are hundreds of columns similar to columns 2 & 3 from my sample DT. How would I go through and detect all columns ending in ".prev" change to ".1" and all columns ending in ".cur" change to ".2"?
So, the new column names would be: ID (unchanged), Attribute1.1, Attribute1.2 and so on for as many columns that match.

With base R we may do
names(d) <- sub("\\.prev", ".1", sub("\\.cur", ".2", names(d)))
d
# ID Attribute1.1 Attribute1.2
# 1 1 A D
# 2 2 B E
# 3 3 C F
With the stringr package you could also use
names(d) <- str_replace_all(names(d), c("\\.prev" = ".1", "\\.cur" = ".2"))
If instead of Attribute1 and Attribute2 you may have some names with dots/spaces, you could also replace "\\.prev" and "\\.cur" patterns to "\\.prev$" and "\\.cur$" as to make sure that we match them at the end of the column names.

Here's an idea using dplyr & stringr syntax
library(dplyr); library(stringr)
names(d) <-
d %>% names() %>%
str_replace(".prev", ".1") %>%
str_replace(".cur", ".2")
Cheers!

Here is an option with gsubfn
library(gsubfn)
names(d) <- gsubfn("(\\w+)", list(prev = 1, cur = 2), names(d))
names(d)
#[1] "ID" "Attribute1.1" "Attribute1.2"

Related

Recoding factor with many levels

I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis.
Is there a way to do it automatically without typing each OldName=NewName?
This is how I do it with dplyr for fewer levels:
df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")
My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode
You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.
library(tidyverse)
# test data
df <- tibble(TraitName = c("a", "b", "c"))
# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter.
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc"))
# Convert to named vector and splice it within the recode
df <-
df |>
mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
One way would be a lookup table, a join, and coalesce (to get the first non-NA value:
my_data <- data.frame(letters = letters[1:6])
levels_to_change <- data.frame(letters = letters[4:5],
new_letters = LETTERS[4:5])
library(dplyr)
my_data %>%
left_join(levels_to_change) %>%
mutate(new = coalesce(new_letters, letters))
Result
Joining, by = "letters"
letters new_letters new
1 a <NA> a
2 b <NA> b
3 c <NA> c
4 d D D
5 e E E
6 f <NA> f

Creating another column in R

This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

I want to count every occurrence of a value/character in each row of a dataframe in R, INCLUDING when it is surrounded by other values/characters

I have read in an excel sheet as a dataframe that contains various numbers and characters in each row/col (with some NAs). For each row, I want to count how many occurrences of "g" there are, for example. My problem is that some cells contain something like, "g#" or "g a" or "1g", and, thus, are not being included in the count. I want to count EVERY occurrence of g, regardless of what is in the cell with it, and then add this count as a new variable to the current dataframe.
I have tried messing around with the following code, all which work for counting each occurrence of EXACTLY "g" but not simply every occurrence of "g".
My hunch is that I am looking for a regular expression to place in any of the following codes. (I searched for a few hours with no avail.) I also tried functions from the stringr package, such as str_count, but these seem to be only applicable to vectors.
oneelecsheet$countg <- rowSums(oneelecsheet == "g", na.rm = TRUE)
library(expss)
oneelecsheet$countg <- count_row_if("g", oneelecsheet)
oneelecsheet$countg <- apply(oneelecsheet, 1, function(x) length(which(x=="g")))
library(dplyr)
oneelecsheet$countg <- apply(oneelecsheet, 1, function(x) sum(x %in% "g"))
If there are multiple occurrences of "g" in a cell how would you want to count it? For example, if there is a word called "ageeg" would it be given a count of 1 or 2? Based on the answer to that question you can use any of the following.
1) If only one "g" has to be counted per cell
df$gcount <- colSums(apply(df, 1, grepl, pattern = "g"))
df
# a b gcount
#1 abcg#g good 2
#2 gg bad 1
#3 g# ugly 2
#4 abcdg ageeg 2
If we want to avoid apply we can use
rowSums(sapply(df, grepl, pattern = "g"))
Or (thanks to #thelatemail)
Reduce(`+`, lapply(df, grepl, pattern ="g"))
2) If every "g" has to be counted separately
df$gcount <- colSums(apply(df, 1, stringr::str_count, "g"))
df
# a b gcount
#1 abcg#g good 3
#2 gg bad 2
#3 g# ugly 2
#4 abcdg ageeg 3
We can use the non-apply versions here too
rowSums(sapply(df, stringr::str_count, "g"))
Or
Reduce(`+`, lapply(df, stringr::str_count, "g"))
data
df <- data.frame(a = c("abcg#g", "gg", "g#", "abcdg"),
b = c("good", "bad", "ugly", "ageeg"))
We can use pmap with str_count from tidyverse
library(tidyverse)
df %>%
mutate(gcount = pmap_int(., ~ str_count(c(...), "g") %>%
sum))
# a b gcount
#1 abcg#g good 3
#2 gg bad 2
#3 g# ugly 2
#4 abcdg ageeg 3
Or with unite and str_count
df %>%
unite(gcount, a, b, remove = FALSE) %>%
mutate(gcount = str_count(gcount, "g"))
Or using base R with gregexpr and lengths
lengths(gregexpr("g", do.call(paste, df)))
#[1] 3 2 2 3
or another option with gsub and nchar
with(df, nchar(gsub("[^g]+", "", paste(a, b))))
#[1] 3 2 2 3
data
df <- structure(list(a = c("abcg#g", "gg", "g#", "abcdg"), b = c("good",
"bad", "ugly", "ageeg")), class = "data.frame", row.names = c(NA,
-4L))

R - Function that selects specific character in a column in data frame and replaces it

I am trying to create a function in R that takes four arguments, namely:
data frame, number, character 1 and character 2.
What I am trying to have as an output is this:
test_df <- data.frame(col1 = c("matt", "baby"), col2 = c("john", "luck"))
my_function(test_df, 1, "u", "o")
col1 col2
mutt john
buby luck
I was just wondering how should I specifically define the function to take the [number] column the user is entering? For the renaming, I guess the function rename() would be fine. Do I need to substitue with [x,x]?
Thank you!
If you have to create a function that takes a column as an argument you need to split out the data frame and column specification (using gsub() to do the actual replacement):
my_function <- function(df, column, pattern, replacement) {
gsub(pattern, replacement, df[[column]])
}
Which would work like:
my_function(df = test_df, column = 1, pattern = "a", replacement = "u")
## [1] "mutt" "buby"
But, this has the downside that if you want to loop over multiple columns, for example with lapply(), the list specification becomes more complicated:
test_df[] <- lapply(colnames(test_df), my_function, df = test_df, pattern = "a", replacement = "u")
test_df
# col1 col2
# 1 mutt john
# 2 buby luck
Which is much more complicated than:
test_df <- data.frame(test_df, stringsAsFactors = FALSE)
test_df[] <- lapply(test_df, gsub, pattern = "a", replacement = "u")
test_df
# col1 col2
# 1 mutt john
# 2 buby luck
(Note: ensure stringsAsFactors = FALSE for this to work. It's a good idea to use this as the default unless you explicitly want factors anyway)

Resources