Rename rownames - r

I would like to rename row names by removing common part of a row name
a b c
CDA_Part 1 4 4
CDZ_Part 3 4 4
CDX_Part 1 4 4
result
a b c
CDA 1 4 4
CDZ 3 4 4
CDX 1 4 4

1.Create a minimal reproducible example:
df <- data.frame(a = 1:3, b = 4:6)
rownames(df) <- c("CDA_Part", "CDZ_Part", "CDX_Part")
df
Returns:
a b
CDA_Part 1 4
CDZ_Part 2 5
CDX_Part 3 6
2.Suggested solution using base Rs gsub:
rownames(df) <- gsub("_Part", "", rownames(df), fixed=TRUE)
df
Returns:
a b
CDA 1 4
CDZ 2 5
CDX 3 6
Explanation:
gsub uses regex to identify and replace parts of strings. The three first arguments are:
pattern the pattern to be replaced - i.e. "_Part"
replacement the string to be used as replacement - i.e. the empty string ""
x the string we want to replace something in - i.e. the rownames
An additional argument (not in the first 3):
fixed indicating if pattern is meant to be a regular expression or "just" an ordinary string - i.e. just a string

You can try this approach, you can use Reduce with intersect to determine the common parts in the name, Note I am assuming here that you have structure like below in your dataset, where underscore is a separator between two words. This solution will work with both word_commonpart or commonpart_word, like in the example below.
Logic:
Using strsplit, split-ted the column basis underscore(not eating underscore as well, so used look around zero width assertions), now using Reduce to find intersection between the strings of all rownames. Those found are then pasted as regex with pipe separated items and replaced by Nothing using gsub.
Input:
structure(list(a = 1:4, b = 4:7), class = "data.frame", row.names = c("CDA_Part",
"CDZ_Part", "CDX_Part", "Part_ABC"))
Solution:
red <- Reduce('intersect', strsplit(rownames(df),"(?=_)",perl=T))
##1. determining the common parts
e <- expand.grid(red, red)
##2. getting all the combinations of underscores and the remaining parts
rownames(df) <- gsub(paste0(do.call('paste0', e[e$Var1!=e$Var2,]), collapse = "|"), '', rownames(df))
##3. filtering only those combinations which are different and pasting together using do.call
##4. using paste0 to get regex seperated by pipe
##5.replacing the common parts with nothing here
Output:
> df
# a b
# CDA 1 4
# CDZ 2 5
# CDX 3 6
# ABC 4 7

Related

filter with a list of string conditions

This is an example what the data looks like:
height <- c("T_0.1", "T_0.2", "T_0.3", "T_0.11", "T_0.12", "T_0.13", "T_10.1", "T_10.2",
"T_10.3", "T_10.11", "T_10.12", "T_10.13","T_36.1", "T_36.2", "T_36.3", "T_36.11", "T_36.12",
"T_36.13")
value <- c(1,12,14,15,20,22,5,9,4,0.0,0.45,0.7,1,2,7,100,9,45)
df <- data.frame(height,value)
I want to filter all the values in height that ends with ".1", ".2", and ".3". However I want to do that using a "list of patterns" because the actual data frame has more than 1000 values.
Here what I tried:
vars_list <- c(".1", ".2",".3")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
matchPattern <- paste(vars_list, collapse = "|")
df_new <- df %>% select(matches(matchPattern))
Both codes returns 0 observation. I am not sure what it is the issue and I couldn't find a post that would help. So any help is very much appreciated!
The dot is a regex metacharacter, which matches any character except a new line. You need to escape it (i.e. tell R you are looking for a literal dot), by prepending it with \\.
However, your pattern will then match all rows in your sample data.
I assume you do not want to match, for example, "T_0.13", because it does not end with ".1", ".2" or ".3". In which case, you should add a dollar sign to indicate that you want your string to end with the desired match, rather than just contain it.
vars_list <- c("\\.1$", "\\.2$","\\.3$")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
df_new
# height value
# 1 T_0.1 1
# 2 T_0.2 12
# 3 T_0.3 14
# 7 T_10.1 5
# 8 T_10.2 9
# 9 T_10.3 4
# 13 T_36.1 1
# 14 T_36.2 2
# 15 T_36.3 7
Incidentally, another way you could express this is:
df[grepl("\\.[1-3]$", df$height),]
You can read more here about the syntax used in regular expressions.
Alternatively use the base function endsWith
df <- data.frame(height,value) %>% filter(endsWith(height,vars_list))
Created on 2023-02-12 with reprex v2.0.2
height value
1 T_0.1 1
2 T_0.2 12
3 T_0.3 14
4 T_10.1 5
5 T_10.2 9
6 T_10.3 4
7 T_36.1 1
8 T_36.2 2
9 T_36.3 7

R: explode a character-string and get the last element (row-wise)

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?
With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA
The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?
You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))
We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)
Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

Regex solution for dataframe rownames

I have a dataframe returned from a function that looks like this:
df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')
firsta 1
seconda 5
firstb 2
secondb 6
my goal is to turn it into this:
df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')
first second
a 1 5
b 2 6
Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.
This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.
In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.
In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.
Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:
library(tidyverse)
df %>%
rownames_to_column('row') %>%
separate(row, into = c('num', 'id'), sep = -1) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:
Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).
We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:
df2
data
firsta 1
firstb 2
firstc 3
firstd 4
seconda 5
secondb 6
secondc 7
secondd 8
secondee 9
df2 %>%
rownames_to_column('row') %>%
mutate(num = str_extract(row, 'first|second'),
id = str_match(row, '(?<=first|second)[a-z]+')) %>%
select(-row) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
5 ee NA 9

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Resources