Remove prefix letter from column variables - r

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?

You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))

We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)

Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

Related

Separate column on the last digit

Say I have a df such as this:
x <- data.frame("SN" = 1:3, "Age" = c(21,15,2), "Name" = c("Q62yes","Q44_1_1Maybe", "Q2Some times"))
I would like separate out the Name column such that:
x_out <- data.frame("SN" = 1:3, "Age" = c(21,15,2), "Name" = c("Q62","Q44_1_1","Q2"), "New" = c("yes", 'Maybe', 'some times'))
I tried this, but I don't think my regex is not separating it into two groups as expected. Any suggestions?
x %>%
tidyr::separate(Name,c("name",'new'), sep = "(Q[[:digit:]]*_[[:digit:]])*([[:alpha:]]*\\s*)")
You can use
x %>%
tidyr::extract(Name,c("name",'new'), "(.*?\\d)([[:alpha:]].*)")
The regex means:
(.*?\d) - Group 1: any zero or more chars as few as possible till the digit that is followed with the subsequent subpatterns
([[:alpha:]].*) - Group 2: a letter and then the rest of the string.
See the regex demo.
R test with output:
> x %>%
+ tidyr::extract(Name,c("name",'new'), "(.*?\\d)([[:alpha:]].*)")
SN Age name new
1 1 21 Q62 yes
2 2 15 Q44_1_1 Maybe
3 3 2 Q2 Some times
We can use a regex lookaround to split between a digit (\\d) and non-digit ([A-Za-z]) in separate
library(tidyr)
library(dplyr)
x %>%
separate(Name, into = c("Name", "New"), sep="(?<=\\d)(?=[A-Za-z])")
-output
SN Age Name New
1 1 21 Q62 yes
2 2 15 Q44_1_1 Maybe
3 3 2 Q2 Some times
Note that this will also work when we have no digits as well compared to extract
x$Name[3] <- "hello"

R: explode a character-string and get the last element (row-wise)

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?
With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA
The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Rename rownames

I would like to rename row names by removing common part of a row name
a b c
CDA_Part 1 4 4
CDZ_Part 3 4 4
CDX_Part 1 4 4
result
a b c
CDA 1 4 4
CDZ 3 4 4
CDX 1 4 4
1.Create a minimal reproducible example:
df <- data.frame(a = 1:3, b = 4:6)
rownames(df) <- c("CDA_Part", "CDZ_Part", "CDX_Part")
df
Returns:
a b
CDA_Part 1 4
CDZ_Part 2 5
CDX_Part 3 6
2.Suggested solution using base Rs gsub:
rownames(df) <- gsub("_Part", "", rownames(df), fixed=TRUE)
df
Returns:
a b
CDA 1 4
CDZ 2 5
CDX 3 6
Explanation:
gsub uses regex to identify and replace parts of strings. The three first arguments are:
pattern the pattern to be replaced - i.e. "_Part"
replacement the string to be used as replacement - i.e. the empty string ""
x the string we want to replace something in - i.e. the rownames
An additional argument (not in the first 3):
fixed indicating if pattern is meant to be a regular expression or "just" an ordinary string - i.e. just a string
You can try this approach, you can use Reduce with intersect to determine the common parts in the name, Note I am assuming here that you have structure like below in your dataset, where underscore is a separator between two words. This solution will work with both word_commonpart or commonpart_word, like in the example below.
Logic:
Using strsplit, split-ted the column basis underscore(not eating underscore as well, so used look around zero width assertions), now using Reduce to find intersection between the strings of all rownames. Those found are then pasted as regex with pipe separated items and replaced by Nothing using gsub.
Input:
structure(list(a = 1:4, b = 4:7), class = "data.frame", row.names = c("CDA_Part",
"CDZ_Part", "CDX_Part", "Part_ABC"))
Solution:
red <- Reduce('intersect', strsplit(rownames(df),"(?=_)",perl=T))
##1. determining the common parts
e <- expand.grid(red, red)
##2. getting all the combinations of underscores and the remaining parts
rownames(df) <- gsub(paste0(do.call('paste0', e[e$Var1!=e$Var2,]), collapse = "|"), '', rownames(df))
##3. filtering only those combinations which are different and pasting together using do.call
##4. using paste0 to get regex seperated by pipe
##5.replacing the common parts with nothing here
Output:
> df
# a b
# CDA 1 4
# CDZ 2 5
# CDX 3 6
# ABC 4 7

I have duplicate ids because of values in another column

Ids are duplicated because of multiple types in another column. I would like to remove duplicate ids and have an indicator column for specific types instead. Would be happy to see a solution in R and SAS if possible. Here's what I have and need:
have<-data.frame(id=c(1,1,2,3,3,3,4,5,5,6))
have$type<-c("healthy","healthy","injury1","healthy","injury2",
"injury1","healthy","injury2","healthy","injury2")
need<-data.frame(id=c(1,2,3,4,5,6))
need$injury_ind<-c(0,1,1,0,1,1)
In R, we can use str_detect (or grepl) to detect the 'injury' in 'type' after grouping by 'id'
library(dplyr)
library(stringr)
have %>%
group_by(id) %>%
summarise(injury_id = +(any(str_detect(type, 'injury'))))
# A tibble: 6 x 2
# id injury_id
# <dbl> <int>
#1 1 0
#2 2 1
#3 3 1
#4 4 0
#5 5 1
#6 6 1
We can make the regex a bit more specific by having '^injury\\d+$' to match the string 'injury' at the start (^) of the string followed by one or more digits (\\d+) at the end ($) of the string
Or with aggregate from base R
aggregate(cbind(injury_ind = type) ~ id, have,
FUN = function(x) +(any(grepl('injury', x))))
Or without grouping, we can make use of grepl to find the 'id's with 'injury' 'type' and check which among the unique 'id's are included
un1 <- unique(have$id)
data.frame(id = un1, injury_id = +(un1 %in%
unique(have$id[grepl('injury', have$type)])))

Regex solution for dataframe rownames

I have a dataframe returned from a function that looks like this:
df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')
firsta 1
seconda 5
firstb 2
secondb 6
my goal is to turn it into this:
df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')
first second
a 1 5
b 2 6
Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.
This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.
In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.
In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.
Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:
library(tidyverse)
df %>%
rownames_to_column('row') %>%
separate(row, into = c('num', 'id'), sep = -1) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:
Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).
We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:
df2
data
firsta 1
firstb 2
firstc 3
firstd 4
seconda 5
secondb 6
secondc 7
secondd 8
secondee 9
df2 %>%
rownames_to_column('row') %>%
mutate(num = str_extract(row, 'first|second'),
id = str_match(row, '(?<=first|second)[a-z]+')) %>%
select(-row) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
5 ee NA 9

Resources