R changing string pattern of column names - r

is there an easy order to change the string pattern of my column names? I've got a data set like the following, and I would like to have all the column names without the "_R1".
df <- read.table(header=TRUE, text="
T_H_R1 T_S_R1 T_A_R1 T_V_R1 T_F_R1
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
Thank you!

We can use sub to match the _R1 at the end ($) of the string of names, and replace with blank ("")
names(df) <- sub("_R1$", "", names(df))
names(df)
#[1] "T_H" "T_S" "T_A" "T_V" "T_F"

Related

Rename dataframe column names by switching string patterns

I have following dataframe and I want to rename the column names to c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")
> dataf <- data.frame(
+ WBC_D7_MIN=1:4,WBC_D7_MAX=1:4,DBP_D3_MIN=1:4
+ )
> dataf
WBC_D7_MIN WBC_D7_MAX DBP_D3_MIN
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
> names(dataf)
[1] "WBC_D7_MIN" "WBC_D7_MAX" "DBP_D3_MIN"
Probably, the rename_with function in tidyverse can do it, But I cannot figure out how to do it.
You can use capture groups with sub to extract values in order -
names(dataf) <- sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', names(dataf))
Same regex can be used in rename_with -
library(dplyr)
dataf %>% rename_with(~ sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', .))
# WBC_MIN_D7 WBC_MAX_D7 DBP_MIN_D3
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
You can rename your dataf with your vector with names(yourDF) <- c("A","B",...,"Z"):
names(dataf) <- c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")

Extract multiple variables by naming convention, for more than two types of naming convention

I'm trying to extract multiple variables that start with certain strings. For this example I'd like to write a code that will extract all variables that start with X1 and Y2.
set.seed(123)
df <- data.frame(X1_1=sample(1:5,10,TRUE),
X1_2=sample(1:5,10,TRUE),
X2_1=sample(1:5,10,TRUE),
X2_2=sample(1:5,10,TRUE),
Y1_1=sample(1:5,10,TRUE),
Y1_2=sample(1:5,10,TRUE),
Y2_1=sample(1:5,10,TRUE),
Y2_2=sample(1:5,10,TRUE))
I know I can use the following to extract variables that begin with "X1"
Vars_to_extract <- c("X1")
tempdf <- df[ , grep( paste0(Vars_to_extract,".*" ) , names(df), value=TRUE)]
X1_1 X1_2
1 3 5
2 3 4
3 2 1
4 2 2
5 3 3
But I need to adapt above code to extract variables multiple variable types, if specified like this
Vars_to_extract <- c("X1","Y2")
I've been trying to do it using an %in% with .* within the grep part, but with little success. I know to I can write the following which is pretty manual, merging each set of variables separately.
tempdf <- data.frame(df[, grep("X1.*", names(df), value=TRUE)] , df[, grep("Y2.*", names(df), value=TRUE)] )
X1_1 X1_2 Y2_1 Y2_2
1 3 5 1 5
2 3 4 1 5
3 2 1 2 3
4 2 2 3 1
5 3 3 4 2
However, in real world situation, I often work with lots of variables and would have to do this numerous times. Is it possible to write it in this way using %in% or does I need use a loop? Any help or tips will be gratefully appreciated. Thanks
We could use contains if we want to extract column names that have the substring anywhere in the string
library(dplyr)
df %>%
select(contains(Vars_to_extract))
Or with matches, we can use a regex to specify the the string starts (^) with the specific substring
library(stringr)
df %>%
select(matches(str_c('^(', Vars_to_extract, ')', collapse="|")))
With grep, we could create a single pattern by paste with collapse = "|"
df[grep(paste0("^(",paste(Vars_to_extract, collapse='|'), ")"), names(df))]
# X1_1 X1_2 Y2_1 Y2_2
#1 3 5 5 3
#2 3 3 5 5
#3 2 3 3 3
#4 2 1 1 2
#5 3 4 4 5
#6 5 1 1 5
#7 4 1 1 3
#8 1 5 3 2
#9 2 3 4 2
#10 3 2 1 2
Or another approach is to startsWith with lapply and Reduce
df[Reduce(`|`, lapply(Vars_to_extract, startsWith, x = names(df)))]

Grouping by character matching & string length

Suppose I have a column in a dataframe with strings. I want to create a grouping technique so that the length of the string is matched and then the character of the string is also matched to acknowledge it as a specific group.
The output should be grouped like the below provided sample:
Rule Group
x 1
x 1
xx 2
xx 2
xy 3
yx 3
xx 2
xyx 4
yxx 4
yyy 5
xyxy 6
yxyx 6
xyxy 6
You can split the Rule, sort and paste back together. Matching the result with the unique result will then give you what you need. In R,
v1 <- sapply(strsplit(df$Rule, ''), function(i)paste(sort(i), collapse = ''))
match(v1, unique(v1))
#[1] 1 1 2 2 3 3 2 4 4 5 6 6 6

Access specific instances in list in dataframe column, and also count list length - R

I have an R dataframe composed of columns. One column contains lists: i.e.
Column
1,2,4,7,9,0
5,3,8,9,0
3,4
5.8,9,3.5
6
NA
7,4,3
I would like to create column which counts, the length of these lists:
Column Count
1,2,4,7,9,0 6
5,3,8,9,0 5
3,4 2
5.8,9,3.5 3
6 1
NA NA
7,4,3 3
Also, is there a way to access specific instances in these lists? i.e. make a new column with only the first instances of each list? or the last instances of each?
One solution is to use strsplit to split element in character vector and use sapply to get the desired count:
df$count <- sapply(strsplit(df$Column, ","),function(x){
if(all(is.na(x))){
NA
} else {
length(x)
}
})
df
# Column count
# 1 1,2,4,7,9,0 6
# 2 5,3,8,9,0 5
# 3 3,4 2
# 4 5.8,9,3.5 3
# 5 6 1
# 6 <NA> NA
# 7 7,4,3 3
If it is desired to count NA as 1 then solution could have been even simpler as:
df$count <- sapply(strsplit(df$Column, ","),length)
Data:
df <- read.table(text = "Column
'1,2,4,7,9,0'
'5,3,8,9,0'
'3,4'
'5.8,9,3.5'
'6'
NA
'7,4,3'",
header = TRUE, stringsAsFactors = FALSE)
count.fields serves this purpose for a text file, and can be coerced to work with a column too:
df$Count <- count.fields(textConnection(df$Column), sep=",")
df$Count[is.na(df$Column)] <- NA
df
# Column Count
#1 1,2,4,7,9,0 6
#2 5,3,8,9,0 5
#3 3,4 2
#4 5.8,9,3.5 3
#5 6 1
#6 <NA> NA
#7 7,4,3 3
On a more general note, you're probably better off converting your column to a list, or stacking the data to a long form, to make it easier to work with:
df$Column <- strsplit(df$Column, ",")
lengths(df$Column)
#[1] 6 5 2 3 1 1 3
sapply(df$Column, `[`, 1)
#[1] "1" "5" "3" "5.8" "6" NA "7"
stack(setNames(df$Column, seq_along(df$Column)))
# values ind
#1 1 1
#2 2 1
#3 4 1
#4 7 1
#5 9 1
#6 0 1
#7 5 2
#8 3 2
#9 8 2
# etc
Here's a slightly faster way to achieve the same result:
df$Count <- nchar(gsub('[^,]', '', df$Column)) + 1
This one works by counting how many commas there are and adding 1.

R, add column to dataframe, count of substrings

This is my desired output:
> head(df)
String numSubStrings
1 1 1
2 1 1
3 1;1;1;1 4
4 1;1;1;1 4
5 1;1;1 3
6 1 1
Hi, I have a data frame which has a "String" column as above. I would like to add a column "numSubStrings" which contains the number of substrings separated by ";" in "String".
I have tried
lapply(df, transform, numSubStrings=length(strsplit(df$Strings,";")[[1]]))
which gives me 1s in the numSubStrings instead.
Please advice.
Thanks.
It sounds like you're looking for count.fields. Usage would be something like:
> count.fields(textConnection(mydf$String), sep = ";")
[1] 1 1 4 4 3 1
You may need to wrap the mydf$String in as.character, depending on how the data were read in or created.
Or, you can try lengths:
> lengths(strsplit(mydf$String, ";", TRUE))
[1] 1 1 4 4 3 1
We can use gsub to remove all the characters except the ; and count the ; with nchar
df$numSubStrings <- nchar(gsub('[^;]+', '', df$String))+1
df$numSubStrings
#[1] 1 1 4 4 3 1
Or another option is stri_count from library(stringi) to count the ; characters and add 1.
library(stringi)
stri_count_fixed(df$String, ';')+1
#[1] 1 1 4 4 3 1
You may use str_count from stringr package.
x <- " String
1 1
2 1
3 1;1;1;1
4 1;1;1;1
5 1;1;1
6 1 "
df <- read.table(text=x, header=T)
df$numSubStrings <- str_count(df$String, "[^;]+")
df
# String numSubStrings
# 1 1 1
# 2 1 1
# 3 1;1;1;1 4
# 4 1;1;1;1 4
# 5 1;1;1 3
# 6 1 1

Resources