Grouping by character matching & string length - r

Suppose I have a column in a dataframe with strings. I want to create a grouping technique so that the length of the string is matched and then the character of the string is also matched to acknowledge it as a specific group.
The output should be grouped like the below provided sample:
Rule Group
x 1
x 1
xx 2
xx 2
xy 3
yx 3
xx 2
xyx 4
yxx 4
yyy 5
xyxy 6
yxyx 6
xyxy 6

You can split the Rule, sort and paste back together. Matching the result with the unique result will then give you what you need. In R,
v1 <- sapply(strsplit(df$Rule, ''), function(i)paste(sort(i), collapse = ''))
match(v1, unique(v1))
#[1] 1 1 2 2 3 3 2 4 4 5 6 6 6

Related

Extract multiple variables by naming convention, for more than two types of naming convention

I'm trying to extract multiple variables that start with certain strings. For this example I'd like to write a code that will extract all variables that start with X1 and Y2.
set.seed(123)
df <- data.frame(X1_1=sample(1:5,10,TRUE),
X1_2=sample(1:5,10,TRUE),
X2_1=sample(1:5,10,TRUE),
X2_2=sample(1:5,10,TRUE),
Y1_1=sample(1:5,10,TRUE),
Y1_2=sample(1:5,10,TRUE),
Y2_1=sample(1:5,10,TRUE),
Y2_2=sample(1:5,10,TRUE))
I know I can use the following to extract variables that begin with "X1"
Vars_to_extract <- c("X1")
tempdf <- df[ , grep( paste0(Vars_to_extract,".*" ) , names(df), value=TRUE)]
X1_1 X1_2
1 3 5
2 3 4
3 2 1
4 2 2
5 3 3
But I need to adapt above code to extract variables multiple variable types, if specified like this
Vars_to_extract <- c("X1","Y2")
I've been trying to do it using an %in% with .* within the grep part, but with little success. I know to I can write the following which is pretty manual, merging each set of variables separately.
tempdf <- data.frame(df[, grep("X1.*", names(df), value=TRUE)] , df[, grep("Y2.*", names(df), value=TRUE)] )
X1_1 X1_2 Y2_1 Y2_2
1 3 5 1 5
2 3 4 1 5
3 2 1 2 3
4 2 2 3 1
5 3 3 4 2
However, in real world situation, I often work with lots of variables and would have to do this numerous times. Is it possible to write it in this way using %in% or does I need use a loop? Any help or tips will be gratefully appreciated. Thanks
We could use contains if we want to extract column names that have the substring anywhere in the string
library(dplyr)
df %>%
select(contains(Vars_to_extract))
Or with matches, we can use a regex to specify the the string starts (^) with the specific substring
library(stringr)
df %>%
select(matches(str_c('^(', Vars_to_extract, ')', collapse="|")))
With grep, we could create a single pattern by paste with collapse = "|"
df[grep(paste0("^(",paste(Vars_to_extract, collapse='|'), ")"), names(df))]
# X1_1 X1_2 Y2_1 Y2_2
#1 3 5 5 3
#2 3 3 5 5
#3 2 3 3 3
#4 2 1 1 2
#5 3 4 4 5
#6 5 1 1 5
#7 4 1 1 3
#8 1 5 3 2
#9 2 3 4 2
#10 3 2 1 2
Or another approach is to startsWith with lapply and Reduce
df[Reduce(`|`, lapply(Vars_to_extract, startsWith, x = names(df)))]

R changing string pattern of column names

is there an easy order to change the string pattern of my column names? I've got a data set like the following, and I would like to have all the column names without the "_R1".
df <- read.table(header=TRUE, text="
T_H_R1 T_S_R1 T_A_R1 T_V_R1 T_F_R1
5 1 2 1 5
3 1 3 3 4
2 1 3 1 3
4 2 5 5 3
5 1 4 1 2
")
Thank you!
We can use sub to match the _R1 at the end ($) of the string of names, and replace with blank ("")
names(df) <- sub("_R1$", "", names(df))
names(df)
#[1] "T_H" "T_S" "T_A" "T_V" "T_F"

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Access specific instances in list in dataframe column, and also count list length - R

I have an R dataframe composed of columns. One column contains lists: i.e.
Column
1,2,4,7,9,0
5,3,8,9,0
3,4
5.8,9,3.5
6
NA
7,4,3
I would like to create column which counts, the length of these lists:
Column Count
1,2,4,7,9,0 6
5,3,8,9,0 5
3,4 2
5.8,9,3.5 3
6 1
NA NA
7,4,3 3
Also, is there a way to access specific instances in these lists? i.e. make a new column with only the first instances of each list? or the last instances of each?
One solution is to use strsplit to split element in character vector and use sapply to get the desired count:
df$count <- sapply(strsplit(df$Column, ","),function(x){
if(all(is.na(x))){
NA
} else {
length(x)
}
})
df
# Column count
# 1 1,2,4,7,9,0 6
# 2 5,3,8,9,0 5
# 3 3,4 2
# 4 5.8,9,3.5 3
# 5 6 1
# 6 <NA> NA
# 7 7,4,3 3
If it is desired to count NA as 1 then solution could have been even simpler as:
df$count <- sapply(strsplit(df$Column, ","),length)
Data:
df <- read.table(text = "Column
'1,2,4,7,9,0'
'5,3,8,9,0'
'3,4'
'5.8,9,3.5'
'6'
NA
'7,4,3'",
header = TRUE, stringsAsFactors = FALSE)
count.fields serves this purpose for a text file, and can be coerced to work with a column too:
df$Count <- count.fields(textConnection(df$Column), sep=",")
df$Count[is.na(df$Column)] <- NA
df
# Column Count
#1 1,2,4,7,9,0 6
#2 5,3,8,9,0 5
#3 3,4 2
#4 5.8,9,3.5 3
#5 6 1
#6 <NA> NA
#7 7,4,3 3
On a more general note, you're probably better off converting your column to a list, or stacking the data to a long form, to make it easier to work with:
df$Column <- strsplit(df$Column, ",")
lengths(df$Column)
#[1] 6 5 2 3 1 1 3
sapply(df$Column, `[`, 1)
#[1] "1" "5" "3" "5.8" "6" NA "7"
stack(setNames(df$Column, seq_along(df$Column)))
# values ind
#1 1 1
#2 2 1
#3 4 1
#4 7 1
#5 9 1
#6 0 1
#7 5 2
#8 3 2
#9 8 2
# etc
Here's a slightly faster way to achieve the same result:
df$Count <- nchar(gsub('[^,]', '', df$Column)) + 1
This one works by counting how many commas there are and adding 1.

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Resources