What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE
Related
This is an example what the data looks like:
height <- c("T_0.1", "T_0.2", "T_0.3", "T_0.11", "T_0.12", "T_0.13", "T_10.1", "T_10.2",
"T_10.3", "T_10.11", "T_10.12", "T_10.13","T_36.1", "T_36.2", "T_36.3", "T_36.11", "T_36.12",
"T_36.13")
value <- c(1,12,14,15,20,22,5,9,4,0.0,0.45,0.7,1,2,7,100,9,45)
df <- data.frame(height,value)
I want to filter all the values in height that ends with ".1", ".2", and ".3". However I want to do that using a "list of patterns" because the actual data frame has more than 1000 values.
Here what I tried:
vars_list <- c(".1", ".2",".3")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
matchPattern <- paste(vars_list, collapse = "|")
df_new <- df %>% select(matches(matchPattern))
Both codes returns 0 observation. I am not sure what it is the issue and I couldn't find a post that would help. So any help is very much appreciated!
The dot is a regex metacharacter, which matches any character except a new line. You need to escape it (i.e. tell R you are looking for a literal dot), by prepending it with \\.
However, your pattern will then match all rows in your sample data.
I assume you do not want to match, for example, "T_0.13", because it does not end with ".1", ".2" or ".3". In which case, you should add a dollar sign to indicate that you want your string to end with the desired match, rather than just contain it.
vars_list <- c("\\.1$", "\\.2$","\\.3$")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
df_new
# height value
# 1 T_0.1 1
# 2 T_0.2 12
# 3 T_0.3 14
# 7 T_10.1 5
# 8 T_10.2 9
# 9 T_10.3 4
# 13 T_36.1 1
# 14 T_36.2 2
# 15 T_36.3 7
Incidentally, another way you could express this is:
df[grepl("\\.[1-3]$", df$height),]
You can read more here about the syntax used in regular expressions.
Alternatively use the base function endsWith
df <- data.frame(height,value) %>% filter(endsWith(height,vars_list))
Created on 2023-02-12 with reprex v2.0.2
height value
1 T_0.1 1
2 T_0.2 12
3 T_0.3 14
4 T_10.1 5
5 T_10.2 9
6 T_10.3 4
7 T_36.1 1
8 T_36.2 2
9 T_36.3 7
I would like to rename row names by removing common part of a row name
a b c
CDA_Part 1 4 4
CDZ_Part 3 4 4
CDX_Part 1 4 4
result
a b c
CDA 1 4 4
CDZ 3 4 4
CDX 1 4 4
1.Create a minimal reproducible example:
df <- data.frame(a = 1:3, b = 4:6)
rownames(df) <- c("CDA_Part", "CDZ_Part", "CDX_Part")
df
Returns:
a b
CDA_Part 1 4
CDZ_Part 2 5
CDX_Part 3 6
2.Suggested solution using base Rs gsub:
rownames(df) <- gsub("_Part", "", rownames(df), fixed=TRUE)
df
Returns:
a b
CDA 1 4
CDZ 2 5
CDX 3 6
Explanation:
gsub uses regex to identify and replace parts of strings. The three first arguments are:
pattern the pattern to be replaced - i.e. "_Part"
replacement the string to be used as replacement - i.e. the empty string ""
x the string we want to replace something in - i.e. the rownames
An additional argument (not in the first 3):
fixed indicating if pattern is meant to be a regular expression or "just" an ordinary string - i.e. just a string
You can try this approach, you can use Reduce with intersect to determine the common parts in the name, Note I am assuming here that you have structure like below in your dataset, where underscore is a separator between two words. This solution will work with both word_commonpart or commonpart_word, like in the example below.
Logic:
Using strsplit, split-ted the column basis underscore(not eating underscore as well, so used look around zero width assertions), now using Reduce to find intersection between the strings of all rownames. Those found are then pasted as regex with pipe separated items and replaced by Nothing using gsub.
Input:
structure(list(a = 1:4, b = 4:7), class = "data.frame", row.names = c("CDA_Part",
"CDZ_Part", "CDX_Part", "Part_ABC"))
Solution:
red <- Reduce('intersect', strsplit(rownames(df),"(?=_)",perl=T))
##1. determining the common parts
e <- expand.grid(red, red)
##2. getting all the combinations of underscores and the remaining parts
rownames(df) <- gsub(paste0(do.call('paste0', e[e$Var1!=e$Var2,]), collapse = "|"), '', rownames(df))
##3. filtering only those combinations which are different and pasting together using do.call
##4. using paste0 to get regex seperated by pipe
##5.replacing the common parts with nothing here
Output:
> df
# a b
# CDA 1 4
# CDZ 2 5
# CDX 3 6
# ABC 4 7
I have data that has two columns. Each column of data has numerical values in it but some of them don't have any numerical values. I want to remove the rows which don't have all values numerical. In reality, the data has 1000 rows but for simplification, I made the data file in smaller size here. Thanks!
a <- c(1, 2, 3, 4, "--")
b <- c("--", 2, 3, "--", 5)
data <- data.frame(a, b)
One base R option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE.
Or using sapply():
data[!is.na(rowSums(sapply(data, as.numeric))), ]
An easier option is to check for NA after converting to numeric with as.numeric. If the element is not numeric, it returns NA and that can be detected with is.na and use it in filter_all to remove the rows
library(dplyr)
data %>%
filter_all(all_vars(!is.na(as.numeric(.))))
# a b
#1 2 2
#2 3 3
If we don't like the warnings, an option is to detect the numbers only element with regex by checking one or more digits ([0-9.]+) including a dot from start (^) to end ($) of string with str_detect
library(stringr)
data %>%
filter_all(all_vars(str_detect(., "^[0-9.]+$")))
# a b
#1 2 2
#2 3 3
If we have only -- as non-numeric, it is easier to remove
data[!rowSums(data == "--"),]
# a b
#2 2 2
#3 3 3
data
data <- data.frame(a,b, stringsAsFactors = FALSE)
I have a string whose structure and length can keep varying, that is
Input:
X <- ("A=12&B=15&C=15")
Y <- ("A=12&B=15&C=15&D=32&E=53")
What I was looking for this string to convert to data frame
Output Expected:
Dataframe X
A B C
12 15 15
and Dataframe Y
A B C D E
12 15 15 32 53
What I tired was this:
X <- as.data.frame(strsplit(X, split="&"))
But this didn't work for me, as it created only one column and column name was messed up.
P.S: I cannot hard code the column names because they can vary, and at any given time a string will contain only one row
One option is to extract the numeric part from the string, and read it with read.table. The pattern [^0-9]+ indicates one or more characters that are not a number and replace it with a space in the first gsub, read that using read.table, and specify the column names in the col.names argument with the values got by removing all characters that are not an upper case letter (second gsub)
f1 <- function(str1){
read.table(text=gsub("[^0-9]+", " ", str1),
col.names = scan(text=trimws(gsub("[^A-Z]+", " ", str1)),
what = "", sep=" ", quiet=TRUE))
}
f1(X)
# A B C
#1 12 15 15
f1(Y)
# A B C D E
#1 12 15 15 32 53
You can try this too:
library(stringr)
res <- str_match_all(X, "([A-Z]+)=([0-9]+)")[[1]]
df <- as.data.frame(matrix(as.integer(res[,3]), nrow=1))
names(df) <- res[,2]
df
A B C
1 12 15 15
I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops.
for example
ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data.frame")
remove all rows that have cells that contain the values 8,9,10
I guess i could use ff[ !ff[,1] %in% c(8, 9, 10), ] or subset(ff, !ff[,1] %in% c(8,9,10) )
but in order to remove all the values from the dataset i have to parse each column (probably with a for loop, something i wish to avoid).
Is there any other (cleaner) way?
Thanks a lot
apply your test to each row:
keeps <- apply(ff, 1, function(x) !any(x %in% 8:10))
which gives a boolean vector. Then subset with it:
ff[keeps,]
j.1 j.2 j.3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
11 11 12 13
12 12 13 14
13 13 14 15
>
I suppose the apply strategy may turn out to be the most economical but one could also do either of these:
ff[ !rowSums( sapply( ff, function(x) x %in% 8:10) ) , ]
ff[ !Reduce("+", lapply( ff, function(x) x %in% 8:10) ) , ]
Vector addition of logical vectors, (equivalent to any) followed by negation. I suspect the first one would be faster.