Convert string to dictionary in R - r

I have a data frame column with dictionary-like strings.
data = data.frame(date = c('2022-12-01', '2022-12-02'),
code = c("{\"551\":4,\"181\":4,\"180\":4,\"181\":4}",
"{\"321\":14,\"181\":4,\"230\":4,\"189\":12}"))
My goal is to calculate the total number if the "dictionary" starts with 18.
For example, first row 2022-12-01, there are three items start with 18, so the total number is 4+4+4 =12.
For second row 2022-12-02, there are two items start with 18, so the total number is 4+4+12=16.
I tried strsplit(data$code, "\\W"), which split on every delimiter; or strsplit(data$code, ","), but fail to store it as a dictionary-type structure.
I feel that after converting the string to a dictionary, then filter on names starts with 18 would be feasible, but have no idea how to get started. Thank you for your advice!

data = data.frame(date = c('2022-12-01', '2022-12-02'),
code = c("{\"551\":4,\"181\":4,\"180\":4,\"181\":4}",
"{\"321\":14,\"181\":4,\"230\":4,\"189\":12}"))
data$count <- lapply(data$code,jsonlite::fromJSON) |> sapply(
\(x) sum(unlist(x)[grep("^18", names(x))]) )
data
#> date code count
#> 1 2022-12-01 {"551":4,"181":4,"180":4,"181":4} 12
#> 2 2022-12-02 {"321":14,"181":4,"230":4,"189":12} 16

Here are several approaches. The first uses strapply and is particularly short. The next shows how to create a dictionary using strapply and the last uses only base R.
In all of these use transform(data, sum = ...) or use mutate in dplyr to add the solution as a new column to data.
1) Match the number after an 18 and then convert the match to numeric and sum. Using strapply we get particularly concise code.
library(gsubfn)
sapply(strapply(data$code, '"18\\d+":(\\d+)', as.numeric), sum)
## [1] 12 16
2) In the question the desirability of creating a dictionary first was mentioned. To do that dict below is a list of dictionaries, one per row, and then we grep out the desired elements and sum.
library(gsubfn)
dict <- strapply(data$code, '"(\\d+)":(\\d+)', x + y ~ setNames(as.numeric(y), x))
sapply(lapply(dict, function(x) x[grepl("^18", names(x))]), sum)
## [1] 12 16
dict
## [[1]]
## 551 181 180 181
## 4 4 4 4
##
## [[2]]
## 321 181 230 189
## 14 4 4 12
3) A base solution replaces the {, } and comma characters with newline and then for each row reads the rest into two columns (the dictionary). It then subsets out the rows that begin with 18 and sums.
sapply(data$code, function(x)
gsub('[{},]', '\n', x) |>
read.table(text = _, sep = ":") |>
subset(grepl("^18", V1)) |>
with(sum(V2)), USE.NAMES = FALSE)
## [1] 12 16
If you just want that part of the code that constructs the dictionaries
lapply(data$code, function(x)
gsub('[{},]', '\n', x) |>
read.table(text = _, sep = ":"))
## [[1]]
## V1 V2
## 1 551 4
## 2 181 4
## 3 180 4
## 4 181 4
##
## [[2]]
## V1 V2
## 1 321 14
## 2 181 4
## 3 230 4
## 4 189 12

I would first make a data.frame where each row is a {name, value} pair. I do this by first separating the pairs onto rows, then separating the name and value into separate columns. Then I parse the text to keep only the numbers. Finally we summarise the table by date, taking the sum of those values for which the name starts with "18".
library(tidyverse)
data %>%
separate_rows(code, sep = ',') %>%
separate(code, sep = '":', into = c('name', 'value')) %>%
mutate(across(c(name, value), parse_number)) %>%
group_by(date) %>%
summarise(result = sum(value[substr(name, 1, 2) == "18"]))

Using base R
data$Sum <- sapply(regmatches(data$code, gregexpr('(?<=18\\d":)(\\d+)',
data$code, perl = TRUE)), \(x) sum(as.numeric(x)))
data$Sum
[1] 12 16

A base R approach using strsplit and sub/gsub.
First remove the braces and quotes, then look for strings starting with ^ 18 and finally sum the trailing numbers after :.
cbind(df, Sum = sapply(strsplit(df$code, ","), function(x)
sum(as.numeric(
sub(".*:", "", grep("^18", gsub("\\{|\"|\\}", "", x), value=T)))
)))
date code Sum
1 2022-12-01 {"551":4,"181":4,"180":4,"181":4} 12
2 2022-12-02 {"321":14,"181":4,"230":4,"189":12} 16

Related

How to convert a sequence of numbers from data frame to text file in special format in R?

I have a data frame with one field consisting of a sequence of numbers:
test <- data.frame(N=c(1,2,3,5,7,8,9,11,13,14,15))
> test
N
1 1
2 2
3 3
4 5
5 7
6 8
7 9
8 11
9 13
10 14
11 15
The field N contains a sequence of integers in ascending order
N sometimes skips some numbers, such as 2,3,5 (4 is missing).
I need to convert it into the following text format:
1-3,5,7-9,
11,13-15
This file is not data frame, but just a simple text file which contains the following conditions:
Consecutive numbers which are located in the middle should be removed and replaced by -, i.e., 1,2,3 should be 1-3 and 1,2,3,5,6 should be 1-3,5,6
Each number (or shortened consecutive numbers) should be separated by comma , (no space needed)
If one line has three numbers or shortened consecutive numbers, the line should be broken to go to the next line
End of each line should have comma, but the last line should not have
Currently I just could convert the data frame into a sequence of numbers, but the output is surrounded by c(), consecutive numbers cannot be shortened without line breaks.
> tapply(test, (seq_along(test)-1)%/%3, paste, collapse=", ")
0
"c(1, 2, 3, 5, 7, 8, 9, 11, 13, 14, 15)"
I appreciate your idea to make it!!
Thank you in advance for your support.
Here's a possible solution using dplyr -
library(dplyr)
output <- test %>%
#Create groups to collapse consecutive numbers
group_by(grp = cumsum(c(TRUE, diff(N) > 1))) %>%
#If more than 1 number in a group paste first and last value
summarise(text = if(n() > 1) paste0(range(N), collapse = '-') else as.character(N)) %>%
#For each 3 groups collapse the ranges as one string
group_by(line_num = ceiling(row_number()/3)) %>%
summarise(text = toString(text))
output
# line_num text
# <dbl> <chr>
#1 1 1-3, 5, 7-9
#2 2 11, 13-15
#Write the output
cat(paste0(output$text, collapse = '\n'), file = 'output.txt')
The output text file looks like -
I'll say v as a vector.
v <- c(1,2,3,5,7,8,9,11,13,14,15)
then split v into consecutive sets
vv <- split(v, cumsum(c(1, diff(v) != 1)))
vv
$`1`
[1] 1 2 3
$`2`
[1] 5
$`3`
[1] 7 8 9
$`4`
[1] 11
$`5`
[1] 13 14 15
Finally, transform to form you want
lapply(vv, function(x) {
if (length(x) == 1) {
x
} else(
paste0(x[1], "-", tail(x, n=1))
)
}) %>% unlist %>% as.vector
[1] "1-3" "5" "7-9" "11" "13-15"

Using mutate with a stored list of formulas over specified columns

This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
match.fun(..3)(!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <- do.call(Map, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))

How to separate a column into two columns

df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'))
I want to separate the variable 'CODE' into two variables. One variable shows the first letter of 'CODE' ('N' or 'M' in my case), the other shows the left number. If there are more than two digits, give a '.' after the second digit.
The output should be
df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'),
VOR_1=c('N','N','N','M'),
VOR_2=c('18','18.0','19.0','19.20'))
Finally, define the variable of 'VOR_2' as a numeric variable.
Using sub for a base R solution:
df$VOR_1 <- sub("^([A-Z]).*$", "\\1", df$CODE)
df$VOR_2 <- sub("^([0-9]{2})(?=[0-9])", "\\1.", sub("^[A-Z]([0-9]+)$", "\\1", df$CODE), perl=TRUE)
df$VOR_2 <- as.numeric(df$VOR_2) # if desired
df
PATIENT_ID CODE VOR_1 VOR_2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20
An explanation on the logic behind VOR_2 is warranted. We first extract all the digits from the second character onwards using the simple regex ^[A-Z]([0-9]+)$. Then, we make a second call to sub on the digit string, to insert a decimal point after the second digit. The pattern uses a positive lookahead which ensures that a dot gets intercolated only in the case of three or more digits.
An idea via tidyr using separate can be,
library(dplyr)
library(tidyr) #separate
df %>%
separate(CODE, into = c("text", "num"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(num = as.numeric(num),
num = num / (10 ^ (nchar(num) - 2))
)
# PATIENT_ID text num
#1 1 N 18.0
#2 2 N 18.0
#3 3 N 19.0
#4 4 M 19.2
You can use str_extract and sub:
library(stringr)
df$VOR1 <- str_extract(df$CODE, "^[A-Z]")
Here, you simply grasp the capicatl letter at the beginning of the string marked by ^.
df$VOR2 <- sub("(\\d{2})(\\d{1,2})", "\\1.\\2", str_extract(df$CODE, "\\d+"))
Here, you first extract just the digits using str_extract and then insert the period .where appropriate:
Result:
df
PATIENT_ID CODE VOR1 VOR2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20

Remove part of a string based on overlapping patterns

I have the following data:
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short"),
some_other_cols = c(1, 2, 2))
Further, I have the following vector of patterns:
my_patterns <- c("my example", "is my", "my other text")
What I want to achieve is to remove any text of my_patterns that occurs in dat$x.
I tried the solution below, but the problem is that as soon as I remove the first pattern from the text (here: "my example"), my solution is not able to detect the occurence of the second (here: "is my") or third pattern anymore.
Wrong solution:
library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")
dat_new <- dat %>%
mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))
I guess I could do sth. like looping through all patterns, collect the string positions in dat$x that match my patterns, then combine them into a range and delete that range from the text. E.g. I add columns to my dat data frame like start_pattern_1 and end_pattern_1 and so on. So for the first row 1 I get 9 (start) and 18 (end) for the first pattern, 6/10 for the second pattern. I then need to check if any end position overlaps with any start position (here start 9 and end 10) and combine them into a range 6-18 and remove this range from the text.
Problem is that I potentially have many new start/end columns then (could be a few hundred patterns in my case) and if I need to pairwise compare the overlapping ranges, my computer will probably crash.
So I'm wondering how I could get it work or how I should best approach this solution. Maybe (and I hope so) there's a better/more elegant/easy solution.
Desired Output of dat would be:
x some_other_cols short_x
this is my example text 1 this text
and here is my other text example 2 and here example
my other text is short 2 is short
Appreciate your help! Thanks.
New option with str_locate_all mentionned by Uwe in a comment under the question which greatly simplify the code:
library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
if (nrow(positions) == 0) return(text)
ret <- strsplit(text,"")[[1]]
lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}
If you have control over the pattern definition and can create it by hand then it can be achieved with a regex solution:
> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this text" "and here example" " is short"
The idea is to create the pattern with optional parts (the ? after the grouping parentheses.
So we have roughly:
(is )? <= optional "is" followed by space
my <= literal "my" followed by space
(other text|example)? <= Optional text after "my ", either "other text" or (the |) "example"
If you don't have control, things gets messy, I hope I've commented enough for it to be understandable, according to the number of loops included don't expect it to be quick:
# Given datas
dat <- data.frame(x = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
some_other_cols = c(1, 2, 2, 4))
my_patterns <- c("my example", "is my", "my other text")
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length
remove_matching_parts <- function(text, positions) {
ret <- strsplit(text,"")[[1]]
lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
paste0(ret[!is.na(ret)],separator="",collapse="")
}
# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
sapply(regexec(pattern,vector),
function(x) {
if(x>-1) {
c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
}
})
}
# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)
# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}
Result after run:
> dat
x some_other_cols result
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
There are two crucial points here:
The patterns to remove from a string may overlap
There may be multiple non-overlapping patterns to remove from the string
The solution below tries to address both issues using my favorite tools
library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later
library(stringr)
library(magrittr) # piping used to improve readability
pos <-
# find start and end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn")) %>%
rbindlist() %>%
# collapse overlapping positions
setorder(rn, start, end) %>%
.[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>%
.[, .(start = min(start), end = max(end)), by = .(rn, grp)]
Now, pos has become:
rn grp start end
1: 1 1 6 18
2: 2 1 10 25
3: 3 1 1 13
4: 5 1 6 10
5: 5 2 24 28
6: 6 1 1 13
7: 6 2 15 27
8: 7 1 3 7
9: 8 1 1 10
10: 8 2 12 16
11: 8 3 22 34
12: 9 1 1 10
13: 9 2 19 31
# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
# update join
dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][ #remove row number
, short_x := str_squish(short_x)][] # remove whitespace
x some_other_cols short_x
1: this is my example text 1 this text
2: and here is my other text example 2 and here example
3: my other text is short 2 is short
4: yet another text 4 yet another text
5: this is my text where 'is my' appears twice 5 this text where '' appears twice
6: my other text is my example 6
7: This myself 7 Thself
8: my example is my not my other text 8 not
9: my example is not my other text 9 is not
The code to collapse overlapping positions is modified from this answer.
The intermediate result
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn"))
[[1]]
rn start end
1: 1 9 18
2: 6 18 27
3: 8 1 10
4: 9 1 10
[[2]]
rn start end
1: 1 6 10
2: 2 10 14
3: 5 6 10
4: 5 24 28
5: 6 15 19
6: 7 3 7
7: 8 12 16
[[3]]
rn start end
1: 2 13 25
2: 3 1 13
3: 6 1 13
4: 8 22 34
5: 9 19 31
shows that patterns 1 and 2 overlap in row 1 and patterns 2 and 3 overlap in row 2. Rows 5, 8, and 9 have non-overlapping patterns. Row 7 is to show that patterns are extracted regardless of word boundaries.
EDIT: dplyr version
The OP has mentioned that he/she has "successfully avoided data.table so far". So, I felt challenged to add a dplyr version:
library(dplyr)
library(stringr)
pos <-
# find start end end positions for each pattern
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as_tibble) %>%
bind_rows(.id = "rn")) %>%
bind_rows() %>%
# collapse overlapping positions
arrange(rn, start, end) %>%
group_by(rn) %>%
mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>%
group_by(rn, grp) %>%
summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
mutate(rn = row_number() %>% as.character(),
short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
dat <- dat %>%
left_join(pos %>% filter(grp == g), by = "rn") %>%
mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>%
select(-grp, -start, -end)
}
# remove row number
dat %>%
select(-rn) %>%
mutate(short_x = str_squish(short_x))
x some_other_cols short_x
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
5 this is my text where 'is my' appears twice 5 this text where '' appears twice
6 my other text is my example 6
7 This is myself 7 This self
8 my example is my not my other text 8 not
9 my example is not my other text 9 is not
The algorithm is essentially the same. However, there are two challenges here where dplyr differs from data.table:
dplyr requires explicit coersion from factor to character
there is no update join available in dplyr, so the for loop has become more verbose than the data.table counterpart (Perhaps, someone knows a fancy purrr function or a map-reduce trick to accomplish the same?)
EDIT 2
There are some bug fixes and improvements to above codes:
Collapsing positions has been corrected to work also for some edge case I have added to dat.
seq() has been replaced by seq_len().
str_squish() reduces repeated whitespace inside a string and removes whitespace from start and end of a string.
Data
I have added some use cases to test for non-overlapping patterns and complete removal, e.g.:
dat <- data.frame(
x = c(
"this is my example text",
"and here is my other text example",
"my other text is short",
"yet another text",
"this is my text where 'is my' appears twice",
"my other text is my example",
"This myself",
"my example is my not my other text",
"my example is not my other text"
),
some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Resources