Separate string into columns by extracting al groups that match regex

Separate string into columns by extracting al groups that match regex - r

I have these strings in every row of one column.
example_df <- tibble(string = c("[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":19.843418733556412,\"volgorde\":10},{\"positieVergelekenMetSchooladvies\":\"op niveau\",\"percentage\":81.81818181818181,\"percentageVergelijking\":78.58821425834631,\"volgorde\":20},{\"positieVergelekenMetSchooladvies\":\"onder niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":1.5683670080972694,\"volgorde\":30}]"))
I'm only interested in the numbers. This regex works:
example_df %>%
.$string %>%
str_extract_all(., "[0-9]+\\.[0-9]+")
Instead of using the separate() function I want to use the extract() function. My understanding is that it differs from separate() in that extract() matches your regex you want to populate your new columns with. separate() matches, of course, the separation string. But where separate() matches all strings you fill in at sep= extract() matches only one group.
example_df %>%
extract(string,
into = c("boven_niveau_school",
"boven_niveau_verg",
"op_niveau_school",
"op_niveau_verg",
"onder_niveau_school",
"onder_niveau_verg"),
regex = "([0-9]+\\.[0-9]+)")
What am I doing wrong?

Instead of separate or extract I would extract all the numbers from the string and then use unnest_wider to create new columns.
library(tidyverse)
example_df %>%
mutate(temp = str_extract_all(string, "[0-9]+\\.[0-9]+")) %>%
unnest_wider(temp)
You can rename the columns as per your choice.

We can use regmatches/regexpr from base R
out <- regmatches(example_df$string, gregexpr("\\d+\\.\\d+", example_df$string))[[1]]
example_df[paste0("new", seq_along(out))] <- as.list(out)
example_df
# A tibble: 1 x 7
# string new1 new2 new3 new4 new5 new6
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 "[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9… 9.09090909… 19.84341873… 81.8181818… 78.588214… 9.0909090… 1.56836700…

Related

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?

tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2

I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.

This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

Using R Regex to identify two characters followed by a dash and two numbers

Very obnoxious regex question incoming! I have a column that I am trying to split into two based off a condition. I'd like a new column to be created when there are two characters, followed by a dash and two numbers (e.g., CA-01).
My code is:
mydf %>% extract(col = pilot_id, regex = "[a-z]{2}.d{2}", into = 'facility_test')
Where the column I'd like to identify the pattern in is pilot_id, and the new column I'd like to make is facility_test.

We need to capture in extract
library(dplyr)
library(tidyr)
mydf %>%
extract(col = pilot_id, regex = ".*-([A-Z]{2}-\\d{2})\\s.*",
into = 'facility_test')
# A tibble: 1 x 1
# facility_test
# <chr>
#1 FL-03
data
mydf <- tibble(pilot_id = "TGT Track -FL-03 (Hilsborough County) 3/3/2021")

String with values mapped from other data frame in R

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.

This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

Unpack json columns into a dataframe

I have json strings inside a dataframe column. I want to bring all these new json columns into the dataframe.
# Input
JsonID <- as.factor(c(1,2,3))
JsonString1 = "{\"device\":{\"site\":\"Location1\"},\"tags\":{\"Engine Pressure\":\"150\",\"timestamp\":\"2608411982\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"2608411982\"}"
JsonString2 = "{\"device\":{\"site\":\"Location2\"},\"tags\":{\"Engine Pressure\":\"160\",\"timestamp\":\"3608411983\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"3608411983\"}"
JsonString3 = "{\"device\":{\"site\":\"Location3\"},\"tags\":{\"Brake Fluid\":\"100\",\"timestamp\":\"4608411984\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"4608411984\"}"
JsonStrings = c(JsonString1, JsonString2, JsonString3)
Example <- data.frame(JsonID, JsonStrings)
Using the jsonlite library I can make each json string into a 1 row dataframe.
library(jsonlite)
# One row dataframes
DF1 <- data.frame(fromJSON(JsonString1))
DF2 <- data.frame(fromJSON(JsonString2))
DF3 <- data.frame(fromJSON(JsonString3))
Unfortunately the JsonID variable column is lost. All json strings share common column name such as "time". But there are column names they don't share. By pivoting the data longer I could Rbind all the dataframes together.
library(dplyr)
library(tidyr)
# Row bindable one row dataframes
DF1_RowBindable <- DF1 %>%
rename_all(~gsub("tags.", "", .x)) %>%
tidyr::pivot_longer(cols = c(colnames(.)[2]))
Is there a better way to do this?
I have never worked with json strings before. The solution must be computationally scalable.

We can store the data from fromJSON in list in the dataframe itself so we don't loose any information that we already have in the data. We can use unnest_wider to create new columns from named list.
library(dplyr)
library(tidyr)
library(jsonlite)
Example %>%
rowwise() %>%
mutate(data = list(fromJSON(JsonStrings))) %>%
unnest_wider(data) %>%
select(-JsonStrings) %>%
unnest_wider(tags) %>%
unnest_wider(device)
# JsonID site `Engine Pressure` timestamp historic adhoc `Brake Fluid` online time
# <fct> <chr> <chr> <chr> <lgl> <lgl> <chr> <lgl> <chr>
#1 1 Location1 150 2608411982 FALSE FALSE NA TRUE 2608411982
#2 2 Location2 160 3608411983 FALSE FALSE NA TRUE 3608411983
#3 3 Location3 NA 4608411984 FALSE FALSE 100 TRUE 4608411984
Since each column (data, tags, device) are of different lengths we need to use unnest_wider separately on each one of them.

How can I use R stringr to leave only the gene name?

I have a large spreadsheet with 3200 observations that has a list of genes in a column. The column however has a bunch of junk that I don't need (example below). How can I use stringr to remove the unnecessary junk and leave only the gene name?
Example: The gene names are TEM-126 and ykkD.
gb|AY628199|+|203-1064|ARO:3000988|TEM-126
gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD

If your gene names are always at the tail of your strings, you can try the code below
> gsub(".*\\|","",v)
[1] "TEM-126" "ykkD"
DATA
v <- c("gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD")

Using stringr:
str_split_fixed(genes, '\\|', n = 6)[, 6]

As you said you have those names in a column and it seems that the gene name is the last "word", you could easily do that using just two packages from tidyverse, dplyr and stringr.
library(dplyr)
library(stringr)
df <- tibble::tribble(
~Text,
"gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD"
)
df %>%
mutate(gene = word(Text, start = -1, end = -1, sep = "\\|"))
#> # A tibble: 2 x 2
#> Text gene
#> <chr> <chr>
#> 1 gb|AY628199|+|203-1064|ARO:3000988|TEM-126 TEM-126
#> 2 gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD ykkD

If you have a vector genevec of gene names, you can vectorise the function:
stringr::str_split(pattern="\\|", string=genevec, simplify=T)[,6]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Separate string into columns by extracting al groups that match regex - r

Instead of separate or extract I would extract all the numbers from the string and then use unnest_wider to create new columns. library(tidyverse) example_df %>% mutate(temp = str_extract_all(string, "[0-9]+\\.[0-9]+")) %>% unnest_wider(temp) You can rename the columns as per your choice.

Related

Creating new columns in R using parts of an existing column

Using R Regex to identify two characters followed by a dash and two numbers

String with values mapped from other data frame in R

Unpack json columns into a dataframe

How can I use R stringr to leave only the gene name?

Categories

Resources