Delimit a column in R based on 2 characters - r

I have a column in a dataframe in R that contains values such as
C22/00556,
C21/00445,
B22/00111,
C22-00679, etc.
I would like to split this into 2 columns named initial and number. The delimiter being "-" or "/".
As a result I would expect a column containing C22, C21, B22, etc and another column containing 00556, 00445 etc.
I am trying to use the separate function but I am struggling with the sep= part.
I have tried using sep= c("/","-") but this is not working and throws an error.

You could use separate from tidyr by / or (|) - like this:
df <- data.frame(V1 = c("C22/00556", "C21/00445", "B22/00111", "C22-00679"))
library(tidyr)
df %>%
separate(V1, c("initial", "number"), sep = "/|-")
#> initial number
#> 1 C22 00556
#> 2 C21 00445
#> 3 B22 00111
#> 4 C22 00679
Created on 2023-01-05 with reprex v2.0.2

Using base R
read.table(text = chartr("/", "-", df$V1), sep = "-", header = FALSE,
col.names = c("initial", "number"), colClasses = "character")
initial number
1 C22 00556
2 C21 00445
3 B22 00111
4 C22 00679

Related

Remove Last Character in R inplace

I came from a Python background and I am working in R with this data df.
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a
Using R language, how can I remove the a in the last part of the age column and do data modification in place??
(just like Python using inplace=True)
Can I attain it using
df$Age[which(df$Age == `a pattern`)] <- ""
This is a perfect use case for parse_number from readr package (it is in tidyverse:
library(dplyr)
library(readr)
df %>%
mutate(age = parse_number(age))
name age
1 Anon1 52
2 Anon2 62
3 Anon3 44
4 Anon4 30
5 Anon5 110
data:
df <- structure(list(name = c("Anon1", "Anon2", "Anon3", "Anon4", "Anon5"
), age = c("52a", "62", "44a", "30", "110a")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
You could use sub here:
df$age <- sub("a$", "", df$age, fixed=TRUE)
#A tidy solution
library(dplyr)
library(stringr)
df <- data.frame(name=c("anon1","anon2"),age=c("52","37a"))
df <- df %>%
mutate(age = str_extract(age,"^\\d+"))
df
name age
1 anon1 52
2 anon2 37
Here are two approaches. No packages are used.
1) We remove all non-digit characters where in a regular expression \D means non-digit. If we knew that only a could appear as a non-digit we could , instead, use "a" as the first argument to gsub and if we knew it only appears once we could use sub instead of gsub.
Also it is easier to debug code if you don't overwrite variables since then you always know that a particular variable is in its original state. Instead assign the result to a new variable.
transform(DF, age = as.numeric(gsub("\\D", "", age)))
This could also be written using pipes:
transform(DF, age = age |> gsub(pattern = "\\D", replacement = "") |> as.numeric())
2) We can use scan specifying that a is a comment character.
transform(DF, age = scan(text = age, comment.char = "a", quiet = TRUE))
Note
Lines <- "
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a"
DF <- read.table(text = Lines)
The inplace modifier in python refers to making a change without creating a copy. The data.table package in R allows for this (called replace by reference).
df <- read.table(text="
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a")
library(data.table)
library(stringi)
setDT(df)[, age:=stri_extract(age, regex='^\\d+')]
df
The first clause (setDT(df)) converts df to a data.table by reference (e.g., without making a copy), and the second clause ([, age:=...]) replaces the values in column age with ... also by reference.

How to Remove characters that doesn't match the string pattern from a column of a data frame

I have a column in my data frame as shown below.
I want to keep the data in the pattern "\\d+Zimmer" and remove all the digits from the column such as "9586" and "927" in the picture.
I tried following gsub function.
gsub("[^\\d+Zimmer]", "", flat_cl_one$rooms)
But it removes all the digits, as below.
What Regex can I use to get the correct result? Thank You in Advance
We can coerce any rows that have alphanumeric characters to NA and then replace the rows that don't have NA to blanks.
library(dplyr)
flat_cl_one %>%
mutate(rooms = ifelse(!is.na(as.numeric(rooms)), "", rooms))
Or we can use str_detect:
flat_cl_one %>%
mutate(rooms = ifelse(str_detect(rooms, "Zimmer", negate = TRUE), "", rooms))
Output
rooms
1 647Zimmer
2 394Zimmer
3
4
5 38210Zimmer
We could do the same thing with filter if you wanted to actually remove those rows.
flat_cl_one %>%
filter(is.na(as.numeric(rooms)))
# rooms
#1 647Zimmer
#2 394Zimmer
#3 38210Zimmer
Data
flat_cl_one <- structure(list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389",
"38210Zimmer")), class = "data.frame", row.names = c(NA, -5L))
Just replace strings that don't contain the word "Zimmer"
flat_cl_one$room[!grepl("Zimmer", flat_cl_one$room)] <- ""
flat_cl_one
#> room
#> 1 3Zimmer
#> 2 2Zimmer
#> 3 2Zimmer
#> 4 3Zimmer
#> 5
#> 6
#> 7 3Zimmer
#> 8 6Zimmer
#> 9 2Zimmer
#> 10 4Zimmer
Data
flat_cl_one <- data.frame(room = c("3Zimmer", "2Zimmer", "2Zimmer", "3Zimmer",
"9586", "927", "3Zimmer", "6Zimmer",
"2Zimmer", "4Zimmer"))
Another possible solution, using stringr::str_extract (I am using #AndrewGillreath-Brown's data, to whom I thank):
library(tidyverse)
df <- structure(
list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389", "38210Zimmer")),
class = "data.frame",
row.names = c(NA, -5L))
df %>%
mutate(rooms = str_extract(rooms, "\\d+Zimmer"))
#> rooms
#> 1 647Zimmer
#> 2 394Zimmer
#> 3 <NA>
#> 4 <NA>
#> 5 38210Zimmer
This pattern [^\\d+Zimmer] matches any character except a digit or the following characters + Z i m etc...
Using gsub, you can check if the string does not start with the pattern ^\\d+Zimmer using a negative lookahead (?! setting perl = TRUE and then match 1 or more digits if the assertion it true.
gsub("^(?!^\\d+Zimmer\\b)\\d+\\b", "", flat_cl_one$rooms, perl = TRUE)
See an R demo.

Subset row based on first letter of each row

dput(abc)
structure(list(Comparison = structure(1:15, .Label = c("C1_C2 ",
"C1_C3 ", "C1_C4 ", "C1_C5 ", "C1_C6 ", "C2_C3 ", "C2_C4 ", "C2_C5 ",
"C2_C6 ", "C3_C4 ", "C3_C5 ", "C4_C5 ", "C4_C6 ", "C5_C6 ", "C6_C5 "
), class = "factor")), row.names = c(NA, -15L), class = "data.frame")
This is my data-frame
Comparison
1 C1_C2
2 C1_C3
3 C1_C4
4 C1_C5
5 C1_C6
6 C2_C3
7 C2_C4
8 C2_C5
9 C2_C6
10 C3_C4
11 C3_C5
12 C4_C5
13 C4_C6
14 C5_C6
15 C6_C5
So when i do for C1 and subset it works fine like I will get C1_C2,C1_C3,C1_C4, C1_C5, C1_C6. This works fine for C1
But when i do grep for C2 ,this will also find row which are C1_C2 as well which i dont want.I want only which starts with C2_C3,C2_C4,C2_C5,C2_C6. Same goes with C3,C3,C5,C6.
My code to filter
C1 <- comparsion %>% filter(str_detect(Comparison,"C1")).
If you are looking for base R alternative you can choose startsWith, Also note that your Comparision column is in factor, so startsWith doesn't work on factors , you need to wrap it inside as.character to make it work, you may change it for other columns like C1, C2 etc if you are only looking for character starting with C1, C2 etc :
data[startsWith(as.character(data$Comparison), 'C1'),,drop=FALSE]
You can also work with dplyr using startsWith, like below:
data %>%
filter(startsWith(as.character(Comparison), 'C1'))
Use ^ to indicate start of the string.
subset(abc, grepl('^C1', Comparison))
# Comparison
#1 C1_C2
#2 C1_C3
#3 C1_C4
#4 C1_C5
#5 C1_C6
With C2 :
subset(abc, grepl('^C2', Comparison))
# Comparison
#6 C2_C3
#7 C2_C4
#8 C2_C5
#9 C2_C6
In dplyr :
library(dplyr)
library(stringr)
abc %>% filter(str_detect(Comparison, '^C2'))

R studio - using grepl() to grab specific characters and populate a new column in the dataframe

I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well
ID Function
F04 FZ000TTY WB002FR088DR011
F05 FZ000AGH WZ004ABD
F06 FZ0005ABD
my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example
ID Function SUBFUN1 SUBFUN2 SUBFUN3 SUBFUN4 SUBFUN5
F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time
I tried the the following
Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))
Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))
I get the error message below in my attempts for subfun1 & subfun2:
Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) :
replacement has 0 rows, data has 343456
Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) :
replacement has 0 rows, data has 343456
I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string
Any help would be greatly appreciated with this
Using data.table:
library(data.table)
Aud <- data.frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
stringsAsFactors = FALSE
)
setDT(Aud)
cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
ID Function V1 V2 V3 V4 V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
2: F05 FZ000AGH WZ004ABD FZ AGH WZ ABD <NA>
3: F06 FZ0005ABD FZ ABD <NA> <NA> <NA>
Staying in base R one could do something like the following:
our_split <- strsplit(Aud$Function, "[0-9]+| ")
cbind(
Aud,
do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)
One can use tidyr::separate to divide Function column in multiple columns using regex as separator.
library(tidyverse)
df %>%
separate(Function, into = paste("V",1:5, sep=""),
sep = "([^[:alpha:]]+)", fill="right", extra = "drop")
# ID V1 V2 V3 V4 V5
# 1 F04 FZ TTY WB FR DR
# 2 F05 FZ AGH WZ ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>
([^[:alpha:]]+) : Separate on anything other than alphabates
Data:
df <- read.table(text=
"ID Function
F04 'FZ000TTY WB002FR088DR011'
F05 'FZ000AGH WZ004ABD'
F06 FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)
A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |.
library(tidyverse)
Aud <- data_frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)
search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"
Aud %>%
mutate(code = str_extract_all(Function, search_terms)) %>%
select(-Function) %>%
unnest(code) %>%
group_by(ID) %>%
mutate(subfun = row_number()) %>%
spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID subfun1 subfun2 subfun3 subfun4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F04 FZ TTY WB FR
#> 2 F05 FZ WZ ABD <NA>
#> 3 F06 FZ ABD <NA> <NA>
Created on 2018-07-11 by the reprex package (v0.2.0).

R: Pulling data from one column to create new columns

I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?
using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))
Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3
This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)

Resources