here is a dataframe for example:
test_df <- structure(list(plant_id = c("AB1234", "CC0000", "ZX9998", "AA1110", "LO8880"),
NewName = c("ZY8765", "XX9999", "AC0001", "ZZ8889", "OL1119")),
row.names = c(NA, -5L), class = "data.frame",
.Names = c("plant_sp", "NewName"))
As you can see, there is a column call "plant_sp" with a 6 character code.
I'd like to tansform this code to a new code (like at column "NewName" by this format:
For letters:
A-Z
B-Y
C-X
D-W
E-V
F-U
G-T
.
.
.
For numbers:
0-9
1-8
2-7
3-6
4-5
5-4
.
.
.
plant_sp NewName
1 AB1234 ZY8765
2 CC0000 XX9999
3 ZX9998 AC0001
4 AA1110 ZZ8889
5 LO8880 OL1119
So that each character will get the opposite one by its value (0=9, 1=8... A=Z, B=Y...)
How can I do it? a pipe solution would be great.
Thanks a lot!
One option to achieve your desired result would be via a lookup table and stringr::str_replace_all:
library(dplyr)
library(stringr)
lt_letters <- setNames(rev(LETTERS), LETTERS)
lt_numbers <- setNames(rev(0:9),0:9)
test_df %>%
mutate(NewName1 = str_replace_all(plant_sp, "[A-Z0-9]", function(x) c(lt_letters, lt_numbers)[x]))
#> plant_sp NewName NewName1
#> 1 AB1234 ZY8765 ZY8765
#> 2 CC0000 XX9999 XX9999
#> 3 ZX9998 AC0001 AC0001
#> 4 AA1110 ZZ8889 ZZ8889
#> 5 LO8880 OL1119 OL1119
Related
I have a column in my data frame as shown below.
I want to keep the data in the pattern "\\d+Zimmer" and remove all the digits from the column such as "9586" and "927" in the picture.
I tried following gsub function.
gsub("[^\\d+Zimmer]", "", flat_cl_one$rooms)
But it removes all the digits, as below.
What Regex can I use to get the correct result? Thank You in Advance
We can coerce any rows that have alphanumeric characters to NA and then replace the rows that don't have NA to blanks.
library(dplyr)
flat_cl_one %>%
mutate(rooms = ifelse(!is.na(as.numeric(rooms)), "", rooms))
Or we can use str_detect:
flat_cl_one %>%
mutate(rooms = ifelse(str_detect(rooms, "Zimmer", negate = TRUE), "", rooms))
Output
rooms
1 647Zimmer
2 394Zimmer
3
4
5 38210Zimmer
We could do the same thing with filter if you wanted to actually remove those rows.
flat_cl_one %>%
filter(is.na(as.numeric(rooms)))
# rooms
#1 647Zimmer
#2 394Zimmer
#3 38210Zimmer
Data
flat_cl_one <- structure(list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389",
"38210Zimmer")), class = "data.frame", row.names = c(NA, -5L))
Just replace strings that don't contain the word "Zimmer"
flat_cl_one$room[!grepl("Zimmer", flat_cl_one$room)] <- ""
flat_cl_one
#> room
#> 1 3Zimmer
#> 2 2Zimmer
#> 3 2Zimmer
#> 4 3Zimmer
#> 5
#> 6
#> 7 3Zimmer
#> 8 6Zimmer
#> 9 2Zimmer
#> 10 4Zimmer
Data
flat_cl_one <- data.frame(room = c("3Zimmer", "2Zimmer", "2Zimmer", "3Zimmer",
"9586", "927", "3Zimmer", "6Zimmer",
"2Zimmer", "4Zimmer"))
Another possible solution, using stringr::str_extract (I am using #AndrewGillreath-Brown's data, to whom I thank):
library(tidyverse)
df <- structure(
list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389", "38210Zimmer")),
class = "data.frame",
row.names = c(NA, -5L))
df %>%
mutate(rooms = str_extract(rooms, "\\d+Zimmer"))
#> rooms
#> 1 647Zimmer
#> 2 394Zimmer
#> 3 <NA>
#> 4 <NA>
#> 5 38210Zimmer
This pattern [^\\d+Zimmer] matches any character except a digit or the following characters + Z i m etc...
Using gsub, you can check if the string does not start with the pattern ^\\d+Zimmer using a negative lookahead (?! setting perl = TRUE and then match 1 or more digits if the assertion it true.
gsub("^(?!^\\d+Zimmer\\b)\\d+\\b", "", flat_cl_one$rooms, perl = TRUE)
See an R demo.
I am struggling with gsub and regular expressions in R and I need help with this. I have a data frame in R with the second column represent some codes presented as alphanumeric digits. I want to place a dot after three characters in codes comprising of four and five digits. Don't want to touch three-character codes,
My input is,
ID
code
1
C443
2
B479
3
E53
4
S9200
5
M8199
My required output is,
ID
code
1
C44.3
2
B47.9
3
E53
4
S92.00
5
M81.99
I am trying, but getting a dot also in the code of 3rd ID
Library(dplyr)
a <- a %>% mutate(code = as.numeric(paste0(substr(code,1,3),".",substr(code,4,nchar(code)))))
Thanks for the help
This is a nice way of doing so using RegEx:
a %>%
mutate(code = gsub("(^[A-Z][0-9]{2})([0-9]{1,2})", "\\1\\.\\2", code))
You could add an if_else to the existing code.
library(dplyr)
df <-
data.frame(id = c(1, 2, 3, 4, 5),
code = c("C443", "B479", "E53", "S9200", "M81999"))
df <-
df %>% mutate(code = if_else(nchar(code) > 3, paste0(
substr(code, 1, 3), ".", substr(code, 4, nchar(code))
), code))
df
#> id code
#> 1 1 C44.3
#> 2 2 B47.9
#> 3 3 E53
#> 4 4 S92.00
#> 5 5 M81.999
Created on 2021-10-01 by the reprex package (v2.0.1)
Using str_replace
library(stringr)
library(dplyr)
df %>%
mutate(code = str_replace(code, "(\\d{2})(\\d+)", "\\1.\\2"))
id code
1 1 C44.3
2 2 B47.9
3 3 E53
4 4 S92.00
5 5 M81.999
data
df <- structure(list(id = c(1, 2, 3, 4, 5), code = c("C443", "B479",
"E53", "S9200", "M81999")), class = "data.frame", row.names = c(NA,
-5L))
So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?
using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))
Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3
This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)
There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)