Remove Last Character in R inplace - r

I came from a Python background and I am working in R with this data df.
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a
Using R language, how can I remove the a in the last part of the age column and do data modification in place??
(just like Python using inplace=True)
Can I attain it using
df$Age[which(df$Age == `a pattern`)] <- ""

This is a perfect use case for parse_number from readr package (it is in tidyverse:
library(dplyr)
library(readr)
df %>%
mutate(age = parse_number(age))
name age
1 Anon1 52
2 Anon2 62
3 Anon3 44
4 Anon4 30
5 Anon5 110
data:
df <- structure(list(name = c("Anon1", "Anon2", "Anon3", "Anon4", "Anon5"
), age = c("52a", "62", "44a", "30", "110a")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))

You could use sub here:
df$age <- sub("a$", "", df$age, fixed=TRUE)

#A tidy solution
library(dplyr)
library(stringr)
df <- data.frame(name=c("anon1","anon2"),age=c("52","37a"))
df <- df %>%
mutate(age = str_extract(age,"^\\d+"))
df
name age
1 anon1 52
2 anon2 37

Here are two approaches. No packages are used.
1) We remove all non-digit characters where in a regular expression \D means non-digit. If we knew that only a could appear as a non-digit we could , instead, use "a" as the first argument to gsub and if we knew it only appears once we could use sub instead of gsub.
Also it is easier to debug code if you don't overwrite variables since then you always know that a particular variable is in its original state. Instead assign the result to a new variable.
transform(DF, age = as.numeric(gsub("\\D", "", age)))
This could also be written using pipes:
transform(DF, age = age |> gsub(pattern = "\\D", replacement = "") |> as.numeric())
2) We can use scan specifying that a is a comment character.
transform(DF, age = scan(text = age, comment.char = "a", quiet = TRUE))
Note
Lines <- "
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a"
DF <- read.table(text = Lines)

The inplace modifier in python refers to making a change without creating a copy. The data.table package in R allows for this (called replace by reference).
df <- read.table(text="
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a")
library(data.table)
library(stringi)
setDT(df)[, age:=stri_extract(age, regex='^\\d+')]
df
The first clause (setDT(df)) converts df to a data.table by reference (e.g., without making a copy), and the second clause ([, age:=...]) replaces the values in column age with ... also by reference.

Related

How to Remove characters that doesn't match the string pattern from a column of a data frame

I have a column in my data frame as shown below.
I want to keep the data in the pattern "\\d+Zimmer" and remove all the digits from the column such as "9586" and "927" in the picture.
I tried following gsub function.
gsub("[^\\d+Zimmer]", "", flat_cl_one$rooms)
But it removes all the digits, as below.
What Regex can I use to get the correct result? Thank You in Advance
We can coerce any rows that have alphanumeric characters to NA and then replace the rows that don't have NA to blanks.
library(dplyr)
flat_cl_one %>%
mutate(rooms = ifelse(!is.na(as.numeric(rooms)), "", rooms))
Or we can use str_detect:
flat_cl_one %>%
mutate(rooms = ifelse(str_detect(rooms, "Zimmer", negate = TRUE), "", rooms))
Output
rooms
1 647Zimmer
2 394Zimmer
3
4
5 38210Zimmer
We could do the same thing with filter if you wanted to actually remove those rows.
flat_cl_one %>%
filter(is.na(as.numeric(rooms)))
# rooms
#1 647Zimmer
#2 394Zimmer
#3 38210Zimmer
Data
flat_cl_one <- structure(list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389",
"38210Zimmer")), class = "data.frame", row.names = c(NA, -5L))
Just replace strings that don't contain the word "Zimmer"
flat_cl_one$room[!grepl("Zimmer", flat_cl_one$room)] <- ""
flat_cl_one
#> room
#> 1 3Zimmer
#> 2 2Zimmer
#> 3 2Zimmer
#> 4 3Zimmer
#> 5
#> 6
#> 7 3Zimmer
#> 8 6Zimmer
#> 9 2Zimmer
#> 10 4Zimmer
Data
flat_cl_one <- data.frame(room = c("3Zimmer", "2Zimmer", "2Zimmer", "3Zimmer",
"9586", "927", "3Zimmer", "6Zimmer",
"2Zimmer", "4Zimmer"))
Another possible solution, using stringr::str_extract (I am using #AndrewGillreath-Brown's data, to whom I thank):
library(tidyverse)
df <- structure(
list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389", "38210Zimmer")),
class = "data.frame",
row.names = c(NA, -5L))
df %>%
mutate(rooms = str_extract(rooms, "\\d+Zimmer"))
#> rooms
#> 1 647Zimmer
#> 2 394Zimmer
#> 3 <NA>
#> 4 <NA>
#> 5 38210Zimmer
This pattern [^\\d+Zimmer] matches any character except a digit or the following characters + Z i m etc...
Using gsub, you can check if the string does not start with the pattern ^\\d+Zimmer using a negative lookahead (?! setting perl = TRUE and then match 1 or more digits if the assertion it true.
gsub("^(?!^\\d+Zimmer\\b)\\d+\\b", "", flat_cl_one$rooms, perl = TRUE)
See an R demo.

Can dplyr remove all dots from as.character string?

Lets say I have a set of dates
> p
birth
1 22.12.1946
2 01.08.1948
3 02.11.2028
4 18.11.1953
5 28.03.1948
Is there a dplyr solution to remove all dots?
I tried
p %>% mutate(birth = str_replace(birth, ".", ""))
Data
p <- structure(list(birth = c("22.12.1946", "01.08.1948", "02.11.2028",
"18.11.1953", "28.03.1948")), row.names = c(NA, 5L), class = "data.frame")
We need fixed wrapped or escape (\\.) the dot as . in regex matches any character and not the literal .
library(dplyr)
library(stringr)
p %>%
mutate(birth = str_remove_all(birth, fixed(".")))
-output
# birth
#1 22121946
#2 01081948
#3 02112028
#4 18111953
#5 28031948
NOTE: while str_replace_all would work as well, the wrapped str_remove would be a compact option
It is easier to convert to Date class first and then do the format
format(as.Date(p$birth, "%d.%m.%Y"), "%d%m%y")
#[1] "221246" "010848" "021128" "181153" "280348"

Remove 91 from mobile_number_column in R

Eg: Mobile_Number column contains
read.table(header=T,text=' Mobile_Number_Column
919177289917
917728991746
917728991748
919126380348
')
Now I want to remove 91 from Mobile_Number_Column
Expected Result:
Mobile_Number_Column
9177289917
7728991746
7728991748
9126380348
This can be accomplished with a regular expression. Since you're reading in the numbers as part of a data.frame, you can leverage the ^ start of string matcher plus the literal numbers of 91 in a sub call. No point in gsub since you only want to match once.
df = read.table(header=T,text=' Mobile_Number_Column
919177289917
917728991746
917728991748
919126380348
')
df$Mobile_Number_Column = sub("^91","",as.character(df$Mobile_Number_Column))
df
#> Mobile_Number_Column
#> 1 9177289917
#> 2 7728991746
#> 3 7728991748
#> 4 9126380348
This uses the stringr and dplyr packages:
library(tidyverse)
data <- tibble(numbers = c(
919177289917,
917728991746,
917728991748,
919126380348)
)
data_2 <- data %>%
mutate(numbers = str_sub(numbers, start = 3L, end = -1L))

R: Pulling data from one column to create new columns

I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?
using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))
Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3
This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)

How to add value base on specific character ,also fix with certain digits in R

There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)

Resources