Remove all the characters before the last comma in R - r

I have a data table like this:
id number
1 5562,4024,...,1213
2 4244,4214,...,244
3 424,4213
4 1213,441
...
And I want to subset only the last part of each column of number, which should be like:
id number
1 1213
2 244
3 4213
4 441
...
So what should I do to achieve that?

One option is capture the digits at the end ($) of the string as a group that follows a , and replace with the backreference (\\1) of the captured group
df$number <- as.numeric(sub(".*,(\\d+)$", "\\1", df$number))
Or match the characters (.*) until the , and replace it with blank ("")
df$number <- as.numeric(sub(".*,", "", df$number))
data
df <- structure(list(id = 1:4, number = c("5562,4024,...,1213",
"4244,4214,...,244",
"424,4213", "1213,441")), class = "data.frame", row.names = c(NA,
-4L))

Related

Change underscore behind word within column in R

Hi I have a data frame like this, with two columns (A and B):
A B
x_1234 rs4566
x_1567 rs3566
z_1444 rs78654
r_1234 rs34567
I would like to change each letter in front of the numbers in column A after the number, also with a underscore.
Expected output:
A B
1234_x rs4566
1567_x rs3566
1444_z rs78654
1234_r rs34567
I tried something like, but it doesn't work:
DF$A <- gsub(".*_", "_*.", DF$A)
We may need to switch the characters after capturing as a group ((.*)- captures characters before the _ and the second capture group as one or more digits (\\d+), then switch those in the replacement with the backreferences (\\2 followed by \\1 separated by a _)
DF$A <- sub("(.*)_(\\d+)", "\\2_\\1", DF$A)
-output
> DF
A B
1 1234_x rs4566
2 1567_x rs3566
3 1444_z rs78654
4 1234_r rs34567
The OP's code matches any characters (.*) followed by the _ and replace with the _ and literal characters (*.). Instead, the replacement should be based on the capture group backreferences
data
DF <- structure(list(A = c("x_1234", "x_1567", "z_1444", "r_1234"),
B = c("rs4566", "rs3566", "rs78654", "rs34567")),
class = "data.frame", row.names = c(NA,
-4L))

Replacing a dataframe entry with the last value in a listed entry in R

I have a dataframe that looks like the below:
BaseRating contRating Participant
5,4,6,3,2,4 5 01
4 4 01
I would first like to run some code that looks to see whether there are any commas in the dataframe, and returns a column number of where that is. I have tried some of the solutions in the questions below, which don't seem to work when looking for a comma instead of a string/whole value? I'm probably missing something simple here but any help appreciated!
Selecting data frame rows based on partial string match in a column
Filter rows which contain a certain string
Check if value is in data frame
Having determined whether there are commas in my data, I then want to extract just the last number in the list separated by commas in that entry, and replace the entry with that value. For instance, I want the first row in the BaseRating column to become '4' because it is last in that list.
Is there a way to do this in R without manually changing the number?
A possible solution is bellow.
EXPLANATION
In what follows, I will explain the regex expression used in str_extract function, as asked for by #milsandhills:
The symbol | in the middle means the logical OR operator.
We use that because BaseRating can have multiple numbers or only one number — hence the need to use |, to treat each case separately.
The left-hand side of | means a number formed by one or more digits (\\d+), which starts (^) and finishes the string ($).
The right-hand side of | means a number formed by one or more digits (\\d+), which finishes the string ($). And (?<=\\,) is used to guarantee that the number is preceded by a comma.
You can find more details at the stringr cheat sheet.
library(tidyverse)
df <- data.frame(
BaseRating = c("5,4,6,3,2,4", "4"),
contRating = c(5L, 4L),
Participant = c(1L, 1L)
)
df %>%
mutate(BaseRating = sapply(BaseRating,
function(x) str_extract(x, "^\\d+$|(?<=\\,)\\d+$") %>% as.integer))
#> BaseRating contRating Participant
#> 1 4 5 1
#> 2 4 4 1
Or:
library(tidyverse)
df %>%
separate_rows(BaseRating, sep = ",", convert = TRUE) %>%
group_by(contRating, Participant) %>%
summarise(BaseRating = last(BaseRating), .groups = "drop") %>%
relocate(BaseRating, .before = 1)
#> # A tibble: 2 × 3
#> BaseRating contRating Participant
#> <int> <int> <int>
#> 1 4 4 1
#> 2 4 5 1
If we want a quick option, we can use trimws from base R
df$BaseRating <- as.numeric(trimws(df$BaseRating, whitespace = ".*,"))
-output
> df
BaseRating contRating Participant
1 4 5 1
2 4 4 1
Or another option is stri_extract_last
library(stringi)
df$BaseRating <- as.numeric(stri_extract_last_regex(df$BaseRating, "\\d+"))
data
df <- structure(list(BaseRating = c("5,4,6,3,2,4", "4"), contRating = 5:4,
Participant = c(1L, 1L)), class = "data.frame", row.names = c(NA,
-2L))

how to add a trailing zeroes to multiple parts of a string

I have the following data
v1
19956673-1
20043747-23
20056956-1
36628-2
45820-4
478
115
I need to add trailing zeroes to the both sting fields (before and after the dash) so the desired output (v2) has 8 digits before the dash and 2 digits after. Also, data with no dash can be passed as is.
v1 v2
19956673-1 19956673-01
20043747-23 20043747-23
20056956-1 20056956-01
36628-2 00036628-02
45820-4 00045820-04
478 478
115 115
Here is an option to extract the part after the -, then use sprintf
i1 <- grep('-', df1$v1)
df1$v2 <- df1$v1
df1$v2[i1] <- sprintf('%s-%02d', sub('-.*', '', df1$v1[i1]),
as.numeric(sub('.*-', '', df1$v1[i1])))
-output
df1
# v1 v2
#1 19956673-1 19956673-01
#2 20043747-23 20043747-23
#3 20056956-1 20056956-01
#4 36628-2 36628-02
#5 45820-4 45820-04
#6 478 478
#7 115 115
Or another option is regex based on capturing as a group i.e. match the digits (\\d+) from the start (^) of the string, capture as a group ((...)), followed by a -, then capture the single digit (\\d) at the end ($), replace with the backreference of the captured groups and insert 0 before the second backreference
df1$v2 <- sub('^(\\d+)-(\\d)$', '\\1-0\\2', df1$v1)
data
df1 <- structure(list(v1 = c("19956673-1", "20043747-23", "20056956-1",
"36628-2", "45820-4", "478", "115")), row.names = c(NA, -7L),
class = "data.frame")
A solution with sub and positive lookbehind:
v2 <- sub("(?<=-)(\\d)$", "0\\1", v1, perl = TRUE)
Result:
v2
[1] "19956673-01" "20043747-23" "20056956-01" "36628-02" "45820-04"
How this works:
(?<=-): positive lookbehind: "if you see a - on the left ...
(\\d)$: ... then remember (\\1) the single digit ((\\d)) right at the end of the string ($) and add a 0 to the left of it"
Data:
v1 <- c("19956673-1", "20043747-23", "20056956-1", "36628-2", "45820-4")

How to remove a mystery character from column header in R?

I have a mystery character in my dataframe in R:
df <- structure(list(`ID21` = c("23", "44"),
ID22 = c("53", "23"), `Drug-na�ve_D22` = c("53",
"45")), row.names = 1:2, class = "data.frame")
> df
ID21 ID22 Drug-na�ve_D22
1 23 53 53
2 44 23 45
What's the best way to remove this character? Would some sort of gsub with regular expression work?
In this example I've replaced it with the letter i:
> df
ID21 ID22 Drug-naive_D22
1 23 53 53
2 44 23 45
To remove any non-word characters (letters, numbers and underscore) in your column names
names(df) <- gsub("\\W", "", names(df))
If you want to replace the characters with a different character, put them in the second argument
To match any non-ASCII character you can use this pattern:
[^ -~]
So, for example, if you want to replace the char by i, you can use sub thus:
sub("[^ -~]", "i", names(df))

Trimming and reformatting dates in R

I have a column of data with the following types of dates and number entries:
16-Jun
21-01A
7-04
Aug-99
5-09
I want to convert these all into numbers, by doing two things. First, where the data have a number before a dash (as in the first three examples), I want to trim the data from the dash onwards. So the entries would appear 16, 21 and 7.
Second, where the entry is written in month-date format (e.g. Aug-99), I want to convert that to the number of the month and then trim it. so this example, would be to convert the date to 8-99 then trim to just 8.
How can I do this in R? When I use grep, sub and match commands, as in the answer below, I get:
[1] 16 21 7 5 8
When I am after: [1] 16 21 7 8 5
We use grep to find the index of elements that start with alphabets. Remove the substring that starts from - to the end of the string with sub. Subset the 'v2' based on 'i1' and convert to numeric while we match the ones starting with alphabets to month.abb and get the index of month, concatenate the output.
i1 <- grepl("^[A-Z]", v1)
v2 <- sub("-.*", "", v1)
c(as.numeric(v2[!i1]), match(v2[i1], month.abb))
#[1] 16 21 7 8
For the new dataset, we can use ifelse
i1 <- grepl("^[A-Z]", df1$v1)
v2 <- sub("-.*", "", df1$v1)
as.numeric(ifelse(i1, match(v2, month.abb), v2))
#[1] 16 21 7 8 5
data
v1 <- c('16-Jun','21-01A','7-04','Aug-99')
df1 <- structure(list(v1 = c("16-Jun", "21-01A", "7-04", "Aug-99", "5-09"
)), .Names = "v1", class = "data.frame", row.names = c(NA, -5L))

Resources