select part of a string after certain number of special character - r

I have a data.table with a column
V1
a_b_c_las_poi
a_b_c_kiosk_pran
a_b_c_qwer_ok
I would like to add a new column to this data.table which does not include the last part of string after the "_".
UPDATE
So i would like the output to be
a_b_c_las
a_b_c_kiosk
a_b_c_qwer

If k is the number of fields to keep:
k <- 2
DT[, V1 := do.call(paste, c(read.table(text=V1, fill=TRUE, sep="_")[1:k], sep = "_"))]
fill=TRUE can be omitted if all rows have the same number of fields.
Note: DT in a reproducible form is:
library(data.table)
DF <- data.frame(V1 = c("a_b_c_las_poi", "a_b_c_kiosk_pran", "a_b_c_qwer_ok"),
stringsAsFactors = FALSE)
DT <- as.data.table(DF)

You can do this with sub and a regular expression.
sub("(.*)_.*", "\\1", V1)
[1] "a_b_c_las" "a_b_c_kiosk" "a_b_c_qwer"

Related

R Replace a value with a character from another data.table

Here I met some problems in R about replacement coding.
Here is the original data.table. There are two datatables:
dt1 <- data.table(V1 = c(1,"A"))
dt2 <- data.table("1" = c(4,5,6), "A" = c("c","d","e"))
Now I want to replace values in dt1 with value in dt2 by matching relationship.
The desired output should be:
dt3 <- data.table(V1 = c("4,5,6", "c,d,e"))
That is, I want to replace values in dt1 with all values in the corresponding column in dt2. And this is a simple example, I want to apply it to the whole data.table in R.
I met so big trouble in dealing with this, so please help me.
Here is a way from your input to desired output.
dt1[, V1 := sapply(dt2, paste, collapse = ',')[V1]]
# Test
all.equal(dt1, dt3)
[1] TRUE
PS. Are you sure that storing the values separated by a comma in a string is the best?
We may do
dt1[, V1 := unlist(lapply(V1, function(x) toString(dt2[[x]])))]
dt1
V1
1: 4, 5, 6
2: c, d, e

How to group column names and add suffixes to them?

I kindly appreciate if someone could help me with the task described below.
I have R dataframe with the following columns:
id
cols_len.max.(1,5]
cols_len.max.(1,55]
cols_width.min.(1,55]
cols_width.min.(2,15]
cols_width.uppen.(1,15]
I want to rename these columns to get the following column names:
id
cols_len.max_1
cols_len.max_2
cols_width.min_1
cols_width.min_2
cols_width.upper
This is my current code:
colnames(df) <- gsub("\\(.*\\]*-*.","",colnames(df))
colnames(df) <- gsub("\\.","",colnames(df))
colnames(df) <- gsub("-","",colnames(df))
colnames(df) <- gsub("\\_","",colnames(df))
But this gives my duplicate column names (cols_len.max and cols_width.min):
id
cols_len.max
cols_len.max
cols_width.min
cols_width.min
cols_width.upper
How can I append then with _N, where N should be assigned as showed above? I am searching for an automated approach because my real data frame contains hundreds of columns.
An option is to remove the substring at the end and wrrap with make.unique
v2 <- make.unique(sub("\\.\\(.*", "", v1))
Or another option is to use the sub output as a grouping variable and then append the sequence at the end
tmp <- sub("\\.\\(.*", "", v1)
t1 <- ave(seq_along(tmp), tmp, FUN = function(x)
if(length(x) == 1) "" else seq_along(x))
and paste it at the end of 'tmp'
i1 <- nzchar(t1)
tmp[i1] <- paste(tmp[i1], t1[i1], sep="_")
tmp
#[1] "id" "cols_len.max_1" "cols_len.max_2" "cols_width.min_1" "cols_width.min_2" "cols_width.upper"
dat
v1 <- c("id", "cols_len.max.(1,5]", "cols_len.max.(1,55]", "cols_width.min.(1,55]",
"cols_width.min.(2,15]", "cols_width.upper.(1,15]")

Spliting string in column by seperator and adding those as new columns in the same data frame using R

I have a column in dataframe df with value 'name>year>format'. Now I want to split this column by > and add those values to new columns named as name, year, format. How can I do this in R.
You can do that easily using separate function in tidyr;
library(tidyr)
library(dplyr)
data <-
data.frame(
A = c("Joe>1993>student")
)
data %>%
separate(A, into = c("name", "year", "format"), sep = ">", remove = FALSE)
# A name year format
# Joe>1993>student Joe 1993 student
If you do not want the original column in the result dataframe change remove to TRUE
An option is read.table in base R
cbind(df, read.table(text = as.character(df$column), sep=">",
header = FALSE, col.names = c("name", "year", "format")))
In case your data is big, it would be a good idea to use data.table as it is very fast.
If you know how many fields your "combined" column has:
Suppose the column has 3 fields, and you know it:
library(data.table)
# the 1:3 should be replaced by 1:n, where n is the number of fields
dt1[, paste0("V", 1:3) := tstrsplit(y, split = ">", fixed = TRUE)]
If you DON'T know in advance how many fields the column has:
Now we can get some help from the stringi package:
library(data.table)
library(stringi)
maxFields <- dt2[, max(stri_count_fixed(y, ">")) + 1]
dt2[, paste0("V", 1:maxFields) := tstrsplit(y, split = ">", fixed = TRUE, fill = NA)]
Data used:
library(data.table)
dt1 <- data.table(x = c("A", "B"), y = c("letter>2018>pdf", "code>2020>Rmd"))
dt2 <- rbind(dt1, data.table(x = "C", y = "report>2019>html>pdf"))

Elegant way to match and replace part of string

I have two data tables like namely dt and dt1
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
s
1 AA-AA-1
2 BB-BB-2
3 CC-CC-3
dt1 <- data.table(x=c(1,2,3), name=c("AA", "BB", "CC"))
x name
1: 1 AA
2: 2 BB
3: 3 CC
I need to replace part of string in s column of dt with name column of dt1 by matching number after last hyphen of s & x col in dt1 column in dt, so that dt becomes like this.
s
1: AA-AA-AA
2: BB-BB-BB
3: CC-CC-CC
I know we can do it by splitting s and matching
split <- lapply(strsplit(as.character(dt$s), split="-"), tail, n=1)
dt1$name[match(dt$split, dt1$x)
Is there any way to speed it up in elegant way?
Here is a base R approach. We can create an x column in the first dt data table, using the digit appearing to the right of the final dash. Then, we can merge the two data tables on x, and finally concatenate the s result you expect.
dt$x <- sub(".*-", "", dt$s)
result <- merge(dt, dt1, by="x")
result$s <- paste0(sub("\\d+", "", result$s), result$name)
result$s
[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
Demo
I'd take the straightforward approach:
dt1[dt[, .(x = as.integer(sub('.*-', '', s)), str = sub('[^-]+$', '', s))],
on = .(x), .(s = paste0(str, name))]
# s
#1: AA-AA-AA
#2: BB-BB-BB
#3: CC-CC-CC
base R, sprintf + sub
mapply(sprintf, sub("[^-]+$", "%s", dt$s), dt1$name)
# AA-AA-%s BB-BB-%s CC-CC-%s
# "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
I presumed that both data frames are in a matching order (as they are in the example). If not, you need to match them before, for example:
mapply(sprintf, sub("-.?$", "-%s", dt$s), dt1$name[match(gsub("[^0-9]","", dt$s), dt1$x)])
Here is a slightly more general approach.
mapply(function(pat, repl, src){ sub(pat, repl, src) }, pat = dt1$x, repl = dt1$name, src = dt$s )
#[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
If you say you always want to replace after last - (hyphen), then you can simplify as:
mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt1$name, src = dt$s )
Please note: My solution works only if dt and dt1 are ordered like in the example. This means each first rows are related, ... and so on. If this is not the case consider a combination of #Tims (the merging ...) and my solution.
So here is a rock-solid solution using some of Tim's ideas:
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
dt1 <- data.table(x=3:1, name=c("CC", "BB", "AA")) # note the order is not right.
dt$x <- sub(".*-", "", dt$s)
dt <- merge.default(dt, dt1, by="x")
dt$endResult <- mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt$name, src = dt$s )
If they are sorted appropriately as in your example you can use stringr::str_replace:
library(stringr)
dt[,s := str_replace(s,as.character(dt1$x),dt1$name)]
dt
# s
# 1: AA-AA-AA
# 2: BB-BB-BB
# 3: CC-CC-CC

changing non character letters in selected variables

I've got a data frame with text. I'd like to change all "," to "-" in all observations of selected variables, and like to select the variables based on their names containing the word date.
I've tried to incorporate various variations of grep() expressions into MyFunc but haven't been able to get it to work.
Thanks!
starting point:
df <- data.frame(dateofbirth=c("25,06,1939","15,04,1941","21,06,1978","06,07,1946","14,07,1935"),recdate=c("26,06,1945","03,04,1964","21,06,1949","15,07,1923","07,12,1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2,st"),disdatenow=c("25,06,1975","25,05,1996","21,06,1932","26,07,1934","07,07,1965"), stringsAsFactors = FALSE)
desired outcome:
df <- data.frame(dateofbirth=c("25-06-1939","15-04-1941","21-06-1978","06-07-1946","14-07-1935"),recdate=c("26-06-1945","03-04-1964","21-06-1949","15-07-1923","07-12-1945"),b=c("8,ted,st","99,tes,rd","6,ldk,dr","2,sdd,jun","asd,2"),disdatenow=c("25-06-1975","25-05-1996","21-06-1932","26-07-1934","07-07-1965"), stringsAsFactors = FALSE)
Current code:
MyFunc <- function(x) {gsub(",","-",df$x)}
You can use mutate_at from dplyr:
df %>%
mutate_at(vars(contains("date")), function(x){gsub(",", "-", x)})
and that gives you this:
dateofbirth recdate b disdatenow
1 25-06-1939 26-06-1945 8,ted,st 25-06-1975
2 15-04-1941 03-04-1964 99,tes,rd 25-05-1996
3 21-06-1978 21-06-1949 6,ldk,dr 21-06-1932
4 06-07-1946 15-07-1923 2,sdd,jun 26-07-1934
5 14-07-1935 07-12-1945 asd,2,st 07-07-1965
Using your function MyFunc, this should also work
MyFunc <- function(x) {gsub(",", "-", x)}
library(data.table)
setDT(df)
cols <- c("dateofbirth", "recdate", "disdatenow")
df[, cols] <- df[, lapply(.SD, MyFunc), .SDcols = cols]

Resources