r String split and merge - r

My dataset looks like this below
Id Col1
--------------------
133 Mary 7E
281 Feliz 2D
437 Albert 4C
What I am trying to do is to take the 1st two characters from the 1st word in Col1 and all the whole second word and then merge them.
My final expected dataset should look like this below
Id Col1
--------------------
133 MA7E
281 FE2D
437 AL4C
Any suggestions on how to accomplish this is much appreciated.

You can do
my_data$Col1 <- sub("(\\w{2})(\\w* )(\\b\\w+\\b)", "\\1\\3", my_data$Col1)
my_data$Col1 <- toupper(my_data$Col1)
my_data
# Id Col1
# 1 133 MA7E
# 2 281 FE2D
# 3 437 AL4C
The brackets show the single groups that are matched and only the first and the third are retained. \\w matches letters and numbers and \\b matches the boundary of words.

We can also do this in paste0 together the output of substr and str_split within a dplyr pipe chain:
df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
library(stringr)
df %>%
mutate(Col1 = toupper(paste0(substr(Col1, 1, 2),
stringr::str_split(Col1, ' ')[[1]][-1])))

You can do this in several steps. First split by space, subset first two letters of the name and capitalize them. Paste that together with the second part. Result is in column final. You could take all these intermediate steps or chain commands into less statements, whatever floats your boat.
xy <- data.frame(id = c(133, 281, 437),
name = c("Mary 7E", "Feliz 2D", "Albert 4C"),
stringsAsFactors = FALSE)
xy$first <- sapply(strsplit(xy$name, " "), "[", 1)
xy$second <- sapply(strsplit(xy$name, " "), "[", 2)
xy$first_upper <- toupper(substr(x = xy$first, start = 1, stop = 2))
xy$final <- paste(xy$first_upper, xy$second, sep = "")
xy
id name first second first_upper final
1 133 Mary 7E Mary 7E MA MA7E
2 281 Feliz 2D Feliz 2D FE FE2D
3 437 Albert 4C Albert 4C AL AL4C

Here is another variation using sub. We can use lookarounds in Perl mode to selectively remove everything except for the first two, and last two, characters. Then, make a call to toupper() to capitalize all letters.
df$Col1 <- toupper(sub("(?<=^..).*(?=..$)", "", df$Col1), perl=TRUE)
[1] "MA7E" "FE2D" "AL4C"
Demo

rather than one row solution this is easy to interpret and modify
xx_df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
xx_df %>%
mutate(xpart1 = stri_split_fixed(Col1, " ", simplify = T)[,1]) %>%
mutate(xpart2 = stri_split_fixed(Col1, " ", simplify = T)[,2]) %>%
mutate(Col1_new = paste0(substr(xpart1,1,2), substr(xpart2, 1, 2))) %>%
select(id, Col1 = Col1_new) %>%
mutate(Col1 = toupper(Col1))
result is
id Col1
1 133 MA7E
2 281 FE2D
3 437 AL4C

For this solution use substr to take the first 2 elements from each string, and the last 2. For selecting the last 2 we need nchar, as part of sapply. paste0 together. Also using toupper to have capital letters.
l2 <- sapply(df$Col1, function(x) nchar(x))
paste0(toupper(substr(df$Col1,1,2)), substr(df$Col1, l2-1, l2))
[1] "MA7E" "FE2D" "AL4C"

Related

regex for mutate from substring

I'm still learning my way around Regex, help much appreciated. I'm trying to extract the string from beginning of file name, aswell as last two characters from inside the square brackets of "File" below to generate "Image" and "ID" variables by mutate shown in data.out.
data<- data.frame("File"= c("TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data",
"TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data"))
data.out<- data %>% data.frame("Image"= c("TA1317", "TA2654"), "ID" = c("2A", "3A"))
File Image ID
1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA2654 3A
Another alternative is strcapture, with only one regex pattern instead of two:
out <- strcapture("^([^_]*).*?\\[[^,]*,([^,]*,[^,*])\\].*", data$File, list(Image = "", ID = ""))
out$ID <- gsub(",", "", out$ID, fixed = TRUE)
out
# Image ID
# 1 TA1317 2A
# 2 TA 2654 3A
cbind(data, out)
# File Image ID
# 1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
# 2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
Within a dplyr pipe, you can still use it:
library(dplyr)
data %>%
bind_cols(strcapture("^([^_]*).*?\\[[^,]*,([^,]*,[^,*])\\].*", .$File, list(Image = "", ID = ""))) %>%
mutate(ID = gsub(",", "", ID, fixed = TRUE))
# File Image ID
# 1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
# 2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
You can try :
transform(data, Image = sub('([A-Z0-9\\s]+)_.*', '\\1', File),
ID = sub('.*\\[.*(\\d+),([A-Z])\\].*', '\\1\\2', File))
# File Image ID
#1 TA1317_Scan3_Core[1,2,A]_[7473,42737]_component_data TA1317 2A
#2 TA 2654_Scan1_Core[1,3,A]_[6700,36673]_component_data TA 2654 3A
where Image captures one or more occurrence of A-Z, 0-9 or whitespace.
and ID consists of a number followed by a comma and a letter between square brackets.

Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

How to split a string (in a column) using 2 different conditions into 2 separate columns and only keep those 2 columns?

I have a column of strings that's like this:
|Image
|---
|CR 00_01_01
|SF 45_04_07
|ect
I want to get an end result of this:
| Condition | Time |
| --- | --- |
| CR | 00 |
I have 2 steps of doing this but it's very cumbersome. Essentially, I split the string twice first using space and second using _.
df <- df[, c("Condition","T") := tstrsplit(Image, " ", fixed=T)]
df <- df[, c("Time") := tstrsplit(T, "_", fixed=TRUE, keep = 1L)]
Is there any better way of doing this?
Here is a strsplit solution that sounds like it is what you are looking for. Split based on space or underscore and select first two elements.
split_string <- strsplit(df1$Image, split = "\\s|_")
data.frame(Condition = sapply(split_string, `[`, 1),
Time = sapply(split_string, `[`, 2))
Condition Time
1 CR 00
2 SF 45
If the format of the Image column is always the same, you could extract based on position.
data.frame(Condition = substr(df1$Image, 1, 2),
Time = substr(df1$Image, 4, 5))
Condition Time
1 CR 00
2 SF 45
Or you could just use regex to extract the letters / first pair of numbers.
data.frame(Condition = gsub("^([[:alpha:]]+).*", "\\1", df1$Image),
Time = gsub(".*[[:space:]]([[:digit:]]+)_.*", "\\1", df1$Image))
Condition Time
1 CR 00
2 SF 45
Data:
df1 <- data.frame(Image = c("CR 00_01_01", "SF 45_04_07"), stringsAsFactors = F)
You can try this using dplyr and tidyr
df%>%separate(image,c("Image","Time")," ")%>%
mutate(Time=sub("([0-9]+).*","\\1",Time))
Image Time
1 CR 00
2 SF 45
Data
structure(list(image = c("CR 00_01_01", "SF 45_04_07")), class = "data.frame", row.names = c(NA,
-2L))

Splitting a column in a data frame by an nth instance of a character

I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.
For example:
View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"
And so on for about 120K rows.
What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:
View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron
I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?
Thank you.
1) Assuming x as defined reproducibly in the Note at the end use read.table as shown. No regular expressions or packages are used.
read.table(text = Table$Column, sep = "|", header = FALSE,
as.is = TRUE, fill = TRUE)[6]
giving:
V6
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA in the into= argument.
library(dplyr)
library(tidyr)
Table %>%
separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")
giving:
commodity
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split= argument of strsplit is "" then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels | and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |) and collapse them together using paste.
data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
wx <- which(cumsum(chars == "|") == 5)
paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)
giving:
commodities
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
Note
Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||",
"|1||KK|Y|Iron|y|12||",
"|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F"), stringsAsFactors = FALSE)
You can try this:
df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]
1) We can use strsplit from base R on the delimiter | and extract the 6th element from the list of vectors
sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
2) Or using regex (again from base R), use sub to extract the 6th word
sub("^([|][^|]+){4}[|]([^|]*).*", "\\2",
gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
data
Table <- structure(list(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA,
-6L))

Control text justification

I am trying to create an input file for another program that is space-delimited. I'm pasting together the contents of multiple columns and having problems when the number have different lengths due to what appears to be a default right-justify in R. For example:
row_id monthly_spend
123 4.55
567 24.64
678 123.09
becomes :
row_id:123 monthly_spend: 4.55
row_id:567 monthly_spend: 24.64
row_id:678 monthly_spend:123.09
while what I need is this:
row_id:123 monthly_spend:4.55
row_id:567 monthly_spend:24.64
row_id:678 monthly_spend:123.09
the code I'm using is derived from this question here and looks like this:
paste(row_id, monthly_spend, sep=":", collapse=" ")
i've tried formatting the columns as numeric or integer without any change.
Any suggestions?
if you put your vectors into a data.frame (if they are not already)
you can use:
apply(sapply(names(myDF), function(x)
paste(x, myDF[, x], sep=":") ), 1, paste, collapse=" ")
# [1] "row_id:123 monthly_spend:4.55"
# [2] "row_id:567 monthly_spend:24.64"
# [3] "row_id:678 monthly_spend:123.09"
or alternatively:
do.call(paste, lapply(names(myDF), function(x) paste0(x, ":", myDF[, x])))
sprintf is also an option. You've got many ways of going about it
sample data used:
myDF <- read.table(header=TRUE, text=
"row_id monthly_spend
123 4.55
567 24.64
678 123.09")
With your data snippet:
df <- read.table(text = "row_id monthly_spend
123 4.55
567 24.64
678 123.09", header = TRUE)
The we can paste together but employ the format function with trim = TRUE to take care of stripping the spaces you don't want:
with(df, paste("row_id:", row_id,
"monthly_spend:", format(monthly_spend, trim = TRUE)))
Which gives:
> with(df, paste("row_id:", row_id,
+ "monthly_spend:", format(monthly_spend, trim = TRUE)))
[1] "row_id: 123 monthly_spend: 4.55" "row_id: 567 monthly_spend: 24.64"
[3] "row_id: 678 monthly_spend: 123.09"
If you need this in a data frame before writing out to file, use:
newdf <- with(df, data.frame(foo = paste("row_id:", row_id,
"monthly_spend:",
format(monthly_spend, trim = TRUE))))
newdf
> newdf
foo
1 row_id: 123 monthly_spend: 4.55
2 row_id: 567 monthly_spend: 24.64
3 row_id: 678 monthly_spend: 123.09
When you write this out, the columns will be justified as you want.
Here is a general answer (any number of variables), assuming your data is in a data.frame dat:
x <- mapply(names(dat), dat, FUN = paste, sep = ":")
write.table(x, file = stdout(),
quote = FALSE, row.names = FALSE, col.names = FALSE)
And you can replace stdout() with a filename.
assuming the data frame is called df
write.table(as.data.frame(sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"))),"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");
equivalent to following substeps:
# generating the column separated records
df_cp<-sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"));
### casting to data frame
df_cp<-as.data.frame(df_cp);
### writing out to disk
write.table(df_cp,"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");

Resources