R splitting a column of character separated by different number of spaces

R splitting a column of character separated by different number of spaces - r

I have a data frame with a column consisting of words separated by a varying number of spaces for example:
head(lst)
'fff fffd ddd'
'sss dd'
'de dd'
'dds sssd eew rrr'
'dsds eed'
What I would like to have is 2 columns:
The first column is the part before the first space
And the second column is the part after the last space
meaning it should like this
V1 v2
'fff' 'ddd'
'sss' 'dd'
'de' 'dd'
'dds' 'rrr
'dsds' 'eed'
The first column I am able to get but the second one is a problem
This is the code I use.
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, `[`, 2)
What I get I get for column v2 is the second word. I know it's because I put 2 inside the sapply How do I tell it to only take what comes after the last space?

You can use tail to grab the last entry of each vector:
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst, head, 1) # example with head to grab first vector element
v2 <- sapply(lst, tail, 1) # example with tail to grab last vector element
Or perhaps the vapply version since you know your return type should be a character vector:
v2 <- vapply(lst, tail, 1, FUN.VALUE = character(1))
Another approach would be to modify your strsplit split criteria to something like this where you split on a space that can optionally be followed by any character one or more times until a final space is found.
strsplit(df$V1, "\\s(?:.+\\s)?")
#[[1]]
#[1] "fff" "ddd"
#
#[[2]]
#[1] "sss" "dd"
#
#[[3]]
#[1] "de" "dd"
#
#[[4]]
#[1] "dds" "rrr"
#
#[[5]]
#[1] "dsds" "eed"
As Sumedh points out this regex works nicely with tidyr's separate:
tidyr::separate(df, V1, c("V1", "V2"), "\\s(?:.+\\s)?")
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Two stringi based approaches:
library(stringi)
v1 <- stri_extract_last_regex(df$V1, "\\S+")
v2 <- stri_extract_first_regex(df$V1, "\\S+")
Or
stri_extract_all_regex(df$V1, "^\\S+|\\S+$", simplify = TRUE)
# this variant explicitly checks for the spaces with lookarounds:
stri_extract_all_regex(df$V1, "^\\S+(?=\\s)|(?<=\\s)\\S+$", simplify = TRUE)

Maybe this?
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, function(x) x[length(x)])
Or
data.frame(t(sapply(strsplit(athletes.df$V1, "\\s+"),
function(x) c(x[1], x[length(x)]))))

Without using any packages, this can be done with read.table after creating a delimiter using sub.
read.table(text=sub("^(\\S+)\\s+.*\\s+(\\S+)$", "\\1 \\2", df1$V1),
header=FALSE, stringsAsFactors= FALSE)
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Another convenient option is word from stringr
library(stringr)
transform(df1, V1 = word(V1, 1), V2 = word(V1, -1))
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
data
df1 <- structure(list(V1 = c("fff fffd ddd", "sss dd", "de dd",
"dds sssd eew rrr",
"dsds eed")), .Names = "V1", class = "data.frame", row.names = c(NA,
-5L))

Related

Identify a set of strings and remove them from a column

I'm trying to loop through a column and remove any characters from the start of the row, that falls under my predefined set of strings.
Reproducible Example
df <- data.frame(serial = 1:3, name = c("Javier", "Kenneth", "Kasey"))
serial name
1 1 Javier
2 2 Kenneth
3 3 Kasey
Condition Vector
Removes these strings from the front of name only!
vec <- c("Ja", "Ka")
Intended Output
serial name
1 1 vier
2 2 Kenneth
3 3 sey

We could create a pattern by pasting vec into one vector and remove their occurrence using sub.
df$name <- sub(paste0("^", vec, collapse = "|"), "", df$name)
df
# serial name
#1 1 vier
#2 2 Kenneth
#3 3 sey
In stringr we can also use str_remove
stringr::str_remove(df$name, paste0("^", vec, collapse = "|"))
#[1] "vier" "Kenneth" "sey"

Since we're using fixed length vec strings in this example, it might even be more efficient to use substr replacements. This will only really pay off in the case when df and/or vec is large though, and comes at the price of some flexibility.
df$name <- as.character(df$name)
sel <- substr(df$name, 1, 2) %in% vec
df$name[sel] <- substr(df$name, 3, nchar(df$name))[sel]
# serial name
#1 1 vier
#2 2 Kenneth
#3 3 sey

We can also do this with substring
library(stringr)
library(dplyr)
df$name <- substring(df$name, replace_na(str_locate(df$name,
paste(vec, collapse="|"))[,2] + 1, 1))
df$name
#[1] "vier" "Kenneth" "sey"
Or with str_replace
str_replace(df$name, paste0("^", vec, collapse="|"), "")
#[1] "vier" "Kenneth" "sey"
Or using gsubfn
library(gsubfn)
gsubfn("^.{2}", setNames(rep(list(""), length(vec)), vec), as.character(df$name))
#[1] "vier" "Kenneth" "sey"

Spacing vector by regular pattern

I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame

Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #

Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame if we need a data.frame output

You can use Base R:
First Collapse the vector into a string while replaceing # where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
First collapse the vector into a string by toString
Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
Then create line breaks after numbers or # and read in the data as a table
You can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #

Another base possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.
Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to #akrun).
Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #

integer type split into two columns

I've column with two alpha numeric characters separated by '->' I'm trying to split them into columns.
Df:
column e
1. asd1->ref2
2. fde4 ->fre4
3. dfgt-fgr ->frt5
4. ftr5 -> lkh-oiut
5. rey6->usre-lynng->usre-lkiujh->kiuj-bunny
6. dge1->fgt4->okiuj-dfet
Desired output
col 1 col 2
1. asd1 ref2
2. fde4 fre4
3. frt5
4. ftr5
5. rey6
6. dge1 fgt4
I tried using out <- strsplit(as.character(Df$column e),'_->_') with no output and used str_extract(m1$column e,"(?<=\\[)[[:alnum:]]")->m1$column f, also strsplit(as.character(Df$column e),' -> 'fixed=T)[[1]][[1]] but not getting the desired output.
The column if of integer type and all are capital letters(I'm not sure if this is imp.)

Here is one way with tidyverse
library(tidyverse)
df1 %>%
separate(columne, into = c('col1', 'col2'), sep = "->", extra = 'drop') %>%
mutate_all(funs(replace(., str_detect(., '-'), "")))
# col1 col2
#1 asd1 ref2
#2 fde4 fre4
#3 frt5
#4 ftr5
#5 rey6
#6 dge1 fgt4

A base R solution as well, though a fair bit less concise than #akrun's tidyverse one:
# split as appropriate
out <- strsplit( as.character( Df$column.e ), '->' )
out <- lapply( out, function(x) {
# I assume you don't want the white space
y <- trimws( x )
# take the first two "columns"
y <- y[1:2]
# remove any items containing a hyphen
y[ grepl( "-", y ) ] <- ""
y
}
)
# then bind it all rowwise
out <- do.call( rbind, out )
data.frame( out )
X1 X2
1 asd1 ref2
2 fde4 fre4
3 frt5
4 ftr5
5 rey6
6 dge1 fgt4

Select the last n characters in a string

I have the following dataset
df <-data.frame(fact=c("a,bo,v", "c,b,v,d", "c"))
I wish to select the last two items for each row. So, Ideally I wish to have this output:
fact
1 bo,v
2 v,d
3 c
I've tried to split the rows and then choose the last two items:
spl <- strsplit(as.character(df$fact), split = ",")
tail(spl[[1]], n=2)
But doe not give me the correct results

You can do this:
lapply(lapply(strsplit(as.character(df$fact), split = ','), function(x) x[c(length(x)-1,length(x))]), paste, collapse = ',')
You split the col and then extract the n and n-1 index. Then paste them together.
You can generalise this for by doing:
lapply(strsplit(as.character(df$fact), split = ','), function(x) x[(length(x)-n):length(x)] )
where n is no of backward steps you want to take.
Using tail is even simpler.
lapply(strsplit(as.character(df$fact), split = ','), tail, n=2)

We can use sapply to loop over every element of fact, split it on basis of , and then select the last n elements using tail
n <- 2
sapply(as.character(df$fact), function(x) {
temp = unlist(strsplit(x, ','))
tail(temp, n)
}, USE.NAMES = F)
#[[1]]
#[1] "bo" "v"
#[[2]]
#[1] "v" "d"
#[[3]]
#[1] "c"
A better option with dplyr I feel using rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(last_two = paste0(tail(unlist(strsplit(as.character(fact),",")), n),
collapse = ","))
# fact last_two
# <fctr> <chr>
#1 a,bo,v bo,v
#2 c,b,v,d v,d
#3 c c

Extract string from a cell and put it in a new data frame R

In a R project, I want to extract strings from a data frame which a column is like
"A|B|C"
"B|Z"
"I|P"
...
I want to have a new data frame with column A B C Z I P
I think to make it with a for and a gsub, but it is not easy because the pattern extract the | and I am not sure if it is the best and elegant way to do this kind of task

With a combination of strsplit,unlist and unique you can do:
#Steps:
#1) split each element of column with separator as "|"
#2) combine output for all items with unlist
#3) retain unique elements of those
vec = c("A|B|C","B|Z","I|P")
newDF = data.frame(newCol = unique(unlist(lapply(vec,function(x) unlist(strsplit(x,"[|]")) ))),
stringsAsFactors = FALSE)
newDF$newCol
#[1] "A" "B" "C" "Z" "I" "P"

starting with the dataframe df, with base R we can try the following:
data.frame(col=unique(unlist(strsplit(as.character(df$col), split='\\|'))))
# col
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
or with dplyr
df %>%
mutate(col = strsplit(col, "\\|")) %>%
unnest(col) %>% unique
# col
# (chr)
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
data
df <- data.frame(col=c("A|B|C",
"B|Z",
"I|P"), stringsAsFactors = FALSE)
If you want them to be the names of the columns, try this:
symbols <- unique(unlist(strsplit(as.character(df$col), split='\\|')))
df <- data.frame(matrix(vector(), 0, length(symbols),
dimnames=list(c(), symbols)), stringsAsFactors=F)
df
#[1] A B C Z I P
#<0 rows> (or 0-length row.names)

We can use cSplit
library(splitstackshape)
unique(cSplit(df1, "V1", "|", "long"), by = "V1")
data
df1 <- data.frame(V1 = c("A|B|C","B|Z","I|P"))

The scan function with the text parameter input appears suited for this task:
st <- c("A|B|C","B|Z","I|P")
scan(text=st, what="", sep="|")
Read 7 items
[1] "A" "B" "C" "B" "Z" "I" "P"
It wasn't clear to me from your problem description or example how you wanted this to be aligned with the original 3 row dataframe.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R splitting a column of character separated by different number of spaces - r

Maybe this? lst <- strsplit(athletes.df$V1, "\\s+") v1 <- sapply(lst ,`[`, 1) v2 <- sapply(lst, function(x) x[length(x)]) Or data.frame(t(sapply(strsplit(athletes.df$V1, "\\s+"), function(x) c(x[1], x[length(x)]))))

Related

Identify a set of strings and remove them from a column

Spacing vector by regular pattern

integer type split into two columns

Select the last n characters in a string

Extract string from a cell and put it in a new data frame R

Categories

Resources