Split a dataframe in multiple columns in R - r

My data frame is as follows:
User
JohnLenon03041965
RogerFederer12021954
RickLandsman01041975
and I am trying to get the output as
Name Lastname Birthdate
John Lenon 03041965
Roger Federer 12021954
Rick Landsman 01041975
I tried the following code:
**a = gsub('([[:upper:]])', ' \\1', df$User)
a <- as.data.frame(a)
library(tidyr)
a <-separate(a, a, into = c("Name", "Last"), sep = " (?=[^ ]+$)")**
I get the following:
Name Last
John Lenon03041965
Roger Federer12021954
Rick Landsman01041975
I am trying to use the separate condition like (?=[0-9]) but getting error like this:
c <-separate(c, c, into = c("last", "date"), sep = '(?=[0-9])')
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], : argument is of length zero

We can use a regex lookaround as sep by specifying either to split between a lower case letter and an upper case ((?<=[a-z])(?=[A-Z])) or (|) between a lower case letter and a number ((?<=[a-z])(?=[0-9]+))
df1 %>%
separate(User, into = c("Name", "LastName", "Birthdate"),
sep = "(?<=[a-z])(?=[A-Z])|(?<=[a-z])(?=[0-9]+)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
Or another option is extract to capture characters as a group by placing it inside the brackets ((...)). Here, the 1st capture group matches an upper case letter followed by one or more lower case letters (([A-Z][a-z])) from the start (^) of the string, 2nd captures one or more characters that are not numbers (([^0-9]+)) and in the 3rs, it is the rest of the characters ((.*))
df1 %>%
extract(User, into = c("Name", "LastName", "Birthdate"),
"^([A-Z][a-z]+)([^0-9]+)(.*)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
data
df1 <- structure(list(User = c("JohnLenon03041965", "RogerFederer12021954",
"RickLandsman01041975")), .Names = "User", class = "data.frame", row.names = c(NA,
-3L))

Related

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill

new column with paste0 in R

I am looking for a function that allows me to add a new column to add the values called ID to a string, that is:
I have a list of words with your ID:
car = 9112
red = 9512
employee = 6117
sky = 2324
words<- c("car", "sky", "red", "employee", "domestic")
match<- c("car", "red", "domestic", "employee", "sky")
the comparison is made by reading in an excel file, if it finds the value equal to my vector words, it replaces the word with its ID, but leaves the original word
x10<- c(words)# string
words.corpus <- c(L4$`match`) # pattern
idwords.corpus <- c(L4$`ID`) # replace
words.corpus <- paste0("\\A",idwords.corpus, "\\z|\\A", words.corpus,"\\z")
vect.corpus <- idwords.corpus
names(vect.corpus) <- words.corpus
data15 <- str_replace_all(x10, vect.corpus)
result:
data15:
" 9112", "2324", "9512", "6117", "employee"
What I'm looking for is to add a new column with the ID, instead of replacing the word with the ID
words ID
car 9112
red 9512
employee 6117
sky 2324
domestic domestic
I'd use data.table for fast lookup based on the fixed words value. While it's not 100% clear what you are asking for, it sounds like you want to replace words with an index value if there is a match, or leave the word as a word if not. This code will do that:
library("data.table")
# associate your ids with fixed word matches in a named numeric vector
ids <- data.table(
word = c("car", "red", "employee", "sky"),
ID = c(9112, 9512, 6117, 2324)
)
setkey(ids, word)
# this is what you would read in
data <- data.table(
word = c("car", "sky", "red", "employee", "domestic", "sky")
)
setkey(data, word)
data <- ids[data]
# replace NAs from no match with word
data[, ID := ifelse(is.na(ID), word, ID)]
data
## word ID
## 1: car 9112
## 2: domestic domestic
## 3: employee 6117
## 4: red 9512
## 5: sky 2324
## 6: sky 2324
Here the "domestic" is not matched so it remains as the word in the ID column. I also repeated "sky" to show how this will work for every instance of a word.
If you want to preserve the original sort order, you could create an index variable before the merge, and then reorder the output by that index variable.

Splitting coloumn with differing syntax in R

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Using gsub across columns

I have some data:
testData <- tibble(fname = c("Alice", "Bob", "Charlie", "Dan", "Eric"),
lname = c("Smith", "West", "CharlieBlack", "DanMcDowell", "Bush"))
A few of the last names have first names concatenated to them.
What is an effective way to go through and fix the lname column?
I want it to look like this:
lname = c("Smith", "West", "Black", "McDowell", "Bush")
I can use a for loop but I have half a million rows of data so I'd like to find a more efficient method.
We can use str_remove
library(tidyverse)
testData %>%
mutate(lname = str_remove(lname, fname))
# A tibble: 5 x 2
# fname lname
# <chr> <chr>
#1 Alice Smith
#2 Bob West
#3 Charlie Black
#4 Dan McDowell
#5 Eric Bush
We can use gsub within apply:
apply(testData,1,function(x) gsub(x['fname'],"",x['lname']))
Output:
[1] "Smith" "West" "Black" "McDowell" "Bush"
try mutate with an ifelse clause to catch the lname entires that are concatenated, e.g.:
library(dplyr)
testData <- testData %>% mutate(lname = ifelse(grepl('[[:upper:]][[:lower:]]+[[:upper:]]', lname), gsub('^[[:upper:]][[:lower:]]+', "", lname), lname))
In this example, you are saying "mutate lname IF the string has
an uppercase letter + at least one lowercase letter + an uppercase letter. If that condition is met, replace the first uppercase letter and following lowercase letters with nothing. If that condition is not met, just keep the original lname text".

R - how to loop through a dataframe to match multiple substrings - concatenate all matches in a new column

I am quite new to R - have worked on this all day but am out of ideas.
I have a dataframe with long descriptions in one column, eg:
df:
ID Name Description
1 A ABC DEF
2 B ARS XUY
3 C ASD
And I have a vector of search terms:
ABC
ARS
XUY
DE
I would like to go through each row in the dataframe and search the Description for any of the search terms. I then want all matches to be concatenated in a new column in the dataframe, e.g.:
ID Name Description Matches
1 A ABC DEF ABC
2 B ARS XUY ARS;XUY
3 C ASD
I would want to search ~100k rows with 1000 search terms.
Does anyone have any ideas? I was able to get a matrix with sapply and grepl, but I'd rather have a concatenated solution.
One option using strsplit and %in% instead of regex:
df$Matches <- sapply(strsplit(as.character(df$Description), '\\s'),
function(x){paste(search[search %in% x], collapse = ';')})
df
# ID Name Description Matches
# 1 1 A ABC DEF ABC
# 2 2 B ARS XUY ARS;XUY
# 3 3 C ASD
data:
search <- c("ABC", "ARS", "XUY", "DE")
df <- structure(list(ID = 1:3, Name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), Description = structure(1:3, .Label = c("ABC DEF",
"ARS XUY", "ASD"), class = "factor"), Matches = c("ABC", "ARS;XUY",
"")), .Names = c("ID", "Name", "Description", "Matches"), row.names = c(NA,
-3L), class = "data.frame")
Another option, which I tried to use in the comments, is to use the stringr package. There are two potential downsides to this approach: 1) it uses regex, and 2) it returns the search term matched instead of the value found.
library(stringr)
df = data.frame(Name=LETTERS[1:3],
Description=c("ABC DEF", "ARS XUY", "ASD"),
stringsAsFactors=F)
search_terms = c("ABC", "ARS", "XUY", "DE")
regex = paste(search_terms, collapse="|")
df$Matches = sapply(str_extract_all(df$Description, regex), function(x) paste(x, collapse=";"))
df
# Name Description Matches
# (chr) (chr) (chr)
# 1 A ABC DEF ABC;DE
# 2 B ARS XUY ARS;XUY
# 3 C ASD
With that being said, I think Alistaire's solution is the better approach since it doesn't use regex.
Here's an alternative:
df <- data.frame(ID=c(1L,2L,3L),Name=c('A','B','C'),Description=c('ABC DEF','ARS XUY','ASD'),stringsAsFactors=F);
st <- c('ABC','ARS','XUY','DE');
df$Matches <- apply(sapply(paste0('\\b',st,'\\b'),grepl,df$Description),1L,function(m) paste(collapse=';',st[m]));
df;
## ID Name Description Matches
## 1 1 A ABC DEF ABC
## 2 2 B ARS XUY ARS;XUY
## 3 3 C ASD

Resources