Parsing names in mixed formats using R - r

I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:
Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")
The goal is to assemble a dataframe that contains names in a format like this:
FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)
I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:
FirstNames2 <- gsub (".*,\\s","",Names)
This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.
So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!

Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.
read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)
giving:
V1 V2
1 Mary Smith
2 Maria Hernandez
3 Ed Bonds
4 Michael Jones

You could rearrange the , version to first last name and then just strsplit.
FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)

temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
if (grepl(pattern = ", ", x = Names[i])){
data.frame(F = temp[[i]][2], L = temp[[i]][1])
}else{
data.frame(F = temp[[i]][1], L = temp[[i]][2])
}
}))
# F L
#1 Mary Smith
#2 Maria Hernandez
#3 Ed Bonds
#4 Michael Jones

Related

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

how to match string from the input and read it as number?

I am taking a string which is given as input and then match it with a number and return back the string as out put.
Example:
my.string <- readline(prompt="Enter string 1: ")
print(my.string)
#[1] "coffee"
DataFrame:
ID name
22 coffee
23 tea
24 milk
Task :
I need to convert 'coffee' into 22 because I want to pass this number into a
function.
How do I get that ?
I suggest, if your data.frame has only 2 columns then using named vector would be an option worth considering. It provides easy way to subset.
It seems to me that name column is unique in your data.frame. If so then you can convert name column to rownames then selection will be much simpler.
#change the 'name' to rownames
rownames(df) <- df$name
df$name <- NULL
#Now selection will be much easier
df[my.string, "ID"]
#22
Using "named vector":
itemVector <- c(22,23,24)
names(itemVector) <- c("coffee", "tea", "milk")
itemVector[my.string]
# coffee
# 22
Data:
df <- read.table(text =
"ID name
22 coffee
23 tea
24 milk",
header = TRUE, stringsAsFactors = FALSE)
Like my comment said, this is rather straightforward.
my.string <- readline(prompt="Enter string 1: ")
inx <- which(DataFrame$name == my.string)
DataFrame$ID[inx]
#[1] 22
Data.
DataFrame <- read.table(text = "
ID name
22 coffee
23 tea
24 milk
", header = TRUE)

reading comma-separated strings with read.csv()

I am trying to load a comma-delimited data file that also has commas in one of its text columns. The following sample code generates such a file'test.csv',which I'll load usingread.csv()to illustrate my problem.
> d <- data.frame(name = c("John Smith", "Smith, John"), age = c(34, 34))
> d
name age
1 John Smith 34
2 Smith, John 34
> write.csv(d, file = "test.csv", quote = F, row.names = F)
> d2 <- read.csv("test.csv")
> d2
name age
John Smith 34 NA
Smith John 34
Because of the ',' in Smith, John, d2 is not assigned correctly. How do I read the file so that d2 looks exactly like d?
Thanks.
1) read.pattern read.pattern (in gsubfn package) can read such files:
library(gsubfn)
pat <- "(.*),(.*)"
read.pattern("test.csv", pattern = pat, header = TRUE, as.is = TRUE)
giving:
name age
1 John Smith 34
2 Smith, John 34
2) two pass Another possibility is to read it in, fix it up and then re-read it. This uses no packages and gives the same output.
L <- readLines("test.csv")
read.table(text = sub("(.*),", "\\1|", L), header = TRUE, sep = "|", as.is = TRUE)
Note: For 3 fields with the third field at the end use this in (1)
pat <- "(.*),([^,]+),([^,]+)"
The same situation use this in (2) assuming that there are non-spaces adjacent to each of the last two commas and at least one space adjacent to any commas in the text field and that fields have at least 2 characters:
text = gsub("(\\S),(\\S)", "\\1|\\2", L)
If you have some other arrangement just modify the regular expression in (1) appropriately and the sub or gsub in (2).

Merging two data frames with columns with certain patterns in strings

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)
I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.
Name Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5
The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.
Name ChaR
tony.x ENG
tom.x US
The main element in column Name of both dfs is "tony', "tom" and so on.
So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".
I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use
StoRe<-merge(A,B, all=T)
I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:
Name Numbers ChaR
tony 15.6 NA
tony NULL ENG
tom 14.5 NA
tom NULL US
It has been giving me splitting headache. I request you to help.
One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.
a <- data.frame(
Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
Numbers = rnorm(5)
)
b <- data.frame(
Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
ChaR = LETTERS[11:15]
)
# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2'
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)
# String consists of 'letter1 point letter2'; replace by 'letter1 letter2'
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)
result <- merge(a, b, all = TRUE, by = "Name_stand")
Output:
#> result
# Name_stand Name.x Numbers Name.y ChaR
#1 barx .bar.x.rds 1.38072696 bar.x M
#2 companyx <NA> NA company.x O
#3 foobarx .foobar.x.rds -1.53076596 foobar.x N
#4 foox .foo.x.rds 1.40829287 <NA> <NA>
#5 tomx .tom.x.rds -0.01204651 tom.x L
#6 tonyx .tony.x.rds 0.34159406 tony.x K
Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/
# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)
result2 <- merge(a, b, all = TRUE, by = "Name_stand")

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Resources