I've got a list of names that have been written in a messy way in a single column. I'm trying to extract first name, middle names and last names out of this column to store separately.
To do this, I gsub the first word from each name entry and save it as the first name. I then remove the last word and first word of each entry and save that as the middle names. Then i gsub the last word from each entry and save it as the last name.
This gave me a problem, because for entries that have only one name entered (so 'kevin' instead of 'kevin banks') my code saves the first name as the last name ('kevin kevin'). I tried to fix it using a for-loop that deletes the lastname column if the original name entry has only 1 word. When i try this, ALL the lastname entries are empty, even the ones that do have a last name!
This is my code:
df <- data.frame(ego = c("linda", "wendy pralice of rivera", "bruce springsteen", "dan", "sam"))
df$firstname <- gsub("([A-Za-z]+).*", "\\1", df$ego)
df$middlename <- gsub("^\\w*\\s*", "", gsub("\\s*\\w*\\.*$", "", df$ego))
df$lastname <- gsub("^.* ([A-Za-z]+)", "\\1", df$ego)
for(n in df$ego) {
if(lengths(strsplit(n, " ")) == 1) {
df$lastname <- ""
}
}
What am i doing wrong?
If there are 4 fields put double quotes around the middle two. For example, a b c d would be changed to a "b c" d giving s1. (If there are not 4 fields then no substitution is done and s1 is set to df$ego.)
If there are exactly two fields insert double quotes between the two. For example, a b would be changed to a "" b. (If there are not exactly two fields then no substitution is done and s2 is set to s1).
Finally read in.
s1 <- sub('^(\\w+) (\\w+ \\w+) (\\w)+$', '\\1 "\\2" \\3', df$ego)
s2 <- sub('^(\\w+) (\\w+)$', '\\1 "" \\2', s1)
read.table(text = s2, as.is = TRUE, fill = TRUE,
col.names = c("first", "middle", "last"))
giving:
first middle last
1 linda
2 wendy pralice of a
3 bruce springsteen
4 dan
5 sam
Related
I am trying to switch genders of words in a string in R. For example, if I have the sentence "My gf has a mother who talks to my father and his bf", I want it to read "My bf has a father who talks to my mother and her gf".
I have a key-value pair list which contains a list of gender pairs -- right now it is just a dataframe which looks something like the below. Then my naive way of solving it was just to do a string replace where I iterate through the list and replace the key with the value. The obvious problem with this is that it just ends up swapping everything in the sentence, and then swapping it all back. You can see this is the example code below.
library(stringr)
key_vals = data.frame(first_word = c("bf", "gf", "mother", "father", "his", "her"), second_word = c("gf", "bf", "father", "mother", "her", "his"))
ex = "My gf has a mother who talks to my father and his bf"
for(i in 1:nrow(key_vals)){
ex = str_replace_all(ex, key_vals$first_word[i], key_vals$second_word[i])
}
My other idea was making two lists, one which had all male keys and all female values, and one which was the opposite. Then if I split up the sentence into individual words, for each word I could do an if statement like "if a male string is present, replace it with a female string, elif a female string is present, replace it with a male string, else do nothing". However, I can't figure out how to get just the words alone in a way I can then easily recombine into a working sentence. String split based on regex etc. just deletes the words, so I'm really struggling.
Another problem is that if, for example, there is something like "mother", it might get replaced to be "mothis", since I'm using a stupid way of matching strings which doesn't first identify the words, so it seems like I need to split it into words in any case.
This feels like it should be much more straightforward than it has been for me! Any help would be very appreciated.
We may use gsubfn
library(gsubfn)
gsubfn("(\\w+)", setNames(as.list(key_vals[[2]]), key_vals[[1]]), ex)
[1] "My bf has a father who talks to my mother and her gf"
Change for loop part to this:
plyr::mapvalues(str_split(ex, ' ')[[1]], key_vals$first_word, key_vals$second_word) %>%
str_flatten(' ')
The following `from` values were not present in `x`: her
[1] "My bf has a father who talks to my mother and her gf"
ex
[1] "My gf has a mother who talks to my father and his bf"
I think the warning can be ignored as it is just complaining that her is not in the sentence that ex contains.
The code first splits the character into a vector, then replaces the individual words and then pastes them back together again.
Rather than relying on a data frame of replacements, you could use a named vector, which is similar to a dictionary of values:
replacements <- key_vals$second_word
names(replacements) <- key_vals$first_word
bf gf mother father his her
"gf" "bf" "father" "mother" "her" "his"
ex_split <- str_split(ex, ' ')[[1]]
swapped <- replacements[ex_split]
final <- paste0(ifelse(!is.na(swapped), swapped, ex_split), collapse = ' ')
"My bf has a father who talks to my mother and her gf"
After creating ex_split, you could also substitute and glue everything together with Reduce:
Reduce(function(x, y) paste(x, ifelse(!is.na(replacements[y]), replacements[y], y)), ex_split)
Here is a base R option using strsplit + match like below
with(
key_vals,
{
v <- unlist(strsplit(ex, "(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", perl = TRUE))
p <- second_word[match(v, first_word)]
paste0(ifelse(is.na(p), v, p), collapse = "")
}
)
and it yields
[1] "My bf has a father who talks to my mother and her gf"
This does what you need.
library(stringr)
# I've updated the columns names, for clarity
key_vals <- data.frame(words = c("bf", "gf", "mother", "father", "his", "her"), swapped_words = c("gf", "bf", "father", "mother", "her", "his"))
# used str_split to break the sentence into multiple words
ex <- "My gf has a mother who talks to my father and his bf"
words <- stringr::str_split(ex, " ")[[1]] #break into words
# do a inner join between the two tables
dict <- merge(data.frame(words=words), key_vals, by = "words", all.x = TRUE, incomparables = NA)
# now we basically apply the dictionary to the string, using an apply function
# we also use paste(..., collapse = " ") to make them into one sentence again
words <- paste(sapply(words, function(x) {
if (!x %in% key_vals$words)
return (x)
return(dict$swapped_words[dict$words == x])
}), collapse=" ")
I have a data frame 'key_words' with vectors of pairs of words
key_words <- data.frame( c1 = ('word1','word2'), c2 = ('word3, word4'), c3 = ('word5','word6'))
I would like to search for these pairs of key words in a character column 'text' in another data frame 'x' where each row can be a few sentences long. I want to grab the word following two consecutive matches of a column in the key_words data frame and insert that value into a table at the same index of where the match was found. For example, if 'word1' and 'word2' are found one after the other in text[1] then I want to grab the word that comes after in text[1] and insert it into table[1].
I have tried splitting each row in 'text' into a list, separating by a single space so that each word has its own index in each row. I have the following idea which seems very inefficient and I'm running into problems where the character value temp_list[k] is of length 0.
x <- x %>% mutate(text = strsplit(text, " "))
for (i in 1:ncol(key_words)) {
word1 <- key_words[i, 1]
word2 <- key_words[i, 2]
for (j in 1:length(x$text)) {
temp_list <- as.list(unlist(x$text[[j]]))
for (k in 1:length(temp_list))
if (word1 == temp_list[k]) {
if (word2 == temp_list[k + 1]) {
table$word_found[j] <- temp_list[k + 2]
}
}
}
Is there a better way to do this or can I search through the text column for 'word1 word2' and grab the next word which can be any length? I'm new to R and coding in general, but I know I should be avoiding nested loops like this. Any help would be appreciated, thanks!!
I would suggest that you create a small function like this one, that returns the word following the occurrence of the pair 'w1 w2'
get_word_after_pair <- function(text,w1,w2) {
stringr::str_extract(text, paste0("(?<=\\b", w1, "\\s", w2, "\\b\\s)\\w*(?=\\b)"))
}
and then you can do this
data.frame(
lapply(key_words, function(x) get_word_after_pair(texttable$text,x[1],x[2]))
)
Input:
(keywords is a list of word pairs, texttable is a frame with a column text)
key_words <- list( pair1 = c('has','important'), pair2 = c('sentence','has'), pair3 = c('third','sentence'))
texttable = data.frame(text=c("this sentence has important words that we must find",
"this second sentence has important words to find",
"this is the third sentence and it also has important words within")
)
Output:
pair1 pair2 pair3
1 words important <NA>
2 words important <NA>
3 words <NA> and
I am trying to extract the first letter of a string that are separated by commas, then counting how many times that letter appears. So an example of a column in my data frame looks like this:
test <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"))
And I'd want a column added next to it that looks like this:
test2 <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))
The code count column extracts the first letter of the string and counts how many times that letter appears in that specific cell.
I looked into using strsplit to get the first letter in the column separated by commas, but I'm not sure how to attach the count of how many times that letter appears in the cell to it.
Here is one option using base R. This splits the Code column on the comma (and at least one space), then tabulates the number of times the first letter appears, then pastes them back together into your desired output. It does sort the new column in alphabetical order (which doesn't match your output). Hope this helps!
test2$Coode_Count2 <- sapply(strsplit(test2$Code, ",\\s+"), function(x) {
tab <- table(substr(x, 1, 1)) # Create a table of the first letters
paste0(names(tab), tab, collapse = ", ") # Paste together the letter w/ the number and collapse them
} )
test2
Code Code_Count Coode_Count2
1 EKST, STFO E1, S1 E1, S1
2 EFGG E1 E1
3 SSGG, RRRR, RRFK S1, R2 R2, S1
4 RRRF R1 R1
Here is a tidier, stringr/purrr solution that grabs the first letter of a word and does the same thing (instead of splitting the string)
library(purrr)
library(stringr)
map_chr(str_extract_all(test2$Code, "\\b[A-Z]{1}"), function(x) {
tab <- table(x)
paste0(names(tab), tab, collapse = ", ")
} )
Data:
test2 <- data.frame("Code" = c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK",
"RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))
test2[] <- lapply(test2, as.character) # factor to character
I need to replace subset of a string with some matches that are stored within a dataframe.
For example -
input_string = "Whats your name and Where're you from"
I need to replace part of this string from a data frame. Say the data frame is
matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
to_word=c("what is your name","names","froth"))
Output expected is what is your name and Where're you from
Note -
It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
It has to match whole string and not partial strings. fro of "from" should not match as "froth"
I referred to the below link but somehow could not get this work as intended/described above
Match and replace multiple strings in a vector of text without looping in R
This is my first post here. If I haven't given enough details, kindly let me know
Edit
Based on the input from Sri's comment I would suggest using:
library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)
Original
I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:
# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
to_word=c("what is your name","names","froth", "where are", "Whatsup"))
# load used library
library(gsubfn)
# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)
# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE
# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])
# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
replacements,input_string)
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")
function have two arguments xi and yi.
xi is pattern (find what), yi is replacement (replace with).
input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Was trying out different things and the below code seems to work.
a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")
for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c
Took help from the below link Making gsub only replace entire words?
However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop
I have one text file like this
DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.
text goes till approx 300 lines. And sometimes Address data exceeds to two second line also i want to convert this text data to either cvs format which will have data like this
DOB, Name, Address
13-03-2003,ABC,xyz.
or at least in one data frame. I tried so many things, when i am giving read.table("file.txt",sep="\n") it makes everything in one column and i also tried first making headers by using
header <- read.table("file.txt",sep= "\n")
and then another data <- read.table("file.txt",skip = 3, sep ="\n") and then combining both but its not working out as my header vector has 3 and data vector has like 300 approx columns, its not working as required. Any help will be really helpful :)
You could try
entries <- unlist(strsplit(text, "\\n")) #separate entries by line breaks
entries <- entries[nchar(entries) > 0] #remove empty lines
as.data.frame(matrix(entries, ncol=3, byrow=TRUE)) #assemble dataframe
# V1 V2 V3
#1 DOB Name Address
#2 13-03-2003 ABC xyz.
#3 12-08-2004 dfs 1 infinite loop.
data
text <-'DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop.'
df <- read.table(text = text)
Two assumptions were made, 1 there will not be any blank names or date of births. By "blank" I do not mean "NA", "", or any other marker that the value was missing. Second assumption was that names and DOBs will only occupy one line each.
s1 <- gsub("^\n|\n$", "", strsplit(x, "\n\n+")[[1]])
stars <- gsub("\n", ", ", sub("\n", "*", sub("\n", "*", s1)))
mat <- t(as.data.frame(strsplit(stars, "\\*")))
dimnames(mat) <- c(NULL, NULL)
write.csv(mat,"filename.csv")
We start by splitting the text by the blank lines and eliminating any leading or trailing newline tokens. Then we replace the first and second "\n" symbols with stars. Next we split on those new star markers that we created to always have 3 elements for each row. We create a matrix with the values and transpose it for display. Then write the data to csv.
When opened with Notepad on a test file, I get:
"","V1","V2","V3"
"1","DOB","Name","Address"
"2","13-03-2003","ABC","xyz."
"3","12-08-2004","dfs","1 infinite loop"
"4","01-01-2000","Bob Smith","1234 Main St, Suite 400"
row and column names can be set to FALSE with ?write.csv if desired.
Data
x <- "DOB
Name
Address
13-03-2003
ABC
xyz.
12-08-2004
dfs
1 infinite loop
01-01-2000
Bob Smith
1234 Main St
Suite 400
"