Detect string pattern in dataframe and conditionally fill another in R - r

I have a dataframe containing text and numeric references, and a vector of words that may appear in the text. What I want is to check for every instance in which a word from words_df appears in text_df$text, and record the word from word_df and the numeric reference from text_df$ref in a new dataframe (edge_df).
text_df <- data.frame(text = c("John went to the shops", "Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"))
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"))
edge_df <- data.frame(ref = NA, word = NA)
The output should look like this:
> edge_df
ref word
1 13.5 shops
2 13.5 John
3 1.9.9 apples
4 20.1 Wendy
5 20.1 this
It isn't very elegant but I thought a for-loop would work, where each word is checked against the text using stringr::str_detect, and if the result is TRUE it would record the word and ref:
for (i in 1:nrow(text_df)) {
for (j in 1:nrow(words_df)) {
if (str_detect(text_df$text[i], words_df$word[j]) == TRUE) {
edge_df$ref <- text_df$ref[i]
edge_df$word <- words_df$word[j]
}
}
}
This did not work, and nor have several variations on this loop. If possible I would rather not use a loop at all as the dataframes I'm working with have around 1000 rows each and it takes far too long to loop through them. Any fixes to the loop much appreciated, and bonus points/props if you can do it without a loop at all.
Thank you!

Here is an option with str_extract and unnest. We extract the words from the 'text' column into a list and use unnest the expand the rows
library(dplyr)
library(stringr)
library(tidyr)
text_df %>%
transmute(ref, word = str_extract_all(text,
str_c(words_df$word, collapse="|"))) %>%
unnest(c(word))
# A tibble: 5 x 2
# ref word
# <chr> <chr>
#1 13.5 John
#2 13.5 shops
#3 1.9.9 apples
#4 20.1 Wendy
#5 20.1 this

Try this tidyverse approach. The key for your issue: you can format your data to long by separating each word in the sentences and then use left_join(). Here the code (I have used the data you provided):
library(tidyverse)
#Data
text_df <- data.frame(text = c("John went to the shops", "Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"),stringsAsFactors = F)
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"),stringsAsFactors = F)
#Join
words_df %>% left_join(text_df %>% separate_rows(text,sep = ' ') %>%
rename(word=text))
Output:
word ref
1 shops 13.5
2 John 13.5
3 apples 1.9.9
4 Wendy 20.1
5 this 20.1

Here is a base R option
u <- lapply(text_df$text,function(x) words_df$word[sapply(words_df$word,function(y) grepl(y,x))])
edge_df <- data.frame(ref = rep(text_df$ref,lengths(u)),word = unlist(u))
which gives
ref word
1 13.5 shops
2 13.5 John
3 1.9.9 apples
4 20.1 Wendy
5 20.1 this

library(data.table)
words_df <- data.frame(word = c("shops", "John", "apples", "Wendy", "this"))
text_df <- data.frame(text = c("John went to the shops",
"Sarita hates apples", "Wendy doesn't care about this"),
ref = c("13.5", "1.9.9", "20.1"))
setDT(words_df)
setDT(text_df)
First we get our words vector ready.
wordvec <- paste0(words_df[,word],collapse="|")
Now all there is to do is to check, for each row all the words in wordvec
## > text_df[,.(word=unlist(regmatches(text,gregexpr(wordvec,text)))),ref]
## ref word
## 1: 13.5 John
## 2: 13.5 shops
## 3: 1.9.9 apples
## 4: 20.1 Wendy
## 5: 20.1 this
The functions regmatches,grepexpr will return a list containing all the words that match the pattern wordvec.
> regmatches("John went to the shops",gregexpr(wordvec,"John went to the shops"))
##[[1]]
##[1] "John" "shops"
Warning, to format the output quickly I'm over-relying the ref variable and consider them to be ids. If it is not the case then it is best to create an id column and use it with in addition to ref. For instance
text_df[,id:=1:.N][,.(word=unlist(regmatches(text,
gregexpr(wordvec,text)))),.(id,ref)]

Related

udpipe (keywords_rake) how to link keywords to the document they where extracted from

I am using the function keywords_rake from the udpipe package (for R) to extract keywords from a bunch of documents.
udmodel_en <- udpipe_load_model(file = dl$file_model)
x <- udpipe_annotate(udmodel_en, x = data$text)
x <- as.data.frame(x)
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 2)
where data looks like this
Text
"cats are nice but dogs are better..."
"I really like dogs..."
"red flowers are pretty, especially roses..."
"once I saw a blue whale ..."
....
(each row is a separate document)
However the output does not include the origin of the keywords, and provides a list of keywords for all the documents
how can I link these keywords to the corresponding documents they were taken from?
(I.e. have a list of keywords for each of the documents)
something like this:
keywords
doc1 dog, cat, blue whale
doc2 dog
doc3 red flower, tower, Donald Trump
You can use txt_recode_ngram together with the outcome of keywords_rake to do this. The advantage is that everything is back in the original data.frame and you can then select what you need. See example below using the dataset supplied with udpipe.
Disclaimer: Code copied from jwijffels' answer in issue 41 on the github page of udpipe.
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$xpos %in% c("NN", "JJ"), sep = "-")
head(keywords)
keyword ngram freq rake
1 openbaar-vervoer 2 19 2.391304
2 heel-fijn 2 2 2.236190
3 heel-vriendelijk 2 3 2.131092
4 herhaling-vatbaar 2 6 2.000000
5 heel-appartement 2 2 1.935450
6 steenworp-afstand 2 4 1.888889
x$term <- txt_recode_ngram(x$lemma, compound = keywords$keyword, ngram = keywords$ngram, sep = "-")
x$term <- ifelse(!x$term %in% keywords$keyword, NA, x$term)
head(x[!is.na(x$term), ])
doc_id language sentence_id token_id token lemma xpos term
67039 19991431 nl 4379 11 erg erg JJ erg-centraal
67048 19991431 nl 4379 20 leuk leuk JJ leuk-adres
67070 21054450 nl 4380 6 goede goed JJ goed-locatie
67077 21054450 nl 4380 13 Europese europees JJ europees-wijk
67272 23542577 nl 4393 84 uitstekende uitstekend JJ uitstekend-gastheer
67299 40676307 nl 4396 25 gezellige gezellig JJ gezellig-buurt

Replacing integers in a dataframe column that's a list of integer vectors (not just single integers) with character strings in R

I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}
Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b
We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")

Replacing vector values in R based on a list (hash)

I have a dataframe, one column of which is names. In a later phase of analysis, I will need to merge with other data by this name column, and there are a few names which vary by source. I'd like to clean up my names using a hash (map) of names->cleaned names. I've found several references to using R lists as hashes (e.g., this question on SE), but I can't figure out how to extract values for keys in a vector only as they occur. So for example,
> players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
> xref = c("Bob"="Robert", "Fred Jr." = "Fred")
> players$names
[1] Joe John Bob
Levels: Bob Joe John
Whereas players$names gives a vector of names from the original frame, I need the same vector, only with any values that occur in xref replaced with their equivalent (lookup) values; my desired result is the vector Joe John Robert.
The closest I've come is:
> players$names %in% names(xref)
[1] FALSE FALSE TRUE
Which correctly indicates that only "Bob" in players$names exists in the "keys" (names) of xref, but I can't figure out how to extract the value for that name and combine it with the other names in the vector that don't belong to xref as needed.
note: in case it's not completely clear, I'm pretty new to R, so if I'm approaching this in the wrong fashion, I'm happy to be corrected, but my core issue is essentially as stated: I need to clean up some incoming data within R by replacing some incoming values with known replacements and keeping all other values; further, the map of original->replacement should be stored as data (like xref), not as code.
Updated answer: ifelse
ifelse is an even more straightforward solution, in the case that xref is a named vector and not a list.
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), stringsAsFactors = FALSE)
xref <- c("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- ifelse(is.na(xref[players$names]), players$names, xref[players$names])
players
Result
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
Previous answer: sapply
If xref is a list, then sapply function can be used to do conditional look-ups
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
xref <- list("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- sapply(players$names, function(x) ifelse( x %in% names(xref), xref[x], as.vector(x)) )
players
Result
> players
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
You can replace the factor levels with the desired text. Here's an example which loops through xref and does the replacement:
for (n in names(xref)) {
levels(players$names)[levels(players$names) == n ] <- xref[n]
}
players
## names scores
## 1 Joe 9.8
## 2 John 9.9
## 3 Robert 8.8
Another example of replacing the factor levels.
allnames = levels(players$names)
levels(players$names)[ !is.na(xref[allnames]) ] = na.omit(xref[allnames])
players
# names scores
# 1 Joe 9.8
# 2 John 9.9
# 3 Robert 8.8
If you get into really big data sets, you might take a look at merge function or the data.table package. Here is a data.table example of a join.
library(data.table)
players=data.table(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), key="names")
nms = data.table(names=names(xref),names2=xref, key="names")
out = nms[players]
out[is.na(names2),names2:=names]
out
# names names2 scores
# 1: Bob Robert 8.8
# 2: Joe Joe 9.8
# 3: John John 9.9
Here is an similar example with the merge function.
players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
nms = data.frame(names=names(xref),names2=xref,row.names=NULL)
merge(nms,players,all.y=TRUE)
# names names2 scores
# 1 Bob Robert 8.8
# 2 Joe <NA> 9.8
# 3 John <NA> 9.9

Resources