My data is as shown below:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
I have an exhaustive list of city names. Listing few of them below:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
I am searching for city names (from the "city" list I have) in txt$txt and extracting them into another column if they are present. So the simple loop below works for me... but it's taking a lot of time on the bigger dataset.
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
I tried to use an apply function, and this is the maximum i could get to.
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
I want a faster process in R, which does the same thing what the for loop above does. Please advise. Thanks
Here's a possibility using stri_extract_first_regex from stringi package:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore
This should be much faster:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
Explanation:
in the first line we build a big regular expression matching all the cities, e.g. :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
Then we use gregexpr in combination with regmatches, in this way we get a list of the matches for each element in txt$txt.
Finally, with a simple sapply, for each element of the list we concatenate the matched cities (after removing the duplicates i.e. cities mentioned more than one time).
Try this:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"
Related
The following names are in a column. I want to retain just five distinct names, while replace the rest with others. how do I go about that?
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
If I am understanding you correctly, use %in% and ifelse. Here, I chose the first five names as an example. I also created it in a new column, but you could just overwrite the column as well or create a vector:
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
fivenamez <- c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm')
df$names_transformed <- ifelse(df$names %in% fivenamez, df$names, "Other")
# names names_transformed
# 1 Marvel Comics Marvel Comics
# 2 Dark Horse Comics Dark Horse Comics
# 3 DC Comics DC Comics
# 4 NBC - Heroes NBC - Heroes
# 5 Wildstorm Wildstorm
# 6 Image Comics Other
# 7 <NA> Other
# 8 Icon Comics Other
# 9 SyFy Other
If you want to keep NA values as NA, just use df$names_transformed <- ifelse(df$names %in% fivenamez | is.na(df$names), df$names, "Other")
You can also use something like case when. The following code will keep marvel, dark horse, dc comics, JK Rowling and George Lucas the same and change all others to "Other". It functionally the same as u/jpsmith, but (in my humble opinion) offers a little more flexibility because you can change multiple things a bit more easily or make different comics have the same name should you choose to do so.
df = df %>%
mutate(new_names = case_when(names == 'Marvel Comics' ~ 'Marvel Comics',
names == 'Dark Horse Comics' ~ 'Dark Horse Comics',
names == 'DC Comics' ~ 'DC Comics',
names == 'George Lucas' ~ 'George Lucas',
names == 'J. K. Rowling' ~ 'J. K. Rowling',
TRUE ~ "Other"))
I have two data.frame one containing partial name and the other one containing full name as follow
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")
In the ideal world, I would love to have a table like this (my real partial df has 12 794 rows)
print(partial)
partial full
Apple Apple Inc
Apple Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows
For every row without an answer I would like to be NA
I tried many things, fuzzyjoin with regex, regex_left_join even with the package sqldf. I have some results but I know it would be better if regex_left_join understand that I am looking for words I know in stringr , boundary( type = c("word")) exist but I do not know of to implement it.
For now, I just prepared the partial df, to get rid of the non-alphanumerical information and to make it lowercase.
partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)
How can I match partial$partial.name with full$full.name based on the maximum number of words in common?
Partial string matching is time consuming to get right. I believe the Jaro-Winkler distance is a good candidate but you would need to spend time tweaking parameters. Here's an example to get you going.
library(stringdist)
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)
mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
find_dist <- function(first, second, method = method, p = p) {
stringdist(a = first, b = second, method = method, p = p)
}
distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
# If the distance is too great assume NA
if (min(distances) > threshold) {
NA
} else {
closest_index <- which.min(distances)
list_of_fulls[closest_index]
}
}
partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))
partial
# partial.name match
#1 Apple Apple Inc
#2 Apple Apple Inc
#3 WWF World Wide Foundation (WWF)
#4 wizz air wizzair
#5 WeMove.eu We Move Europe
#6 ILU (ILU)
#7 None <NA>
I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.
A <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main Frankfurt",
"Hannover")
B <- c("Berlin Brandenburg",
"Hamburg",
"Munich Bayern",
"Region Stuttgart",
"Main Rhein Darmstadt",
"Wiesbaden")
The resulting column / data frame should look like this:
result <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main",
NA
)
df <- data.frame(A, B, result)
...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".
Thank you for your help!
Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect.
df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
df
# A B Result
#1 Berlin Berlin Brandenburg Berlin
#2 Hamburg Hamburg Hamburg
#3 Munich Munich Bayern Munich
#4 Stuttgart Region Stuttgart Stuttgart
#5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
#6 Hannover Wiesbaden
This returns empty string when there is no match. You can turn the empty string to NA if needed.
I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)
I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG