Get row index to update another dataframe in loop - r

I have the following data:
EDIT:
df<- data.frame(
id = c(432, 324, 322, 331, 242,443,223 ),
desc1= c("metal","steels&iron","irons\\copper", "sports material", "leather material", "durable goods", "electronic store")
,
store_names = c("ik bros","steel idrs", "kb materials", "ca pty (ltd)", "bkk stores", "k/k \\shop", "h/j & jj")
,
class = c("", "unknown","", "sports", "unknown", "unknown", "")
)
I want to search keywords from both desc1 and desc2 and assign a string value to class column. For example, keywords can be
indus_1 <- c("iron", "steel")
goods_store_1 <- c("goods", "store", "stores")
electr_1 <- c("electronic", "chips", "semiconductor")
unlabelled_1 <- c("leather")
here variable names indus_1, sports_1 and so on will be used to assign a string value to class. For instance, if "metal" keyword is found I assign indus after stripping away "_1" to class. In my approach, I am finding index of rows where keywords found and copying them to the copy of same dataframe, but this take quite long for a larger dataset, and may miss few classes as I am using \\b to find exact match. Here is the expected output:
id desc1 store_names class
432 metal ik bros
324 steels&iron steel idrs indus
322 irons\\copper kb materials indus
331 sports material ca pty (ltd) sports
242 leather material bkk stores unlabelled
443 durable goods k/k \\shop goods_store
223 electronic store h/j & jj electr
I am looking for a more efficient method to do the same, a fully dplyr version would be preferable. Thanks for suggestions.

In that case, you could do:
vars_1 <- mget(ls(pattern = '_1'))
vars_1 <- vars_1[!grepl('vars', names(vars_1))]
pat <- sub("_1", "", names(vars_1))
names(pat) <- sprintf(".*(%s).*", unlist(vars_1))
df %>%
mutate(class = str_replace_all(invoke(str_c, across(starts_with('desc'))), pat))
id desc1 desc2 class
1 432 iron and metal ik bros indus
2 324 sports material ca pty (ltd) sports
3 322 leather material bkk stores unlabelled
4 331 durable goods k/k \\shop goods_store
5 242 electronic goods h/j & jj electr

Logically my answer is similar to #Onyambu 's answer but with few tweaks.
library(tidyverse)
mget(ls(pattern = '_1')) %>%
stack() %>%
group_by(ind = sub('_1', '', ind)) %>%
summarise(values = sprintf('.*\\b(%s)\\b.*', paste0(values, collapse = '|'))) %>%
select(2, 1) %>%
deframe() -> pat
pat
#.*\\b(electronic|chips|semiconductor)\\b.* .*\\b(goods|store|stores)\\b.*
# "electr" "goods_store"
# .*\\b(iron|steel)\\b.* .*\\b(leather)\\b.*
# "indus" "unlabelled"
df %>%
mutate(class2 = str_replace_all(desc1, pat),
class2 = ifelse(desc1 == class2, '', class2))
# id desc1 store_names class class2
#1 432 metal ik bros
#2 324 steels&iron steel idrs unknown indus
#3 322 irons\\copper kb materials
#4 331 sports material ca pty (ltd) sports
#5 242 leather material bkk stores unknown unlabelled
#6 443 durable goods k/k \\shop unknown goods_store
#7 223 electronic store h/j & jj electr
For id = 322 it doesn't match indus because we are looking for an exact match. indus_1 has iron whereas desc1 column has irons.

Related

Use regex to replace duplicate phrases

I need to parse large data files and for reasons unknown the addresses are sometimes repeated, like this:
d<- data.table(name = c("bill", "tom"), address = c("35 Valerie Avenue / 35 Valerie Avenue", "702 / 9 Paddock Street / 702 / 9 Paddock Street"))
I have figured out how to de-dupe the easy ones (e.g. "35 Valerie Avenue / 35 Valerie Avenue") with the following:
replace.dupe.addresses<- function(x){
rep_expr<- "^(.*)/(.*)$"
idx<- grepl("/",x) & (trimws(sub(rep_expr, "\\2", x)) == trimws(sub(rep_expr, "\\1",x)))
x[idx]<- trimws(sub(rep_expr, "\\1",x[idx]))
x
}
d[,address := replace.dupe.addresses(address)]
But this doesn't work for addresses where the critical '/' is further embedded. I have tried this as my regex: rep_expr<- "^(.*)[:alpha:][:space:]?/(.*)$"
but this doesn't work. What regex expression would capture both of these repeating phrases?
See if this works for your dataset
library(data.table)
d[, .(name, address = lapply(strsplit(address, " / "), function(x)
paste(x[!duplicated(x)], collapse=" / "))), by=.I]
name address
1: bill 35 Valerie Avenue
2: tom 702 / 9 Paddock Street
Please check the below code
d %>% separate_rows(address, sep = '\\/') %>% mutate(address=trimws(address)) %>%
group_by(name, address) %>% slice_head(n=1) %>% group_by(name) %>%
mutate(address=paste(address, collapse = '/')) %>% slice_head(n=1)
Created on 2023-01-27 with reprex v2.0.2
# A tibble: 2 × 2
# Groups: name [2]
name address
<chr> <chr>
1 bill 35 Valerie Avenue
2 tom 702/9 Paddock Street
Split on forward slash, then get unique and paste it back:
sapply(strsplit(d$address, " / ", fixed = TRUE),
function(i) paste(unique(i), collapse = "/"))
# [1] "35 Valerie Avenue" "702/9 Paddock Street"
Try using a regex
gsub("((?:\\S+\\s+/\\s+)?(?:\\S+\\s+){2}\\S+)\\s+/\\s+\\1+", "\\1", d$address)
[1] "35 Valerie Avenue" "702 / 9 Paddock Street"

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

Grepl and Extract the Match in R

In R I have:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Pepsi, "Microsoft", "Tesla") # a vector with abbreviated names of firms
I want to check if the abbreviated names are in the full names dataset, and if true subsequently match the full_names row to the abbr_names vector, like:
[1] [2] [3] [4]
[1] Abbott ABBOTT LABORATORIES ABT 555
[2] Apple APPLE INC. AAPL 111
[3] Microsoft MICROSOFT CORPORATION MSFT 222
[4] Tesla TESLA INC. TSLA 444
Tried several str_extract and grepl functions, but could not make it work yet.
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
That will give you a vector with the names as abbreviations and the firms as values
names(matches)
# [1] "ABBOTT" "APPLE" "MICROSOFT" "TESLA"
c(firm_matches, use.names = FALSE)
# [1] "ABBOTT LABORATORIES" "APPLE INC." "MICROSOFT CORPORATION" "TESLA INC."
There are a variety of ways to put this together... cobbling...
From #Oscar 's comment, we get the desired output with a total of two lines of code:
matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
tibble(ABBR_FIRM = names(matches), FIRM = matches) %>% left_join(., full_names, by = "FIRM")
how about this?
full_names$row_num <- 1:nrow(full_names)
do.call(rbind,
lapply(abbr_names,
function(x){
if(sum(grepl(x, full_names$FIRM, ignore.case = TRUE)) > 0){
row <- grepl(x, full_names$FIRM, ignore.case = TRUE) %>%
which()} else {row <- 0}
data.frame("name" = x,
"row_num" = row)})) %>%
right_join(full_names, by = "row_num")
My advise, turn on all the word's to upcase or lowercase. Is more easy to the functions as grepl make comparation.
My code:
library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"),
TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Microsoft", "Tesla") # a vector with abbreviated names of firms
Here I created a new column, the one we want to index the returns of grepl
full_names$new_column <- NA
Then, I did a loop in the name's that we want to index in the dataframe
for(i in 1:length(abbr_names)){
search_test <- grepl(tolower(substr(abbr_names[i], 0,4)), tolower(full_names$FIRM))
position <- grep("TRUE", search_test)
full_names$new_column[position] <- abbr_names[i]
}
The result is the follow dataframe:
FIRM TICKER ID new_column
1 APPLE INC. AAPL 111 Apple
2 MICROSOFT CORPORATION MSFT 222 Microsoft
3 GOOGLE GOOGL 333 NA
4 TESLA INC. TSLA 444 Tesla
5 ABBOTT LABORATORIES ABT 555 Abbott
"GOOG" is not in the abbr_names vector, so the return is NA
Another option might be eg this ...
map_int(abbr_names, ~ {
idx <- grep(., full_names$FIRM, ignore.case = TRUE)
if (length(idx) == 0) return(NA) else return(idx)
}) %>%
cbind(ABBR = abbr_names, FIRM = full_names$FIRM[.]) %>%
as.tibble() %>%
left_join(full_names, by = "FIRM") %>%
complete(FIRM)
# A tibble: 4 x 5
FIRM . ABBR TICKER ID
<chr> <chr> <chr> <chr> <dbl>
1 ABBOTT LABORATORIES 5 Abbott ABT 555
2 APPLE INC. 1 Apple AAPL 111
3 MICROSOFT CORPORATION 2 Microsoft MSFT 222
4 TESLA INC. 4 Tesla TSLA 444
Just wanted to still post it :)

how to extract sub strings from a string like "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" using r and stringr

I have some property sale data downloaded from Internet. It is a PDF file. When I copy and paste the data into a text file, it looks like this:
> a
[1] "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" "Albert Park 106 Graham St 2 br h $0 SP RT Edgar"
Let's take the first line as an example. Every row is a record of a property, including suburb (Airport West), address (1/26 Cameron St), the count of bedrooms (3), property type (t), price ($830000), sale type (S). The last one (Nelson) is about the agent, which I do not need here.
I want to analyse this data. I need to extract the information first. I hope I can get the data like this: (b is a data frame)
> b
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
Could anyone please tell me how to use stringr package or other methods to split the long string into the sub strings that I need?
1) gsubfn::read.pattern read.pattern in the gsubfn package takes a regular expression whose capture groups (the parts within parentheses) are taken to be the fields of the input and a data frame is created to assemble them.
library(gsubfn)
pat <- "^(.*?) (\\d.*?) (\\d) br (.) [$](\\d+) (\\w+) .*"
cn <- c("Suburb", "Address", "Bedroom", "PropertyType", "Price", "SoldType")
read.pattern(text = a, pattern = pat, col.names = cn, as.is = TRUE)
giving this data.frame:
Suburb Address Bedroom PropertyType Price SoldType
1 Airport West 1/26 Cameron St 3 t 830000 S
2 Albert Park 106 Graham St 2 h 0 SP
2) no packages This could also be done without any packages like this (pat and cn are from above):
replacement <- "\\1,\\2,\\3,\\4,\\5,\\6"
read.table(text = sub(pat, replacement, a), col.names = cn, as.is = TRUE, sep = ",")
Note: The input a in reproducible form is:
a <- c("Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander",
"Albert Park 106 Graham St 2 br h $0 SP RT Edgar")

how to read text files and create a data frame in R

Need to read the txt file in
https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt
and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...
Tried to use sep command to separate them but failed...
Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.
library(stringr) # For str_trim
# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")
# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1", dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1", dat$address)
# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)
dat = dat[,c(1:2,7:8,4:6)]
dat
LastName FirstName streetno streetname city state zip
1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
...
41 Wright Greg 791 Holmdel-Keyport Rd. Holmdel NY 07733-1988
42 Zingale Michael 5640 S. Ellis Ave. Chicago IL 60637
Try this.
x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" ,
what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))
data<-as.data.frame(x)
I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.
## get the page as text
txt <- RCurl::getURL(
"https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
## add most comma-separators, then the last for the house number
text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)),
header = FALSE,
## set the column names
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
# LastName FirstName streetno streetname city state zip
# 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA O2215
# 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
# 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
# 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA O2215
# 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
# 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720
Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".
So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)
addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt) # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt) # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt) # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt) # city, by elimination
addr <- read.csv(textConnection(addr.txt), header = FALSE,
col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
stringsAsFactors = FALSE)
head(addr)
## LastName FirstName streetno streetname city state zip
## 1 Bania Thomas M. 725 Commonwealth Ave. Boston MA 02215
## 2 Barnaby David 373 W. Geneva St. Wms. Bay WI 53191
## 3 Bausch Judy 373 W. Geneva St. Wms. Bay WI 53191
## 4 Bolatto Alberto 725 Commonwealth Ave. Boston MA 02215
## 5 Carlstrom John 933 E. 56th St. Chicago IL 60637
## 6 Chamberlin Richard A. 111 Nowelo St. Hilo HI 96720

Resources