My df has a comments column and I need to search for multiple names in the comments using key words(he comment has a lot of irrelevant information and not necessarily full name provided) - able o accomplish this with nested ifelse but there is a limit of 50 for nesting and my list has grown to more than 200 names so the code looks very tedious and I don't want to be editing the code each time (instead i want to upload an excel with list of names and key search terms)
I am currently using this statement - which should give clear understanding of what the relevant columns contain
comdata$name <- ifelse(grepl('jen',comdata$comments),'Jennifer A',
ifelse(grepl('rick',final_DM$comments) | grepl('richard',final_DM$comments) ,'richard',
ifelse(grepl('summ',comdata$comments),'Summer','Others'))))
Is it possible to do this with a loop or some other way if I create a list of the names and the possible 'key' search terms?
basically i need the correct syntax to write below code - which just gives other for most of the rows in comdata$name:
comdata$name< - ifelse(comdata$comments %like% name_list$Key.1, name_list$FullName, 'Other')
Create a key/val dataset and use regex_left_join
keyval <- data.frame(comments = c("jen", "rick"),
name = c("Jennifer A", "richard"))
library(fuzzyjoin)
regex_left_join(comdata, keyval, by = "comments")
Related
The page in question is this:
https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)
I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.
The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.
x <- remdr$getPageSource()
xpg <- read_html(x[[1]])
# get the HS descriptions
treeView <- xpg %>%
html_nodes(xpath = '//*/div[#class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
html_nodes(xpath = '//*/p[contains(#class, "MuiTypography-body1")]') %>%
html_nodes('span') %>%
html_text(trim = TRUE)
I need all the descriptions in order as a list.
Update: This is the output format. Descriptions and the 8-digit code
General thoughts:
RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.
The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.
The API call:
* You can click on many of the smaller images below to enlarge
The API response:
This request returns nested JSON:
The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:
Each of these named lists is nested as per the levels on the webpage:
You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.
Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.
Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:
In addition to this, there is, where present, a commodityCode key within the nested headingItems:
Approach and challenges:
Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.
Example mask (descriptions|text):
The TRUE values are for the following chained accessors (chaining dependent on depth):
Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:
This work is done within the custom function get_data, where I also then do the following steps:
I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve
Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:
At this point the codes and descriptions/text are in a single column:
I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.
Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:
I then keep only the rows where description_header_flag is TRUE:
Finally, I remove the now not needed flag column:
This leaves me with a clean data.table to return from the function.
Generating the final output:
As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:
df <- rbindlist(map(data, get_data))
This can then be written to csv for example.
fwrite(df, 'result.csv')
Example rows in df:
N.B. I return a data.table as you showed 2 columns in your desired output.
R:
library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)
get_data <- function(x) {
y <- x %>% unlist(recursive = T)
t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
t <- t[description_header_flag | commodity_code, ]
t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
html_text(read_html(.x))
} else {
.x
}))
t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
.y
} else {
NA
})
t[, commodity_code := c(NA, commodity_code[.I - 1])]
t <- t[description_header_flag == T, ]
t[, description_header_flag := NULL]
return(t)
}
data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems
df <- rbindlist(map(data, get_data))
fwrite(df, "result.csv")
Sample output:
Credits:
gsub solution taken from: #shabbychef here
row shift solution adapted from: #Gary Weissman here
I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta
Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))
This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.
First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})
More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)
I have three lists of country names.
orig_list <- c("MURICA","uSa","Ger","Gerrmany",...) #[LONG, about 15000 entries]
unique_spellings <- c("MURICA","uSa","Ger","Gerrmany") #[600 entries]
norm_spellings <- c("USA","USA","Germany","Germany") #[600 entries]
So the idea is that I want the entries in the original list be replaced by the corresponding normed spellings. Each unique spelling from orig_list is represented in unique_list and the normed spellings can be found at the same index in norm_spellings.
In a related thread (which I cannot find again for the life of me, suggested using setNames for a similar task, like
map = setNames(norm_spellings, unique_spellings)
orig_list = map[orig_list]
works almost, but produces a lot of NAs for some reason. Is there a clever way to do it without multiple for loops?