Splitting the rows that has "|" using separate() fn not splitted - r

my data looks like
> company
name category_list
11 1-4 All Entertainment|Games|Software
12 1.618 Technology Networking|Real Estate|Web Hosting
13 1-800-DENTIST Health and Wellness
14 1-800-DOCTORS Health and Wellness
15 1-800-PublicRelations, Inc. Internet Marketing|Media|Public Relations
i will have to split the category_list column based the values. when the values are pipe separated, the row should be split.
i tried the same using separate function but the column is not populated with any values
c1 <- company %>% separate(category_list,into=c("primary_Sector"), sep="|")
Actual output:
name primary_Sector
11 1-4 All
12 1.618 Technology
13 1-800-DENTIST
14 1-800-DOCTORS
15 1-800-PublicRelations, Inc.
Expected output
name category_list
11 1-4 All Entertainment
12 1-4 All Games
13 1-4 All Software
can someone tell me what is wrong?

tidyr::separate() does the column-wise separation, tidyr::separate_rows() does the row-wise separation:
library(tidyr)
read.table(
text="name;category_list
1-4 All;Entertainment|Games|Software
1.618 Technology;Networking|Real Estate|Web Hosting
1-800-DENTIST;Health and Wellness
1-800-DOCTORS;Health and Wellness
1-800-PublicRelations, Inc.;Internet Marketing|Media|Public Relations",
sep=";", header = TRUE, stringsAsFactors = FALSE
) %>%
separate_rows(category_list, sep = "\\|")
## name category_list
## 1 1-4 All Entertainment
## 2 1-4 All Games
## 3 1-4 All Software
## 4 1.618 Technology Networking
## 5 1.618 Technology Real Estate
## 6 1.618 Technology Web Hosting
## 7 1-800-DENTIST Health and Wellness
## 8 1-800-DOCTORS Health and Wellness
## 9 1-800-PublicRelations, Inc. Internet Marketing
## 10 1-800-PublicRelations, Inc. Media
## 11 1-800-PublicRelations, Inc. Public Relations

Related

How to iterate one dataframe based on a mapping file in R?

Serial No.
Company 1
Company 2
Company 3
01
NA
2
NA
02
2
NA
5
03
NA
NA
4
04
1
NA
NA
05
NA
4
NA
I have a data structure like this where the column headings represent some companies and the row headings represents consumers who buy the products. 'NA' representing no purchase for that company's products by the consumer.
I have a second mapping file where the companies are represented as row headings as follows -
Company
Country
Category
Company 1
UK
FMCG
Company 2
UK
FMCG
Company 3
India
FMCG
Company 4
US
Nicotine
The data set is for over 10000 consumers and 1000 companies. I'm getting the market share for different countries and categories using the aggregate function and mapping file.
I want to make a look to iterate values in the first data-frame to change the share for different countries and categories. The idea is to make a loop where I can choose which country's (or category) share needs to be changed along with the share and then to use the mapping file to iterate values for companies in those countries (or category). The values need to be changes for only those consumers who buy the products from companies belonging to that country (or category).
Can someone suggest how can this be done in R (preferably) or Python?
Edit:
Before iteration I will use the aggregate function in R to get the shares for a country (or category) like this -
Country
Share
UK
0.33
US
0.02
IN
0.41
IR
0.11
PK
0.13
In the loop I want to be able to specify the share for some country (say UK) to whatever is required (say 0.5). The mapping file will be used to iterate values to the first data structure where people have bought products from companies in UK.
The final output will be something like this.
Country
Share
UK
0.50
US
0.00
IN
0.38
IR
0.11
PK
0.01
Here's a guess: ultimately, this is a combination of reshape from wide to long, then merge/join, and finally aggregation/summarizing by group. If you need more information for either operation, using those key-words (on SO) will provide very useful information.
base R (and reshape2)
## reshape
dat1melted <- reshape2::melt(dat1, "Serial No.", variable.name = "Company")
dat1melted$Company <- as.character(dat1melted$Company)
dat1melted <- dat1melted[!is.na(dat1melted$value),]
dat1melted
# Serial No. Company value
# 2 02 Company 1 2
# 4 04 Company 1 1
# 6 01 Company 2 2
# 10 05 Company 2 4
# 12 02 Company 3 5
# 13 03 Company 3 4
## merge
dat1merged <- merge(dat1melted, dat2, by = "Company", all.x = TRUE)
dat1merged
# Company Serial No. value Country Category
# 1 Company 1 02 2 UK FMCG
# 2 Company 1 04 1 UK FMCG
# 3 Company 2 01 2 UK FMCG
# 4 Company 2 05 4 UK FMCG
# 5 Company 3 02 5 India FMCG
# 6 Company 3 03 4 India FMCG
## aggregate by group
aggregate(value ~ Country, data = dat1merged, FUN = sum)
# Country value
# 1 India 9
# 2 UK 9
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
dat1 %>%
## reshape
tidyr::pivot_longer(-`Serial No.`, names_to = "Company") %>%
filter(!is.na(value)) %>%
## merge
left_join(., dat2, by = "Company") %>%
## aggregate by group
group_by(Country) %>%
summarize(value = sum(value))
# # A tibble: 2 x 2
# Country value
# <chr> <int>
# 1 India 9
# 2 UK 9

How to group similar strings together in a database in R

I have a tibble of just 1 column called 'title'.
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
I wish to find similar records and group them together as such (all the while keeping their indices):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
I'm using the below function to find strings that are close enough to each other (cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
I've implemented the loop below to 'send' similar records together in a new dataframe but it's not working properly, I've tried a few variations but nothing is working yet.
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
(I've inserted prints in the loop 'to see what's happening')
REPRODUCIBLE DAT:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Any suggestions please?
EDIT: I'd also like a new index column called 'group' as below:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing
I'm afraid I've never tried RecordLinkage, but if you're just using the Jaro-Winkler distance it should also be fairly easy to cluster similar strings with the stringdist package. Using your dput above:
library(tidyverse)
library(stringdist)
map_dfr(dat$title, ~ {
i <- which(stringdist(., dat$title, "jw") < 0.40)
tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
Explanation:
map_dfr iterates over each string in dat$title, extracts the indices of the closest matches computed by stringdist (constrained by 0.40, i.e. your "threshold"), creates a tibble with the indices and matches, then stacks these tibbles with a group variable corresponding to the integer position (and row number) of the original string. distinct then drops any cluster duplicates based on repeats of index.
Output:
# A tibble: 13 x 3
group index title
<int> <int> <chr>
1 1 1 lymphoedema clinic
2 1 8 lymphoedema early intervenstion
3 1 9 lymphoedema expo
4 1 10 lymphoedema for breast care nurses
5 2 2 zostavax shingles vaccine
6 2 5 zostavax recall toolkit
7 2 11 xray meeting and case studies
8 3 3 xray operator
9 3 6 xray meetint
10 3 12 xray online examination
11 3 13 xray operator in service paediatric extremities
12 4 4 workplace mental health wellbeing workshop
13 4 7 workplace mental health and wellbeing
An interesting alternative would be to use tidytext with widyr to tokenize by word and compute the cosine similarity of the titles based on similar words, rather than characters as above.

Search a column of names in another data frame and get the result with data from other column combined

I want to create a data frame based in two data frames distincts.
The first one has the name of journals ans its respective impact factor.
The second data frame has the names of the journals that I want to search.
df1:
Full Journal Title Journal Impact Factor
CA-A CANCER JOURNAL FOR CLINICIANS 223.679
Nature Reviews Materials 74.449
NEW ENGLAND JOURNAL OF MEDICINE 70.670
LANCET 59.102
NATURE REVIEWS DRUG DISCOVERY 57.618
CHEMICAL REVIEWS 54.301
Nature Energy 54.000
NATURE REVIEWS CANCER 51.848
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION 51.273
NATURE REVIEWS IMMUNOLOGY 44.019
NATURE REVIEWS GENETICS 43.704
NATURE REVIEWS MOLECULAR CELL BIOLOGY 43.351
NATURE 43.070
and continues...
str(df1)
data.frame': 12541 obs. of 2 variables:
$ my.journal: Factor w/ 11879 levels "","2D Materials",..: 4155 1872 8866 8999 8033 8861 2143 8841 8856 5795 ...
$ jcr : Factor w/ 4732 levels "","0.000","0.006",..: 4731 2905 4614 4613 4337 4336 4335 4334 4333 4332 ...
df2:
my.journal
1 Bioscience journal
2 Summa phytopathologica (impresso)
3 Summa phytopathologica (impresso)
4 Summa phytopathologica (impresso)
5 Australian journal of crop science (online)
6 Summa phytopathologica (impresso)
7 Summa phytopathologica
8 Pesquisa agropecuaria tropical (online)
9 Crop breeding and applied biotechnology
10 Genetics and molecular research
11 Tropical plant pathology
12 Genetics and molecular research
13 Perspectivas online: biológicas e saúde
14 Científica (jaboticabal. online)
15 Journal of plant physiology & pathology
16 Tropical plant pathology
17 Summa phytopathologica (impresso)
> str(df2)
'data.frame': 17 obs. of 1 variable:
$ my.journal: Factor w/ 11 levels "Australian journal of crop science (online)",..: 2 10 10 10 1 10 9 8 4 5 ...
I want another df (df3) where the journals in df2 where searched in df1 and if match give me something like this (Without the NA):
In NA place i want the Journal Impact Factor correspondet to the journal in my df2.
df3
journal jcr total
<chr> <fct> <int>
1 Summa phytopathologica (impresso) NA 5
2 Genetics and molecular research NA 2
3 Tropical plant pathology NA 2
4 Australian journal of crop science (online) NA 1
5 Bioscience journal NA 1
6 Científica (jaboticabal. online) NA 1
7 Crop breeding and applied biotechnology NA 1
8 Journal of plant physiology & pathology NA 1
9 Perspectivas online: biológicas e saúde NA 1
10 Pesquisa agropecuaria tropical (online) NA 1
11 Summa phytopathologica NA 1
I'm starting using R a few months and I don't know how to start to resolve this.
The two dataframes are in the link df1 and df2
Updated:
One solution would be to use join with dplyr:
library(dplyr)
df1 <- read.table("df1.txt", skip = 1, header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table("df2.txt", header = TRUE, stringsAsFactors = FALSE)
df1 <- df1 %>%
mutate(Full.Journal.Title = toupper(Full.Journal.Title))
df2 <- df2 %>%
mutate(my.journal = toupper(my.journal))
df2 %>%
left_join(df1, by = c("my.journal" = "Full.Journal.Title")) %>%
group_by(my.journal, Journal.Impact.Factor) %>%
summarize(total = n()) %>%
arrange(desc(total))
my.journal Journal.Impact.Factor total
<chr> <chr> <int>
1 SUMMA PHYTOPATHOLOGICA (IMPRESSO) NA 5
2 GENETICS AND MOLECULAR RESEARCH NA 2
3 TROPICAL PLANT PATHOLOGY 1.254 2
4 AUSTRALIAN JOURNAL OF CROP SCIENCE (ONLINE) NA 1
5 BIOSCIENCE JOURNAL 0.375 1
6 CIENTíFICA (JABOTICABAL. ONLINE) NA 1
7 CROP BREEDING AND APPLIED BIOTECHNOLOGY 1.026 1
8 JOURNAL OF PLANT PHYSIOLOGY & PATHOLOGY NA 1
9 PERSPECTIVAS ONLINE: BIOLóGICAS E SAúDE NA 1
10 PESQUISA AGROPECUARIA TROPICAL (ONLINE) NA 1
11 SUMMA PHYTOPATHOLOGICA NA 1
A few things to note to make this work:
Reading in df1 the header appears to take up 2 rows, so skipped the first line (since this more closely matched your previous example)
read.table includes stringsAsFactors = FALSE if you do not want them as factors
Some journal names are upper case, others lower case. The join is case-sensitive so included toupper to make everything upper case before the join (as an alternative, you can embed the toupper inside the left_join if you want to leave the original data frames untouched)
Please let me know if this is what you had in mind.

Extract before and after lines based on keyword in Pdf using R programming

I want to extract information related to keyword "cancer" from list of pdf using R.
i want to extract before and after lines or paragraph containing word cancer in text file.
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})
above regex is not working
Here's one approach:
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
doc
## page_id element_id text
## 1 24 28 Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2 24 29 partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3 24 30 stresses that, in order for them to work, they should be voluntary, and the government
## 4 25 8 the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5 25 9 while an average estimate of the value of drugs to treat the country's cancer patients is
## 6 25 10 $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7 25 12 because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8 25 13 excise exemptions for anti-cancer drugs.
## 9 25 14 Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10 32 19 Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11 32 20 anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12 32 21 December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1

reuters data scraping in R with rvest, find CSS selector

Yes, I know there are similar questions, I've read the answers and tried those which I could implement. So, sorry in advance in case the question is stupid :)
I'm scraping the age of company board members from Reuters for a list of companies.
Here's the link: http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
I'm using rvest library and selectorgadget to find proper CSS selector.
Here's the code:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
The result is
character(0)
I think I have the wrong CSS selector. Can you please tell me how to select the table?
You need to use html_session to get the data loaded properly:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations

Resources