reuters data scraping in R with rvest, find CSS selector - r

Yes, I know there are similar questions, I've read the answers and tried those which I could implement. So, sorry in advance in case the question is stupid :)
I'm scraping the age of company board members from Reuters for a list of companies.
Here's the link: http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
I'm using rvest library and selectorgadget to find proper CSS selector.
Here's the code:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
The result is
character(0)
I think I have the wrong CSS selector. Can you please tell me how to select the table?

You need to use html_session to get the data loaded properly:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations

Related

Seperating characters in dfm object R

all,
I have imported the sotu corpus from quanteda in R. I am somewhat new to dfm objects and am wanting to separate the doc_id column to give me a name and a year column. If this was a tibble, this code works:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
separate(doc_id, c("name","year"),"-")
However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)
Is there some equivalent way to do this with dfm objects?
The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.
For this task, you want to use the docnames() functions or accessing the document IDs, and this works for the corpus as well as for the dfm.
library("quanteda")
## Package version: 2.1.2
data("data_corpus_sotu", package = "quanteda.corpora")
data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
You could also have taken this from the "President" docvar field and the "Date":
data.frame(
name = data_corpus_sotu$President,
year = lubridate::year(data_corpus_sotu$Date)
) %>%
head()
## name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794
Created on 2021-02-13 by the reprex package (v1.0.0)
The following code will do exactly what you want, albeit it might break some operations in quanteda that will look for docid_ in sotudfm#docvars, the data frame that stores the documents relational data. For instance, it will break any filtering by sotudfm#Dimnames$docs, which is where the dimension names of the documents are listed.
sotudfm#docvars <- sotudfm#docvars %>% separate(col = docid_, c("name","year"),"-")
> sotudfm#docvars %>% as_tibble()
# A tibble: 241 x 10
docname_ name year segid_ FirstName President Date delivery type party
<chr> <chr> <chr> <int> <chr> <chr> <date> <fct> <fct> <fct>
1 Washington-1790 Washington 1790 1 George Washington 1790-01-08 spoken SOTU Independent
2 Washington-1790b Washington 1790b 1 George Washington 1790-12-08 spoken SOTU Independent
3 Washington-1791 Washington 1791 1 George Washington 1791-10-25 spoken SOTU Independent
4 Washington-1792 Washington 1792 1 George Washington 1792-11-06 spoken SOTU Independent
5 Washington-1793 Washington 1793 1 George Washington 1793-12-03 spoken SOTU Independent
6 Washington-1794 Washington 1794 1 George Washington 1794-11-19 spoken SOTU Independent
7 Washington-1795 Washington 1795 1 George Washington 1795-12-08 spoken SOTU Independent
8 Washington-1796 Washington 1796 1 George Washington 1796-12-07 spoken SOTU Independent
9 Adams-1797 Adams 1797 1 John Adams 1797-11-22 spoken SOTU Federalist
10 Adams-1798 Adams 1798 1 John Adams 1798-12-08 spoken SOTU Federalist
Here is the code that ended up working for me:
sotudfm#docvars <- sotudfm#docvars %>%
separate(col = docname_, c("name","year"),"-")
This kept the doc_id intact when I ran
head(sotudfm, 10)
It appears that docid_ and docname_ are identical.

Matching Pairs for R Dataframes

I have a data frame that contains the career records of employees in different offices of a large corporations. I want to identify every pair of employees who have shared working experience in a same office. My data frame structure looks like below
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob
In the above sample, Henry and Peter worked together in HR in 2013. Henry also worked with Bob in logistics in 2011 and 2012. I want the final results can be something like:
Year_of_shared_experience Person_A Person_B
1 Henry Peter
2 Henry Bob
The order of Person_A and Person_B does not matter (i.e., it can be Henry in Person_A or it can be Peter in Person_A column). Thanks!
You could merge the table with itself (i.e., a "self-join") and then filter out duplicate entries:
# read data
dat = "
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob"
dat = read.table(text=dat, header=TRUE)
# self-join
dat = merge(dat, dat, all=TRUE, by=c("Year", "Office"))
# filter out duplicates
dat = dat[dat$Employee_Name.x < dat$Employee_Name.y,]
dat
#> Year Office Employee_Name.x Employee_Name.y
#> 4 2011 Logistics Bob Henry
#> 8 2012 Logistics Bob Henry
#> 12 2013 HR Henry Peter
An option in tidyverse
library(dplyr)
full_join(dat, dat, by = c("Year", "Office")) %>%
filter(Employee_Name.x < Employee_Name.y)

Partial String Matching in R to unify text into one category

I have dataset as follow
EstablishmentName Freq
bahria university 20
bahria university islamabad 12
arid agriculture 3
arid agriculture university 15
arid rawalpindi 9
college of e&me, nust 20
college of e & me (nust) 15
college of eme 30
As you can see above that Bahria University and Bahria University Islamabad are almost same, so goes for other strings. I want to unify them into one such that
Expected Output
EstablishmentName Freq
Bahria University 32
Arid Agriculture 27
College of EME 30
I have tried the following solution but it doesn't seems to work.
library(SnowballC)
library(dplyr)
mutate(df, word = wordStem(EstablishmentName)) %>%
group_by(EstablishmentName) %>%
summarise(total = sum(Freq))

Splitting the rows that has "|" using separate() fn not splitted

my data looks like
> company
name category_list
11 1-4 All Entertainment|Games|Software
12 1.618 Technology Networking|Real Estate|Web Hosting
13 1-800-DENTIST Health and Wellness
14 1-800-DOCTORS Health and Wellness
15 1-800-PublicRelations, Inc. Internet Marketing|Media|Public Relations
i will have to split the category_list column based the values. when the values are pipe separated, the row should be split.
i tried the same using separate function but the column is not populated with any values
c1 <- company %>% separate(category_list,into=c("primary_Sector"), sep="|")
Actual output:
name primary_Sector
11 1-4 All
12 1.618 Technology
13 1-800-DENTIST
14 1-800-DOCTORS
15 1-800-PublicRelations, Inc.
Expected output
name category_list
11 1-4 All Entertainment
12 1-4 All Games
13 1-4 All Software
can someone tell me what is wrong?
tidyr::separate() does the column-wise separation, tidyr::separate_rows() does the row-wise separation:
library(tidyr)
read.table(
text="name;category_list
1-4 All;Entertainment|Games|Software
1.618 Technology;Networking|Real Estate|Web Hosting
1-800-DENTIST;Health and Wellness
1-800-DOCTORS;Health and Wellness
1-800-PublicRelations, Inc.;Internet Marketing|Media|Public Relations",
sep=";", header = TRUE, stringsAsFactors = FALSE
) %>%
separate_rows(category_list, sep = "\\|")
## name category_list
## 1 1-4 All Entertainment
## 2 1-4 All Games
## 3 1-4 All Software
## 4 1.618 Technology Networking
## 5 1.618 Technology Real Estate
## 6 1.618 Technology Web Hosting
## 7 1-800-DENTIST Health and Wellness
## 8 1-800-DOCTORS Health and Wellness
## 9 1-800-PublicRelations, Inc. Internet Marketing
## 10 1-800-PublicRelations, Inc. Media
## 11 1-800-PublicRelations, Inc. Public Relations

R:Fuzzy Logic Name match

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets
Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern
Cus File
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
Master
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
i have tried with this simple loop
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*result *
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO
it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist
code:
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
Please if anybody has worked with such data in R please help !
achieved partial result by
code :
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])

Resources