Webscrape from webpage list with no clear delimiters in R - r

Learning to webscrape in R from a list of contacts on this webpage:https://ern-euro-nmd.eu/board-members/
There are 65 rows (contacts) and should be 3 columns of associated details (Name, institution, and location). Here is a copy/paste of one row of data from the webpage:
Adriano Chio
Azienda Ospedaliero Universitaria Città della Salute e della Scienza
Italy
My current approach lumps all the details into one column. How can I split the data into 3 columns.
There is only white space apparently between these details on the webpage. Not sure what to do.
#Below is my R code:
EURONMD_KOLs<- read_html("https://ern-euro-nmd.eu/board-members/") %>%
html_nodes(".detailsListing")%>%
html_text()
EURONMD_KOLs
EURONMD_KOLs_table<-data.frame(EURONMD_KOLs)
#end of R code
My resulting table lumps everything into one column. I need to separate reach row of data into 3 columns. Any help is appreciated.

Remove leading and lagging new line character from the text, split on '\n' and create a 3-column dataframe.
library(rvest)
read_html("https://ern-euro-nmd.eu/board-members/") %>%
html_nodes(".detailsListing")%>%
html_text() %>%
trimws() %>%
strsplit('\n+') %>%
do.call(rbind, .) %>%
as.data.frame() %>%
setNames(c('Name', 'institution', 'location')) -> result
head(result)
# Name institution location
#1 Adriano Chio Azienda Ospedaliero Universitaria Città della Salute e della Scienza Italy
#2 Alessandra Ferlini University Hospital St. Anna Italy
#3 Andrés Nascimento Hospital Sant Juan de Déu Universidad de Barcelona Spain
#4 Angela Kaindl Charité - Universitätsmedizin Berlin Germany
#5 Anna Kostera-Pruszczyk SPCSK, Medical University of Warsaw Poland
#6 Anneke van der Kooi Academic Medical Centre Netherlands

Related

Combine two rows in R that are separated

I am trying to clean the dataset so that all data is in its appropriate cell by combining rows since they are oddly separated. There is an obvious nuance to the dataset in that there are some rows that are correctly coded and there are some that are not.
Here is an example of the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center
230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.24
8211137
42
2514
Erie
Yorktown Centre
2501 West 12th Street
190305
41.17
7862624
Here is an example of how I want the data:
Rank
Store
City
Address
Transactions
Avg. Value
Dollar Sales Amt.
40
1404
State College
Hamilton Square Shop Center, 230 W Hamilton Ave
155548
52.86
8263499
41
2310
Springfield
149 Baltimore Pike
300258
27.28
8211137
42
2514
Erie
Yorktown Centre, 2501 West 12th Street
190305
41.17
7862624
Is there an Excel or R function to fix this, or does anyone know how to write an R functional to correct this?
I read into the CONCATENATE function in excel and realized it was not going to accomplish anything. I figured an R functional would be the only way to fix this.
The concatenate function will work here or in the excel select the columns and by using the merge formula in the formala option you can complete the task.
I recommend checking how the file is being parsed.
From the example data you provided, it looks like the address column is being
split on ", " and going to the next line.
Based on this assumption alone, below is a potential solution using the
tidyverse:
library(tidyverse)
original_data <- tibble(Rank = c(40,NA,41,42,NA),
Store = c(1404,NA,2310,2514,NA),
City = c("State College",NA,"Springfield","Erie",NA),
Address = c("Hamilton Square Shop Center",
"230 W Hamilton Ave","149 Baltimore Pike",
"Yorktown Centre","2501 West 12th Street"),
Transactions = c(NA,155548,300258,NA,190305),
`Avg. Value` = c(NA,52.86,27.24,NA,41.17),
`Dollar Sales Amt.` = c(NA,8263499,8211137,NA,7862624))
new_data <- original_data %>%
fill(Rank:City) %>%
group_by_at(vars(Rank:City)) %>%
mutate(Address1 = lag(Address)) %>%
slice(n()) %>%
ungroup() %>%
mutate(Address = if_else(is.na(Address1), Address,
str_c(Address1, Address, sep = ", "))) %>%
select(Rank:`Dollar Sales Amt.`)
new_data

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

How to clean up data in R using strings?

I need to clean up gender and dates columns of the dataset found here.
They apparently contain some misspellings and ambiguities. I am new to R and data cleaning so I am not sure how to go about doing this. For starters, I have tried to correct the misspellings using
factor(data$artist_data$gender)
str_replace_all(data$artist_data$gender, pattern = "femle", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "f.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "F.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "female", replacement = "Female")
But it doesn't seem to work as I still have f., F. and femle in my output. Secondly, there seem to be empty cells inside. Do I need to remove them or is it alright to leave them there. If I need to remove them, how?
Thirdly, for the dates column, how do I make it clearer? i.e. change the format of born in xxxx to maybe xxxx-yyyy if died or xxxx-present if still alive. e.g. born in 1940 - is it safe to assume that they are still alive? Also one of the data has the word active in it. Would like to make this data more straight-forward.
Please help,
Thank you.
We have to escape the dot in f. and F.
library(dplyr)
library(stringr)
library(tibble)
pattern <- paste("f\\.|F\\.|female|femle", collapse="|")
df[[2]] %>%
mutate(gender = str_replace(string=gender,
pattern = pattern,
replacement="Female")) %>%
as_tibble()
name gender dates placeOfBirth placeOfDeath
<chr> <chr> <chr> <chr> <chr>
1 Abakanowicz, Magdalena Female born 1930 Polska ""
2 Abbey, Edwin Austin Male 1852–1911 Philadelphia, United States "London, United Kingdom"
3 Abbott, Berenice Female 1898–1991 Springfield, United States "Monson, United States"
4 Abbott, Lemuel Francis Male 1760–1803 Leicestershire, United Kingdom "London, United Kingdom"
5 Abrahams, Ivor Male born 1935 Wigan, United Kingdom ""
6 Absalon Male 1964–1993 Tel Aviv-Yafo, Yisra'el "Paris, France"
7 Abts, Tomma Female born 1967 Kiel, Deutschland ""
8 Acconci, Vito Male born 1940 New York, United States ""
9 Ackling, Roger Male 1947–2014 Isleworth, United Kingdom ""
10 Ackroyd, Norman Male born 1938 Leeds, United Kingdom ""
# ... with 3,522 more rows

Extract cities from each row in excel and export to its respective row using R

I have extracted tweets in .csv format and the data looks like this:
(row 1) The latest The Admin Resources Daily! Thanks to #officerenegade #roberthalf #elliottdotorg #airfare #jobsearch
(row 2) RT #airfarewatchdog: Los Angeles #LAX to Cabo #SJD $312 nonstop on #AmericanAir for summer travel. #airfare
(row 3) RT #TheFlightDeal: #Airfare Deal: [AA] New York - Mexico City, Mexico. $270 r/t. Details:
(row 4) The latest The Nasir Muhammad Daily! Thanks to #Matt_Revel #Roddee #JaeKay #lefforum #airfare
(row 5) RT #BarefootNomads: So cool! <U+2708> <U+2764><U+FE0F> #airfare deals w #Skyscanner Facebook Messenger Bot #traveldeals #cheapflights ht…
(row 6) Flights to #Oranjestad #Aruba are £169 for a 15 day trip departing Tue, Jun 7th. #airfare via #hitlist_app"
I have made use of the NLP technique to extract city names from the tweets but the output is a list of cities with each city occupying a row one below the other. It is just identifying all the city names and making a list of it.
Output:
1 Los Angeles
2 New York
3 Mexico City
4 Mexico
5 Tue
6 London
7 New York
8 Fort Lauderdale
9 Los Angeles
10 Paris
I want the output to be something like:
1 Los Angeles Cabo (from the first tweet in row 2)
2 New York Mexico City Mexico (from the second tweet in row 3)
Code:
#Named Entity Recognition (NER)
bio <- readLines("C:\\xyz\\tweets.csv")
print(bio)
install.packages(c("NLP", "openNLP", "RWeka", "qdap"))
install.packages("openNLPmodels.en",
repos = "http://datacube.wu.ac.at/",
type = "source")
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
library(openNLPmodels.en)
library(magrittr)
bio <- as.String(bio)
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
bio_annotations <- annotate(bio, list(sent_ann, word_ann))
class(bio_annotations)
head(bio_annotations)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
sents(bio_doc) %>% head(2)
words(bio_doc) %>% head(10)
location_ann <- Maxent_Entity_Annotator(kind = "location")
pipeline <- list(sent_ann,
word_ann,
location_ann)
bio_annotations <- annotate(bio, pipeline)
bio_doc <- AnnotatedPlainTextDocument(bio, bio_annotations)
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, `[[`, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
entities(bio_doc, kind = "location")
cities <- entities(bio_doc, kind = "location")
library(xlsx)
write.xlsx(cities, "C:\\xyz\\xyz.xlsx")
Also is there a way that I can further separate the cities as origin and destination, i.e. by classifying cities before 'to' or '-' as origin cities and the rest as destination cities?

How to separate a row in a CSV and generate another CSV file from it in R?

I have a CSV file like
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
I have done it like
record <- read.csv("C:/Users/Administrator/Downloads/Campaignname.csv",header=TRUE)
ad <- record$AdvertiserName
camp <- record$CampaignName
read.table(text=gsub('Metro', '\n', c), col.names=c('City', 'State'))
It throws an error.
How to get the desired result?
Thanks in advance.
You can do this for example:
## read the csv file, you change text here by your fileName
xx <- read.table(text ='AdvertiserName,CampaignName
Wells Fargo,Gary INMetro Chicago IL Metro
EMC,Los Angeles CAMetro Boston MA Metro',sep=',',header=TRUE)
## use regular expression to create city and state variables
## rows are separated by ":"
## columns are separated by a comma ","
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$CampaignName)
## Use strsrsplit to extract rows and columns
## This is a compacted code !
yy <-
Map(function(x,y)
cbind.data.frame(y,do.call(rbind,strsplit(x,','))),
strsplit(res,':'),xx$AdvertiserName)
## create the final data.frame and set names
res <- do.call(rbind,yy)
setNames(res, c('AdvertiserName','City','State'))
AdvertiserName City State
1 Wells Fargo Gary IN
2 Wells Fargo Chicago IL
3 EMC Los Angeles CA
4 EMC Boston MA

Resources