Webscraping question on appending several values to a single row with beautifulsoup - web-scraping

Lets say i want to scrape imdb for top 10 movies. I would like to fetch the title for the movies and the cast members for the movies.
Im easily able to fetch the title of the movies and append them to a list. The problem is i dont know how to append several values to a single row. Let say the first movie has 3 actors, the second movie has 5 actors, how can append the actors to a list so that the 3 actors on in first movie are in row 1 of the list and the 5 actors from the second movie are in row 2 and so on.

Just a general approach, cause there is no code provided in your question.
Request the webiste (example top 250 movies) and cook your soup:
response = requests.get('http://www.imdb.com/chart/top')
soup = BeautifulSoup(response.text, 'lxml')
Create your empty list that should store your results:
data = []
Iterate over the result set of your selection (example top 250 movies) and append a dict per iteration to your list:
for e in soup.select('.titleColumn a'):
data.append({
'title':e.text,
'director':e['title'].split('(dir.),')[0],
'actors':e['title'].split('(dir.),')[-1]
})
Print your data or create a data frame from your list of dicts and :
pd.DataFrame(data)
Output
title director actors
0 Die Verurteilten Frank Darabont Tim Robbins, Morgan Freeman
1 Der Pate Francis Ford Coppola Marlon Brando, Al Pacino
2 Der Pate 2 Francis Ford Coppola Al Pacino, Robert De Niro

Related

How do I automate the code to run for each row entry in the dataset in R?

Let's say I have this data frame of several random sentences
Sentences<-c("John is playing a video game at the moment","Tom will cook a delicious meal later",
"Kyle is with his friends watching the game",
"Diana is hosting her birthday party tomorrow night"
)
df<-data.frame(a)
keywords<-c("game","is","will","meal","birthday","party")
And I have a vector of key words. I need to create a new column in the data frame with only keywords mentioned in the sentence appearing.
na.omit(str_match(df[n,],keywords))
I have constructed this line of code which returns keywords that were used in those sentences (n stands for row number). How do I automate this code to be applied for each row?
We could use str_extract_all from stringr package for this:
library(dplyr)
library(stringr)
df %>%
mutate(new_col = str_extract_all(Sentences, paste(keywords, collapse = "|")))
Sentences new_col
1 John is playing a video game at the moment is, game
2 Tom will cook a delicious meal later will, meal
3 Kyle is with his friends watching the game is, is, game
4 Diana is hosting her birthday party tomorrow night is, birthday, party

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:
movies = c("Star Wars Episode IV: A New Hope", "Interstellar",
"Happythankyoumoreplease")
I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like
for (i in movies) {
info = find_item(i)
}
This is what I get from each item:
> find_item("Interstellar")
Wikidata item search
Number of results: 10
Results:
1 Interstellar (Q13417189) - 2014 US science fiction film
2 Interstellar (Q6057099)
3 interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space
4 space colonization (Q686876) - concept of permanent human habitation outside of Earth
5 rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly
6 interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy
7 interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars
8 Interstellar Boundary Explorer (Q835898)
9 starship (Q2003852) - spacecraft designed for interstellar travel
10 interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet
>
Unfortunately, the information that I get from find_item (see below) has two problems:
it is not a dataframe with all wikidata information of the item I
am searching but a list of what seems to be metadata (wikidata's id,
link...).
it does not have the information I need (wikidata's
properties from each particular wikidata item).
Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:
> find_property("genre")
Wikidata property search
Number of results: 4
Results:
1 genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic
2 radio format (P415) - describes the overall content broadcast on a radio station
3 sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279)
4 gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus
This has similar problems:
it is not a dataframe
it just stores metadata about the property
I don't find any way to link each property with each object in movies vector.
Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)
These are just lists. you can get a picture with str(find_item("Interstellar")) for example.
Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label
a <- find_item("Interstellar")
b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
data.frame(b)
## X1 X2
## 1 Q13417189 Interstellar
## 2 Q6057099 Interstellar
## 3 Q41872 interstellar medium
## 4 Q686876 space colonization
## 5 Q167910 rogue planet
## 6 Q1054444 interstellar cloud
## 7 Q834826 interstellar travel
## 8 Q835898 Interstellar Boundary Explorer
## 9 Q2003852 starship
## 10 Q2441216 interstellar object
This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.
Reduce("rbind",lapply(a,
function(x) cbind(x$title,
x$label,
ifelse(length(x$description)==0,NA,x$description))))

How to write a for-loop that searches names from data.frame in a character vector?

I have a data.frame with names of football players, for example:
names <- data.frame(id=c(1,2,3,4,5,6,7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'))
> names
id year
1 1 Maradona
2 2 Cruyff
3 3 Messi
4 4 Ronaldo
5 5 Pele
6 6 Van Basten
7 7 Diego
I also have a 6,000 scraped text files, containing stories about these football players. These stories are stored as 6,000 elements in a large vector called stories.
Is there a way a loop (or an apply function) can be written that searches for the names of each of the football players. If a match or multiple matches occur, I would like to record the element number and the name(s) of the football player.
For example, consider the following text in stories[1]:
Diego Armando Maradona (born 30 October 1960) is a retired Argentine
professional footballer. He has served as a manager and coach at other
clubs as well as the national team of Argentina. Many in the sport,
including football writers, former players, current players and
football fans, regard Maradona as the greatest football player of all
time. He was joint FIFA Player of the 20th Century
with Pele.
The ideal data.frame would have the following structure:
> outcome
element name1 name2
1 1 Maradona Pele
Does somebody know a way to write such a code that results in one data.frame for with information on all football players?
I just did it with a loop, but maybe you can do it with an apply function
#Make sure you include stringsAsFactors = F or my code won't work
football_names <- data.frame(id=c(1:7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'),stringsAsFactors = F)
outcome <- data.frame(element=football_names$id)
for (i in 1:nrow(football_names)){
names_in_story <- football_names$year[football_names$year %in% unlist(strsplit(stories[i],split=" "))]
for (j in 1:length(names_in_story)){
outcome[i,j+1] <- names_in_story[j]
}
}
names(outcome) <- c("element",paste0("name",1:(ncol(outcome)-1)))
I don't undertsand your question exactly. But you can try to use a string match using astringr function and lapply.
I assumed that your data stories is a list.
The function finds all names you provide into the function as a vector and counts their occurence. The output is again a list.
foo <- function(x,y) table(unlist(str_match_all(x,paste0(y,collapse = "|"))))
The result
res <- lapply(series, foo,names$year)
Then you can merge and sum up the data (rowSums()) for example like this:
Reduce(function(...) merge(..., all=T, by="Var1"), res)

Name matching with different length data frames in R

I have two dataframes with numerous variables. Of primary concern are the following variables, df1.organization_name and df2.legal.name. I'm just using fully qualified SQL-esque names here.
df1 has dimensions of 15 x 2700 whereas df2 has dimensions of 10x40,000. And essentially, the 'common' or 'matching' columns are name fields.
I reviewed this post Merging through fuzzy matching of variables in R and it was very helpful but I can't really figure out how to wrangle the script to get it to work with my dfs.
I keep getting an error - Error in which(organization_name[i] == LEGAL.NAME) :
object 'LEGAL.NAME' not found.
Desired Matching and Outcome
What I am trying to do is compare each and every one of my df1.organization_name to every one of the df2.legal_name and make a comparison if they are a very close match (like >=85%). And then like in the script above, take matched customer name and the matched comparison name and put them into a data.frame for later analysis.
So, if one of my customer names is 'Johns Hopkins Auto Repair' and one of my public list names is, 'John Hopkins Microphone Repair', I would call that a good match and I want some sort of indicator appended to my customer list (in another column) that says, 'Partial Match' and the name from the public list.
Example(s) of the dfs for text wrangling:
df1.organization_name (these are fake names b/c I can't post customer names)
- My Company LLC
- John Johns DBA John's Repair
- Some Company Inc
- Ninja Turtles LLP
- Shredder Partners
df2.LEGAL.NAME (these are real names from the open source file)
- $1 & UP STORE CORP.
- $1 store 0713
- LLC 0baid/munir/gazem
- 1 2 3 MONEY EXCHANGE LLC
- 1 BOY & 3 GIRLS, LLC
- 1 STAR BEVERAGE INC
- 1 STOP LLC
- 1 STOP LLC
- 1 STOP LLC DBA TIENDA MEXICANA LA SAN JOSE
- 1 Stop Money Centers, LLC/Richard

Datalist or Datagrid to display rows of database looking like a table?

I'm working with a sql table that looks like this:
ID Region Store Sales
1 MEX Supermarket 10,000
2 USA Supermarket 5,000
3 MEX Club 10,000
4 USA Direct 1,000
5 MEX Direct 4,000
6 USA Club 0
I have 8 different region options(just 2 here to make it shorter). Also, 3 store options, and I want to display the info like this:
Store MEX USA
Supermarket 10,000 5,000
Club 10,000 0
Direct 4,000 1,000
So, I don't know what to choose to display the info like a table to compare the sales between the regions in each store segment, which would be the correct sql to call them and order in columns to make it look like the table.
Good candidate for PIVOT operator. Check this example.
For your situation, try this. You can add additional columns besides MEX and USA.
SELECT *
FROM ( SELECT [Region] ,
[Store] ,
[Sales]
FROM [RegionSales]
) rs PIVOT ( SUM(Sales) FOR Region IN ( [MEX], [USA] ) ) AS SalesPerRegion

Resources