BeautifulSoup.__init__() got multiple values for argument 'features' (when importing a custom BeautifulSoup wrapper function) - web-scraping

Problem
I had been using this wrapper function with no issues.
def make_soup(response):
return BeautifulSoup(response.text, 'lxml')
response = requests.get('https://www.webscraper.io/test-sites/e-commerce/allinone')
soup = make_soup(response) # No Error
However, after upgrading to python 3.11 from 3.8, when importing the function and trying to use it, I get the following error:
from util import make_soup
response = requests.get('https://www.webscraper.io/test-sites/e-commerce/allinone')
make_soup(response) # Error
TypeError: BeautifulSoup.__init__() got multiple
values for argument 'features'
Attempted fix: Passing "features" as a keyword argument
I tried passing lxml as a keyword argument instead:
def make_soup(response):
return bs(response.text, features='lxml')
Initially, this appeared to fix the issue, as the function ran fine when imported.
However, I noticed that often the returned BeautifulSoup object was missing large parts of the HTML, that were not missing if I executed the same code without importing it.
Here's one more extreme example
from util import make_soup
session = requests.Session()
response = session.get('https://www.deviantart.com/') # One of the many sites I tested this on
soup = bs(response.text, 'lxml')
soup_from_imported_func = make_soup(response)
num_divs = lambda soup: len(soup.find_all('div'))
print(num_divs(soup))
print(num_divs(soup_from_imported_func))
These printed values should be equal, but they're not.
0
1120
If I print the broken soup this is all I get
<?xml version="1.0" encoding="utf-8"?>
I have tried using a different parser (html.parser and html5lib) but the issue persists.
Conclusion
As a temporary solution I can reluctantly repeat the contents of the make_soup function
However I would like to understand why this is happening, in case it generates more problems in the future.

Related

Can´t import biom-file with import_biom in R

I have collected 6 fastq-files from the same mock-sample and I merged them using gzip in linux for further using it with Kraken2. The output-file from Kraken2 (.report) was converted to .biom-format using Kraken-biom in linux. When I then try to import the .biom-file into R using import_biom-package I receive the following message:
Error in validObject(.Object) : invalid class “phyloseq” object:
Component sample names do not match. Try sample_names()
I have opened the .biom-file and can only see one sample name (the one I called the output-file during gzip). I tried to use sample_names(), but cant do it since the .biom-file is not loaded into R. Do anyone know why the sample names do not match since I merged them to one, so should it not be one sample name?
Edit: When I run Kraken2 on the 6 fastq-files without merging them and then using kraken-biom, it works to import the .biom-file into R.

Importing CSV via API Stopped Working in R

I have some code that I have run multiple times successfully but today it randomly stopped working specifically its coercing a response from an API import into a data frame much like this:
results <- data.frame(httr::content(response))
Again this code worked before and other than some filtering of data I have not changed it.
The error I get is:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘c("xml_document", "xml_node")’ to a data.
I trying to fix it and figure out what happened.
I am at a loss here. Let me know if you need more info.
Thanks.

Return Query when Using HTTR in a function in R

I was learning about the package httr and webscraping based on an exercise from Dataquest and attempting to implement it in my own practice programs. My issue comes from trying to make a query within a function.
For example, the following code:
api_request <- function(base_url, loc){
url <- modify_url(paste(base_url),
path = paste(loc))
response <- GET(url)
return(response)
}
When I run the code, everything initially appears to run correctly. The status comes back with the code 200 and no errors or warnings show up. However, I cannot get the response to save to the global environment. I've tried this method as well as changing the return(response) to just response in the function as recommended by Dataquest, but it will not save to the global environment.
I can get this to work outside of a function, but I want to implement it inside of a function so that if any errors occur when making this query I can stop the function and not save a 404 link.
How can I get the query to return from the function so that I can reference it later on in the code?

Python LDA gensim "DeprecationWarning: invalid escape sequence"

I am new to stackoverflow and python so please bear with me.
I am trying to run an Latent Dirichlet Analysis on a text corpora with the gensim package in python using PyCharm editor. I prepared the corpora in R and exported it to a csv file using this R command:
write.csv(testdf, "C://...//test.csv", fileEncoding = "utf-8")
Which creates the following csv structure (though with much longer and already preprocessed texts):
,"datetimestamp","id","origin","text"
1,"1960-01-01","id_1","Newspaper1","Test text one"
2,"1960-01-02","id_2","Newspaper1","Another text"
3,"1960-01-03","id_3","Newspaper1","Yet another text"
4,"1960-01-04","id_4","Newspaper2","Four Five Six"
5,"1960-01-05","id_5","Newspaper2","Alpha Bravo Charly"
6,"1960-01-06","id_6","Newspaper2","Singing Dancing Laughing"
I then try the following essential python code (based on the gensim tutorials) to perform simple LDA analysis:
import gensim
from gensim import corpora, models, similarities, parsing
import pandas as pd
from six import iteritems
import os
import pyLDAvis.gensim
class MyCorpus(object):
def __iter__(self):
for row in pd.read_csv('//mpifg.local/dfs/home/lu/Meine Daten/Imagined Futures and Greek State Bonds/Topic Modelling/Python/test.csv', index_col=False, header = 0 ,encoding='utf-8')['text']:
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(row.split())
if __name__ == '__main__':
dictionary = corpora.Dictionary(row.split() for row in pd.read_csv(
'//.../test.csv', index_col=False, encoding='utf-8')['text'])
print(dictionary)
dictionary.save(
'//.../greekdict.dict') # store the dictionary, for future reference
## create an mmCorpus
corpora.MmCorpus.serialize('//.../greekcorpus.mm', MyCorpus())
corpus = corpora.MmCorpus('//.../greekcorpus.mm')
dictionary = corpora.Dictionary.load('//.../greekdict.dict')
corpus = corpora.MmCorpus('//.../greekcorpus.mm')
# train model
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, iterations=1000)
I get the following error codes and the code exits:
...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:832: DeprecationWarning: invalid escape sequence \d
\...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:2736: DeprecationWarning: invalid escape sequence \d
\...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:2914: DeprecationWarning: invalid escape sequence \g
\...\Python\venv\lib\site-packages\pyLDAvis_prepare.py:387:
DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
I cannot find any solution and to be honest neither have any clue where exactly the problem comes from. I spent hours making sure that the encoding of the csv is utf-8 and exported (from R) and imported (in python) correctly.
What am I doing wrong or where else could I look at? Cheers!
DeprecationWarining is exactly that - warning about a feature being deprecated which is supposed to prompt the user to use some other functionality instead to maintain the compatibility in the future. So in your case I would just watch for the update of libraries that you use.
Starting with the last warning it look like it is originating from pandas and has been logged against pyLDAvis here.
The remaining ones come from pyparsing module but it does not seem that you are importing it explicitly. Maybe one of the libraries you use has a dependency and uses some relatively old and deprecated functionality. To eradicate the warning for the start I would check if upgrading does not help. Good luck!
import warnings
warnings.filterwarnings("ignore")
pyLDAvis.enable_notebook()
Try using this

Importing a file only if it has been modified since last import and then save to a new object

I am trying to create a script that I run about once a week. The goal is that it will go out and check an MS Excel file that a co-worker manages. It then tests to see if the date the file was modified is newer then the last time it was imported. If it is newer, it will import the file (I am using readxl package - WONDERFUL!) into a new object that is a named with the date the original Excel file was last modified included in the object name. I have everything working except for the assignment of the imported data.frame to a new object that includes the date.
An example of the code I am using is:
First I create an object with a path pointing to the file of interest.
pfdFilePath <- file.path("H:", "3700", "3780", "002-00", "3.
Project Information", "Program", "RAH program.xls")
after testing to verify the file has been modified, I have tried simple assignment ("test" is just an example for simplification):
paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = "") <- "test"
But that code produces an error:
Error in paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = "") <- "test" :
target of assignment expands to non-language object
I then try the assign function:
assign(paste("df-", as.Date(file.info(pfdFilePath)$mtime), sep = ""), "test")
Running this code creates an object that looks to be okay, but when I evaluate it, or try using str() or class() I get the following error:
Error in df - df-2016-08-09 :
non-numeric argument to binary operator
I am pretty sure this is an error that has to do with the environment I am using assign, but being relatively new to R, I cannot figure it out. I understand that the assign function seems to be frowned upon, but those warnings seem to centered on for-loops vs. lapply functions. I am not really iterating within a function though. Just a dynamically named object whenever I run a script. I can't come up with a better way to do it. If there is another way to do this that doesn't require the assign function, or a better way to use assign function , I would love to know it.
Thank you in advance, and sorry if this is a duplicate. I have spent the entire evening digging and can't derive what I need.
Abdou provided the key.
assign(paste0("df.", "pfd.", strftime(file.info(pfdFilePath)$mtime, "%Y%m%d")), "test01")
I also converted to the cleaner paste0 function and got rid of the dashes to avoid confusion. Lesson learned.
Works perfectly.

Resources