Tab Delimited csv instead of comma delimited in scrapy - web-scraping

I am currently using the command
scrapy crawl myspider -o output.csv -t csv
to get output csv files. These files by default are comma delimited. How do i get a tab delimited file instead?

Use this solution to override Scrapy's default CSV writer delimiter.
scraper/exporters.py
from scrapy.exporters import CsvItemExporter
class CsvCustomSeperator(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['encoding'] = 'utf-8'
kwargs['delimiter'] = '\t'
super(CsvCustomSeperator, self).__init__(*args, **kwargs)
scraper/settings.py
FEED_EXPORTERS = {
'csv': 'scraper.exporters.CsvCustomSeperator'
}
In terminal
$ scrapy crawl spider -o file.csv

Related

snakemake Wildcards in input files cannot be determined from output files:

I use the snakemkae to create a pipeline to split bam by chr,but I met a problem,
Wildcards in input files cannot be determined from output files:
'OutputDir'
Can someone help me to figure it out ?
if config['ref'] == 'hg38':
ref_chr = []
for i in range(1,23):
ref_chr.append('chr'+str(i))
ref_chr.extend(['chrX','chrY'])
elif config['ref'] == 'b37':
ref_chr = []
for i in range(1,23):
ref_chr.append(str(i))
ref_chr.extend(['X','Y'])
rule all:
input:
expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr)
rule minimap2:
input:
TargetFastq
output:
Sortbam = "{OutputDir}/{name}.sorted.bam",
Sortbai = "{OutputDir}/{name}.sorted.bam.bai"
resources:
mem_mb = 40000
threads: nt
singularity:
OntSoftware
shell:
"""
minimap2 -ax map-ont -d {ref_mmi} --MD -t {nt} {ref_fasta} {input} | samtools sort -O BAM -o {output.Sortbam}
samtools index {output.Sortbam}
"""
rule split_bam:
input:
rules.minimap2.output.Sortbam
output:
splitBam = expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr),
splitBamBai = expand(f"{OutputDir}/split/{name}.{{chr}}.bam.bai",chr=ref_chr)
resources:
mem_mb = 30000
threads: nt
singularity:
OntSoftware
shell:
"""
samtools view -# {nt} -b {input} {chr} > {output.splitBam}
samtools index -# {nt} {output.splitBam}
"""
I change the wilcards {outputdir},but is dose not help.
expand(f"{OutputDir}/split/{name}.{{chr}}.bam",chr=ref_chr),
splitBamBai = expand(f"{OutputDir}/split/{name}.{{chr}}.bam.bai",chr=ref_chr),
A couple of comments on this lines...:
You escape chr by using double braces, {{chr}}. This means you don't want chr to be expanded, which I doubt it is correct. I suspect you want something like:
expand("{{OutputDir}}/split/{{name}}.{chr}.bam",chr=ref_chr),
The rule minimpa2 does not contain {chr} wildcard, hence the error you get.
As an aside, when you create a bam file and its index in the same rule, you can get the time stamp of the index file to be older than the bam file itself. This later can generate spurious warning from samtools/bcftools. See https://github.com/snakemake/snakemake/issues/1378 (not sure if it's been fixed).

How to access all file names in hydra config

I have a directory contains a bunch of txt files:
dir/train/[train1.txt, train2.txt, train3.txt]
I'm able to read a single file, if I define following in a config.yaml
file_name: ${paths.data_dir}/train/train1.txt
So I get the str and I used np.loadtxt(self.hparams.file_name)
I tried
file_name: ${paths.data_dir}/train/*
So I have List[str], I then loop over file_name
dat = []
for file in self.hparams.file_name:
dat.append(np.loadtxt(file))
but it didn't work out.
You could define an OmegaConf custom resolver for this:
# my_app.py
import pathlib
from pathlib import Path
from typing import List
from omegaconf import OmegaConf
yaml_data = """
paths:
data_dir: dir
file_names: ${pathlib_glob:${paths.data_dir}, 'train/*'}
"""
def pathlib_glob(data_dir: str, glob_pattern: str) -> List[str]:
"""Use Pathlib glob to get a list of filenames"""
data_dir_path = pathlib.Path(data_dir)
file_paths: List[Path] = [p for p in data_dir_path.glob(glob_pattern)]
filenames: List[str] = [str(p) for p in file_paths]
return filenames
OmegaConf.register_new_resolver("pathlib_glob", pathlib_glob)
cfg = OmegaConf.create(yaml_data)
assert cfg.file_names == ['dir/train/train3.txt', 'dir/train/train2.txt', 'dir/train/train1.txt']
Now, at the command line:
mkdir -p dir/train
touch dir/train/train1.txt
touch dir/train/train2.txt
touch dir/train/train3.txt
python my_app.py # the assertion passes

How to replace or remove special characters from scrapy?

I just started learning scrapy and trying to make spider to grab some info from website and trying to replace or remove special characters in 'short_descr'
import scrapy
class TravelspudSpider(scrapy.Spider):
name = 'travelSpud'
allowed_domains = ['www.tripadvisor.ca']
start_urls = [
'https://www.tripadvisor.ca/Attractions-g294265-Activities-c57-Singapore.html/'
]
base_url = 'https://www.tripadvisor.ca'
def parse(self, response, **kwargs):
for items in response.xpath('//div[#class= "_19L437XW _1qhi5DVB CO7bjfl5"]'):
yield {
'name': items.xpath('.//span/div[#class= "_1gpq3zsA _1zP41Z7X"]/text()').extract()[1],
'reviews': items.xpath('.//span[#class= "DrjyGw-P _26S7gyB4 _14_buatE _1dimhEoy"]/text()').extract(),
'rating': items.xpath('.//a/div[#class= "zTTYS8QR"]/svg/#title').extract(),
'short_descr': items.xpath('.//div[#class= "_3W_31Rvp _1nUIPWja _17LAEUXp _2b3s5IMB"]'
'/div[#class="DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'place': items.xpath('.//div[#class= "ZtPwio2G"]'
'/div'
'/div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'cost': items.xpath('.//div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]'
'/div[#class= "DrjyGw-P _1SRa-qNz _2AAjjcx8"]'
'/text()').extract(),
}
next_page_partial_url = response.css("div._1I73Kb0a").css("div._3djM0GaD").xpath('.//a/#href').extract_first()
if next_page_partial_url is not None:
next_page_url = self.base_url + next_page_partial_url
yield scrapy.Request(next_page_url, callback=self.parse)
Character I'm trying to replace is Hiking Trails • Scenic Walking Areas. The dot in the middle decodes in csv file as •
Everyting else works like a charm.
I've tried to use .replace(), but I'm getting an error:
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated
If you're removing these special characters just because they appear weirdly in a CSV file, then I suggest not removing them. Just simply add the following line in the settings.py file.
FEED_EXPORT_ENCODING = 'utf-8-sig'
This will print the special character in your CSV file.

Generating consecutive numbered urls

I want to generate a text file containing the folowing lines:
http://example.com/file1.pdf
http://example.com/file2.pdf
http://example.com/file3.pdf
.
.
http://example.com/file1000.pdf
Can any one advise how to do it using unix command line, please?
Thank you
With an interating for loop
for (( i=1;i<=1000;i++ ));
do
echo "http://example.com/file$i.pdf";
done > newfile
With seq:
while read i;
do
echo "http://example.com/file$i.pdf";
done <<< $(seq 1000) > newfile
It is possible to create/run a python script file ato generate this. Using vim, nano, or any other terminal editor, create a python file as follows:
def genFile(fn, start, end):
with open(fn, "w+") as f:
f.writelines([f"http://example.com/file{str(i)}.pdf\n" for i in range(start, end+1)])
try:
fn = input("File Path: ") # can be relative
start = int(input("Start: ")) # inclusive
end = int(input("End: ")) # inclusive
genFile(fn, start, end)
except:
print("Invalid Input")
Once this is written to a file, let's call it script.py. We can run the following command to execute the script:
python script.py
Then, fill out the prompts for the file path, start, and end. This should result in all those lines printed in the file specified delimited by '\n'.

how to open Text File in google Collab

I am recently using google collab juypter notebook.After Uploading text file, unable to open the file using open function in python 3.
from google.colab import files
import io
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
data_path = io.StringIO(uploaded['fra.txt'].decode('utf-8'))
with open(data_path, 'rb') as f:
lines = f.read().split('\n')
but it gives this error : TypeError: expected str, bytes or os.PathLike object, not _io.StringIO
how to open text file in google collab juypter notebook ?
Change to just
data_path = 'fra.txt'
Should work.
The _io.StringIO refers to the StringIO object (in-memory file stream). "For strings StringIO can be used like a file opened in text mode."
The issue is that the file is already open and you have it available to you as a StringIO buffer. I think you want to do readlines() on the StringIO object (data_path).
You can also call getvalue() on the object and get the str of the entire buffer.
https://docs.python.org/3/library/io.html#io.StringIO
See my example here; which I started with your code...
https://colab.research.google.com/drive/1Vbh13FVm02HMXeHXx-Zko1pFpqyp7bwI
do like this
with open('anna.txt', 'r') as f:
text=f.read()
vocab = sorted(set(text))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

Resources