Cells in Jupyter notebooks - jupyter-notebook

I am trying to parse a pdf using PyPDF2 in a Jupyter notebook. Below is how I would like to write the different parts of the code, that is, the extract text statements in one cell and the RegEx in a new cell. However, if I separate the two pieces of code as below, the RegEx only runs through the last page of the file and not through the whole file (12 pages). Why does this happen? I would really like to use different cells.
import PyPDF2
import re
file = open(r'file.pdf', 'rb')
doc = PyPDF2.PdfFileReader(file)
#print(doc.getNumPages())
#new cell
for i in range(0, 12):
page = doc.getPage(i)
text = page.extractText()
#print(text)
#new cell
doc_re = re.compile(r'S\d+_\d+', re.IGNORECASE)
result = doc_re.findall(text)
print(result)

Each time you run your for loop, you're resetting the value of text by using text = page.extractText()
The RegEx is running on what you give it, which is text. Even though your loop runs over 12 pages, the second cell of code receives the final value of text (which is whatever you assign it to be on the last iteration of the loop).
You can either move the code from your second cell inside of your for loop, or a better option would be to add each page's text to text.
So, turning text = into text += should solve your problem.

Related

How to select a random word from a txt file in python version 3.4?

**
I have a txt file called 'All_Words' and it consist of 2000 words and i'm making a hangman so i needed to choose a random word i've already thought of picking a random number from 0 to 2000 and read the line of the number to chose but i don't know how to do that, also some background info:
i am in 8th grade and i like coding im trying to get better so i'm trying to get what people suggest and try to figure out what every part does and the reserved words such as 'global' for example
also i have also tried to just shuffle the txt file because i already got it to print the first word so if im a able to shuffle the txt file then it would print a different word an i could create an if statement saying if the word chosen was already chooses then it would shuffle it again and pick he first word again, also i got this idea of the shuffle the txt file from my dad but he only did something called 'dos' or something like that he said he did it before it was even called coding so i don't even know if it world word in python, and i've asked my coding teacher and he said he dont know how you would do that because he is use to java and javascript
this is what i have so far and also i would like it to only pick one word instead of every word in order:**
import random
with open("All_Words.txt") as file:
for line in file:
print(line)
break
Assuming each word is on a new line in file, you can read the text file into a list and use random.choice() to pick a random element in the list. Then you remove the word from the list so you don't pick it again.
import random
file = open("All_Words.txt", "r")
words = file.read()
listOfWords = words.split("\n")
randWord = random.choice(listOfWords)
print(randWord)
listOfWords.remove(randWord)
newUnqiueRandWord = random.choice(listOfWords)
print(newUnqiueRandWord)

Can cells be merged without adding a blank line

If I merge multiple cells in a Jupyter notebook, there is a blank line between the code from each cell.
Can I merge cells without this blank line?
No, this is the expected behavior.
Here is the source for mergeCells
Here is the line that joins the contents toMerge
newModel.value.text = toMerge.join('\n\n');
const toMerge: string[] = [];
Note, there are 2 new lines, '\n' in the .join(), therefore the answer to Can I merge cells without this blank line? is no, unless you do a PR for the project, change the desired line, compile and run it on your system.

R write over only first few lines of a file

Hi I want to do something that I thought should be simple but can't seem to find a way. I have some files I want to change something in the heading lines, given by a key word. The lines I want to change are always in the first 20 lines, but not necessarily exactly the same line number.
So I want to read in the first 20 lines (which is easy), find and change my string (again easy) and then write over the top 20 lines of the file and keep the thousands of rows below without changing. I'm going to have to do this hundreds and hundreds of times thus why I don't want to read in the entire file and write it all.
I have created a very simplified example. WARNING I'm creating a file called Temp.txt in your current working directory if you run this.
#Create Dummy text file
write.table(file = "Temp.txt", data.frame(Test = letters[1:26]), row.names = F, quote = F)
And then I can read in the lines and change
# read top 5 lines of text file (In real life I need to look at 20, but for examples sake 2)
TestHeader = readLines("Temp.txt", n = 5)
# Find and replace my search string
TestHeader[grepl("c",TestHeader)] <- "My New Line"
# <what to do here?> I want to only write over first 20 lines
Obviously I can read in the entire thing and change
TestHeader = readLines("Temp.txt")
TestHeader[grepl("c",TestHeader)] <- "My New Line"
writeLines(TestHeader, "Temp.txt")
But this would involve unnecessarily reading and writing thousands and thousands of lines.

fwrite(, append = TRUE) appends wrong way

I am facing a problem with the fwrite function from the DataTable package in R. In fact it appends the wrong way and I'd end up with something like:
**user ref version status Type DataExtraction**
user1 2.02E+11 1 Pending 1 No
user2 2.02E+11 1 Saved 2 No"user3" 2.01806E+11 1 Saved NB No
I am using the function as follows :
library(data.table)
fwrite(Save, "~/Downloads/Register.csv", append = TRUE, sep = ",", quote = TRUE)
Reproducible example:
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = FALSE)
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = TRUE)
I'm not sure if it isolates the problem, but it seems that if I manually change something in the .csv file (for instance rename DataExtraction to Extraction), the problem of appending in the wrong way occurs.
Does someone know what is going wrong?
Thanks!
When I run your example code I have no problems with the behavior - the file comes out as desired. Based on your comments about manually changing what is in the file, and what the undesired output looks like, here is what I believe is happening. When fwrite() (and many other similar IO functions) write to a file, each line has at the end of it a line break (in R, this is generally represented as \n). This is desired, so that subsequent lines of data indeed appear on subsequent lines of the file. Generally this will also mean that when you open a file in a text editor, there will be a blank line at the very end, since this reflects the line break in the last line that was written. (different editors handle this differently though). So, I suspect what is happening is that when you go in and manually edit the file in your editor, you are somehow losing that last line break. What this means is that when you go to write again using append, there is no line break at the end of the file, and therefore you get the undesired behavior of two lines of data on a single line of the file.
So, the solution would be to either find how to prevent your manual editing from deleting that last line break character. Barring that, there are ways to just write a single line break character to the file using R. E.g. with the cat() function.

Implementing syntax highlighting for markdown titles in PySide/PyQt

I am trying to implement a syntax highlighter for markdown for my project in PySide. The current code covers the basic, with bold, italic, code blocks, and some custom tags. Below is an extract of the relevant part of the current code.
What is blocking me right now is how to implement the highlighting for titles (underlined with ===, for the main title, or --- for sub-titles). The method that is used by Qt/PySide to highlight the text is highlightBlock, which processes only one line at a time.
class MySyntaxHighlighter(QtGui.QSyntaxHighlighter):
def highlightBlock(self, text):
# do something with this line of text
self.setCurrentBlockState(0)
startIndex = 0
if self.previousBlockState() != 1:
startIndex = self.blockStartExpression.indexIn(text)
while startIndex >= 0:
endIndex = self.blockEndExpression.indexIn(
text, startIndex)
...
There is a way to recover the previousBlockState, which is useful when a block has a defined start (for instance, the ~~~ syntax at the beginning of a code-block). Unfortunately, there is nothing that defines the start of a title, except for the underlining with === or --- that take place on the next line. All the examples I found only handle cases where there is a defined start of the expression, and so that the previousBlockState gives you an information (as in the example above).
The question is then: is there a way to recover the text of the next line, inside the highlightBlock? To perform a look-ahead, in some sense.
I though about recovering the document currently being worked on, and find the current block in the document, then find the next line and make the regular expression check on this. This would however break if there is a line in the document that has the exact same wording as the title. Plus, it would become quite slow to systematically do this for all lines in the document. Thanks in advance for any suggestion.
If self.currentBlock() gives you the block being highlighted, then:
self.currentBlock().next().text()
should give you the text of the following block.

Resources