Import text file to R and change it's format - r

I have a data in text file. The example of the text file looks like this:
"vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);"
Can I change it to like this in R?
eg.
latitute longtitude contents
23.8145833 90.4043056 LRP: LRPS Start...Tongi

Solution 1
That looks a lot like javascript code. Execute the javascript (using a web browser) and save the result to JSON, then open the file with R with jsonlite.
With your example, create this file and save it as my_page.html:
<html>
<header>
<script>
// Initialize locations to be able to push more values in it
// probably not required with your full code
var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);
// actually write the json to file
function download(content, fileName, contentType) {
var a = document.createElement("a");
var file = new Blob([content], {type: contentType});
a.href = URL.createObjectURL(file);
a.download = fileName;
a.click();
}
download(jsonData, 'export_json.txt', 'text/plain');
</script>
</header>
<body>
Download should start automatically. You can look at the web console for errors.
</body>
</html>
When you open it with your web browser it should "download" a file, that you can open with R:
jsonlite::read_json("export_json.txt",simplifyVector = TRUE)
One problem is that the javascript code is created an array without names. So the names are not exported. I don't see how you could make javascript export it.
Solution 2
Instead of relying on a browser to execute the javascript code, you could do it directly in R with a javascript engine. It should give you the same result, but makes communication between the two easier.
Solution 3
If the file really looks like that all along, you might be able to remove the javascript lines that organize the arrays, and only keep the lines that define variables. In R, the symbols = and ; are technically valid, it's not too hard to rewrite the javascript into R code. Note this solution could be very fragile depending on what else is in your javascript code!
js_script <- "var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);" %>%
str_split(pattern = "\n", simplify=TRUE) %>%
as.character() %>%
str_trim()
# Find the lines that look like defining variables
js_script <- js_script[str_detect(js_script, pattern = "^\\w+ ?= ?'.*' ?;$")]
# make it into an R expression
r_code <- str_remove(js_script, ";$") %>%
paste(collapse = ",")
r_code <- paste0("c(", r_code, ")")
# Execute
eval(str2expression(r_code))

Related

Iterate faster over a large collection of files (objects) inside an Observable List (JavaFX 8)

I have an excel file that contains all the filenames of the Images. The path of these images are stored in an Observable Collection via <File> class which came from the folder that contains all of the images. My goal is to create a hyperlink of these filenames by matching it through the pool of image file collection.
I would like to ask if how can I iterate faster through a large collection of file classes in order to get their paths easily.
For example:
Image name from Excel :
ABC_0001
The Full path from the collection must be:
C:\Users\admin\Desktop\Images\ABC_0001.jpg
In order to get their full path, I perform the iteration through Stream.
My procedures:
Extract data using Apache POI.
Stream through the Image Collection by converting each data into
their base filenames vs extracted data.
Get the result and store the fullpath on the object via
getAbsolutePath().
Code:
//storage during iteration
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList()
//Image collection containing over 13k Images listed via commons-io
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Sheet data
Sheet sheet1 = wb.getsheetAt(0);
for (Row row: sheet1)
{
DetailedData data = new DetailedData();
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//to be filled up based on stream result.
String IMAGE_SOURCE = null;
//stream code with the help of commons-io
File IMAGE = IMAGE_COLLECTION.stream().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
data.setFileName(FILENAME);
data.setFullPath(IMAGE_SOURCE);
dataCollection.add(data);
}
Result:
Excel rows = 9,400
Image Files = 13,000
Iteration Time = 120,000ms
Are the results should appear normal or it can become faster?
I tried using parallelStream() and the results went faster but it consumes higher CPU usage.
This code should speed your code up a lot, but there are a few questions about your code.
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList() Why are you using ObservableList? Why is this a list of DetailedData and not File. Given that detailed data has setFileName and setFullPath. File already has these.
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true)); Why ObservableList?
These two are small things, but I am curious.
So what I think you should do is use a Map. Your code should look something like the code below.
//storage during iteration
List<DetailedData> dataCollection = new ArrayList();
//Image collection containing over 13k Images listed via commons-io
List<File> IMAGE_COLLECTION = new ArrayList(FileUtils.listFiles(new File("C:\\Users\\blj0011\\Pictures"), new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Use this to map file name to file
Map<String, File> map = new HashMap();
//Use this to add data to the map
IMAGE_COLLECTION.forEach((file) -> {map.put(file.getName().substring(0, file.getName().lastIndexOf(".")).toLowerCase(), file);});
for (Row row: sheet1)
{
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//If the map contains the file name, create `DetailedData` object. Then set data. Then add object to datacollection list.
if (map.containsKey(FILENAME.toLowerCase()))
{
DetailedData data = new DetailedData();
data.setFileName(FILENAME);
data.setFullPath(map.get(FILENAME.toLowerCase()).getAbsolutePath());
dataCollection.add(data);
}
}
Comments in the code
I still believe this could be cleaned up a little more if you used List<File> dataCollection = new ArrayList()
If you really want to speed up your search, you should try not to do things repeatedly which could just be done once. For example you could use two loops. The first to prepare your search and the second to actually do the search. Inside your filter you call FilenameUtils.getBaseName and two time a conversion to lower case. It would be better to do these things only once in the first loop and store the resulting Strings in a list. In the second loop you then do the search on this list.
I am also wondering why you use ObservableLists here. A simple List would do as well.
I've tested another approach in this slow iteration.
It seems that the cause is declaring the Stream repeatedly inside the foreach.
I tried using Baeldung's solution <Supplier> and declared it outside the loop together with parallelStream()
Sample Code:
Supplier<Stream<File>> streamSupplier = () -> imageCollection.parallelStream();
for (Row row : sheet)
{
File IMAGE = streamSupplier.get().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
}
Result went 45000ms
Please correct me if my approach was not right.

Is there a function to scrape the notes sections of Powerpoint Slides?

I am attempting to read through ~ 100 powerpoint slides and read the notes sections of each slide. I will do some text wrangling and write to csv after the fact, but need to get the notes in a workable format first.
I am working with the officer package, read_pptx function right now, but am open to whatever packages needed. It doesn't seem to pull in notes, but I may just be looking at this wrong.
To show a bit of what I've tried -->
library(officer)
ppt_var <- read_pptx('test_presentation.pptx')
view(ppt_var)
Ideally, I could get the text of each notes slide added to individual variables to write to a csv. I am confident that I can handle the manipulation once I get the notes read in, but cannot seem to get that part down.
Thank you for any pointers or support!
How do do that is shown in the code here: https://github.com/davidgohel/officer/issues/117 .
The following is based on that code:
library(magrittr)
library(officer)
library(xml2)
p <- read_pptx("mypresentation.pptx")
notes_dir <- file.path(p$package_dir, "ppt", "notesSlides")
files <- list.files(pattern = ".xml$", path = notes_dir, full.names = TRUE)
Notes <- lapply(files,
. %>%
read_xml %>%
xml_find_all("//a:t") %>%
xml_text
)
Assuming you are using the Document.OpenXML dependencies in C#, a more native way would be:
public static SlidePart GetSlidePart(PresentationDocument pptxDoc, int index)
{
// Get the relationship ID of the first slide.
PresentationPart presentationPart = pptxDoc.PresentationPart;
OpenXmlElementList slideIds = presentationPart.Presentation.SlideIdList.ChildElements;
string relId = (slideIds[index] as SlideId).RelationshipId;
// Get the slide part from the relationship ID.
return (SlidePart)presentationPart.GetPartById(relId);
}
public static string GetNoteText(PresentationDocument pptxDoc, int index)
{
//Get the Slide Part
SlidePart slidePart = GetSlidePart(pptxDoc, index);
//Extract the Note text
return slidePart.NotesSlidePart.NotesSlide.InnerText.ToString();
}

Python Selenium Webdriver Wait until Element is Loaded

The idea is to scrape a Website. By doing so, I wanted to scrape it via screenshots and then extract the data off the screenshot. Because in the Data I wanted to scrape is not in the HTML-Code and to be honest I didn't know how to handle it ( I am pretty new to python/programming).
It is working fine so far, but I had the problem that WebDriverWait doesn't work properly.
That's the Webpage: https://exporo.de/investment/betreutes-wohnen-huerth and in detail it's this dynamic part:
<div class="key">Bereits investiert</div>
<div class="value"
ng-controller="pubSubController as pubSubCtrl"
ng-show="pubSubCtrl.hasProject(2385)"
ng-bind="pubSubCtrl.getProject(2385, 'total')"></div>
So this is my code so far(the loop of it):
while AktuellerWert1 < Endwert1:
Zeit = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
driver1.get_screenshot_as_file(png_link % FileName1)
img = Image.open(png_link % FileName1)
PNG1 = image_to_string(img)
PNG1_bearb = PNG1.split()
AktuellerWert1 = PNG1_bearb[PNG1_bearb.index('investiert') + 1]
Endwert1 = PNG1_bearb[PNG1_bearb.index('Finanzierungsziel') + 1]
if AnfangsWert1 != AktuellerWert1:
with open("/Users/davidoverbeck/Dropbox/Screen/Exporo/%s.csv" % FileName1, 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow([AktuellerWert1, Zeit])
print(AktuellerWert1)
else:
pass
AnfangsWert1 = AktuellerWert1
driver1.refresh()
element = WebDriverWait(driver1, 2).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/main/section[1]/section/div[2]/div[2]/div[1]/div[2]/div[10]/div[2]')))
else:
with open("/Users/davidoverbeck/Dropbox/Screen/Abgeschlossen.csv", 'a') as csvFile:
writer = csv.writer(csvFile)
writer.writerow([Zeit, FileName1])
print(FileName1, 'abgeschlossen')
driver1.close()
It's working fine for 2 minutes and then it gives me the following error:
selenium.common.exceptions.TimeoutException: Message:
(no message behind it?!)
I am not sure whether the loop does anything at all or, in case it's working, what's wrong with it?
Thank you for your help!
I'm under the impression that the data you're looking for is here:
https://exporo.de/pubsub/initial .
In that case no need to parse html, you will need to parse the json.
See F12 -> network tab -> Type column = json

Difference between two files view in HTML using Java or any jar

I want to write a script which compare two files in java and see there difference in html page ( side by side ), can someone help me out how to write ( where to start). I am pulling my hair out for this....
I want to use this script in beanshell postprocessor so that I can compare the standard output files with result files easily
I don't think you should be asking people for writing code for you here, consider hiring a freelancer instead.
Alternatively you can use the following approach:
Add JSR223 Assertion as a child of the request which you would like to fail if files won't be equal
Put the following code into "Script" area:
def file1 = new File('/path/to/file1')
def file2 = new File('/path/to/file2')
def file1Lines = file1.readLines('UTF-8')
def file2Lines = file2.readLines('UTF-8')
if (file1Lines.size() != file2Lines.size()) {
AssertionResult.setFailure(true)
AssertionResult.setFailureMessage('Files size is different, omitting line-by-line compare')
} else {
def differences = new StringBuilder()
file1Lines.eachWithIndex {
String file1Line, int number ->
String file2Line = file2Lines.get(number)
if (!file1Line.equals(file2Line)) {
differences.append('Difference # ').append(number).append('. Expected: ')
.append(file1Line).append('. Actual: ' + file2Line)
differences.append(System.getProperty('line.separator'))
}
}
if (differences.toString().length() > 0) {
AssertionResult.setFailure(true)
AssertionResult.setFailureMessage(differences.toString())
}
}
If there will be differences in files content you will see them listed one by one in the JSR223 Assertion
See Scripting JMeter Assertions in Groovy - A Tutorial for more details.

Reformatting date in google spreadsheet

I'm setting up a spreadsheet for someone else with a form to enter data.
One of the columns is supposed to hold a date. The input date format is like this example: "Jan 26, 2013" (there will be a lot of copy & paste involved to collect data, so changing the format at input step is not a real option).
I need this date column to be sortable, but the spreadsheet doesn't recognize this as a date but simply as a string. (It would recognize "Jan-26-2013", I've tried.)
So I need to reformat the input date.
My question is: how can I do this? I have looked around and google apps script looks like the way to go (though I haven't found a good example of reformatting yet).
Unfortunately my only programming experience is in Python, and of intermediate level. I could do this in Python without a problem, but I don't know any JavaScript.
(My Python approach would be:
splitted = date.split()
newdate = "-".join([splitted[0], splitted[1][:-1], splitted[2]])
return newdate
)
I also don't know how I'd go about linking the script to the spreadsheet - would I attach it to the cell, or the form, or where? And how? Any link to a helpful, understandable tutorial etc. on this point would help greatly.
Any help greatly appreciated!
Edit: Here's the code I ended up with:
//Function to filter unwanted " chars from date entries
function reformatDate() {
var sheet = SpreadsheetApp.getActiveSheet();
var startrow = 2;
var firstcolumn = 6;
var columnspan = 1;
var lastrow = sheet.getLastRow();
var dates = sheet.getRange(startrow, firstcolumn, lastrow, columnspan).getValues();
newdates = []
for(var i in dates){
var mydate = dates[i][0];
try
{
var newdate = mydate.replace(/"/g,'');
}
catch(err)
{
var newdate = mydate
}
newdates.push([newdate]);
}
sheet.getRange(startrow, firstcolumn, lastrow, columnspan).setValues(newdates)
}
For other confused google-script Newbies like me:
attaching the script to the spreadsheet works by creating the script from within the spreadsheet (Tools => Script Editor). Just putting the function in there is enough, you don't seem to need a function call etc.
you select the trigger of the script from the Script Editor (Resources => This Project's Triggers).
Important: the script will only work if there's an empty row at the bottom of the sheet in question!
Just an idea :
If you double click on your date string in the spreadsheet you will see that its real value that makes it a string instead of a date object is this 'Jan 26, 2013 with the ' in front of the string that I didn't add here...(The form does that to allow you to type what you want in the text area, including +322475... for example if it is a phone number, that's a known trick in spreadsheets cells) You could simply make a script that runs on form submit and that removes the ' in the cells, I guess the spreadsheet would do the rest... (I didn't test that so give it a try and consider this as a suggestion).
To remove the ' you can simply use the .replace() method **
var newValue = value.replace(/'/g,'');
here are some links to the relevant documentation : link1 link2
EDIT following your comment :
It could be simpler since the replace doesn't generate an error if no match is found. So you could make it like this :
function reformatDate() {
var sheet = SpreadsheetApp.getActiveSheet();
var dates = sheet.getRange(2, 6, sheet.getLastRow(), 1).getValues();
newdates = []
for(var i in dates){
var mydate = dates[i][0];
var newdate = mydate.replace(/"/g,'');
newdates.push([newdate]);
}
sheet.getRange(2, 6, sheet.getLastRow(), 1).setValues(newdates)
}
Also, you used the " in your code, presumably on purpose... my test showed ' instead. What made you make this choice ?
Solved it, I just had to change the comma to dot and it worked

Resources