I have made a web scraper by using rvest and created a data frame. One of the columns that I scraped has a lot of text in it and is hard to view.
The text in the column of my table looks like this,
Created 104 commits in 5 repositories raysan5/raylib 83 commits raysan5/raylib.com 17 commits raysan5/raygui 2 commits raysan5/rres 1 commit raysan5/raylib-games 1 commit , Reviewed 1 pull request in 1 repository raysan5/raylib 1 pull request Fixed #1455 Dec 13 , Started 1 discussion in 1 repository raysan5/raylib Welcome to raylib Discussions! Dec 8 , 2 contributions in private repositories Dec 18
Could someone help me find a way to add a \n after every comma (,) and perhaps remove the white space at the start of the next line? Also, it would be nice if a line can be skipped like shown in the desired layout. Moreover, if a bullet point can be added to each new paragraph that would be great too.
The text should be formated for example like this (desired layout),
Created 104 commits in 5 repositories raysan5/raylib 83 commits raysan5/raylib.com 17 commits raysan5/raygui 2 commits raysan5/rres 1 commit raysan5/raylib-games 1 commit
Reviewed 1 pull request in 1 repository raysan5/raylib 1 pull request Fixed #1455 Dec 13 , Started 1 discussion in 1 repository raysan5/raylib Welcome to raylib Discussions! Dec 8
2 contributions in private repositories Dec 18
I hope I'm not asking for too much.
gsub can do that using a regular expression:
gsub(' *, *', '\n', text)
*, * matches an arbitrary number of spaces, followed by a comma, followed by more spaces. And it replaces all that with the line break.
Related
I'm trying to perform joins in SQLite on Hebrew words including vowel points and cantillation marks and it appears that the sources being joined built the components in different orders, such that the final strings/words appear identical on the screen but fail to match when they should. I'm pretty sure all sources are UTF-8.
I don't see a built in method of unicode normalization in SQLite, which would be the easiest solution; but found this link of Tcl Unicode but it looks a bit old using Tcl 8.3 and Unicode 1.0. Is this the most up-to-date method of normalizing unicode in Tcl and is it appropriate for Hebrew?
If Tcl doesn't have a viable method for Hebrew, is there a preferred scripting language for handling Hebrew that could be used to generate normalized strings for joining? I'm using Manjaro Linux but am a bit of a novice at most of this.
I'm capable enough with JavaScript, browser extensions, and the SQLite C API to pass the data from C to the browser to be normalized and back again to be stored in the database; but I figured there is likely a better method. I refer to the browser because I assume that they area kept most up to date for obvious reasons.
Thank you for any guidance you may be able to provide.
I used the following code in attempt to make the procedure provided by #DonalFellows a SQLite function such that it was close to not bringing the data into Tcl. I'm not sure how SQLite functions really work in that respect but that is why I tried it. I used the foreach loop solely to print some indication that the query was running and progressing because it took about an hour to complete.
However, that's probably pretty good for my ten-year old machine and the fact that it ran on 1) the Hebrew with vowel points, 2) with vowel points and cantillation marks and 3) the Septuagint translation of the Hebrew for all thirty-nine books of the Old Testament, and then two different manuscripts of Koine Greek for all twenty-seven books of the New Testament in that hour.
I still have to run the normalization on the other two sources to know how effective this is overall; however, after running it on this one which is the most involved of the three, I ran the joins again and the number of matches nearly doubled.
proc normalize {string {form nfc}} {
exec uconv -f utf-8 -t utf-8 -x "::$form;" << $string
}
# Arguments are: dbws function NAME ?SWITCHES? SCRIPT
dbws function normalize -returntype text -deterministic -directonly { normalize }
foreach { b } { 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 } {
puts "Working on book $b"
dbws eval { update src_original set uni_norm = normalize(original) where book_no=$b }
puts "Completed book $b"
}
If you're not in a hurry, you can pass the data through uconv. You'll need to be careful when working with non-normalized data though; Tcl's pretty aggressive about format conversion on input and output. (Just… not about normalization; the normalization tables are huge and most code doesn't need to do it.)
proc normalize {string {form nfc}} {
exec uconv -f utf-8 -t utf-8 -x "::$form;" << $string
}
The code above only really works on systems where the system encoding is UTF-8… but that's almost everywhere that has uconv in the first place.
Useful normalization forms are: nfc, nfd, nfkc and nfkd. Pick one and force all your text to be in it (ideally on ingestion into the database… but I've seen so many broken DBs in this regard that I suggest being careful).
I have some trouble using CombiTimeTable.
I want to fill the table using a txt file that contains two columns, the first is the time and the second is the related value (a current sample). Furthermore, I add #1 in the first line as the manual says.
Moreover, I add the following parameters:
tableOnFile=true,
fileName="C:/Users/gg/Desktop/CurrentDrivingCycle.txt"
I also have to add the parameter tableName but I don't know how to define it. I tried to define it using the name of the file (i.e. CurrentDrivingCycle) but I got this error message at the end of the simulation:
Table matrix "CurrentDrivingCycle" not found on file "C:/Users/ggalli/Desktop/CurrentDrivingCycle.txt".
simulation terminated by an assertion at initialization
Simulation process failed. Exited with code -1.
Do you know how can I solve this issue?
Thank you in advance!
See the documentation:
https://build.openmodelica.org/Documentation/Modelica.Blocks.Sources.CombiTimeTable.html
The name tab1(6,2) in the example of the documentation is the tableName. So yours should look something like:
#1
double CurrentDrivingCycle(6,2) # comment line
0 0
1 0
1 1
2 4
3 9
4 16
I want to read MarkLogic logs (for eg : ErrorLog.txt) from query console using Xquery. I had the below code but the problem is output is not properly formatted. Result is like below
xquery version "1.0-ml";
for $hid in xdmp:hosts()
let $h := xdmp:host-name($hid)
return
xdmp:filesystem-file("file://" || $h || "/" ||xdmp:data-directory($hid) ||"/Logs/ErrorLog.txt")
Problem is result is coming as per host basis like first all log of one host is coming and then starting with time 00:00:01 of host 2 and then 00:00:01 of host 3 after running the Xquery.
2019-07-02 00:00:35.668 Info: Merging 2 MB from /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4cd to /cams/q06data02/testQA2/Forests/testQA2-2.2/0002b4ce, timestamp=15620394303480170
2019-07-02 00:00:36.007 Info: Merged 3 MB at 9 MB/sec to /cams/q06data02/testQA2/Forests/test2-2.2/0002b4ce
2019-07-02 00:00:38.161 Info: Deleted 3 MB at 399 MB/sec /cams/q06data02/test2/Forests/test2-2.2/0002b4cd
Is it possible to get the output with hostname included with log entries and also if we can sort the output by timelines something like
host 1 : 2019-07-02 00:00:01 : Info Merging ....
host 2 : 2019-07-02 00:00:02 : Info Deleted 3 MB at 399 MB/sec ...
Log files are text files. You can parse and sort them like any other text file.
Although they can get very large (GB+), so simple methods may not be performant.
Plus you need to be able to parse the text into fields in order to sort by a field.
Since the first 20 bytes of every line is the time stamp, and that timestamp is in ISO format which sorts lexically same as date, you can split the file by lines and sort using basic colation to get by time sorting of multiple files.
In V9 one can use the pair of xdmp:logfile-scan and xdmp:logmessage-parse to efficiently search over log files (remotely as well as local) and then transform the results into text, XML (attribute or element format) or JSON.
One can also use the REST API for the same.
see: https://docs.marklogic.com/REST/GET/manage/v2/logs
Once logfiles (ideally a selected subset of log messages that is small enough to manage) is converted to a structured format (xml , json or text lines) then sorting, searching, enriching etc is easily performed.
For something much better take a look at Ops Director https://docs.marklogic.com/guide/opsdir/intro
I have downloaded some data from the following site as a zip file and extracted it onto my computer. Now, I'm having trouble trying to open the included json data files.
Running following code:
install.packages("rjson")
library("rjson")
comp <- fromJSON("statsbomb/data/competitions")
gave this error:
Error in fromJSON("statsbomb/data/competitions") : unexpected character 's'
Also, is there a way to load all files at once instead of writing individual statements each time?
Here is what I did(Unix system).
Clone the Github repo(mark location)
git clone https://github.com/statsbomb/open-data.git
Set working directory(directory to which you cloned the repo or extracted the zip file).
setwd("path to directory where you cloned the repo")
Read data.
jsonlite::fromJSON("competitions.json")
With rjson: rjson::fromJSON(file="competitions.json")
To run all the files at once, move all .json files to a single directory and use lapply/assign to assign your objects to your environment.
Result(single file):
competition_id season_id country_name
1 37 4 England
2 43 3 International
3 49 3 United States of America
4 72 30 International
competition_name season_name match_updated
1 FA Women's Super League 2018/2019 2019-06-05T22:43:14.514
2 FIFA World Cup 2018 2019-05-14T08:23:15.306297
3 NWSL 2018 2019-05-17T00:35:34.979298
4 Women's World Cup 2019 2019-06-21T16:45:45.211614
match_available
1 2019-06-05T22:43:14.514
2 2019-05-14T08:23:15.306297
3 2019-05-14T08:02:00.567719
4 2019-06-21T16:45:45.211614
The function fromJSON takes a JSON string as a first argument unless you specify you are giving a file (fromJSON(file = "competitions.json")).
The error you mention comes from the function trying to parse 'statsbomb/data/competitions' as a string and not a file name. In JSON however, everything is enclosed in brackets and strings are inside quotation marks. So the s from "statsbomb" is not a valid first character.
To read all json files you could do:
lapply(dir("open-data-master/",pattern="*.json",recursive = T), function(x) {
assign(gsub("/","_",x), fromJSON(file = paste0("open-data-master/",x)), envir = .GlobalEnv)
})
however this will take a long time to complete! You probably should elaborate a little bit on this function. E.g. split the list of files obtained with dir into chunks of 50 before running the lapply call.
I have these two files
File: 11
11
456123
File: 22
11
789
Output of diff 11 22
2c2
< 456123
---
> 789
Output to be
< 456123
> 789
I want it to not print the 2c2 and --- lines. I looked at the man page but could not locate any help. Any ideas? The file has more than 1000 lines.
What about diff 11 22 | grep "^[<|>]"?
Update: As knitti pointed out the correct pattern is ^[<>]
Diff has a whole host of useful options like --old-group-format that are described very briefly in help. They are expanded in http://www.network-theory.co.uk/docs/diff/Line_Group_Formats.html
The following is producing something similar to what you want.
diff 11.txt 22.txt --unchanged-group-format="" --changed-group-format="<%<>%>"
<456123
>789
You might also need to play with --old-group-format=format (groups hunks containing only lines from the first file) --new-group-format=format --old-line-format=format (formats lines just from the first file) and --new-line-format=format etc
Disclaimer - I have not used this for real before, in fact I have only just understood them. If you have further questions I am happy to look at it later.
Edited to change order of lines