getting data from pdf file using ElasticSearch / Symfony

getting data from pdf file using ElasticSearch / Symfony - symfony

actually i'm using Symfony and i have integrated ElasticSearch ,
I want to extract data from Pdf file (name,age,adress)/ or text file in order to get a form filled with data how to do it ?
Ps: I have configured elastic.yml i have added all the indexs , and when i have have visited 127.0.0.1:9200 i got my object successfully (json format) all i want is how to read from pdf file and retrieve all data

You have to use pdf parsing library like https://pdfparser.org/documentation
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;

Related

reading in stored searched tweets in rds format

I recently used search_fullarchive function in rtweet package to read in a large quantity of tweets (~500,000). Because the file is so large and I needed to transfer the data to another computer, I saved the file to rds format, which appeared to be smaller in size. I used the following function in base R to read in the file:
tweet<-readRDS("20201211164534-1.rds")
However, instead of a dataset I got the object that's captured in the image below
a screenshot of View(tweet)
typeof(tweet)
[1]"list"
I opened every node but I cannot find the dataset. Does anyone have an idea if/where I can find the dataset with my tweets in the rds file? Thank you!
Update: Someone asked about the function I used to save the tweet dataset. I did not use a function, but r automatically stored the rds files in my directory when I ran the search_fullarchive function as below:
fever.june.1<-search_fullarchive(
fever,#Search query on which to match/filter tweets
n = 500000, #Number of tweets to return; it is best to set this number in intervals of 100 for the '30day' API and either 100 (for sandbox) or 500 (for paid) for the 'fullarchive' API. Default is 100.
fromDate = 202006010000, #Oldest date-time (YYYYMMDDHHMM)
toDate = 202006302359,#Newest date-time (YYYYMMDDHHMM)
env_name = "develop",#Name/label of developer environment to use for the search.
#safedir = "~/Desktop/",#Name of directory to which each response object should be saved.
parse = TRUE,#Logical indicating whether to convert data into data frame.
token = token #A token associated with a user-created APP
)

File Manipulation: output file and prompt user for name on Windows

I'm cringing just asking this here... and likely searching by the wrong terms, so apologies if this is a redundant question. How can I have a .R file create an output file and prompt the user for a file name, using an interactive dialog box? Or more simply put a "Save As" prompt?
Basically the reverse of:
str <- "Microsoft Excel Comma Separated Values File (.csv)"
data <- read_csv(choose.files(multi = FALSE, filters = cbind(str, "*.csv")))
I want to use write_csv() and have the user to decide the name and directory.

Solution comes from: How to let user choose output file name in writecsv
str <- "Microsoft Excel Comma Separated Values File (.csv)"
write_csv(data, path = choose.files(caption = "Save As...",
multi = FALSE,
filters = cbind(str, "*.csv")))
This produces the desired output. It prompts the user and filters files by only .csv extension. If a file doesn't exist it will create it as.csv by default.

How can I open a .json.gz format file in R?

I am a Data Science student writing my thesis using product review data. However, this is packed in a .gz file.
The file name when downloaded is 'xxx.json.gz' and when I look into the properties it says the type of file is gz Archive (.gz), Opens with 7-Zip File Manager.
I found the following code:
z <- gzfile("xxx.json.gz")
data = read.csv(z)
But the object 'data' is now a list. All columns are factors and the column with the review text is not right at all. I think the read.csv() part is wrong since it's supposed to be a json file.
Does anyone have solution? I also have the URL address of the data if that's better to use: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz

Loading it at the moment, I got 5,152,500 records right now, it is probably the review text that is clogging it up
library(jsonlite)
happy_data <-stream_in(
gzcon(
url("http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz")
)
)

Check if write_xlsx has correctly executed and the file has been written

I am using a code like this:
writexl::write_xlsx(
x = list(
"Sheet" = df
),
path = paste0("User/", name,"_", date, ".xlsx")
)
The code should write in the folder User that is in the wd the file.xlsx
I need that if the command not success in the writing operation the console display a message.
For example if in the object name there is a "/" windows not allow to write the file but any message appear to me. Also if for permission problem windows not allow to write the file.
I just tried with TryCatch without success.
Thanks

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.

If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)

Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

getting data from pdf file using ElasticSearch / Symfony - symfony

You have to use pdf parsing library like https://pdfparser.org/documentation $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile('document.pdf'); $text = $pdf->getText(); echo $text;

Related

reading in stored searched tweets in rds format

File Manipulation: output file and prompt user for name on Windows

How can I open a .json.gz format file in R?

Check if write_xlsx has correctly executed and the file has been written

integrating hadoop, revo-scaleR and hive

Categories

Resources