Import multiple data from webpage into Google Sheets - web-scraping

I'm creating a stocks sheet with lots of stock items. Each stock has multiple data and I'm scraping this data from multiple websites.
Currently, as my sheet is always increasing, I'm starting to have troubles to do IMPORTXML and IMPORTHTML functions.
Question: It would be possible to import, let's say, an entire webpage's source into a cell just once, and then I could run my IMPORTHTML/IMPORTXML having that cell as a source? I'm thinking about it because in this case I just have to call that particular page once and just process all different data inside the sheet itself.
Any ideas would be appreciated, thx!

Use Google Apps Script instead of a built-in formula.
The above because there is no built-in function that imports the data "as-is"
IMPORTDATA will split the source code by commas and break lines
IMPORTXML doesn't import tags, only the text that they enclose.
IMPORTHTML only the content of import tables and lists
By the other hand the above functions can't be used to parse data from cells, the only are able to parse content from external sources referred by mean of a URL.
Regarding the use of Google Apps Script it has the URL Fetch Service

to import source code you can use IMPORTDATA formula. depends on your website structure you may need to ArrayConstrain it:
=ARRAY_CONSTRAIN(IMPORTDATA("url-here"), 5000, 25)

Related

How to Update a Google Sheet directly in R without creating a CSV file in the computer

I am trying to develop a web scraping code. I need to automate it and run it in the Google Cloud daily. The daily web scrapped data have to be saved in a Google sheet. Following is the relevant part of the code that I have developed to save data in a CSV file and then upload it to an existing Google Sheet.
# Here is a sample data set
apt_link <- c('https://www.immobilienscout24.at/expose/6220b265d188d1cf74252fbb',
'https://www.immobilienscout24.at/expose/622f314859ff6df2ed86c2ee',
'https://www.immobilienscout24.at/expose/619ca702f1a2b400224637d4',
'https://www.immobilienscout24.at/expose/61cc1cf099a6ef002161f721',
'https://www.immobilienscout24.at/expose/606761cd2c34720022d4117f')
rooms <- c(4,5,2,4,3)
Surface <-c(87.09,104.00,44.90,138.00,146.00)
cost <- c(389000,497000,279000,1890000,1600000)
address <-c('1140 Wien','1210 Wien','1210 Wien','1180 Wien','1060 Wien')
# Creating a dataframe with web scrapped data
df_one <- cbind.data.frame(apt_link,rooms,surface,cost, address, Sys.time())
# Saving data as a CSV file in the computer
con <- file('Real_Estate_Wien_Data.csv',encoding="UTF-8")
data <- write.csv('Real_Estate_Wien_Data.csv', file=con, row.names = T)
# Write Google sheets
library(googlesheets4)
library(googledrive)
drive_auth()
# Link to the folder in my google drive
td <- drive_get("https://drive.google.com/drive/u/0/folders/1ZK6vUGXhRfzCPJ9I-gIrj3Xbzu72R1e3")
# Update
drive_put('Real_Estate_Wien_Data.csv', name = "Real_Estate_Wien_Data", type="spreadsheet", path=as_id(td)) # keeps id because of other links
The issue here is that now this code creates a CSV file on my computer. So that when I am going to automate it on the Google Cloud Platform, I think it's not possible to save the CSV file. There has to be another way to directly write the data to a Google Sheet.
Thank you in advance, and your suggestions are much appreciated.
I would recommend using Google Apps Script, as it is specifically built to interact with Sheets and other Google files. It seems to me that you would like to accomplish 3 different tasks, I've summarized them below:
Fetching Drive folders and files: This can be accomplished by Apps Script's DriveApp class. From here you can fetch folders via getFolderById() or getFoldersByName(), as well as fetching individual files with the same dynamic.
Writing data into spreadsheets: You can do that using the SpreadsheetApp class. The are many ways in which a Spreadsheet can be modified via code, here is a simple example of using the Range.setValues() function to write some data in the spreadsheet.
Running the code daily: Within Apps Script, you can easily set up Triggers (read more about them here) that will enable you to automatically run the code daily in the cloud, without interacting in any way with your local computer.
Not sure if you ever found the solution, but you can absolutely use googlesheet4package to write your data to a new or existing spreadsheet. Check out the write_sheet() function here.

How to convert data type from CSV before load in DynamoDB

I want to load a CSV file into DynamoDB but I can't find a way to specify the type for each column of my file.
Take the following data from my CSV file:
"discarded","query","uuid","range_key"
false,"How can I help you?","h094dfd9e-a604-4187-99ff--mmxk","log#en#MISMATCH#2021-04-30T12:00:00.000Z"
The discarded column should be considered as a BOOL but DynamoDB imports it as a String.
Is there any way I can specify a type before importing the CSV or should I process the data with a script to handles the conversions myself?
AWS does not currently provide any tools to simplify this kind of operation rather than the REST API.
However, Dynobase, a third-party application developed to easily manage DynamoDB, allows you to import/export data in CSV and JSON formats.
The import tool allows you to select the type of data before insertion.

Power Automate Flow: Convert json to readable PowerAutomate-Items

In CRM I have a 'Doc_Config' value.
The 'Doc_Config' value gets passed to Power Automate Flow
With the data I populate an Microsoft Word Document. My problem here is, that instead of the data the raw text is filled into the Word Document.
Is there a way to convert the raw text so Power Automate recognizes the data I actually want? Like as if it is presented to the Flow like so:
Problem: You probably have copied the path for your objects and pasted the path value in your 'Doc_Config' value. Here the problems should be with the #{...} pattern.
Solution: Please, remove #{...} pattern from any objects that you are referring to by their path as in the example below:
incorrect:
#{items('Apply_to_each_2')?['productname']}
correct:
items('Apply_to_each_2')?['productname']
Background:
In Power Automate cloud flows, you reference objects that the dynamic content tooling offers. However, sometimes you want to catch objects that the dynamic content tooling cannot see, or does not provide. At these times, you can refer to them by specifying the path for them as in the example below.
items('Apply_to_each_2')?['productname']
You can observe the path for the objects by hovering over any object that the dynamic content tooling is providing you.
Another option would be to simply parse the data from your array, as it is already JSON.
I will add three links here to Imgur images as I can't post images directly yet, but the idea is very simple:
Append your data to your array variable
Add a Parse JSON task, click generate from sample, and paste in the JSON you use.
Your actions can be used in all other steps now.
The images will clarify a lot.
Append your data
Parse your data
Use your data

Is there a R package to generate a .ris file based on a query in a data bases?

For a scoping study/systematic literature review I would like a package which generates a reference list as a .ris file directly from publisher data bases such as Wiley, PubMed, Science Direct Web of Science and JSTOR.
Is there a package (or a workaround with API) that can "output" all listed resources of a database query as a file / dataframe in R?
I have read about "refwork" and "revtools" so far, but they seem to need .ris data upfront. I am looking for something generating me this file and not me doing this individually (which means ticking results page for page and exporting it).

How can I use an R model to power a cell in a google sheet?

I want to use a stats::loess model object created in R to generate a cell's value in google sheets automatically. The input to the function being two other cells in the sheet.
I created a model object using loess() and created a model that
intakes two features.
predict(df, tibble(x,y)) = prediction
I want to use this model to power a cell in a google sheet so that non-technical teammates can change the inputs to see what outputs they'd get in different scenarios.
I can read and write to a google sheet with R, but I want someone shared on the google sheet to have the output value live-updated if they change the input values.
Is there a way to do that?
I dont think that's possible. You can only interact with googlesheets when a manual trigger is fired like when you run to edit cells in googlesheets or download values from a sheet. To have a dynamic platform I recommend using shinny.

Resources