pretty new to R and just trying to get my head round web scrapping.
So I want to download into R a CSV file (easy enough I can do that). The issue that i am having is that every month the CSV file name changes on the website, so the URL then also changes.
So is there a way in R to tell it to download the the CSV file without the exact file url?
here is the website i am practicing it on:
https://www.police.uk/pu/your-area/police-scotland/performance/999-data-performance/
Thanks :)
So tried the basic stuff like below, issue is the url will change when the file is updated.
dat <- read.csv(
"https://www.police.uk/contentassets/069a2c11fcb444bbbeb519f69875577e/2022/sept/999-data-nov-21---sep-22.csv"
, header = T)
view(dat)
Related
Right now I have to inport a file on Rstudio using a URL, I tried to inport it like I do for a cvs file (read.csv...), but it doesn't work. Can someone please tell me how can I do it?
I'm trying to download all the zipped csv file from the following pg: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12301&reportTitle=Settlement%20Point%20Prices%20at%20Resource%20Nodes,%20Hubs%20and%20Load%20Zones&showHTMLView=&mimicKey
I've started by trying to download one file as an example before I move on to downloading multiple. This site contains prices from specific locations in Texas - interesting, given recent power outages due to cold weather in Texas.
url <- "http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12301&reportTitle=Settlement%20Point%20Prices%20at%20Resource%20Nodes,%20Hubs%20and%20Load%20Zones&showHTMLView=&mimicKey/cdr.00012301.0000000000000000.20210220.141704636.SPPHLZNP6905_20210220_1415_csv.zip"
temp <- tempfile()
download.file(url,temp, mode = "wb")
data <- read.csv(unzip(temp, "cdr.00012301.0000000000000000.20210220.141704.SPPHLZNP6905_20210220_1415.csv"))
unlink(temp)
Keep receiving the following error message: "error 1 in extracting from zip file."
I'm relatively new to R, so any advice would be helpful.
Edit: If the link above doesn't work, another way to get to the link is following this: http://www.ercot.com/mktinfo/rtm and going to "Real-Time Price Reports" and selecting the last option "Settlement Point Prices at Resource Nodes, Hubs, and Load Zones." Might look a little overwhelming, but my goal for right now is just to download and open the first zipped csv file on there (and ignore all the other files on there)
I've got a folder full of .doc files and I want to merge them all into R to create a dataframe with filename as one column and content as another column (which would include all content from the .doc file.
Is this even possible? If so, could you provide me with an overview of how to go about doing this?
I tried starting out by converting all the files to .txt format using readtext() using the following code:
DATA_DIR <- system.file("C:/Users/MyFiles/Desktop")
readtext(paste0(DATA_DIR, "/files/*.doc"))
I also tried:
setwd("C:/Users/My Files/Desktop")
I couldn't get either to work (output from R was Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.) but I'm not sure if this is necessary for what I want to do.
Sorry that this is quite vague; I guess I want to know first and foremost if what I want to do can be done. Many thanks!
I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")
I know that this question has been asked exhaustively on this site, but I cannot find any question which addresses my problem.
I am trying to import multiple .csv files into R which are located in nested .zip files on my PC. The other questions seem to relate to importing a single file from a URL, which is not my issue.
I have set my working directory to the folder which contains the first .zip file, but there is another one inside of it, which then contains normal folders, and finally hundreds of .csv files which I am looking to access.
Up to now I have always manually extracted the data since I have no idea where to begin with unzipping code, but considering this folder contains around 20GB of data, I'm going to need to try something else.
Any help would be appreciated!
EDIT - CODE:
setwd("C:/docs/data/241115")
temp <- tempfile()
unzip("C:/docs/data/241115/Requested.zip",exdir=temp)
l = list.files(temp)
unzip("C:/docs/data/241115/Requested/Data Requested.zip",exdir=temp)
> error 1 in extracting from zip file
Without a minimal reproducible example it's difficult to know exactly where the problem lies. My best guess is that using a tempfile() is causing problems.
I would create a folder within your working directory to unzip the files to. You can do this from within R if you like:
# Create the folder 'temp' in your wd
dir.create("temp")
Now assuming your zip file is in the working directory I would unzip the first .zip in to temp in one step:
unzip("Requested.zip", exdir = "temp")
Finally, unzip the final .zip:
unzip("temp/Data Requested.zip", exdir = "temp")