R: Read file or sheet name of a csv file - r

Is there a possibility to read out the filename or the sheet name of a .csv file when importing it in R? I generated a .csv by clicking on the url:
https://www.populationpyramid.net/api/pp/4/2019/?csv=true
The file has the name "Afghanistan-2019" and the sheet name is the same.
Now I tried to do the same with R using
library(readr)
df <- read_csv("https://www.populationpyramid.net/api/pp/4/2019/?csv=true")
However, that only gives me access to the data, but I lost the information of the file/sheet name. Any suggestions?

You can use the excel_sheets function from the readxl package to get a character vector of all the sheets contained in the excel file.
Edit:
Sorry, I realized now that you are downloading a CSV file. CSV files are flat files and as such don't have any sheet names, so your only option is the file name. Since you are essentially querying an API, you could use the httr package instead to send a GET request:
library(httr)
library(stringr)
res <- httr::GET("https://www.populationpyramid.net/api/pp/4/2019/?csv=true")
This gives you a response object which contains all kind of interesting information - including both the actual data (duh) and the file name.
You can get the data with the content function:
httr::content(res)
#> # A tibble: 21 x 3
#> Age M F
#> <chr> <dbl> <dbl>
#> 1 0-4 2891330 2747452
#> 2 5-9 2765393 2636519
#> 3 10-14 2614937 2501560
#> 4 15-19 2321520 2197654
#> 5 20-24 1950650 1843985
#> 6 25-29 1551332 1433056
#> 7 30-34 1255855 1138037
#> 8 35-39 1033269 954327
#> 9 40-44 834402 758533
#> 10 45-49 649695 603870
#> # … with 11 more rows
To retrieve the file name, we need to get a bit more creative. The file name is stored in the content-disposition element in the headers section of the res object:
res$headers$`content-disposition`
#> [1] "attachment; filename=Afghanistan-2019.csv"
We can extract it with a regex which pulls out all the text after the first =:
stringr::str_extract(res$headers$`content-disposition`, "(?<=\\=).*")
# [1] "Afghanistan-2019.csv"
Since response objects should always contain the same information in the same places (especially when retrieved from the same API), you could easily automate this process.

Related

Reading a JSON file (with 1 key to many values mapping) in R

I have a file named data.json. It has the following contents:
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
In RStudio, I have installed the 'rjson' package and have the following code:
library("rjson")
myData <- fromJSON(file="data.json")
print(myData)
As per the description of the fromJSON() function, it should read the contents of 'data.json' file into a R object 'myData'. When I executed it, I got the following error:
Error in fromJSON(file = "data.json") :
not all data was parsed (0 chars were parsed out of a total of 3 chars)
I validated the structure of the 'data.json' file on https://jsonlint.com/. It was valid.
I searched stackoverflow.com and got the following page: Error in fromJSON("employee.json") : not all data was parsed (0 chars were parsed out of a total of 13 chars)
My program already complies with the answers given here but the 'data.json' file is still not getting parsed.
I would be grateful if you could point out what mistake I am making in the R program or JSON file as I am new to both.
Thank You.
I can confirm the error for rjson, but jsonlite::fromJSON appears to work.
jsonlite::fromJSON('foo.dat') |> as.data.frame()
# ID Name Salary StartDate Dept
# 1 1 Rick 623.3 1/1/2012 IT
# 2 2 Dan 515.2 9/23/2013 Operations
# 3 3 Michelle 611 11/15/2014 IT
# 4 4 Ryan 729 5/11/2014 HR
# 5 5 Gary 843.25 3/27/2015 Finance
# 6 6 Nina 578 5/21/2013 IT
# 7 7 Simon 632.8 7/30/2013 Operations
# 8 8 Guru 722.5 6/17/2014 Finance

scraping with select/ option dropdown

List item
I am new to web scrapping and after a couple of Wikipedia pages I found this page where I wanted to extract the tables for all the portfolio managers. I am not able to use the things I found on the internet. I thought it would be easy since it's just a table but I am not able to extract even a single table after filling out the form. Can someone please tell me how I could get this done in R? I have added an image in this post but it seems to look like a link that says to enter image description here.
https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes
library(tidyverse)
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes"
result <- postForm(url,
pmrId="RIGHT HORIZONS PORTFOLIO MANAGEMENT PRIVATE LIMITED",
year="2022",
month="August")
attr(result,"Content-Type")
result
enter image description here
Sebi Website
If you change those passed values to corresponding value attribute values of options (i.e. "8" instead of "August" in case of <option value="8">August</option>), you should be all set.
And you can also check the actual payload of POST requests:
Lazy approach would be just using Copy as cURL in DevTools and heading to https://curlconverter.com/r/ to convert it to httr request.
library(rvest)
resp <- httr::POST("https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes",
body = list(
pmrId="INP000004417##INP000004417##AEQUITAS INVESTMENT CONSULTANCY PRIVATE LIMITED",
year="2022",
month="8"))
tables <- resp %>%
read_html() %>%
html_elements("table") %>%
html_table()
# first table:
tables[[1]]
#> # A tibble: 11 × 2
#> X1 X2
#> <chr> <chr>
#> 1 Name of the Portfolio Manager "Aeq…
#> 2 Registration Number "INP…
#> 3 Date of Registration "201…
#> 4 Registered Address of the Portfolio Manager ",,,…
#> 5 Name of Principal Officer ""
#> 6 Email ID of the Principal Officer ""
#> 7 Contact Number (Direct) of the Principal Officer ""
#> 8 Name of Compliance Officer ""
#> 9 Email ID of the Compliance Officer ""
#> 10 No. of clients as on last day of the month "124…
#> 11 Total Assets under Management (AUM) as on last day of the month (Amoun… "143…
Created on 2022-10-11 with reprex v2.0.2

Read multilingual data in R

I have a dataset with more than 1 language (e.g. Korean, Chinese).
Country,Name
USA,Alix
Korea,티디
Germany,Zürn Gm
China,和 Taiwan
I have saved the file in csv format with UTF-8.
library(readr)
guess_encoding("test.csv", n_max = 1000)
# A tibble: 1 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
However, when I load the file into R, it is showing invalid character (<U+D2F0>):
df <- read.csv("test.csv",encoding = "UTF-8")
Country Name
1 USA Alix
2 Korea <U+D2F0><U+B514>
3 Germany Zürn Gm
4 China <U+548C> Taiwan
How can I load and write the file to show the correct foreign characters?
If you don't need to write the file as a .csv, you can consider saving it with save(), which will keep all formatting.

openxlsx: read.xlsx throws an error if the sheet name contains the "&" character

Create an .xlsx file with three sheets named: "Test 1", "S&P500 TR" and "SP500 TR". Put some random content in each sheet and save it as "Book1.xlsx".
Run:
> a <- getSheetNames("Book1.xlsx")
> a
[1] "Test 1" "S&P500 TR" "SP500 TR"
Now try:
> read.xlsx("Book1.xlsx", a[2])
Error in read.xlsx.default("Book1.xlsx", a[2]) :
Cannot find sheet named "S&P500 TR"
First check if you actually type the name S&P500 TR instead of using a[2] that would change anything.
Alternatively, you can use readxl package for importing;
library(readxl)
X1 <- read_excel("C:/1.xls", sheet = "S&P500 TR")
This is a spreadsheet that I had and it is the result after it is imported;
head(X1)
# A tibble: 6 × 4
# Year Month Community ` Average Daily`
# <dbl> <chr> <chr> <dbl>
# 1 2016 Jan Arlington 5.35
# 2 2016 Jan Ashland 1.26
# 3 2016 Jan Bedford 2.62
# 4 2016 Jan Belmont 3.03
# 5 2016 Jan Boston 84.89
# 6 2016 Jan Braintree 8.16
I ran into the same problem, but found a workaround. First load in the workbook using read.xlsx(). Then rename the problematic sheet to avoid the ampersand. To fix the code in your example:
wb = read.xlsx("Book1.xlsx")
renameWorksheet(wb, "S&P500 TR", "NEW NAME")
output = read.xlsx(wb, "NEW NAME")
Hope this helps!
First load the workbook, then use the which and grepl function to return the sheet number containing the sheet name (which can include the '&' character when done in this way). This seems to work quite well in an application I am currently working on.
An (incomplete) example is given below that should be easily modified to your context. In my case 'i' is a file name (looping over many files). The "toy" code is here:
wb <- loadWorkbook(file = i)
which( grepl("CAPEX & Depreciation", names(wb)) )

Trouble opening a tempfile in R

I am trying and failing to use a tempfile() to get data from a .gz file posted on the web without writing the archive to my hard drive and manually extracting the desired file. I'm re-using code that has worked in similar situations before, and R can find other tempfiles with no trouble.
Here's the code I'm using:
temp <- tempfile()
download.file("http://unified-democracy-scores.org/files/20140312/z/uds_summary.csv.gz", temp)
UDS <- read.csv(unz(temp, "uds_summary.csv"), stringsAsFactors = FALSE)
Here's the error it's throwing:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open zip file 'C:\Users\Jay\AppData\Local\Temp\RtmpKs4ZWm\file100877485507'
I tried setting the mode in download.file() to other options (e.g., mode="wb") to no avail. Ditto for varying the method at that step. If I download the archive to my hard drive and manually extract the .csv using the name used in the third line of my code, it reads in fine.
Any ideas what I'm doing wrong here?
Use gzfile instead of unz:
UDS <- read.csv(gzfile(temp), stringsAsFactors = FALSE)
This gives the output:
head(UDS)
#> country year cowcode mean sd median pct025
#> 1 United States 1946 2 1.086431 0.2962744 1.072743 0.5424734
#> 2 United States 1947 2 1.094423 0.2989538 1.077987 0.5516301
#> 3 United States 1948 2 1.050040 0.2604016 1.038927 0.5642550
#> 4 United States 1949 2 1.039801 0.2585845 1.031048 0.5628056
#> 5 United States 1950 2 1.084971 0.2449264 1.071610 0.6280569
#> 6 United States 1951 2 1.043591 0.2551857 1.033722 0.5695530
#> pct975
#> 1 1.694063
#> 2 1.719771
#> 3 1.588783
#> 4 1.567912
#> 5 1.589253
#> 6 1.577150

Resources