Reading 'raw' xlsx file from httr response into data frame

Reading 'raw' xlsx file from httr response into data frame - r

I am using the httr package to retrieve data from our reporting system using its REST API. I am specifying the content to be a xlsx. The response contains the raw (binary?) file.
Here's what my request looks like:
request = GET("http://server/.../documents/123456",
add_headers(.headers = c('Accept'= 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'authtoken' = paste0('', logonToken,''))) ,
content_type("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"),
encode = 'raw'
)
content(request)
[1] 50 4b 03 04 0a 00 08 08 08 00 19 88 79 54 00 00 00 00 00 00 00
[44] 2f 73 68 65 65 74 31 2e 78 6d 6c a5 57 5b 6f 9b 30 14 7e 9f b4
[87] 00 23 43 2f db af 9f c1 94 d8 c6 58 93 92 87 54 f5 77 f1 39 fe
... etc
The result can be saved as a .xlsx and opened in Excel. However, I would like to read this data directly into a data frame. Is there a way to mimic the result into a readable input within the same script?
I am able to pass an extra parameter write_disk to save the response directly as a file. Specifying a path is required. I tried testing with tempfile() to write and read the response directly back in, but wasn't able to get it to work.
Is there any way to read a raw file from an R environment object?

Yes, here's a fully reproducible example url:
url <- paste0('https://file-examples.com/storage/fe91183158623ded19eb446/',
'2017/02/file_example_XLSX_100.xlsx')
Now download our file an get its raw contents:
raw_xlsx <- httr::GET(url)$content
Let's create a temporary file to store it:
tmp <- tempfile(fileext = '.xlsx')
Now write the raw data to the file:
writeBin(raw_xlsx, tmp)
Our excel file is now saved in the temporary file, which we can read however you would normally read them into R:
my_excel <- readxl::read_excel(tmp)
And the result is:
my_excel
#> # A tibble: 100 x 8
#> `0` `First Name` `Last Name` Gender Country Age Date Id
#> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
#> 1 1 Dulce Abril Female United States 32 15/10/2017 1562
#> 2 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
#> 3 3 Philip Gent Male France 36 21/05/2015 2587
#> 4 4 Kathleen Hanner Female United States 25 15/10/2017 3549
#> 5 5 Nereida Magwood Female United States 58 16/08/2016 2468
#> 6 6 Gaston Brumm Male United States 24 21/05/2015 2554
#> 7 7 Etta Hurn Female Great Britain 56 15/10/2017 3598
#> 8 8 Earlean Melgar Female United States 27 16/08/2016 2456
#> 9 9 Vincenza Weiland Female United States 40 21/05/2015 6548
#> 10 10 Fallon Winward Female Great Britain 28 16/08/2016 5486
#> # ... with 90 more rows

Related

Am I able to get a specific P-value to see where the significance lies?

So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations

Using pivot longer to collapse data from multiple columns

I've read through a number of similar posts and tutorials but am really struggling to understand the solution to my issue. I have a dataset that is wide, and when I make it longer - I want to collapse two sets of data (both duration and results).
For each participant (id), there is a category, and then a series of blood test results. Each test has both duration (in days) and a result (numeric value).
Here's how it looks now:
id
category
duration_1
results_1
duration_2
results_2
duration_3
results_3
01
diabetic
58
32
65
56
76
87
02
prediabetic
54
32
65
25
76
35
03
unknown
46
65
65
56
21
67
How I'd like it to be is:
id
category
duration
results
01
diabetic
58
32
01
diabetic
65
56
01
diabetic
76
87
02
prediabetic
54
32
02
prediabetic
65
25
02
prediabetic
76
35
03
unknown
46
65
03
unknown
65
25
03
unknown
21
67
I can get pivot longer to work for "results" - but I can't get it to pivot on both "results" and "duration".
Any assistance would be greatly appreciated. I'm still fairly new to R.
Thanks!

One way is to separate the column names into two columns while you pivot (hence the names_sep below). Then, you can just drop the column number.
library(tidyverse)
df %>%
tidyr::pivot_longer(!c(id, category),
names_to = c(".value", "num"),
names_sep = "_") %>%
dplyr::select(-num)
Output
# A tibble: 9 × 4
id category duration results
<chr> <chr> <dbl> <dbl>
1 01 diabetic 32 23
2 01 diabetic 87 67
3 01 diabetic 98 78
4 02 prediabetic 43 45
5 02 prediabetic 34 65
6 02 prediabetic 12 12
7 03 unknown 32 54
8 03 unknown 75 45
9 03 unknown 43 34
Data
df <-
structure(
list(
id = c("01", "02", "03"),
category = c("diabetic", "prediabetic", "unknown"),
duration_1 = c(32, 43, 32),
results_1 = c(23, 45, 54),
duration_2 = c(87, 34, 75),
results_2 = c(67, 65, 45),
duration_3 = c(98, 12, 43),
results_3 = c(78, 12, 34)
),
class = "data.frame",
row.names = c(NA,-3L)
)

How to download a file from a link that is a forced download

Using R, I want to grab the .csv from this site: https://vacationwithoutacar.com/gvrat-tracking/ but there is no link associated with the download it just pops up automatically. Usually, I would use httr to download but when I right-click the the link it doesn't give me the option to copy a URL. There's not much of a reproducible example because, well, I can't even get the file downloaded.
GET("https://vacationwithoutacar.com/gvrat-tracking/",
write_disk("gvrat.txt",
overwrite = TRUE)
)

This seems to work...
I opened the URL in chrome, hit F12, clicked network, trash can to clear, reload page. Then looked for some request to download the data. I saw the URL below returned JSON.
library(jsonlite)
df <- fromJSON("https://vacationwithoutacar.com/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=2596&target_action=get-all-data&default_sorting=old_first")
> head(df)
pos bib participantsname event home g a miles km yourapproximatelocation comp projfin 51 52 53 54
1 1 17395 Gingerbread Man GVRAT US-TN M 145 275.6 443.5 Cowley Hollow 43.50% tbd 83 43 49 52
2 2 1074 Terri Biloski GVRAT CA-ON F 44 274.6 442.0 Cowley Hollow 43.30% tbd 63 59 52 52
3 3 1429 Dave Proctor GVRAT CA-AB M 39 255.3 410.8 Cowley Hollow 40.30% tbd 62 62 5 62
4 4 743 John Sharp GVRAT US-TX M 42 235.0 378.2 Reed Hollow 37.10% tbd 34 41 46 54
5 5 386 Roc Powell GVRAT US-OH M 60 226.0 363.7 Britton Hollow 35.60% tbd 35 75 63 43
6 6 322 Rob Donkersloot GVRAT AU M 60 190.7 306.9 past Deerfield 30.10% tbd 42 42 43 32
55 56 57 58 59 51_1
1 48
2 48
3 63
4 60
5 10
6 32

How can I include pseudos in a DTM in R?

I am trying to get a DTM of the words on this page:
https://en.wikipedia.org/wiki/Talk:Libyan_Civil_War_(2011)/Archive_1
My problem is the pseudos of the person posting (that are words of my corpus) never appear in my DTM even if I am setting dictionary on NULL. For instance, I expect the word "Lihaas" to be found 31 times but it is not showing up on my DTM.
My code :
library(tm)
docs<- VCorpus(DirSource(directory = "~dir"))
docsTDM <- DocumentTermMatrix(docs, control=list(dictionary=NULL))
I obtain :
the 2011 february utc
628 319 293 280
talk and this that
236 197 163 152
for are not uprising
106 101 92 79
libyan protests but support
76 75 68 68
with there revolt its
68 65 62 61
protest article have now
58 57 53 50
has civil should which
47 46 44 44
more think war was
43 43 41 41
from libya what would
40 40 36 35
about revolution added sources
34 34 32 32
comment government people some
30 30 30 30
all just section you
29 29 29 29
than unsigned will can
27 27 27 26
talk•contribs then even name
26 26 25 25

It might have to do with the fact that "Lihaas" is adjacent to a preceding "." in all of the cases that I see, or inside parentheses. So it is likely to be due to issues with tm's tokeniser.
Here is an alternative that produces what you want, using the quanteda package.
# read the document using the readtext package
wikitxt <- readtext::readtext("Talk:Libyan Civil War (2011):Archive 1 - Wikipedia.html")
library("quanteda")
wikidfm <- dfm(corpus(wikitxt), tolower = FALSE)
wikidfm
## Document-feature matrix of: 1 document, 3,006 features (0% sparse).
wikidfm[, c("lihaas", "Lihaas")]
## Document-feature matrix of: 1 document, 2 features (0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs lihaas Lihaas
## text1 1 30

Dealing with date format in zoo

I've a csv data file with the following formats
Stock prices over the period of Jan 1, 2015 to Sep 26, 2017
Now I use the following code to import the data as zoo object:
sensexzoo1<- read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv",
format="%d-%B-%Y", header=T, sep=",")
It produces the following error:
Error in read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv", format =
"%d-%B-%Y", : index has 679 bad entries at data rows: 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
100 ...
What is the wrong with this? Please suggest

The problem is the mismatch between the header and the data. The header line has 5 fields and the remaining lines of the file have 6 fields:
head(count.fields("SENSEX.csv", sep = ","))
## [1] 5 6 6 6 6 6
When that happens it assumes that the first field of the data is the row names so by default the next field (which in fact contains the Open data) is assumed to be the time index.
We can address this in several alternative ways:
1) The easiest way to fix this is to add a field called Volume, say, to the header so that the header looks like this:
Date,Open,High,Low,Close,Volume
2) If you have many files of this format so that it is not feasible to modify them we can read the data in without the headers and then add them on in a second pass. The [, -5] drops the column of NAs and the [-1] on the second line drops the Date header.
z <- read.zoo("SENSEX.csv", format="%d-%B-%Y", sep = ",", skip = 1)[, -5]
names(z) <- unlist(read.table("SENSEX.csv", sep = ",", nrow = 1))[-1]
giving:
> head(z)
Open High Low Close
2015-01-01 27485.77 27545.61 27395.34 27507.54
2015-01-02 27521.28 27937.47 27519.26 27887.90
2015-01-05 27978.43 28064.49 27786.85 27842.32
2015-01-06 27694.23 27698.93 26937.06 26987.46
2015-01-07 26983.43 27051.60 26776.12 26908.82
2015-01-08 27178.77 27316.41 27101.94 27274.71
3) A third approach is to read the file in as text, use R to append ",Volume" to the first line and then read the text with read.zoo:
Lines <- readLines("SENSEX.csv")
Lines[1] <- paste0(Lines[1], ",Volume")
z <- read.zoo(text = Lines, header = TRUE, sep = ",", format="%d-%B-%Y")
Note: The first few lines of SENSEX.csv are shown below to make this self-contained (not dependent on the link in the question which could disappear in the future):
Date,Open,High,Low,Close
1-January-2015,27485.77,27545.61,27395.34,27507.54,
2-January-2015,27521.28,27937.47,27519.26,27887.90,
5-January-2015,27978.43,28064.49,27786.85,27842.32,
6-January-2015,27694.23,27698.93,26937.06,26987.46,
7-January-2015,26983.43,27051.60,26776.12,26908.82,
8-January-2015,27178.77,27316.41,27101.94,27274.71,
9-January-2015,27404.19,27507.67,27119.63,27458.38,
12-January-2015,27523.86,27620.66,27323.74,27585.27,
13-January-2015,27611.56,27670.19,27324.58,27425.73,
14-January-2015,27432.14,27512.80,27203.25,27346.82,