I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)
Related
I am trying to webscrape historical DFS NFL ownership from fanatsylabs.com using Rselenium. I am able to navigate to the page and even able to highlight the element I am trying to scrape, but am coming up with an error when I put it into a table.
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"webElement"’
I have looked up the error but cannot seem to find a reason why. I am essentially trying to follow this stack overflow example for this web scraping problem. Would someone be able to help me understand why I am not able to scrape this table and what I could do differently in order to do so?
here is my full code:
library(RSelenium)
library(XML)
library(RCurl)
# start the Selenium server
rdriver <- rsDriver(browser = "chrome",
chromever = "106.0.5249.61",
)
# creating a client object and opening the browser
obj <- rdriver$client
# navigate to the url
appURL <- 'https://www.fantasylabs.com/nfl/contest-ownership/?date=10112022'
obj$navigate(appURL)
obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')$highlightElement()
tableElem <- obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')
projTable <- readHTMLTable(tableElem, header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
dvpCTable <- projTable[[1]]
dvpCTable
library(tidyverse)
library(httr2)
"https://www.fantasylabs.com/api/contest-ownership/1/10_12_2022/4/75377/0/" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble
#> # A tibble: 43 x 4
#> Prope~1 $Fant~2 $Posi~3 $Play~4 $Team $Salary $Actu~5 Playe~6 SortV~7 Fanta~8
#> <int> <int> <chr> <chr> <chr> <int> <dbl> <int> <lgl> <int>
#> 1 50882 1376298 TE Albert~ "DEN" 2800 NA 50882 NA 1376298
#> 2 51124 1376299 TE Andrew~ "DEN" 2500 1.7 51124 NA 1376299
#> 3 33781 1385590 RB Austin~ "LAC" 7500 24.3 33781 NA 1385590
#> 4 55217 1376255 QB Brett ~ "DEN" 5000 NA 55217 NA 1376255
#> 5 2409 1376309 QB Chase ~ "LAC" 4800 NA 2409 NA 1376309
#> 6 40663 1385288 WR Courtl~ "DEN" 6100 3.4 40663 NA 1385288
#> 7 50854 1376263 RB Damare~ "DEN" 4000 NA 50854 NA 1376263
#> 8 8580 1376342 WR DeAndr~ "LAC" 3600 4.7 8580 NA 1376342
#> 9 8472 1376304 D Denver~ "DEN" 2500 7 8472 NA 1376304
#> 10 62112 1376262 RB Devine~ "" 4000 NA 62112 NA 1376262
#> # ... with 33 more rows, 34 more variables:
#> # Properties$`$5 NFL $70K Flea Flicker [$20K to 1st] (Mon-Thu)` <dbl>,
#> # $Average <dbl>, $Volatility <lgl>, $GppGrade <chr>, $MyExposure <lgl>,
#> # $MyLeverage <lgl>, $MyLeverage_rnk <lgl>, $MediumOwnership_pct <lgl>,
#> # $PlayerId_rnk <int>, $PlayerId_pct <dbl>, $FantasyResultId_rnk <int>,
#> # $FantasyResultId_pct <dbl>, $Position_rnk <lgl>, $Position_pct <lgl>,
#> # $Player_Name_rnk <lgl>, $Player_Name_pct <lgl>, $Team_rnk <lgl>, ...
Created on 2022-11-03 with reprex v2.0.2
I am working with time series, i have 2 different time series that have 2 columns and different row number.
df_1=read.table("data_1")
df_2=read.table("data_2")
I would like to compare the values of df_1$V2 (second column) with the values in df_2$V2, if they are equal calculate the time difference between them (df_2$V1[i]-df_2$V1[j])
here is my code:
vect=c()
horo=c()
j=1
for (i in 2: nrow(df_1)){
for(j in 1:nrow(df_2)) {
if(df_1$V2[i]==df_2$V2[j]){
calc=abs(df_2$V1[j] - df_1$V1[i])
vect=append(vect, calc)
}
}
}
The problem is:
it could exist many element in df_2$V2[j] that are equal to df_1$V2[i]
and i only want the first value.
as i know that in my data if (for example) df_1$V2[1]= df_2$V2[8] so for the next iteration no need to compare the df_1$V1[2] with the first 8 values of df_2$V2 and i can start comparing from df_2$V2[9]
it take too much time... because of the for loop, so is there another way to do it?
Thank you for your help!
data example:
df_1=
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851
....
df_2=
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942
You've mentioned that your for-loop is slow; it's generally advisable to avoid writing your own for-loops in R, and letting built-in vectorisation handle things efficiently.
Here's a non-for-loop-dependent solution using the popular dplyr package from the tidyverse.
Read in data
First, let's read in your data for the sake of reproducibility. Note that I've added names to your data, because unnamed data is confusing and hard to work with.
require(vroom) # useful package for flexible data reading
df_1 <- vroom(
"timestamp value
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851")
#> Rows: 20 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_2 <- vroom(
"timestamp value
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942")
#> Rows: 17 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Comparing time differences for matching values
Let's go through the solution step-by-step:
Add id for each row of df_1
We'll need this later to remove unwanted values.
require(dplyr)
#> Loading required package: dplyr
df_1 <- mutate(df_1, id = paste0("id_", row_number()) |>
## For the sake of neatness, we'll make id_ an ordered factor
## that's ordered by it's current arrangement
ordered() |>
fct_inorder())
df_1 <- relocate(df_1, id)
head(df_1)
#> # A tibble: 6 × 3
#> id timestamp value
#> <chr> <dbl> <dbl>
#> 1 id_1 15.9 2633
#> 2 id_2 15.9 2664
#> 3 id_3 15.9 2699
#> 4 id_4 15.9 3068
#> 5 id_5 15.9 4256
#> 6 id_6 15.9 5444
Join rows from df_2 on matching values
joined <- left_join(df_1, df_2, by = "value", suffix = c(".1", ".2"))
head(joined)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <chr> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_1 15.9 2633 16.0
#> 3 id_2 15.9 2664 NA
#> 4 id_3 15.9 2699 NA
#> 5 id_4 15.9 3068 16.0
#> 6 id_5 15.9 4256 NA
Get the first returned value for each value in df_1
We can do this by grouping by our id column, then just getting the first() row from each group.
joined <- group_by(joined, id) # group by row identifiers
summary <- summarise(joined, across(everything(), first))
head(summary)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <ord> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_2 15.9 2664 NA
#> 3 id_3 15.9 2699 NA
#> 4 id_4 15.9 3068 16.0
#> 5 id_5 15.9 4256 NA
#> 6 id_6 15.9 5444 NA
Get time difference
A simple case of using mutate() to subtract timestamp.1 from timestamp.2:
times <- mutate(summary, time_diff = timestamp.2 - timestamp.1) |>
relocate(value, .after = id) # this is just for presentation
## You may want to remove rows with no time diff?
filter(times, !is.na(time_diff))
#> # A tibble: 8 × 5
#> id value timestamp.1 timestamp.2 time_diff
#> <ord> <dbl> <dbl> <dbl> <dbl>
#> 1 id_1 2633 15.9 16.0 0.0167
#> 2 id_4 3068 15.9 16.0 0.0178
#> 3 id_7 6632 15.9 16.0 0.0238
#> 4 id_10 10196 15.9 16.0 0.0329
#> 5 id_12 12572 16.0 16.0 0.0219
#> 6 id_15 16136 16.0 16.0 0.0339
#> 7 id_18 18624 16.0 16.0 0.0322
#> 8 id_19 18659 16.0 16.1 0.0175
Created on 2022-10-25 with reprex v2.0.2
I can download in the browser a file from this website
https://www.cmegroup.com/ftp/pub/settle/comex_future.csv
However when I try the following
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
dest <- "C:\\COMEXfut.csv"
download.file(url, dest)
I get the following error message
Error in download.file(url, dest) :
cannot open URL 'https://www.cmegroup.com/ftp/pub/settle/comex_future.csv'
In addition: Warning message:
In download.file(url, dest) :
InternetOpenUrl failed: 'The operation timed out'
even if I choose:
options(timeout = max(600, getOption("timeout")))
any idea why is this happening ? thanks !
The problem here is that the site from which you are downloading needs a couple of additional headers. The easiest way to supply them is using the httr package
library(httr)
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
UA <- paste('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0)',
'Gecko/20100101 Firefox/98.0')
res <- GET(url, add_headers(`User-Agent` = UA, Connection = 'keep-alive'))
This should download in less than a second.
If you want to save the file you can do
writeBin(res$content, 'myfile.csv')
Or if you just want to read the data straight into R without even saving it, you can do:
content(res)
#> Rows: 527 Columns: 20
#> 0s-- Column specification ----------------------------------------------------------------
#> Delimiter: ","
#> chr (10): PRODUCT SYMBOL, CONTRACT MONTH, CONTRACT DAY, CONTRACT, PRODUCT DESCRIPTIO...
#> dbl (10): CONTRACT YEAR, OPEN, HIGH, LOW, LAST, SETTLE, EST. VOL, PRIOR SETTLE, PRIO...
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 527 x 20
#> `PRODUCT SYMBOL` `CONTRACT MONTH` `CONTRACT YEAR` `CONTRACT DAY` CONTRACT
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 0GC 07 2022 NA 0GCN22
#> 2 4GC 03 2022 NA 4GCH22
#> 3 4GC 05 2022 NA 4GCK22
#> 4 4GC 06 2022 NA 4GCM22
#> 5 4GC 08 2022 NA 4GCQ22
#> 6 4GC 10 2022 NA 4GCV22
#> 7 4GC 12 2022 NA 4GCZ22
#> 8 4GC 02 2023 NA 4GCG23
#> 9 4GC 04 2023 NA 4GCJ23
#> 10 4GC 06 2023 NA 4GCM23
#> # ... with 517 more rows, and 15 more variables: PRODUCT DESCRIPTION <chr>, OPEN <dbl>,
#> # HIGH <dbl>, HIGH AB INDICATOR <chr>, LOW <dbl>, LOW AB INDICATOR <chr>, LAST <dbl>,
#> # LAST AB INDICATOR <chr>, SETTLE <dbl>, PT CHG <chr>, EST. VOL <dbl>,
#> # PRIOR SETTLE <dbl>, PRIOR VOL <dbl>, PRIOR INT <dbl>, TRADEDATE <chr>
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
My data frame has different dates as rows. Every unique date occurs appr. 500 times. I want to make a new data frame where every column is a unique date and where the rows are all the observations of that date from my old dataset. So for every column dat represents a certain date, I should have appr. 500 rows that each represent a rel_spread from that day.
You can use pivot_wider from tidyr:
library(tidyr)
pivot_wider(df, names_from = date, values_from = rel_spread, values_fn = list) %>%
unnest(everything())
#> # A tibble: 2 x 17
#> `20000103` `20000104` `20000105` `20000106` `20000107` `20000108` `20000109`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.0234 -0.0128 0.00729 0.0408 -0.0298 0.0398 0.0445
#> 2 0.0492 -0.0120 0.0277 0.0435 -0.0288 0.0152 -0.0374
#> # ... with 10 more variables: `20000110` <dbl>, `20000111` <dbl>,
#> # `20000112` <dbl>, `20000113` <dbl>, `20000114` <dbl>, `20000115` <dbl>,
#> # `20000116` <dbl>, `20000117` <dbl>, `20000118` <dbl>, `20000119` <dbl>
Note that we don't have your data (and I wasn't about to transcribe a picture of your data), but I created a little reproducible data set which should match the structure of your data set, except it only has two values per date for demo purposes:
set.seed(1)
df <- data.frame(date = rep(as.character(20000103:20000119), 2),
rel_spread = runif(34, -0.05, 0.05))
df
#> date rel_spread
#> 1 20000103 -0.0234491337
#> 2 20000104 -0.0127876100
#> 3 20000105 0.0072853363
#> 4 20000106 0.0408207790
#> 5 20000107 -0.0298318069
#> 6 20000108 0.0398389685
#> 7 20000109 0.0444675269
#> 8 20000110 0.0160797792
#> 9 20000111 0.0129114044
#> 10 20000112 -0.0438213730
#> 11 20000113 -0.0294025425
#> 12 20000114 -0.0323443247
#> 13 20000115 0.0187022847
#> 14 20000116 -0.0115896282
#> 15 20000117 0.0269841420
#> 16 20000118 -0.0002300758
#> 17 20000119 0.0217618508
#> 18 20000103 0.0491906095
#> 19 20000104 -0.0119964821
#> 20 20000105 0.0277445221
#> 21 20000106 0.0434705231
#> 22 20000107 -0.0287857479
#> 23 20000108 0.0151673766
#> 24 20000109 -0.0374444904
#> 25 20000110 -0.0232779331
#> 26 20000111 -0.0113885907
#> 27 20000112 -0.0486609667
#> 28 20000113 -0.0117612043
#> 29 20000114 0.0369690846
#> 30 20000115 -0.0159651003
#> 31 20000116 -0.0017919885
#> 32 20000117 0.0099565825
#> 33 20000118 -0.0006458693
#> 34 20000119 -0.0313782399
Allan’s answer is perfect if you have the same number of rows for each date. If this isn’t the case, the following should work:
library(tidyr)
library(dplyr)
data_wide <- data_long %>%
group_by(date) %>%
mutate(daterow = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = date, values_from = rel_spread) %>%
select(!daterow)
data_wide
Output:
# A tibble: 6 x 4
`20000103` `20000104` `20000105` `20000106`
<dbl> <dbl> <dbl> <dbl>
1 -0.626 0.184 -0.836 -0.621
2 1.60 0.330 -0.820 -2.21
3 0.487 0.738 0.576 1.12
4 -0.305 1.51 0.390 -0.0449
5 NA NA NA -0.0162
6 NA NA NA 0.944
Example data:
set.seed(1)
data_long <- data.frame(
date = c(rep(20000103:20000105, 4), rep(20000106, 6)),
rel_spread = rnorm(18)
)
I'm trying to read in a small (17kb), simple csv file from EdX.org (for an online course), and I've never had this trouble with readr::read_csv() before. Base-R read.csv() reads the file without generating the problem.
A small (17kb) csv file from EdX.org
library(tidyverse)
df <- read_csv("https://courses.edx.org/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+1T2020+type#asset+block/WHO.csv")
head(df)
Gives this output
#> # A tibble: 6 x 13
#> Country Region Population Under15 Over60 FertilityRate LifeExpectancy
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 Afghan… Easte… 29825 47.4 3.82 "\r5.4\r" 60
#> 2 Albania Europe 3162 21.3 14.9 "\r1.75\r" 74
#> 3 Algeria Africa 38482 27.4 7.17 "\r2.83\r" 73
#> 4 Andorra Europe 78 15.2 22.9 <NA> 82
#> 5 Angola Africa 20821 47.6 3.84 "\r6.1\r" 51
#> 6 Antigu… Ameri… 89 26.0 12.4 "\r2.12\r" 75
#> # … with 6 more variables: ChildMortality <dbl>, CellularSubscribers <dbl>,
#> # LiteracyRate <chr>, GNI <chr>, PrimarySchoolEnrollmentMale <chr>,
#> # PrimarySchoolEnrollmentFemale <chr>
You'll notice that the column FertilityRate has "\r" added to the values. I've downloaded the csv file and cannot find them there.
Base-R read.csv() reads in the file with no problems, so I'm wondering what the problem is with my usage of the tidyverse read_csv().
head(df$FertilityRate)
#> [1] "\r5.4\r" "\r1.75\r" "\r2.83\r" NA "\r6.1\r" "\r2.12\r"
How can I fix my usage of read_csv() so that: the "\r" strings are not there?
If possible, I'd prefer not to have to individually specify the type of every single column.
In a nutshell, the characters are inside the file (probably by accident) and read_csv is right to not remove them automatically: since they occur within quotes, this by convention means that a CSV parser should treat the field as-is, and not strip out whitespace characters. read.csv is wrong to do so, and this is arguably a bug.
You can strip them out yourself once you’ve loaded the data:
df = mutate_if(df, is.character, ~ stringr::str_remove_all(.x, '\r'))
This seems to be good enough for this file, but in general I’d be wary that the file might be damaged in other ways, since the presence of these characters is clearly not intentional, and the file follows no common file ending convention (it’s neither a conventional Windows nor Unix file).