I want to directly import the dataset from the following URL directly into R to work with the data:
http://www.football-data.co.uk/mmz4281/2223/B1.csv
Previously I have used read_csv() or fread() like this:
data <- data.table::fread("http://www.football-data.co.uk/mmz4281/2223/B1.csv")
data <- readr::read_csv("http://www.football-data.co.uk/mmz4281/2223/B1.csv")
This used to work with the data being in a data.frame, and looking like the original data. However, now the output appears to be HTML:
e.g. if using read_csv()
head(data)
# A tibble: 6 x 1
`<HTML>`
<chr>
1 "<HEAD>"
2 "<TITLE>Football Betting | Football Results | Free Bets | Betting Odds</TITLE>"
3 "<meta name=\"twitter:card\" content=\"summary\" />"
4 "<meta name=\"twitter:site\" content=\"#12Xpert\" />"
5 "<meta name=\"twitter:title\" content=\"Football-Data.co.uk\" />"
Is there a way to directly import the csv from such a downloadable link, without having to download the excel file onto your computer and then loading it?
Goede morgen Jalapic, I think you are almost there.
Short answer - check your download link. You will have success with https (and not http://....
bets2223 <- readr::read_csv("https://www.football-data.co.uk/mmz4281/2223/B1.csv")
Longer answer. You can always work out your links by using tools that the R ecosystem offers.
library(rvest) # package to download web-content
library(dplyr) # tidyverse data wrangling
library(readr) # tidyverse read package
# check the page given by Jalapic
# and extract all `href` links
page <- read_html("https://www.football-data.co.uk/belgiumm.php")
links <- page %>% html_nodes("a") %>% html_attr("href")
# the list can be reduced to our "payload" mmz... files
links <- links[grepl(pattern = "^mmz4281", x = links)]
base_url <- "https://www.football-data.co.uk/"
# construct a vector of all links that fit our search patters
download_urls <- paste0(base_url, links)
download_urls[1:4] # to shorten the output - show only first 4 results
This yields
[1] "https://www.football-data.co.uk/mmz4281/2223/B1.csv" "https://www.football-data.co.uk/mmz4281/2122/B1.csv"
[3] "https://www.football-data.co.uk/mmz4281/2021/B1.csv" "https://www.football-data.co.uk/mmz4281/1920/B1.csv"
Ok. We now spot the right file name (or write a loop to download all files of interest).
For your case we pick the first one, i.e. download_urls[1].
bets2223 <- read_csv(download_urls[1])
glimpse(bets2223)
This gets us what we are looking for (note: truncated for presentation purposes):
Rows: 134
Columns: 105
$ Div <chr> "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B…
$ Date <chr> "22/07/2022", "23/07/2022", "23/07/2022", "23/07/2022", "23/07/2022", "24/07/2022", "24/07/2022", "24/07/2022", "24/07/202…
$ Time <time> 19:45:00, 15:00:00, 17:15:00, 17:15:00, 19:45:00, 12:30:00, 15:00:00, 17:30:00, 20:00:00, 19:45:00, 15:00:00, 17:15:00, 1…
$ HomeTeam <chr> "Standard", "Charleroi", "Kortrijk", ...
Related
There are many other questions which read similar to this, but none address my query [in a way I understand]
I have multiple dataframes: 'snps' 'snp2' 'snp3' 'snp4' 'snp5' which all have the same format
library(tidyr)
library(TwoSampleMR)
glimpse(snps)
Rows: 4,873
Columns: 4
$ chrpos <chr> "6:39016574-39055519", "6:39016574-39055519", "6:39016574-39055519"
$ target <chr> "GL", "GL", "GL"
$ query <chr> "6:39016574-39055519", "6:39016574-39055519", "6:39016574-39055519"
$ name <chr> "rs113920163", "rs183723208", "rs555466268"
The only thing that differs is the 'target' column.
I want to run a function from the TwoSampleMR package
glc <- extract_outcome_data(snps = snps$name, outcomes = 'ukb-a-583')
But I want to do this for the other dataframes, in a loop
I made a list of all the names in each of the 'snps..' dataframes called targets
glimpse(targets)
Rows: 11
Columns: 1
$ target <chr> "GL", "ML", "HL", "TD", "ED"
And have been trying to loop through
list <- targets$target
for(i in list)
l <- list()
{
[[i]] <- extract_outcome_data(snps = targets$target, outcomes = 'ukb-a-583')
}
Which runs the but the object it makes 'l' is empty but there is no error msg so I don't know what to change.
Thanks!
*** EDIT ***
I am now trying
my_files <- list.files(pattern = "*_files.txt")
my_data <- lapply(my_files, read.table)
for (i in 1:length(my_data)){
dat <- extract_outcome_data(snps = my_data$[i]$targets, outcomes='ukb-a-583'))))
}
It works fine until the for loop and the for loop does run. However this does not extract the information that I need. I think its due to the 'snps = my_data$[i]$targets' bit - how do I access the column 'targets' in each of my dataframes in the my_data list?
Thanks!
My query returns a bunch of columns that I can't manually rename (using project-rename) one by one. Also the input query is fixed , I can't change it and it may return different number of columns each time I run it, so I can't have a fixed project-rename statement. For example lets say for one of the runs of the input query the result is the following columns:-
fixedstring_region
fixedstring_state
fixedstring_level
fixedstring_reach
fixedstring_mode
fixedstring_something
fixedstring_otherthing
... etc
These can be hundreds. I want to remove 'fixedstring_' from all of these. Is there some wild card technique for this?
There's no built-in way to do it.
The best I can think of is a very non-efficient way, that also changes the order of the columns:
datatable(fixedstring_region:string, fixedstring_state:string, fixedstring_level:string, fixedstring_reach:string)
[
"a1", "b1", "c1", "d1",
"a2", "b2", "c2", "d2"
]
| project PackedRecord = todynamic(replace('"fixedstring_([a-zA-Z0-9_]*)":"', #'"\1":"', tostring(pack_all())))
| evaluate bag_unpack(PackedRecord)
Output:
datatable(fixedstring_region:string, fixedstring_state:string, fixedstring_level:string, fixedstring_reach:string)
[
"a1", "b1", "c1", "d1",
"a2", "b2", "c2", "d2"
]
| project PackedRecord = todynamic(replace('"fixedstring_([a-zA-Z0-9_]*)":"', #'"\1":"', tostring(pack_all())))
| evaluate bag_unpack(PackedRecord)
Output:
level
reach
region
state
c1
d1
a1
b1
c2
d2
a2
b2
I am new to R so thank you in advance for your patience.
I would like to create a multiple choice quiz in R using the learnr package (the quiz content is not about r code). I have all of the questions, response options, and correct answers in a spreadsheet. Since my item bank has over 100 items, I will give a simpler example
Stem<-c("stem1", "stem2", "stem3")
OptionA <- c("a1", "a2", "a3")
OptionB<- c("b1", "b2", "b3")
OptionC<- c("c1", "c2", "c3")
Correct<- c("c1", "b2", "a3")
items<-cbind(Stem, OptionA, OptionB, OptionC, Correct)
Currently, the only way I know how to pull in the data from the spreadsheet is like this:
learnr::question(items$Stem[1],
answer(items$OptionA[1]),
answer(items$OptionB[1]),
answer(items$OptonC[1], correct = TRUE),
answer(items$OptionD[1])
)
however this still requires me to write that chunk of code for each item and manually assign the correct answers. Does anyone know an easier way of doing this, either with learnr or another package?
You can simply loop over the rows of your data or spreadsheet and use a function to set up the questions and save them in a list. My approach uses purrr::map but you a simple for-loop we also do the trick. Try this:
---
title: "Tutorial"
output: learnr::tutorial
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
library(dplyr)
library(purrr)
knitr::opts_chunk$set(echo = FALSE)
```
```{r}
Stem<-c("stem1", "stem2", "stem3")
OptionA <- c("a1", "a2", "a3")
OptionB<- c("b1", "b2", "b3")
OptionC<- c("c1", "c2", "c3")
Correct<- c("c1", "b2", "a3")
items<-data.frame(Stem, OptionA, OptionB, OptionC, Correct)
```
## Topic 1
### Quiz
```{r quiz}
make_q <- function(x) {
question(x$Stem,
answer(x$OptionA, correct = x$Correct == x$OptionA),
answer(x$OptionB, correct = x$Correct == x$OptionB),
answer(x$OptionC, correct = x$Correct == x$OptionC))
}
questions <- items %>%
split(.$Stem) %>%
purrr::map(make_q)
```
```{r}
quiz(
questions[[1]],
questions[[2]],
questions[[3]])
```
I'm trying to use R to analyse some facebook messages downloaded. Some of the messages have the apostrophe replaced with "â" - I'm trying to replace this using str_replace_all().
As an example, take the following data.table
names <- c("Me", "Me", "You", "You", "Me", "You")
content <- c("Iâ<U+0080><U+0099>ve got my party on the 5th", "Hello", "Bears", "Four times four", "what do you want to eat?", "get some music")
date <- c("1/1/2001", "2/1/2001", "3/1/2001", "4/1/2001", "5/1/2001", "6/1/2001")
fbmessagesexample <- data.table(names, date, content)
I then try to use str_replace_all
fbmessagesexample[, content := str_replace_all(content, pattern = fixed("â<U\\+0080><U\\+0099>"), replacement=fixed("'"))]
The first row in content is not being replaced. Is there something I'm doing wrong?
Please pass a vector for pattern.
The following code snippet is resulting in console output as shown below.
library(data.table)
library(tidyverse)
names <- c("Me", "Me", "You", "You", "Me", "You")
content <- c("Iâ<U+0080><U+0099>ve got my party on the 5th", "Hello", "Bears", "Four times four", "what do you want to eat?", "get some music")
date <- c("1/1/2001", "2/1/2001", "3/1/2001", "4/1/2001", "5/1/2001", "6/1/2001")
fbmessagesexample <- data.table(names, date, content)
pattern <- c("â<U\\+0080><U\\+0099>")
fbmessagesexample[, content := str_replace_all(content, pattern, replacement=fixed("'"))]
Console:
> fbmessagesexample
names date content
1: Me 1/1/2001 I've got my party on the 5th
2: Me 2/1/2001 Hello
3: You 3/1/2001 Bears
4: You 4/1/2001 Four times four
5: Me 5/1/2001 what do you want to eat?
6: You 6/1/2001 get some music
In the webpage, there is a kind of table in the webpage which have more than one element in one cell. I can crawl the content in the table by following code, but I could not bind these elements as their webpage architecture. Do we have some methods to combine these element perfectly, or we should use other idea to get each element?
library(XML)
dataissued <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
ec_parsed <- htmlTreeParse(dataissued, encoding = "UTF-8", useInternalNodes = TRUE)
# gether content in table and build the dataframe
# title and introduction link of IR resource
item_title <- xpathSApply(ec_parsed, '//td[#headers="t1"]//a', xmlValue)
item_hrefs <- xpathSApply(ec_parsed, '//td[#headers="t1"]//a/#href')
# author and introduction link of IR resource
auth_name <- xpathSApply(ec_parsed, '//td[#headers="t2"]//a', xmlValue)
auth_hrefs <- xpathSApply(ec_parsed, '//td[#headers="t2"]//#href')
# publish date of IR resource
pub_date <- xpathSApply(ec_parsed, '//td[#headers="t3"]', xmlValue)
# whole content link of IR resource
con_link <- xpathSApply(ec_parsed, '//td[#headers="t3"]//a[#href]', xmlValue)
item_table <- cbind(item_title, item_hrefs, auth_name, auth_hrefs, pub_date, con_link)
colnames(item_table) <- c("t1", "href1", "t2", "href2", "t3", "t4", "href4")
I have tried many times but still cannot organise them as it should be, just like one paper may have several authors, and all the authors and their links should save in one "row", but now one author is in one row, and the title of paper totally reused. That makes the result messed up.
This is one way to make a long data frame from that table:
library(rvest)
library(purrr)
library(tibble)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued")
# extract the columns
col1 <- html_nodes(pg, "td[headers='t1']")
col2 <- html_nodes(pg, "td[headers='t2']")
col3 <- html_nodes(pg, "td[headers='t3']")
# this is the way to get the full text column
col4 <- html_nodes(pg, "td[headers='t3'] + td")
# now, iterate over the rows; map_df() will bind all our data.frame's together
map_df(1:legnth(col1), function(i) {
# extract the links
a1 <- xml_nodes(col1[i], "a")
a2 <- xml_nodes(col2[i], "a")
a4 <- xml_nodes(col4[i], "a")
# put the row into a long data.frame for the row
data_frame( title = html_text(a1, trim=TRUE),
title_link = html_attr(a1, "href"),
author = html_text(a2, trim=TRUE),
author_link = html_attr(a2, "href"),
issue_date = html_text(col3[i], trim=TRUE),
full_text = html_attr(a4, "href"))
})
The biggest problem during using "rvest" package is mess code. Even the parameter "encoding" has been used in the program, the result still have mess code. But the web page encoding is UTF-8. Such as:
library(rvest)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued", encoding = "UTF-8")
For my test, the best performance should be "XML", when I use getNodeset function, the result is right, no mess code at all. However, I only get the whole node, and could not gether each row of table with their structure.
library(XML)
pg <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
pg_tables <- getNodeSet(htmlParse(pg), "//table[#summary='This table browse all dspace content']")
# gether the node of whole table
papernode <- getNodeSet(pg_tables[[1]], "//td[#headers='t1']")
paper_hrefs <- xpathSApply(papernode[[1]], '//a/#href')
paper_name <- xpathSApply(papernode[[1]], '//a', xmlValue)
# gether authors in table
authnode <- getNodeSet(pg_tables[[1]], "//td[#headers='t2']")
# gether date in table
datenode <- getNodeSet(pg_tables[[1]], "//td[#headers='t3']")
With this program, I could get these "nodes" separatly. However, crawling the headers and their links seems getting harder. Because the result class of "getNodeSet" is not same as "html_nodes". How can we read the dataframe generated by "getNodeSet" automatically and extract the header and their links from these nodes in an exact way?