I am trying to extract some data from an html site. I got 500 nodes which should conatain a date, a title and a summary. By using
url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[#class='query']/ol") %>%
xml_children()
xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[#class='headline']") #title
xml_find_all(main_node, ".//p[#class='entry-content']") #summary
it returns three vectors with dates, titles and summaries, which than can be knitted together. At least in theory. Unfortunately my Code finds 500 dates, 500 titles but only 499 summaries. The reason for this is, that one of the nodes is just missing.
This leaves me with the problem, that I cannot bind this into an data frame because of the difference in length. The summaries wouldn't match the exact dates and titles.
An easy solution would be, to loop through the nodes and replace the empty node with a placeholder like an "NA".
dates <- c()
titles <- c()
summaries <- c()
for(i in 1:length(main_node)){
date_temp <- xml_find_all(main_node[i], ".//time") %>%
xml_text(trim = TRUE) %>%
as.Date(format = "%d.%m.%Y")
title_temp <- xml_find_all(main_node[i], ".//span[#class='headline']") %>%
xml_text(trim = TRUE)
summary_temp <- xml_find_all(main_node[i], ".//p[#class='entry-content']") %>%
xml_text(trim = TRUE)
if(length(summary_temp) == 0) summary_temp <- "NA"
dates <- c(dates, date_temp)
titles <- c(titles, title_temp)
summaries <- c(summaries, summary_temp)
}
But this makes a simple three line code unnecessary long. So my question I guess is: Is there a more sophisticated approach than a loop?
You could use the purrr library to help and avoid the explicit loop
library(purrr)
dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[#class='headline']") %>% xml_text())
summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[#class='entry-content']") %>% xml_text())
This uses the fact that xml_find_first will return NA if an elements is not found as pointed out by #Dave2e.
But also in general growing a list by appending each iteration in a loop is very inefficient in R. It's better to pre-allocate the vector (since it will be of a known length) and then assign values each iteration to the proper slot (out[i] <- val). There's not really anything wrong with loops themselves in R; it's really just memory reallocation that can slow things down.
Related
Hello I have created a forloop to split my data with respect to a certain row as so:
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
R doesn't like this, as it says team_ not found. the code does run if I include a list as such
team_ <- list()
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
This works... However, I am given a list with thousands of empty items and just a few that contain my filtered data sets.
is there a simpler way to create the data sets without this list function?
thank you
A simpler option is split from base R which would be faster than using == to subset in a loop
team_ <- split(df, df$team)
If we want to do some operations for each row, in tidyverse, it can be done with rowwise
library(dplyr)
df %>%
rowwise %>%
... step of operations ...
or with group_by
df %>%
group_by(team) %>%
...
The methods akrun suggests are much better than a loop, but you should understand why this isn't working. Remember for(i in 1:nrow(df)) will give you one list item for i = 1, i = 2, etc, right up until i = nrow(df), which is several thousand by the sounds of thing. If you don't have any rows where team is 1, you will get an empty data frame as the first item, and the same will be true for every other value of i that isn't represented.
A loop like this would work:
for(i in unique(df$team)) team_[[i]] <- df %>% filter(team == i)
But I would stick to a non-looping method as described by akrun.
I'm pretty new to R, so please bear with me.
I read in an Excel spreadsheet containing 31 years of MLB prospect data on separate sheets using the following code:
path = "../Documents/BA Prospects 1990-2021.xlsx"
prospect_data <- excel_sheets(path = path) %>%
map(~ data.frame(read_excel(path, sheet = .)))
I then wrote a function to clean up the data a bit, and applied it to all 31 elements:
pull_df <- function(data, n = 1, year = 1990) {
prospect_data[[n]] %>%
data.frame() %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
prospect_data <- lapply(prospect_data[1:31], pull_df)
What I want is to take the 31 dataframes and save each one globally. However, those dataframes are each nested within a separate list. Those 31 lists are nested within the list prospect_data. No matter what I try, with for loops and all, I can't extract those dataframes from the prospect_data list to be able to manipulate them further. Honestly, I'd settle for one big data frame with 3100 rows and 17 columns at this point. I just want to get my data into a dataframe.
I know I did a poor job of explaining it, but please help!
It is not a good practice to create individual dataframes in the global environment. Using lists is a better idea or if you prefer combine them in one big dataframe. Also length(1990:2021) gives 32 values so I have adjusted below answer to use 1990:2020 instead if you in total have 31 dataframes in prospect_data.
library(dplyr)
library(purrr)
pull_df <- function(data, year) {
data %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
combine_df <- map2_df(prospect_data, 1900:2020, pull_df)
If I understand your problem correctly, prospect_data is a list of one-element lists, where the single element in each component list is a dataframe.
If this is the case, you can "flatten" prospect_data with:
flattened_prospect_data <- unlist(prospect_data, recursive = FALSE)
However, if prospect_data is a list of arbitrary length lists, where the dataframe of interest is at index j you can do the following.
flattened_prospect_data <- lapply(prospect_data, function(x) x[[j]])
If each dataframe is nested even deeper, say each one is the kth element, of the jth element, of the ith element of an element of prospect_data. Then you can perform recursive extraction with
flattened_prospect_data <- lapply(prospect_data, function(x) x[[c(i, j, k)]])
Once you have all the dataframes in a list you can assign them to a name by their year using
get_year <- function(df) unique(df$year)
create_name <- function(x) paste(x, "Prospects") # create_name(1990) returns "1990 Prospects"
df_names <- sapply(flattened_prospect_data, function(x) create_name(get_year(x)))
Map(function(x, y) assign(x,value = y, envir = .GlobalEnv), df_names, flattened_prospect_data)
I am trying to develop a function that will take data, see if it matches a value in a category (e.g., 'Accident', and if so, develop a new dataframe using the following code.
cat.df <- function(i) {
sdb.i <- sdb %>%
filter(Category == i) %>%
group_by(Year) %>%
summarise(count = n()}
The name of the dataframe should be sdb.i, where i is the name of the category (e.g., 'Accident'). Unfortunately, I cannot get it to work. I'm notoriously bad with functions and would love some help.
It's not entirely clear what you are after so I am making a guess.
First of all, your function cat.df misses a closing bracket so it would not run.
I think it is good practice to pass all objects as parameters to a function. In my example I use the iris dataset so I pass this explicitly to the function.
You cannot change the name of a data frame in the way you describe. I offer two alternatives: if the count of your categories is limited you can just create separate names for each object. it you have many categories, best to combine all result objects into a list.
library(dplyr)
data(iris)
cat.df <- function(data, i) {
data <- data %>%
filter(Species== i) %>%
group_by(Petal.Width) %>%
summarise(count = n())
return(data)
}
result.setosa <- cat.df(iris, "setosa") # unique name
Species <- sort(unique(iris$Species))
results_list <- lapply(Species, function(x) cat.df(iris, x)) # combine all df's into a list
names(results_list) <- Species # name the list elements
You can then get the list elements as e.g. results_list$setosa or results_list[[1]].
I have got a list of dataframes where I want to perform some data wrangling operations on. For every year, I got a new list of data.frames
results_2018 <- list_of_objects %>%
map(~dplyr::top_n(.x, 10, Germany)) %>%
map(~rename(.x, "Answers" = "Answer.Options"))
results_2019 <- list_of_objects_2 %>%
map(~dplyr::top_n(.x, 10, Germany)) %>%
map(~rename(.x, "Answers" = "Data.Points"))
This is my code where I calculate the top 10 values for each year for one country. Since there are 10 years in history, is there a way to comebine these calculations into a single function?
I guess map2 and pmap might do the job, but I canĀ“t get my head around how this works.
Can anyone help me? (sorry for not providing reproduceable data, datasets are quite large)
You can try :
library(tidyverse)
list_obj <- list(list_of_objects, list_of_objects_2, ..., ..)
#If there are lot of them use
#list_obj <- mget(ls(pattern = 'list_of_objects\\d+'))
output <- map(list_obj, ~map(.x, function(x)
x %>% top_n(10, Germany) %>% rename("Answers" = "Answer.Options"))
This would return you list of lists, possibly using map_df for inner map would be useful.
map(list_obj, ~map_df(.x, function(x)
x %>% top_n(10, Germany) %>% rename("Answers" = "Answer.Options"))
I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.