R loop function for selecting an element in multiple objects - r

I'm using the Google Trends R package to perform several queries of keywords like so:
trends1 <- gtrends(keyword="compare", gprop=channel,geo="AU", time=time, category=249)
trends2 <- gtrends(keyword="switch", gprop=channel,geo="AU", time=time, category=249)
trends3 <- gtrends(keyword="change", gprop=channel,geo="AU", time=time, category=249)
I'm only interested in the interest over time results, so I single them out:
time_trend1 <- trends1$interest_over_time
time_trend2 <- trends2$interest_over_time
time_trend3 <- trends3$interest_over_time
But I have 60 of these (and many more to add). I want to write a repeat loop (I think):
#select only interest over time
x <- 0
repeat {
time_trend(x+1) <- trends(x+1)$interest_over_time
if (x == 61){break}
}
but I get the error: Error in trends(x + 1) : could not find function "trends"
what am I missing?

You could use lapply to iterate over a list of keywords and extract the requested element like this:
library(gtrendsR)
time <- "today+5-y"
channel <- "web"
keywords <- list("compare", "switch", "change")
trends <- setNames(lapply(keywords, function(x) gtrends(keyword=x,
gprop=channel, geo="AU", time=time, category=249)), keywords)
lapply(trends, `[[`, "interest_over_time")
#> $compare
#> date hits geo time keyword gprop category
#> 1 2015-04-26 25 AU today+5-y compare web 249
#> 2 2015-05-03 26 AU today+5-y compare web 249
#> 3 2015-05-10 41 AU today+5-y compare web 249
#> 4 2015-05-17 29 AU today+5-y compare web 249
#> 5 2015-05-24 32 AU today+5-y compare web 249
# ...
#> 260 2020-04-12 9 AU today+5-y compare web 249
#>
#> $switch
#> date hits geo time keyword gprop category
#> 1 2015-04-26 0 AU today+5-y switch web 249
#> 2 2015-05-03 0 AU today+5-y switch web 249
#> 3 2015-05-10 0 AU today+5-y switch web 249
#> 4 2015-05-17 0 AU today+5-y switch web 249
#> 5 2015-05-24 0 AU today+5-y switch web 249
# ...
#> 260 2020-04-12 0 AU today+5-y switch web 249
#>
#> $change
#> date hits geo time keyword gprop category
#> 1 2015-04-26 45 AU today+5-y change web 249
#> 2 2015-05-03 68 AU today+5-y change web 249
#> 3 2015-05-10 23 AU today+5-y change web 249
#> 4 2015-05-17 52 AU today+5-y change web 249
#> 5 2015-05-24 76 AU today+5-y change web 249
# ...
#> 260 2020-04-12 38 AU today+5-y change web 249
Created on 2020-04-20 by the reprex package (v0.3.0)
Edit:
It may be easiest to further manipulate the data once the individual list elements get combined into a data.table, tibble or data.frame. Shown here is an example of how to remove unwanted columns. To subset by keyword, one could do, e.g. res[keyword=="compare"]
library(gtrendsR)
library(data.table)
time <- "today+5-y"
channel <- "web"
keywords <- list("compare", "switch", "change")
trends <- setNames(lapply(keywords, function(x) gtrends(keyword=x,
gprop=channel, geo="AU", time=time, category=249)), keywords)
res <- rbindlist(lapply(trends, `[[`, "interest_over_time"))
res[,-c("geo","category","time")]
#> date hits keyword gprop
#> 1: 2015-04-26 25 compare web
#> 2: 2015-05-03 26 compare web
#> 3: 2015-05-10 41 compare web
#> 4: 2015-05-17 29 compare web
#> 5: 2015-05-24 32 compare web
#> ---
#> 776: 2020-03-15 51 change web
#> 777: 2020-03-22 27 change web
#> 778: 2020-03-29 20 change web
#> 779: 2020-04-05 0 change web
#> 780: 2020-04-12 35 change web
Created on 2020-04-21 by the reprex package (v0.3.0)

You can get the data in a list using ls + mget, use lapply to iterate over each list and get "interest_over_time" element of each list.
total_list <- lapply(mget(ls(pattern = 'trends\\d+')), `[[`, "interest_over_time")
total_list would give you list of dataframes. It is better to keep data in a list since it easier to manage and does not clutter the environment with lot of objects. However, if you want data separately for each we can use list2env.
list2env(total_list, .GlobalEnv)
To drop certain columns, we can do :
total_list <- lapply(mget(ls(pattern = 'trends\\d+')), function(x) {
data <- x$interest_over_time
data[setdiff(names(data), c("geo","category","time"))]
})

Related

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

World Bank API query

I want to get data using World Bank's API. For this purpose I use follow query.
wb_data <- httr::GET("http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO?format=json") %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
data.frame()
It works pretty good. However, when I try to specify more than two variables it doesn't work.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?format=json
Note, if i change format to xml and also add source=2 because data become from same database (World Development Indicator) query works.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?source=2&formal=xml
However, if i want to get data from different databases (e.g. WDI and Doing Business) it doesn't work again.
So, my first question is how can I get multiple data from different databases using one query. According to the World Bank API tutorial I can include about 60 indicators.
My second question is how can I specify number of rows per page. As I might know I can add something like &per_page=100 to get 100 rows as an output. Should i calculate number of rows by myself or I can use something lika that &per_page=9999999 to get all data upon request.
P.S. I don't want to use any libraries (such as: wb or wbstats). I want to do it by myself and also to learn something new.
Here's an answer to your question. To use multiple indicators and return JSON, you need to provide both the source ID and the format type, as mentioned in the World Bank API tutorial. You can get the total number of pages from one of the returned JSON parameters, called "total". You can then use this value in a second GET request to return the full number of pages using the per_page parameter.
library(magrittr)
library(httr)
library(jsonlite)
# set up the target url - you need BOTH the source ID and the format parameters
target_url <- "http://api.worldbank.org/v2/country/chn;ago/indicator/AG.AGR.TRAC.NO;SP.POP.TOTL?source=2&format=json"
# look at the metadata returned for the target url
httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# the metadata is in the first item in the returned list of JSON
extract2(1)
#> $page
#> [1] 1
#>
#> $pages
#> [1] 5
#>
#> $per_page
#> [1] 50
#>
#> $total
#> [1] 240
#>
#> $sourceid
#> NULL
#>
#> $lastupdated
#> [1] "2019-12-20"
# get the total number of pages for the target url query
wb_data_totalpagenumber <- httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the first item in the returned list of JSON
extract2(1) %>%
# get the total number of pages, which is a named element called "total"
extract2("total")
# get the data
wb_data <- httr::GET(paste0(target_url, "&per_page=", wb_data_totalpagenumber)) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the data, which is the second item in the returned list of JSON
extract2(2) %>%
data.frame()
# look at the data
dim(wb_data)
#> [1] 240 11
head(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 1 AGO 2019 NA 0 AG.AGR.TRAC.NO
#> 2 AGO 2018 NA 0 AG.AGR.TRAC.NO
#> 3 AGO 2017 NA 0 AG.AGR.TRAC.NO
#> 4 AGO 2016 NA 0 AG.AGR.TRAC.NO
#> 5 AGO 2015 NA 0 AG.AGR.TRAC.NO
#> 6 AGO 2014 NA 0 AG.AGR.TRAC.NO
#> indicator.value country.id country.value
#> 1 Agricultural machinery, tractors AO Angola
#> 2 Agricultural machinery, tractors AO Angola
#> 3 Agricultural machinery, tractors AO Angola
#> 4 Agricultural machinery, tractors AO Angola
#> 5 Agricultural machinery, tractors AO Angola
#> 6 Agricultural machinery, tractors AO Angola
tail(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 235 CHN 1965 715185000 <NA> 0 SP.POP.TOTL
#> 236 CHN 1964 698355000 <NA> 0 SP.POP.TOTL
#> 237 CHN 1963 682335000 <NA> 0 SP.POP.TOTL
#> 238 CHN 1962 665770000 <NA> 0 SP.POP.TOTL
#> 239 CHN 1961 660330000 <NA> 0 SP.POP.TOTL
#> 240 CHN 1960 667070000 <NA> 0 SP.POP.TOTL
#> indicator.value country.id country.value
#> 235 Population, total CN China
#> 236 Population, total CN China
#> 237 Population, total CN China
#> 238 Population, total CN China
#> 239 Population, total CN China
#> 240 Population, total CN China
Created on 2020-01-30 by the reprex package (v0.3.0)

Struggling to Create a Pivot Table in R

I am very, very new to any type of coding language. I am used to Pivot tables in Excel, and trying to replicate a pivot I have done in Excel in R. I have spent a long time searching the internet/ YouTube, but I just can't get it to work.
I am looking to produce a table in which I the left hand side column shows a number of locations, and across the top of the table it shows different pages that have been viewed. I want to show in the table the number of views per location which each of these pages.
The data frame 'specificreports' shows all views over the past year for different pages on an online platform. I want to filter for the month of October, and then pivot the different Employee Teams against the number of views for different pages.
specificreports <- readxl::read_excel("Multi-Tab File - Dashboard
Usage.xlsx", sheet = "Specific Reports")
specificreportsLocal <- tbl_df(specificreports)
specificreportsLocal %>% filter(Month == "October") %>%
group_by("Employee Team") %>%
This bit works, in that it groups the different team names and filters entries for the month of October. After this I have tried using the summarise function to summarise the number of hits but can't get it to work at all. I keep getting errors regarding data type. I keep getting confused because solutions I look up keep using different packages.
I would appreciate any help, using the simplest way of doing this as I am a total newbie!
Thanks in advance,
Holly
let's see if I can help a bit. It's hard to know what your data looks like from the info you gave us. So I'm going to guess and make some fake data for us to play with. It's worth noting that having field names with spaces in them is going to make your life really hard. You should start by renaming your fields to something more manageable. Since I'm just making data up, I'll give my fields names without spaces:
library(tidyverse)
## this makes some fake data
## a data frame with 3 fields: month, team, value
n <- 100
specificreportsLocal <-
data.frame(
month = sample(1:12, size = n, replace = TRUE),
team = letters[1:5],
value = sample(1:100, size = n, replace = TRUE)
)
That's just a data frame called specificreportsLocal with three fields: month, team, value
Let's do some things with it:
# This will give us total values by team when month = 10
specificreportsLocal %>%
filter(month == 10) %>%
group_by(team) %>%
summarize(total_value = sum(value))
#> # A tibble: 4 x 2
#> team total_value
#> <fct> <int>
#> 1 a 119
#> 2 b 172
#> 3 c 67
#> 4 d 229
I think that's sort of like what you already did, except I added the summarize to show how it works.
Now let's use all months and reshape it from 'long' to 'wide'
# if I want to see all months I leave out the filter and
# add a group_by month
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
head(5) # this just shows the first 5 values
#> # A tibble: 5 x 3
#> # Groups: team [1]
#> team month total_value
#> <fct> <int> <int>
#> 1 a 1 17
#> 2 a 2 46
#> 3 a 3 91
#> 4 a 4 69
#> 5 a 5 83
# to make this 'long' data 'wide', we can use the `spread` function
specificreportsLocal %>%
group_by(team, month) %>%
summarize(total_value = sum(value)) %>%
spread(team, total_value)
#> # A tibble: 12 x 6
#> month a b c d e
#> <int> <int> <int> <int> <int> <int>
#> 1 1 17 122 136 NA 167
#> 2 2 46 104 158 94 197
#> 3 3 91 NA NA NA 11
#> 4 4 69 120 159 76 98
#> 5 5 83 186 158 19 208
#> 6 6 103 NA 118 105 84
#> 7 7 NA NA 73 127 107
#> 8 8 NA 130 NA 166 99
#> 9 9 125 72 118 135 71
#> 10 10 119 172 67 229 NA
#> 11 11 107 81 NA 131 49
#> 12 12 174 87 39 NA 41
Created on 2018-12-01 by the reprex package (v0.2.1)
Now I'm not really sure if that's what you want. So feel free to make a comment on this answer if you need any of this clarified.
Welcome to Stack Overflow!
I'm not sure I correctly understand your need without a data sample, but this may work for you:
library(rpivotTable)
specificreportsLocal %>% filter(Month == "October")
rpivotTable(specificreportsLocal, rows="Employee Team", cols="page", vals="views", aggregatorName = "Sum")
Otherwise, if you do not need it interactive (as the Pivot Tables in Excel), this may work as well:
specificreportsLocal %>% filter(Month == "October") %>%
group_by_at(c("Employee Team", "page")) %>%
summarise(nr_views = sum(views, na.rm=TRUE))

R tidyr::spread duplicate error

I have the following data:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC VSTESTCD VSORRES
32320058 58 M WHITE UKRAINE 2 2016-04-28 DIABP 74
32320058 58 M WHITE UKRAINE 1 2016-04-21 HEIGHT 183
32320058 58 M WHITE UKRAINE 1 2016-04-21 SYSBP 116
32320058 58 M WHITE UKRAINE 2 2016-04-28 SYSBP 116
32320058 58 M WHITE UKRAINE 1 2016-04-21 WEIGHT 109
22080090 75 M WHITE MEXICO 1 2016-05-17 DIABP 81
22080090 75 M WHITE MEXICO 1 2016-05-17 HEIGHT 176
22080090 75 M WHITE MEXICO 1 2016-05-17 SYSBP 151
I would like to reshape the data using tidyr::spread to get the following output:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC DIABP SYSBP WEIGHT HEIGHT
32320058 58 M WHITE UKRAINE 2 2016-04-28 74 116 NA NA
32320058 58 M WHITE UKRAINE 1 2016-04-21 NA 116 109 183
22080090 75 M WHITE MEXICO 1 2016-05-17 81 151 NA 176
I receive duplicate errors, although I don't have duplicates in my data!
df1=spread(df,VSTESTCD,VSORRES)
Error: Duplicate identifiers for rows (36282, 36283), (59176, 59177), (59179, 59180)
I assume that I understand your question
# As many rows are identical, we should create a unique identifier column
# Let's take iris dataset as an example
# install caret package if you don't have it
install.packages("caret")
# require library
library(tidyverse)
library(caret)
# check the dataset (iris)
head(iris)
# assume that I gather all columns in iris dataset, except Species variable
# Create an unique identifier column and transform wide data to long data as follow
iris_gather<- iris %>% dplyr::mutate(ID=row_number(Species)) %>% tidyr::gather(key=Type,value=my_value,1:4)
# check first six rows
head(iris_gather)
# using *spread* to spread out the data
iris_spread<- iris_gather %>% dplyr::group_by(ID) %>% tidyr::spread(key=Type,value=my_value) %>% dplyr::ungroup() %>% dplyr::select(-ID)
# Check first six rows of iris_spread
head(iris_spread)

rvest: follow different links with same tag

I'm doing a little project in R that involves scraping some football data from a website. Here's the link to one of the years of data:
http://www.sports-reference.com/cfb/years/2007-schedule.html.
As you can see, there is a "Date" column with the dates hyperlinked, this hyperlink takes you to the stats from that particular game, which is the data I would like to scrape. Unfortunately, a lot of games take place on the same dates, which means their hyperlinks are the same. So if I scrape the hyperlinks from the table (which I have done) and then do something like:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
it will scrape from the first link corresponding to each date. For example there were 11 games that took place on Aug 30, 2007, but if I put that in the follow_link function, it grabs data from the first game (Boise St. Weber St.) every time. Is there any way I can specify that I want it to move down the table?
I have already found a workaround by finding out the formula for the urls to which the date hyperlinks take you, but it's a pretty convoluted process, so I thought I'd see if anyone knew how to do it this way.
This is a complete example:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[#id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...
you are going over a loop, but setting to the same variable ever time, try this:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}

Resources