I have a web scrape function that I created that gets data from an API. I pass a df column I have to one of the function arguments in the web scrape function. The issue I'm having is that the URL takes up to 500 numbers in one of the parameters, and my df has 2000 rows.
How would I split the rows by 500 in order to pass the values into the function?
I've created a very basic reprex that shows the workflow of what I am looking to do. I want to pass the split df column to the parse function. I'm guessing I would need to wrap the JSON parse with map_dfr
library(tidyverse)
sample_df <- tibble(id = 1:20,
col_2 = rnorm(1:20))
# parse function
parse_people <- function(ids = c("1", "10"), argument_2 = NULL){
# Fake Base Url
base_url <- "https://www.thisisafakeurl.com/api/people?Ids="
# fix query parameters to collapse Ids to pass to URL
ids<- stringr::str_c(ids, collapse = ",")
url <- glue::glue("{base_url}{ids}")
# Get URL
resp <- httr::GET(url)
# Save Response in JSON Format
out <- httr::content(resp, as = "text", encoding = "UTF-8")
# Read into JSON format.
jsonlite::fromJSON(out, simplifyDataFrame = TRUE, flatten = TRUE)
}
sample_parse <- parse_people(sample_df$id)
I think I probably need to create 2 functions. 1 function that parses the data, and one that uses map_dfr based off of the splits.
Something like:
# Split ID's from DF here. I want blocks of 500 rows to pass below
# Map Split ID's over parse_people
ids %>%
map_dfr(parse_people)
If we need to split the data.frame into a list of data.frame, an option is group_split with gl
library(dplyr)
n <- 3
lst1 <- sample_df %>%
group_split(grp = as.integer(gl(n(), n, n())), keep = FALSE) %>%
map(pull, id)
and pass it to the function as
map(lst1, ~ parse_people(ids = .x))
Possible duplicate here.
In the meantime, you can split your 20 row dataframe into 5 dataframes of 4 rows each via:
sample_df <- tibble(id = 1:20,
col_2 = rnorm(1:20))
split(sample_df, rep(1:5, each = 4))
Then you can pass the resulting list of dataframes to a purrr function.
Edit: If you don't know the total rows in advance, want to split by a given number, but also include all rows, there's another solution in the link:
chunk <- 3
n <- nrow(sample_df)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(sample_df,r)
Here I want chunks of 3, but it will include all rows (the last data frame in the list has 2 rows)
Related
I have a list of unnamed lists of named dataframes of varying lengths. I'm looking for a way to grep or search through the indices of the list elements to find specific named dfs.
Here is the current method:
library(tibble) # for tibbles
## list of lists of dataframes
abc_list <- list(list(dfAAA = tibble(names = state.abb[1:10]),
dfBBB = tibble(junk = state.area[5:15]),
dfAAA2 = tibble(names = state.abb[8:20])),
list(dfAAA2 = tibble(names = state.abb[10:15]),
dfCCC = tibble(junk2 = state.area[4:8]),
dfGGG = tibble(junk3 = state.area[12:14])))
# Open list, manually ID list index which has "AAA" dfs
# extract from list of lists into separate list
desired_dfs_list <- abc_list[[1]][grepl("AAA", names(abc_list[[1]]))]
# unlist that list into a combined df
desired_rbinded_list <- as.data.frame(data.table::rbindlist(desired_dfs_list, use.names = F))
I know there's a better way than this.
What I've attempted so far:
## attempt:
## find pattern in df names
aaa_indices <- sapply(abc_list, function(x) grep(pattern = "AAA", names(x)))
## apply that to rbind ???
desired_aaa_rbinded_list <- purrr::map_df(aaa_indices, data.table::rbindlist(abc_list))
the steps from the manual example would be:
pull identified list items (dfs) into a separate list
rbind the list of dfs into one df
I'm just not sure how to do that in a way that allows me more flexibility, instead of manually opening the lists and ID-ing the indices to pull.
thanks for any help or ideas!
If your tibbles( or dataframes) are always one level deep in the list (meaning a list(0.level) of lists (1st level)) you can use unlist to get rid of the first level:
all_dfs_list <- unlist(abc_list,
recursive = FALSE # will stop unlisting after the first level
)
This will result in a list of tibbles:
> all_dfs_list
$dfAAA
# A tibble: 10 x 1
names
<chr>
1 AL
2 AK
...
then you can filter by name and use rbindlist on the desired elements, as you already did in your question:
desired_dfs_list <- all_dfs_list[grepl("AAA",names(all_dfs_list))]
desired_rbinded_list <- as.data.frame(
data.table(rbindlist(desired_dfs_list, use.names = F))
)
I'm currently having an issue where I'm trying to nest simulated data for an efficient frontier inside a tibble containing all 250 simulations. The tibble will have 1 column named "sim" which indicates the number of the simulation, i.e. the rows in this column runs from 1:250. The other column should contain the nested simulation data which is a 3x123 tibble for each simulation. (Really hope this makes sense).
I've tried to replicate the problem such that you don't need all of the previous code and data to see the issue. Problem is that the nested data is saved as a list:
library(tidyverse)
counter = 0
table <- tibble(sim = 1:250, obs = NA)
for(i in (1:250)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:113, c = 2, d = 3)
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble
}
In this simplified reproducible example the values in the tibble are identical. Whereas in the assignment I'm working on, the tibble contains values for the efficient frontier. Variable 'a' in the tibble corresponds to simulation number and this is the variable i use to nest the efficient frontier. Afterwards I wish to remove this variable a, and insert the nested tible in the corresponding 'obs' field currently being NA.
I really hope this makes sense. I'm still very new with R and coding. If you need any additional documentation please let me know.
Your nested_tibble is a list containing a tibble. To access the tibble inside the list, you can use double bracket notation: nested_tibble[[1]]. So to get the result you want you can change your loop as follows:
counter = 0
table <- tibble(sim = 1:250, obs = NA)
for(i in (1:250)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:113, c = 2, d = 3)
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble[[1]]
}
I am working on parsing data from PowerBI REST API for activity data. The way this API functions is - the same end point may return data with 10 fields today and tomorrow it may return with 15 fields. My goal is to run a scheduled process that would extract daily data (say into a SQL table). I predefined an output data frame with the columns that I need. But I am looking for a way to handle the case when - say I defined 12 columns in my output data frame and in today's REST API extract, the results did not contain 1 of those 12 columns. I would like to source them as NA (or an empty string). How to do that in R? Below is a block of code I am working with:
response<-httr::GET(url=RESTEndPoint,config=httpHeader)
parsedResp<-httr::content(response, "text",encoding = "UTF-
8")%>%jsonlite::fromJSON(flatten = TRUE)
df<-as.data.frame(parsedResp$activityEventEntities)
outputDF<-df %>%
dplyr::select(
LogID='Id'
,CreationTimeD='CreationTime'
,Operation='Operation'
,OrganizationID='OrganizationId'
)
Say if the field 'Operation' is missing from the parsed response, then this would throw the error: 'Error: Can't subset columns that don't exist.' since that's how dplyr:Select works. Is there a way to say, when the field 'Operation' is missing in the parsed response, assign it as NA and move on with the next iteration of the loop
Thank you!
Not sure if this is a solution to your problem without seeing the data. Delivery_ are examples of deliveries. You can also create an empty data frame with your desired columns needed called "data_delivery_cools_needed" and deleted it later on.
library(data.table)
library(tidyverse)
coos_needed <- c('a', 'b', 'c')
delivery_1 <- data.frame(a = 1, b = 2, x = 3, z = 4)
delivery_2 <- data.frame(c = 1)
delivery_3 <- data.frame(a = 1, b = 2, c = 3)
delivery_4 <- data.frame(f = 5)
# Create a list of all deliveries
all_deliveries <- mget(ls(pattern = "^delivery_"))
# Combine everything into one - fill = TRUE
all_deliveries_data_frame <- rbindlist(all_deliveries, fill = TRUE, idcol = "delivery_file")
final_data <- all_deliveries_data_frame %>% select(cols_needed)
I couldn't find a solution in stack, so here's my issue:
I have a df with 342 columns.
I want to make a new df with only specific columns
The list of columns to keep is in another df, listed in 3 columns titled X,Y,Z for 3 new dataframes
Here's my code right now:
# Read the data:
data <- data.table::fread("data_30_9.csv")
# Import variable names #
variable.names.full = openxlsx::read.xlsx("variables2.xlsx")
Y.variable.names = na.omit(variable.names.full[1])
X.variable.names = na.omit(variable.names.full[2])
Z.variable.names = na.omit(variable.names.full[3])
# Make new DF with only specific columns:
X.Data = data %>% select(as.character(X.variable.names)) # This works as X has only 1 variable
Y.Data = data %>% select(as.character(Y.variable.names)) # This give an error: Error:
# # Can't subset columns that don't exist.
Help?
the data is available here:
https://github.com/amirnakar/TammyA/blob/main/data_30_9.csv
https://github.com/amirnakar/TammyA/blob/main/Variables2.xlsx
The problem is that Y.variable.names is a data.frame which you cannot use to subset another data.frame.
You can check by typing class(Y.variable.names).
So the solution to your problem is subsetting Y.variable.names:
Y.Data = data %>% select(Y.variable.names[,1])
Use lapply on variable.names.full and select the columns from data.
list_data <- lapply(variable.names.full, function(x)
data[, na.omit(x), drop = FALSE])
I do not want to loop if I don't have to!
I am trying to iterate through a list, for each value in the list I want to lookup that value in a dataframe and pull data from another column (like a vlookup). I did my best to explain more detail in me code below.
# Create First dataframe
df = data.frame(Letter=c("a","b","c"),
Food=c("Apple","Bannana","Carrot"))
# Create Second dataframe
df1 = data.frame(Testing=c("ab","abc","c"))
# Create Function
SplitAndCalc <- function(i,dat){
# Split into characters
EachCharacter <- strsplit(as.character(dat$Testing), "")
# Iterate through Each Character, look up the matching Letter in df, pull back Food from df
# In the end df1 will looks something like Testing=c("ab","abc","c"), Food=c("Apple","AppleBannanaCarrot","Carrot")
return(Food)
}
library("parallel")
library("snow")
# Detect the number of CPU cores on local workstation
num.cores <- detectCores()
# Create cluster on local host
cl <- makeCluster(num.cores, type="SOCK")
# Get count of rows in dataframe
row.cnt = nrow(df1)
# Call function in parallel
system.time(Weight <- parLapply(cl, c(1:row.cnt), SplitAndCalc, dat=df1))
# Create new column in dataframe to store results from function
df1$Food <- NA
# Unlist the Weight to fill out dataframe
df1$Food <- as.numeric(unlist(Weight))
Thanks!
I think I found something that will work for me, posting incase it can help someone else...
# Create First dataframe
df = data.frame(Letter=c("a","b","c"),
Food=c("Apple","Bannana","Carrot"))
# Create Second dataframe
df1 = data.frame(ID=c(1,2,3,4,5),
Testing=c("ab","abc","a","cc","abcabcabc"))
# Split into individual characters
EachCharacter <- strsplit(as.character(df1$Testing), "")
# Set the names of list values to df1$ID so we can merge back together later
temp <- setNames(EachCharacter,df1$ID)
# Unlist temp list and rep ID for each letter
out.dat <- data.frame(ID = rep(names(temp), sapply(temp, length)),
Letter = unlist(temp))
# Merge Individual letter weight
PullInLetterFood <- (merge(df, out.dat, by = 'Letter'))