I am creating a pipeline that allows for an arbitrary number of dataset names to be put in, where they will all be put through similar cleaning processes. To do this, I am using the targets package, and using the tar_map function from tarchetypes, I subject each dataset to a series of tidying and wrangling functions.
My issue now is that one dataset needs to be split into three datasets by a factor (a la split) while the rest should remain untouched. The pipeline would then theoretically move on by processing each dataset individually, including the three 'daughter' datasets.
Here's my best attempt:
library(targets)
library(tarchetypes)
library(tidyverse)
# dir.create("./data")
# tibble(nums = 1:300, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td1.csv")
# tibble(nums = 301:600, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td2.csv")
# tibble(nums = 601:900, groups = rep(letters[1:3], each = 100)) |>
# write_csv("./data/td3.csv")
tar_option_set(
packages = c("tidyverse")
)
read_data <- function(paths) {
read_csv(paths)
}
get_group <- function(data, groups) {
filter(data, groups == groups)
}
do_nothing <- function(data) {
data
}
list(
map1 <- tar_map(
values = tibble(datasets = c("./data/td1.csv", "./data/td2.csv", "./data/td3.csv")),
tar_target(data, read_data(datasets)),
map2 <- tar_map(values = tibble(groups = c("a", "b", "c")),
tar_skip(tester, get_group(data, groups), !str_detect(tar_name(), "td3\\.csv$"))
),
tar_target(dn, do_nothing(list(data, tester)))
)
)
The skipping method is a bit clumsy, I may be thinking about that wrong as well.
I'm obviously trying to combine the code poorly at the end there by putting them in a list, but I'm at a loss as to what else to do.
The datasets can't be combined by, say, rbind, since in actuality they are SummarizedExperiment objects.
Any help is appreciated - let me know if any further clarification is needed.
If you know the levels of that factor in advance, you can handle the splitting of that third dataset with a separate tar_map() call similar to what you do now. If you do not know the factor levels in advance, then the splitting needs to be handled with dynamic branching, and I recommend something like tarchetypes::tar_group_by().
I do not think tar_skip() is relevant here, and I recommend removing it.
If you start with physical files (or write physical files) then I strongly suggest you track them with format = "file": https://books.ropensci.org/targets/files.html#external-input-files.
library(targets)
library(tarchetypes)
tar_option_set(packages = "tidyverse")
list(
tar_map(
values = list(paths = c("data/td1.csv", "data/td2.csv")),
tar_target(file, paths, format = "file"),
tar_target(data, read_csv(file, col_types = cols()))
),
tar_target(file3, "data/td3.csv", format = "file"),
tar_group_by(data3, read_csv(file3, col_types = cols()), groups),
tar_target(
data3_row_counts,
tibble(group = data3$groups[1], n = nrow(data3)),
pattern = map(data3)
)
)
Related
I'm collecting some Economic indicator data. In this process, I also want to collect hourly tweet counts with the script. I asked a similar question with simple data before. As the historical data grows, the run times will get longer. Since the result table will be a dataframe, can I run this script more effectively with functions such as apply family or do.call?
library(httr)
library(dplyr)
library(lubridate)
library(tidyverse)
library(stringr)
sel1<-c('"#fed"','"#usd"','"#ecb"','"#eur"')
for (i in sel1)
{
for (ii in 1:20){
headers = c(
`Authorization` = 'Bearer #enter your Bearer token#'
)
params = list(
`query` =i,
#my sys.time is different
`start_time` = strftime(Sys.time()-(ii+1)*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`end_time` =strftime(Sys.time()-ii*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`granularity` = 'hour'
)
res1<- httr::GET(url = 'https://api.twitter.com/2/tweets/counts/recent', httr::add_headers(.headers=headers), query = params) %>%
content( as = 'parsed')
x1<-cbind(data.frame(res1),topic=str_replace_all(i, "([\n\"#])", ""))
if(!exists("appnd1")){
appnd1 <- x1
} else{
appnd1 <- rbind(appnd1, x1)
}
}
}
In general, iteratively rbind-ing data in a for loop will always get worse with time: each time you do one rbind, it copies all of the previous frame into memory, so you have two copies of everything. With small numbers this is not so bad, but you can imagine that copying a lot of data around in memory can be a problem. (This is covered in the R Inferno, chapter 2, Growing objects. It's good reading, even if it is not a recent document.)
The best approach is to create a list of frames (see https://stackoverflow.com/a/24376207/3358227), add contents to it, and then when you are done combine all frames within the list into a single frame.
Untested, but try this modified process:
library(httr)
library(dplyr)
library(lubridate)
library(tidyverse)
library(stringr)
sel1<-c('"#fed"','"#usd"','"#ecb"','"#eur"')
listofframes <- list()
for (i in sel1) {
for (ii in 1:20){
headers = c(
`Authorization` = 'Bearer #enter your Bearer token#'
)
params = list(
`query` =i,
#my sys.time is different
`start_time` = strftime(Sys.time()-(ii+1)*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`end_time` =strftime(Sys.time()-ii*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`granularity` = 'hour'
)
res1<- httr::GET(url = 'https://api.twitter.com/2/tweets/counts/recent', httr::add_headers(.headers=headers), query = params) %>%
content( as = 'parsed')
x1<-cbind(data.frame(res1),topic=str_replace_all(i, "([\n\"#])", ""))
listofframes <- c(listofframes, list(x1))
}
}
# choose one of the following based on your R-dialect/package preference
appnd1 <- do.call(rbind, listofframes)
appnd1 <- dplyr::bind_rows(listofframes)
appnd1 <- data.table::rbindlist(listofframes)
I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)
I have 7 of two different asc files loaded into R, asc[i] and wasc[i], [i] denotes that there are 1:7 ascs and wascs loaded into R. I need to combine the wasc[i] with the asc[i][[1]] (Just the first column in asc[i] with the whole wasc[i] file).
This should be repeated for every pair of asc and wasc files.
The code keeps giving me blank data frames, so I don't know why this doesn't work. The naming is correct, yet the code is not recognizing that the asc[i] and wasc[i] correlate with previously loaded files.
Any help will be greatly appreciated.
# These data frames will reproduce my issue
asc1 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc1 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc2 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc2 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc3 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc3 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
for (i in 1:3) {
d <- paste("asc", i, sep ="")
f <- paste("wasc", i, sep ="")
full_wing <- as.character(paste("full_wing", i, sep = ""))
assign(full_wing,cbind(d[[1]], f))
}
# Output of full_wing1 data frame
dput(full_wing1)
structure(c("asc1", "wasc1"), .Dim = 1:2, .Dimnames = list(NULL,
c("", "f")))
Additional Information:
asc files are 19 columns long
wasc files are 13 columns long
I only want to combine column 1 from the asc file with the entire wasc file, thus cutting out the remaining 18 columns of the asc file.
# put data in a list
asc = mget(ls(pattern = "^asc"))
wasc = mget(ls(pattern = "^wasc"))
full_wing = Map(f = function(w, a) cbind(w, a[[1]]), w = wasc, a = asc)
Map is a nice shortcut for iterating in parallel over multiple arguments. It returns a nice list. You can access the individual elements with, e.g., full_wing[[1]], full_wing[[3]], etc. Map is just a shortcut, the above code is basically equivalent to the for loop below:
results = list()
for (i in seq_along(asc)) {
results[[i]] = cbind(wasc[[i]], asc[[i]][[1]])
}
I use mget to put the data in a list because in your example you already have objects like asc1, asc2, etc. A much better way to go is to never create those variables in the first place, instead read the files directly into a list, something like this:
asc_paths = list.files(pattern = "^asc")
asc = lapply(asc_paths, read.table)
You can see a lot more explanation of this at How to make a list of data frames?
If you only ever need one column of the asc files, another way to simplify this would be to only read in the needed column, see Only read limited number of columns for some recommendations there.
I am a long-time Stata user but am trying to familiarize myself with the syntax and logic of R. I am wondering if you could help me with writing more efficient codes as shown below (The "The Not-so-efficient Codes")
The goal is to (A) read several files (each of which represents the data of a year), (B) create the same variables for each file, and (C) combine the files into a single one for statistical analysis. I have finished revising "part A", but are struggling with the rest, particularly part B. Could you give me some ideas as to how to proceed, e.g. use unlist to unlist data.l first, or lapply to each element of data.l? I appreciate your comments-thanks.
More Efficient Codes: Part A
# Creat an empty list
data.l = list()
# Create a list of file names
fileList=list.files(path="C:/My Data, pattern=".dat")
# Read the ".dat" files into a single list
data.l = sapply(fileList, readLines)
The Not-so-efficient Codes: Part A, B and C
setwd("C:/My Data")
# Part A: Read the data. Each "dat" file is text file and each line in the file has 300 characters.
dx2004 <- readLines("2004.INJVERBT.dat")
dx2005 <- readLines("2005.INJVERBT.dat")
dx2006 <- readLines("2006.INJVERBT.dat")
# Part B-1: Create variables for each year of data
dt2004 <-data.frame(hhx = substr(dx2004,7,12),fmx = substr(dx2004,13,14),
,iphow = substr(dx2004,19,318),stringsAsFactors = FALSE)
dt2005 <-data.frame(hhx = substr(dx2005,7,12),fmx = substr(dx2005,13,14),
,iphow = substr(dx2005,19,318),stringsAsFactors = FALSE)
dt2006 <-data.frame(hhx = substr(dx2006,7,12),fmx = substr(dx2006,13,14),
iphow = substr(dx2006,19,318),stringsAsFactors = FALSE)
# Part B-2: Create the "iid" variable for each year of data
dt2004$iid<-paste0("2004",dt2004$hhx, dt2004$fmx, dt2004$fpx, dt2004$ipepno)
dt2005$iid<-paste0("2005",dt2005$hhx, dt2005$fmx, dt2005$fpx, dt2005$ipepno)
dt2006$iid<-paste0("2006",dt2006$hhx, dt2006$fmx, dt2006$fpx, dt2006$ipepno)
# Part C: Combine the three years of data into a single one
data = rbind(dt2004,dt2005, dt2006)
you are almost there. Its a combination of lapply and do.call/rbind to work with lapply's list output.
Consider this example:
test1 = "Thisistextinputnumber1"
test2 = "Thisistextinputnumber2"
test3 = "Thisistextinputnumber3"
data.l = list(test1, test2, test3)
makeDF <- function(inputText){
DF <- data.frame(hhx = substr(inputText, 7, 12), fmx = substr(inputText, 13, 14), iphow = substr(inputText, 19, 318), stringsAsFactors = FALSE)
DF <- within(DF, iid <- paste(hhx, fmx, iphow))
return(DF)
}
do.call(rbind, (lapply(data.l, makeDF)))
Here test1, test2, test3 represent your dx200X, and data.l should be the list format you get from the efficient version of Part A.
In makeDF you create your desired data.frame. The do.call(rbind, ) is somewhat standard if you work with lapply-return values.
You also might want to consider checking out the data.table-package which features the function rbindlist, replacing any do.call-rbind construction (and is much faster), next to other great utility for large data sets.
Is there a way to make matching values at scale more programmatic? Basically what I want to do is add a bunch of columns for value lookups onto a dataframe, but I don't want to write the match[] argument every time. It seems like this would be a use case for mapply but I can't quite figure out how to use it here. Any suggestions?
Here's the data:
data <- data.frame(
region = sample(c("northeast","midwest","west"), 50, replace = T),
climate = sample(c("dry","cold","arid"), 50, replace = T),
industry = sample(c("tech","energy","manuf"), 50, replace = T))
And the corresponding lookup tables:
lookups <- data.frame(
orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
)
So now what I want to do is: First add a column to "data" that's called "reg_lookups" and it will match the region to its appropriate value in "lookups". Do the same for "climate_lookups" and so on.
Right now, I've got this mess:
data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]
I've tried using a function to do this, but the function doesn't seem to work, so then applying that to mapply is a no-go (plus I'm confused about how the mapply syntax would work here):
match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
return(df)
}
data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)
I think you're just trying to do this:
data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)
But it would be much better to store the lookups either in separate data frames. That way you can control the names of the new columns more easily. It would also allow you to do something like this:
lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')
do.call(cbind,mapply(merge,
x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
y = lookups1,
moreArgs = list(all.x = TRUE),
SIMPLIFY = FALSE))
and you should be able to wrap that do.call bit in a function.
I used data[,1,drop = FALSE] in order to preserve them as one column data frames.
The way you structure mapply calls is to pass named arguments as lists (the x = and y = parts). I wanted to be sure to preserve all the rows from data, so I passed all.x = TRUE via moreArgs, so that gets passed each time merge is called. Finally, I need to stitch them all together myself, so I turned off SIMPLIFY.