How to loop through urls in a column using download.file()

How to loop through urls in a column using download.file() - r

I have this df from which I need to download all file urls:
library(RCurl)
view(df)
Date column_1
<chr> <chr>
1 5/1/21 https://this.is.url_one.tar.gz
2 5/2/12 https://this.is.url_two.tar.gz
3 7/3/19 https://this.is.url_three.tar.gz
4 8/3/13 https://this.is.url_four.tar.gz
5 10/1/17 https://this.is.url_five.tar.gz
6 12/12/10 https://this.is.url_six.tar.gz
7 9/9/16 https://this.is.url_seven.tar.gz
8 4/27/20 https://this.is.url_eight.tar.gz
9 7/20/15 https://this.is.url_nine.tar.gz
10 8/30/19 https://this.is.url_ten.tar.gz
# … with 30 more rows
Of course I do not want to type download.file(url='https://this.is.url_number.tar.gz', destfile='files.tar.gz', method='curl') 40 times for each url. How can I loop over all url's in column_1 using download.file()?

Here is one way in a forloop
for(i in seq_len(nrow(df))) {
download.file(url = df$column_1[i],
destfile = paste0('files', df$column_ID_number[i],
'.tar.gz'), method = 'curl')
}

You can use Map -
Map(download.file, df$column_1, sprintf('file%d.tar.gz', seq(nrow(df))))
where sprintf is used to create filenames to save the file.

Related

How to access Youtube Data API v3 with R

I am trying to use R to retrieve data from the YouTube API v3 and there are few/no tutorials out there that show the basic process. I have figured out this much so far:
# Youtube API query
base_url <- "https://youtube.googleapis.com/youtube/v3/"
my_yt_search <- function(search_term, max_results = 20) {
my_api_url <- str_c(base_url, "search?part=snippet&", "maxResults=", max_results, "&", "q=", search_term, "&key=",
my_api_key, sep = "")
result <- GET(my_api_url)
return(result)
}
my_yt_search(search_term = "salmon")
But I am just getting some general meta-data and not the search results. Help?
PS. I know there is a package 'tuber' out there but I found it very unstable and I just need to perform simple searches so I prefer to code the requests myself.

Sadly there is no way to directly get the durations, you'll need to call the videos endpoint (with the part set to part=contentDetails) after doing the search if you want to get those infos, however you can pass as much as 50 ids in a single call thus we can save some time by pasting all the ids together.
library(httr)
library(jsonlite)
library(tidyverse)
my_yt_duration <- function(...){
my_api_url <- paste0(base_url, "videos?part=contentDetails", paste0("&id=", ..., collapse=""), "&key=",
my_api_key )
GET(my_api_url) -> resp
fromJSON(content(resp, "text"))$items %>% as_tibble %>% select(id, contentDetails) -> tb
tb$contentDetails$duration %>% tibble(id=tb$id, duration=.)
}
### getting the video IDs
my_yt_search(search_term = "salmon")->res
## Converting from JSON then selecting all the video ids
# fromJSON(content(res,as="text") )$items$id$videoId
my_yt_duration(fromJSON(content(res,as="text") )$items$id$videoId) -> tib.id.duration
# A tibble: 20 x 2
id duration
<chr> <chr>
1 -x2E7T3-r7k PT4M14S
2 b0ahREpQqsM PT3M35S
3 ROz8898B3dU PT14M17S
4 jD9VJ92xyzA PT5M42S
5 ACfeJuZuyxY PT3M1S
6 bSOd8r4wjec PT6M29S
7 522BBAsijU0 PT10M51S
8 1P55j9ub4es PT14M59S
9 da8JtU1YAyc PT3M4S
10 4MpYuaJsvRw PT8M27S
11 _NbbtnXkL-k PT2M53S
12 3q1JN_3s3gw PT6M17S
13 7A-4-S_k_rk PT9M37S
14 txKUTx5fNbg PT10M2S
15 TSSPDwAQLXs PT3M11S
16 NOHEZSVzpT8 PT7M51S
17 4rTMdQzsm6U PT17M24S
18 V9eeg8d9XEg PT10M35S
19 K4TWAvZPURg PT3M3S
20 rR9wq5uN_q8 PT4M53S

Extract and match sets from list of filenames

I have a dataset of 4000+ images. For the purpose of figuring out the code, I moved a small subset of them to another folder.
The files look like this:
folder
[1] "r01c01f01p01-ch3.tiff" "r01c01f01p01-ch4.tiff" "r01c01f02p01-ch1.tiff"
[4] "r01c01f03p01-ch2.tiff" "r01c01f03p01-ch3.tiff" "r01c01f04p01-ch2.tiff"
[7] "r01c01f04p01-ch4.tiff" "r01c01f05p01-ch1.tiff" "r01c01f05p01-ch2.tiff"
[10] "r01c01f06p01-ch2.tiff" "r01c01f06p01-ch4.tiff" "r01c01f09p01-ch3.tiff"
[13] "r01c01f09p01-ch4.tiff" "r01c01f10p01-ch1.tiff" "r01c01f10p01-ch4.tiff"
[16] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
[19] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
[22] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
I cannot remove the name prior to the -ch# as that information is important. What I want to do, however, is to filter this list of images, and return only sets (ie: r01c02f10p01) which have all four ch values (ch1-4).
I was originally thinking that we could approach the issue along the lines of this:
ch1 <- dir(path="/Desktop/cp/complete//", pattern="ch1")
ch2 <- dir(path="/Desktop/cp/complete//", pattern="ch2")
ch3 <- dir(path="/Desktop/cp/complete//", pattern="ch3")
ch4 <- dir(path="/Desktop/cp/complete//", pattern="ch4")
Applying this list with the file.remove function, similar to this:
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
file.remove(folder,final2)
However, creating new variables for each ch value fragments out each file. I am unsure how to use these to actually distinguish whether an individual image has all four ch values to meaningfully filter my images. I'm kind of at a loss, as the other sources I've seen have issues that don't quite match this problem.
Earlier, I was able to remove the all images with ch5 from my image set like this. I was thinking this may be helpful in trying to filter only images which have ch1-ch4, but I'm not sure how to proceed.
##Create folder variable which has all image files
folder <- list.files(getwd())
##Create final2 variable which has all image files ending in ch5
final2 <- dir(path="/Desktop/cp1/Images//", pattern="ch5")
##Remove final2 from folder
file.remove(folder,final2)
To summarize: I expect to filter files from a random assortment without complete ch values (ie: maybe only ch1 and ch2, or ch3 and ch4, or ch1, ch2, ch3, and ch4), to an assortment which only contains files which have a complete set (four files with ch1, ch2, ch3, and ch4).

Starting with a vector of filenames like you would get from list.files or something similar, you can create a data frame of filenames, use regex to extract the alphanumeric part at the beginning and the number that follows "-ch". Then check that all elements of an expected set (I put this in ch_set, but there might be another way you need to do this) occur in each group's set of CH values.
# assume this is the vector of file names that comes from list.files
# or something comparable
files <- c("r01c01f01p01-ch3.tiff", "r01c01f01p01-ch4.tiff", "r01c01f02p01-ch1.tiff", "r01c01f03p01-ch2.tiff", "r01c01f03p01-ch3.tiff", "r01c01f04p01-ch2.tiff", "r01c01f04p01-ch4.tiff", "r01c01f05p01-ch1.tiff", "r01c01f05p01-ch2.tiff", "r01c01f06p01-ch2.tiff", "r01c01f06p01-ch4.tiff", "r01c01f09p01-ch3.tiff", "r01c01f09p01-ch4.tiff", "r01c01f10p01-ch1.tiff", "r01c01f10p01-ch4.tiff", "r01c01f11p01-ch1.tiff", "r01c01f11p01-ch2.tiff", "r01c01f11p01-ch3.tiff", "r01c01f11p01-ch4.tiff", "r01c02f10p01-ch1.tiff", "r01c02f10p01-ch2.tiff", "r01c02f10p01-ch3.tiff", "r01c02f10p01-ch4.tiff")
library(dplyr)
ch_set <- 1:4
files_to_keep <- data.frame(filename = files, stringsAsFactors = FALSE) %>%
tidyr::extract(filename, into = c("group", "ch"), regex = "(^[\\w\\d]+)\\-ch(\\d)", remove = FALSE) %>%
mutate(ch = as.numeric(ch)) %>%
group_by(group) %>%
filter(all(ch_set %in% ch))
files_to_keep
#> # A tibble: 8 x 3
#> # Groups: group [2]
#> filename group ch
#> <chr> <chr> <dbl>
#> 1 r01c01f11p01-ch1.tiff r01c01f11p01 1
#> 2 r01c01f11p01-ch2.tiff r01c01f11p01 2
#> 3 r01c01f11p01-ch3.tiff r01c01f11p01 3
#> 4 r01c01f11p01-ch4.tiff r01c01f11p01 4
#> 5 r01c02f10p01-ch1.tiff r01c02f10p01 1
#> 6 r01c02f10p01-ch2.tiff r01c02f10p01 2
#> 7 r01c02f10p01-ch3.tiff r01c02f10p01 3
#> 8 r01c02f10p01-ch4.tiff r01c02f10p01 4
Now that you have a dataframe of the complete groups, just pull the matching filenames back out:
files_to_keep$filename
#> [1] "r01c01f11p01-ch1.tiff" "r01c01f11p01-ch2.tiff" "r01c01f11p01-ch3.tiff"
#> [4] "r01c01f11p01-ch4.tiff" "r01c02f10p01-ch1.tiff" "r01c02f10p01-ch2.tiff"
#> [7] "r01c02f10p01-ch3.tiff" "r01c02f10p01-ch4.tiff"
One thing to note is that this worked without the mutate line where I converted ch to numeric—i.e. comparing character versions of those numbers to regular numeric version of them—because under the hood, %in% converts to matching types. That didn't seem totally safe if you needed to scale this, so I converted to have them in matching types.

Need to use jsonlite to handle ndjson message list using stream_in() and stream_out()

I have an ndjson data source. For a simple example, consider a text file with three lines, each containing a valid json message. I want to extract 7 variables from the messages and put them in a dataframe.
Please use the following sample data in a text file. You can paste this data into a text editor and save it as "ndjson_sample.txt"
{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}
The following three lines of code accomplish what I want to do:
file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))
For reasons I won't get into, I cannot use the ndjson package. I must find a way to use the jsonlite package to do the same thing using the stream_in() and stream_out() functions. Here's what I tried:
con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)
This is not giving me the same data frame as a final output. Can you tell me what I'm doing wrong and what I have to change to make raw_df_temp1 equal raw_df_temp2?
I could potentially solve this with a the fromJSON() functions operating on each line of the file, but I'd like to find a way to do it with the stream functions. The files I will be dealing with a are quite large and so efficiency will be key. I need this to be as fast as possible.
Thank you in advance.

Currently under ct you'll find a string that can (subsequently) be fed to fromJSON independently, but it will not be parsed as such. Ignoring your stream_out(stream_in(...),...) test, here are a couple of ways to read it in:
library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(
ts = sapply(json, `[[`, "ts"),
do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
Calling fromJSON on each string might be cumbersome, and with larger data this slow-down is why there is stream_in, so if we can capture the "ct" component into a stream of its own, then ...
writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')
(There are far-more-efficient ways to do this with non-R tools, including perhaps a simple
sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2
though this makes a few assumptions about the data that may not be perfectly safe. A better solution would be to use jq, which should "always" correctly-parse proper json, then a quick sed to replace escaped quotes:
jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson
and you can do that with system(...) in R if needed.)
From there, under the assumption that each line will contain exactly one row of data.frame data:
json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
# ts Var1 Var2 Var3 Var4 Var5 Var6 Var7
# 1 1 6 6 -70 12353 1 abc x
# 2 2 6 6 -68 4528 1 def y
# 3 3 6 6 -70 -5409 1 ghi z
NB: in the first example, "ts" is a factor, all others are character because that's what fromJSON gives. In the second example, all strings are factor. This can easily be addressed through judicious use of stringsAsFactors=FALSE, depending on your needs.

Saving object attributes using a function

I'm trying to modify the save() function so that the script from which the object originated is stored as an attribute of the object.
s = function(object, filepath, original.script.name){
#modified save() function
#stores the name of the script from which the object originates as an attribute, then saves as normal
attr(object, "original.script") = original.script.name
save(object, file = filepath)
}
Sample:
testob = 1:10
testob
# [1] 1 2 3 4 5 6 7 8 9 10
s(testob, filepath = "rotation1scripts_v4/saved.objects/testob", "this.is.the.name")
load(file = "rotation1scripts_v4/saved.objects/testob")
testob
# [1] 1 2 3 4 5 6 7 8 9 10
attributes(testob)
# NULL
Investigating further, it seems that the object is not being loaded into the environment:
testob2 = 1:5
testob2
# [1] 1 2 3 4 5
s(testob2, "rotation1scripts_v4/saved.objects/testob2", "this.is.the.name")
rm(testob2)
load(file = "rotation1scripts_v4/saved.objects/testob2")
testob2
# Error: object 'testob2' not found
Why isn't it working?

You need to be careful with save(). It saves variables with the same name that's passed to save(). So When you call save(object, ...), it's saving the variable as "object" and not "testob" which you seem to be expecting. You can do some non-standard environment shuffling to make this work though. Try
s <- function(object, filepath, original.script.name){
objectname <- deparse(substitute(object))
attr(object, "original.script") = original.script.name
save_envir <- new.env()
save_envir[[objectname]] <- object
save(list=objectname, file = filepath, envir=save_envir)
}
We use deparse(substitute()) to the get name of the variable passed to the function. Then we create a new environment where we can create the object with the same name. This way we can use that name when actually saving the object.
This appears to work if we test with
testob <- 1:10
s(testob, filepath = "test.rdata", "this.is.the.name")
rm(testob)
load(file = "test.rdata")
testob
# [1] 1 2 3 4 5 6 7 8 9 10
# attr(,"original.script")
# [1] "this.is.the.name"

show the map using mapview include multibyte character in data.frame

I want to show the data using mapview package.
but include multibyte character, sometime cannot show the map.
What would be the best thing to show the map?
library(mapview)
data(atlStorms2005)
test1 <- test2 <- atlStorms2005
test1#data$test <- as.factor(c("日本語", "てすと"))
test2#data$test <- as.factor(c("日本語", "五十嵐"))
mapview(test1) # can show the map
mapview(test2) # cannot show
re.data.frame <- function(data, encoding = "UTF-8", fileEncoding="UTF-8"){
write.csv(data, file("tmp.csv", encoding = encoding), row.names = F, fileEncoding=fileEncoding)
tmp <- readr::read_csv("tmp.csv", col_types = cols())
return(tmp)
}
test2#data <- re.data.frame(test2#data)
mapview(test2) # can show
but,the popup in test colum character is corrupted text.
data is correct.
head(test2#data)
# A tibble: 6 × 4
Name MaxWind MinPress test
<chr> <int> <int> <chr>
1 ALPHA 45 998 日本語
2 ARLENE 60 989 五十嵐
3 BRET 35 1002 日本語
4 CINDY 65 991 五十嵐
5 DELTA 60 980 日本語
6 DENNIS 130 930 五十嵐

As of commit bc2c57f, this should have been fixed. Until the next CRAN release of mapview, simply use the development version (devtools::install_github("environmentalinformatics-marburg/mapview", ref = "develop")) to solve this issue.
In brief, this behavior was related to our Rcpp routines which run under the hood in order to ensure a computationally efficient creation of popup tables. Here, the user's native encoding was used instead of UTF-8 to create JSON output files, resulting in corrupted text output on some machines where UTF-8 was not the default.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to loop through urls in a column using download.file() - r

Here is one way in a forloop for(i in seq_len(nrow(df))) { download.file(url = df$column_1[i], destfile = paste0('files', df$column_ID_number[i], '.tar.gz'), method = 'curl') }

You can use Map - Map(download.file, df$column_1, sprintf('file%d.tar.gz', seq(nrow(df)))) where sprintf is used to create filenames to save the file.

Related

How to access Youtube Data API v3 with R

Extract and match sets from list of filenames

Need to use jsonlite to handle ndjson message list using stream_in() and stream_out()

Saving object attributes using a function

show the map using mapview include multibyte character in data.frame

Categories

Resources