Use R Loop to Bulk Download Youtube Transcripts with youtubecaption

Use R Loop to Bulk Download Youtube Transcripts with youtubecaption - r

I'm trying to use the youtubecaption library to download all the transcripts for a playlist then create a dataframe with all the results.
I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption() function. I can only get one video's transcripts added to the df.
I've tried a few approaches:
vids <- as.list(mydata$videoId)
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
rbind(test_transcript, test_transcript2)
}
Also using the column of the main dataframe:
captions <- sapply(mydata[,24], FUN = get_captions)
Is there an efficient way to accomplish this?

In your code, you do rbind(test_transcript, test_transcript2) but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow) paradigm, your code might be
vids <- as.list(mydata$videoId)
out <- list()
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
out <- c(out, list(test_transcript2))
}
alldat <- do.call(rbind, out)
Some other pointers:
for (i in 1:length(.)) can be a bad practice if this is functionalized, it's better to use for (i in seq_along(vids))
we never need the index number itself, we can use for (vid in vids)
we can do the pasteing in one shot, generally faster for R, with for (vid in paste0("https://www.youtube.com/watch?v=", vids)), and then url=vid in the call to get_caption
with all that, it might be even simpler to use lapply for the whole thing:
path <- getwd()
out <- lapply(paste0("https://www.youtube.com/watch?v=", vids),
get_caption, language = "en", savexl = FALSE,
openxl = FALSE, path = path)
do.call(rbind, out)
(NB: untested.)

Related

Reduce unnecessary repeated reading from files in nested for loop R

I'm writing some R code to handle pairs of files, an Excel and a csv (Imotions.txt). I need extract a column from the Excel and merge it to the csv, in pairs. Below is my abbreviated script: My script is now in polynomial time, and keeps repeating the body of the nested for loop 4 times instead of just doing it once.
Basically is there a general way to think about running some code over a paired set of files that I can translate to this and other languages?
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
for (imotion_file in imotion_files) {
for (excel_file in excel_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
#read in paradigm data
paradigm_data <- read_excel(path = excel_file, range = "H30:H328")
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}

It is not entirely clear about some operations. Here is a an option in tidyverse
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
out <- crossing(excel_files, imotion_files) %>%
mutate(filename = str_remove(imotion_file, "_Imotions.txt"),
raw_data = map(imotion_files, extract_raw_data),
event_data = map(imption_filess, extract_event_data),
paradigm_data = map(excel_files, ~
read_excel(.x, range = "H30:H328") %>%
bind_cols(latency_ms, .))
Based on the OP's code, latency_ms can be created outside the loop once and used it while binding the columns

Based on the naming of raw_data_first_timestamp, I'm assuming it's created by the extract_raw_data function - otherwise you can move the latency_ms outside the loop entirely, as akrun mentioned.
If you don't want to use tidyverse, see the modified version of your code at bottom. Notice that the loops have been broken out to cut down on duplicated actions.
Some general tips to improve efficiency when working with loops:
Before attempting to improve nested loop efficiencies, consider whether the loops can be broken out so that data from earlier loops is stored for usage in later loops. This can also be done with nested loops and variables tracking whether data has already been set, but it's usually simpler to break the loops out and negate the need for the tracking variables.
Create variables and call functions before the loop where possible. Depending on the language and/or compiler (if one is used), variable creation outside loops may not help with efficiency, but it's still good practice.
Variables and functions which must be created or called inside loops should be done in the highest scope - or the outermost loop - possible.
Disclaimer - I have never used R, so there may be syntax errors.
excel_files <- list.files(pattern = ".xlsx" , full.names = TRUE)
imotion_files <-list.files(pattern = 'Imotions.txt', full.names = TRUE)
paradigm_data_list <- vector("list", length(excel_files))
for (i in 1:length(excel_files)) {
#read in paradigm data
paradigm_data_list[[i]] <- read_excel(path = excel_files[[i]], range = "H30:H328")
}
for (imotion_file in imotion_files) {
filename <- paste(sub("_Imotions.txt", "", imotion_file))
raw_data <- extract_raw_data(imotion_file)
event_data <- extract_event_data(imotion_file)
#convert times to milliseconds
latency_ms <- as.data.frame(
sapply(
df_col_only_ones$latency,
convert_to_ms,
raw_data_first_timestamp = raw_data_first_timestamp
)
)
for (paradigm_data in paradigm_data_list) {
merged <- bind_cols(latency_ms, paradigm_data)
print(paste("writing = ", filename))
write.table(
merged,
file = paste(filename, "_EVENT", ".txt", sep = ""),
sep = '\t',
col.names = TRUE,
row.names = FALSE,
quote = FALSE
)
}
}

Uploading multiple pictures to R

I'm trying to upload multiple images to do some machine learning in R. I can upload a single image just fine, but when I try to upload multiple images using either lapply or a for loop, I get the following error: "Error in wrap.url(file, load.image.internal) : File not found". I did a check to make sure the files do exist, my WD is set correctly and R recognizes that the files and directory do exist. No matter what I change, the error is always the same. It doesn't change the outcome if I list the path from the drive it originates in or from the WD onward. I've asked many people for help with no success. I've posted my code using lapply and a for loop below. I'm still relatively new to R so if there is something I'm missing I'd greatly appreciate knowing. Also, I'm using imager here to load the files.
eggs2015 <- list()
file_list <- list.files(path="~/Grad School/Thesis Work/Machine Learning R/a2015_experimental_clustering_R/*.jpg", pattern="*.jpg", full.names = TRUE)
for (i in 1:length(file_list)){
Path <- paste0("a2015_experimental_clustering_R",file_list[i])
eggs2015 <- c(eggs2015, list(load.image(Path)))
}
names(eggs2015) <- file_list
eggs2015 <- list.files(path = "~/Grad School/Thesis Work/Machine Learning R/2015_experimental_clustering_R", pattern = ".jpg", all.files = TRUE, full.names = TRUE)
eggs2015 <- lapply(list, FUN = load.image("~/Grad School/Thesis Work/Machine Learning R/a2015_experimental_clustering_R/*.jpg"))
eggs2015 <- as.data.frame(eggs2015)

Personally for this kind of operation I prefer to use sapply so I can identify images with the original file names later on (if needed):
FilesToRead <- list.files(path = "~/Grad School/Thesis Work/Machine Learning R/2015_experimental_clustering_R", pattern = ".jpg", all.files = TRUE, full.names = TRUE)
ListOfImages <- sapply(FilesToRead, FUN = load.image, simplify = FALSE, USE.NAMES = TRUE)
should work and give you a list of elements with your images using the file paths as names
Or using lapply (sapply is just a wrapper for lapply)
ListOfImages <- lapply(FilesToRead, FUN = load.image)
As you can see, your code just needed a little tweaking.
Hope it helps

looping over all files in the same directory in R

the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.

You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.

In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})

How do I loop through a number of variables and perform operations on them in R?

Suppose that I have 30 tsv files of twitter data, say Google, Facebook and LlinkedIn, etc. I want to perform a set of operations on all of them, and was wondering if I can do so using a loop.
Specifically, I know that I can create variables using a loop, such as
index = c("fb", "goog", "lkdn")
for (i in 1:length(index)){
file_name = paste(names[i], ".data", sep = "")
assign(file_name, read.delim(paste(index$report_id[i],
"-tweets.tsv", sep = ""), header = T,
stringsAsFactors = F))
}
But how do I perform operations to all these data files in the loop? For example, if I want to order the datafiles using data[order(data[,4]), ], how do I make sure that the data file name is changed in each iteration of the loop? Thanks!

Build a function that does all of the operations you need it to do and then create a loop calling that function instead. If you insist on using assign to create lots of variables (not a great practice for this very reason) then try something like:
files <- dir("path/to/files", pattern = "*.tsv")
fileFunction <- function(x){
df <- read.delim(x, sep = "\t", header = T, stringsAsFactors = F)
df <- df[order(df[,4]),]
return(df)
}
for (a in files){
assign(a, fileFunction(a))
}

Executing function on objects of name 'i' within for-loop in R

I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.

Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use R Loop to Bulk Download Youtube Transcripts with youtubecaption - r

Related

Reduce unnecessary repeated reading from files in nested for loop R

Uploading multiple pictures to R

looping over all files in the same directory in R

How do I loop through a number of variables and perform operations on them in R?

Executing function on objects of name 'i' within for-loop in R

Categories

Resources