Read files recursively from github repos - r

I have some files in github that I would like to read recursively in R. So If I do this, I get list of all files.
library(httr)
req <- req <- GET("https://api.github.com/repos/jakevdp/data-USstates/git/trees/master?recursive=1")
stop_for_status(req)
all.files <- unlist(lapply(content(req)$tree, "["), use.names = F)
file.names.only <- unlist(lapply(content(req)$tree, "[", "path"), use.names = F)
Which is not what I actually wanted. I would like to be able to read these from the repository itself just like using list.files locally. How can we make this work? Or, at least, get list of full url to each file in the repository that can be read locally.
Say, from this repository: https://github.com/jakevdp/data-USstates

We can do this fairly simply with the rvest library. We select the links by using the .js-navigation-open html node, and then pull the href values from the links. We get a couple of empty strings with that, and .[. != ""] removes those.
library(rvest)
fileList <- read_html("https://github.com/jakevdp/data-USstates") %>%
html_nodes(".js-navigation-open") %>%
html_attr("href") %>%
.[. != ""] # remove empty elements
[1] "/jakevdp/data-USstates/blob/master/README.md" "/jakevdp/data-USstates/blob/master/state-abbrevs.csv"
[3] "/jakevdp/data-USstates/blob/master/state-areas.csv" "/jakevdp/data-USstates/blob/master/state-population.csv"

Related

Read latest txt in a Folder R

I am working with stock files from a repository, a new file is generated every day:
For example:
"stock2021-11-05.txt"
I need to read the last generated file, or in its defect read all the files that begin with the word stock, join them.
I currently use the following code:
fileList <- list.files( pattern= "*.txt")
But this brings me all the txt files from the repository and not just the ones that start with the word stock.
I would appreciate a help with this.
Thanks!
Simply use:
list.files(pattern = "stock.*\\.txt")
to find all files that begin with "stock" and end with ".txt"
Check out this REGEX cheat sheet from the stringr package to learn more:
https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
So you managed to filter out files which are not .txt files. Two steps are missing. A possible fileList now could be:
fileList <- c("stock2021-11-05.txt",
"stock2020-11-15.txt",
"stock2021-02-05.txt",
"vwxyz2018-01-01.txt")
1 - Filter stock-files
> fileList_stock <- grep("^stock", fileList, value = TRUE)
> fileList_stock
[1] "stock2021-11-05.txt" "stock2020-11-15.txt" "stock2021-02-05.txt"
2 - Get latest file
> sort(fileList_stock, decreasing = TRUE)[1]
[1] "stock2021-11-05.txt"
1+2 - Wrapper function
> get_last_stock_file <- function(x){
+ grep("^stock", x, value = TRUE) %>% sort(decreasing = TRUE) %>% .[1]
+ }
> get_last_stock_file(fileList)
[1] "stock2021-11-05.txt"

Copy and rename Specific Files based on parent directories in R

I am attempting to solve this issue in R, but I'll upvote answers in any programming language.
I have an example vector of filenames like so called file_list
c("D:/example/sub1/session1/OD/CD/text.txt", "D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
What I'm trying to do is move and rename the text files to be based on the part of the parent directory that contains the part about sub and session. So the first file would be renamed sub2_session1_text.txtand be copied along with the other text files to just 1 new directory called all_files
I'm struggling with some of the specifics of how to rename the file. I'm trying to use substr combined with str_locate_all and paste0 to copy and rename the files based on these parent directories.
Locate the position in each element of the vector file_list to construct starting and ending position for substr
library(stringr)
ending<-str_locate_all(pattern="/OD",file_list)
starting <- str_locate_all(pattern="/sub", file_list)
I then want to somehow pull out of those lists the starting and ending position of those patterns for each element and then feed it to substr to get the naming down and then in turn use paste0 to create
What I'd like is something like
substr_naming_vector<-substr(file_list, start=starting[starting_position],stop=ending[starting_position])
but I don't know how to index the list such that it can know how to correctly index for each element the starting_position. Once I figure that out I'd fill in something like this
#paste the filenames into a vector that represents them being renamed in a new directory
all_files <- paste0("D:/all_files/", substr_naming_vector)
#rename and copy the files
file.copy(from = file_list, to = all_files)
Here's an example using regular expression, which makes it somewhat shorter:
library(stringr)
library(magrittr)
all_dirs <-
c("D:/example/sub1/session1/OD/CD/text.txt",
"D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
new_dirs <-
all_dirs %>%
# Match each group using regex
str_match_all("D:/example/(.+)/(.+)/OD/CD/(.+)") %>%
# Paste the matched groups into one path
vapply(function(x) paste0(x[2:4], collapse = "_"), character(1)) %>%
paste0("D:/all_files/", .)
# Copy them.
file.copy(all_dirs, new_dirs)
This is one way of doing it. I assumed your file is always called text.txt.
library(stringr)
my_files <- c("D:/example/sub1/session1/OD/CD/text.txt",
"D:/example/sub2/session1/OD/CD/text.txt",
"D:/example/sub3/session1/OD/CD/text.txt")
# get the sub information
subs <- str_extract(string = my_files,
pattern = "sub[0-9]")
# get the session information
sessions <- str_extract(string = my_files,
pattern = "session[0-9]")
# paste it all together
new_file_names <- paste("D:/all_files/",
paste(subs,
sessions,
"text.txt",
sep = "_"),
sep = "")
file.copy(from = my_files,
to = new_file_names)

Exporting scraped data to one CSV

I managed to do a scraper for gathering Election info in R(rvest), but now I am struggling with how I can save the data not in separate CSV files, but in the one CSV file.
Here is my working code where I can scrap pages 11,12,13 separately.
library(rvest)
library(xml2)
do.call(rbind, lapply(11:13,
function(n) {
url <- paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=", n)
mi <- read_html(url)%>% html_table(fill = TRUE)
mi[[8]]
file <- paste0("election2014_", n, ".csv")
if (!file.exists(file)) write.csv(mi[[8]], file)
Sys.sleep(5)
}))
I tried to do this in the end, but it is not working as I expected
write.csv(rbind(mi[[8]],url), file="election2014.csv")
try this one :
library(rvest)
library(tidyverse)
scr<-function(n){
url<-paste0("http://www.cvk.gov.ua/pls/vnd2014/WP040?PT001F01=910&pf7331=",n)
df=read_html(url)%>%
html_table(fill = TRUE)%>%
.[[8]]%>%
data.frame()
colnames(df)<-df[1,]
df<-df[-1,]
}
res<-11:13%>%
map_df(.,scr)
write.csv2(res,"odin_tyr.csv")
I wasn't able to get your code to work, but you could try creating an empty data frame before running you code, and then do this before writing a csv file with the complete data:
df = rbind(df,mi[[8]])
you could also consider turning your csv files into one using the purrr package:
files = list.files("folder_name",pattern="*.csv",full.names = T)
df = files %>%
map(read_csv) %>%
reduce(rbind)

How to replace the title of columns in a merged document with the file directory using R?

I have performed an experiment under different conditions. Each of those condition has its own Folder. In each of those folders, there is a subfolder for each replicate that containts a text file called DistList.txt. This then looks like this, where the folders "C1.1", "C1.2" and so on contain the mentioned .txt files:
I have now managed to combine all those single DistList.txt files using the following script:
setwd("~/Desktop/Experiment/.")
fileList <- list.files(path = ".", recursive = TRUE, pattern = "DistList.txt", full.names = TRUE)
listData <- lapply(fileList, read.table)
names(listData) <- gsub("DistList.txt","",basename(fileList))
library(tidyverse)
library(reshape2)
bind_rows(listData, .id = "FileName") %>%
group_by(FileName) %>%
mutate(rowNum = row_number()) %>%
dcast(rowNum~FileName, value.var = "V1") %>%
select(-rowNum) %>%
write.csv(file="Result.csv")
This then yields a .csv file that has just numbers a titles (marked in red), which are not that useful for me, as shown in this picture:
I would rather like to have the directory of the "DistList.txt" files or even better only the name of the folder they are in as a title. I thought that I could do that using the function list.dirs() and colnames, but I somehow didn't manage to get it to work.
I would be very grateful, if someone could help me with this issue!
I think this line
names(listData) <- gsub("DistList.txt", "", basename(fileList))
should be:
names(listData) <- gsub("DistList.txt", "", fileList)
Because by using basename we are removing all the folders, leaving us with filename "DistList.txt", and that filename gets replaced by empty string "" using gsub.
We might actually want below instead, extract the last directory, which should give in your case something like c("C1.1", "C1.2", ...):
names(listData) <- basename(dirname(fileList))

Read the file created/modified last in different directories in R

I'd want to read the CSV files modified( or created) most recently in differents directories and then put it in a pre-existing single dataframe (df_total).
I have two kinds of directories to read:
A:/LogIIS/FOLDER01/"files.csv"
On others there a folder with several files.csv, as the example bellow:
"A:/LogIIS/FOLDER02/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER02/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_A/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_B/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_C/"files.csv"
"A:/LogIIS/FOLDER03/FOLDER_D/"files.csv"
Something like this...
#get a vector of all filenames
files <- list.files(path="A:/LogIIS",pattern="files.csv",full.names = TRUE,recursive = TRUE)
#get the directory names of these (for grouping)
dirs <- dirname(files)
#find the last file in each directory (i.e. latest modified time)
lastfiles <- tapply(files,dirs,function(v) v[which.max(file.mtime(v))])
You can then loop through these and read them in.
If you just want the latest file overall, this will be files[which.max(file.mtime(files))].
Here a tidyverse-friendly solution
list.files("data/",full.names = T) %>%
enframe(name = NULL) %>%
bind_cols(pmap_df(., file.info)) %>%
filter(mtime==max(mtime)) %>%
pull(value)
Consider creating a data frame of files as file.info maintains OS file system metadata per path such as created time:
setwd("A:/LogIIS")
files <- list.files(getwd(), full.names = TRUE, recursive = TRUE)
# DATAFRAME OF FILE, DIR, AND METADATA
filesdf <- cbind(file=files,
dir=dirname(files),
data.frame(file.info(files), row.names =NULL),
stringsAsFactors=FALSE)
# SORT BY DIR AND CREATED TIME (DESC)
filesdf <- with(filesdf, filesdf[order(dir, -xtfrm(ctime)),])
# AGGREGATE LATEST FILE PER DIR
latestfiles <- aggregate(.~dir, filesdf, FUN=function(i) head(i)[[1]])
# LOOP THROUGH LATEST FILE VECTOR FOR IMPORT
df_total <- do.call(rbind, lapply(latestfiles$file, read.csv))
Here is a pipe-friendly way to get the most recent file in a folder. It uses an anonymous function which in my view is slightly more readable than a one-liner. file.mtime is faster than file.info(fpath)$ctime.
dir(path = "your_path_goes_here", full.names = T) %>% # on W, use pattern="^your_pattern"
(function(fpath){
ftime <- file.mtime(fpath) # file.info(fpath)$ctime for file CREATED time
return(fpath[which.max(ftime)]) # returns the most recent file path
})

Resources