I have below mentioned code which download google sheet and store it in Documents.
library(dplyr)
library(data.table)
library(googlesheets)
library(rJava)
t.start<-Sys.Date()
t.start<-as.character(t.start)
#gs_auth(new_user = TRUE)
#gs_ls()
#gs_auth()
as<-gs_title("XYZ")
gs_download(as, overwrite = TRUE)
I want the sheet XYZ store to a specific location (i.e E:\My_data\File) with below mentioned condition.
I want to Run this script 2 time a day, Where i want to rename the file XYZ based on Sys.Date() and with time condition. (For Ex. If Sys.Date=01/01/2017 and time is < 15:00 hrs than name should be 01/01/2017_A_XYZ.xlsx' for >15:00 hrs it should be01/01/2017_B_XYZ.xlsx')
I want to automatically create folder in E:\My_data\File based on Sys.Date() (i.e Year & Month). If Sys.Date()=01/01/2017 than there would be one folder with name 2017 and one sub folder with name Jan-17 and in sub folder there would be 2 sub folder A (For files <15:00 hrs for that particular Year/Month) and B (For files >15:00 hrs for that particular Year/Month).
If Year/Month change the new folder creates with same structure.
You can use the following code to do that:
# To handle the googlesheets
require(googlesheets)
# For easier date manipulation
require(lubridate)
# Get current time
t <- Sys.time()
# Set your base path and create the basic file structure
base_path <- "E:/My_data/File"
dir.create(paste0(base_path, year(t)))
sub_folder_path <- paste0(base_path, year(t), "/", month(t, label = TRUE), "-", day(t))
dir.create(sub_folder_path)
AB_split <- ifelse(hour(t)<15, "A", "B")
dir.create(paste0(sub_folder_path, "/", AB_split))
# Set your gsheet title and the wanted file-name
ws_title <- "XYZ"
ws_file_name <- paste0(date(t), "_", AB_split, "_", ws_title, ".xlsx")
ws_file_path <- paste0(sub_folder_path, "/", AB_split, "/", ws_file_name)
# Download it
as<-gs_title(ws_title)
gs_download(as, to = ws_file_path, overwrite = TRUE)
Trying to create an existing folder results if a warning. if you want to surpress the warnings wrap the create.dir calls in suppressWarnings(create.dir(...))
I would strongly recommend NOT to work with the worksheet title but use the key instead. See ?gs_key
Related
I need to shape the data stored in Excel files and save it as new .csv files. I figured out what specific actions should be done, but can't understand how to use lapply.
All Excell files have the same structure. Each of the .csv files should have the name of original files.
## the original actions successfully performed on a single file
library(readxl)
library("reshape2")
DataSource <- read_excel("File1.xlsx", sheet = "Sheet10")
DataShaped <- melt(subset(DataSource [-(1),], select = - c(ng)), id.vars = c ("itemname","week"))
write.csv2(DataShaped, "C:/Users/Ol/Desktop/Meta/File1.csv")
## my attempt to apply to the rest of the files in the directory
lapply(Files, function (i){write.csv2((melt(subset(read_excel(i,sheet = "Sheet10")[-(1),], select = - c(ng)), id.vars = c ("itemname","week"))))})
R returns the result to the console but doesn't create any files. The result resembles .csv structure.
Could anybody explain what I am doing wrong? I'm new to R, I would be really grateful for the help
Answer
Thanks to the prompt answer from #Parfait the code is working! So glad. Here it is:
library(readxl)
library(reshape2)
Files <- list.files(full.names = TRUE)
lapply(Files, function(i) {
write.csv2(
melt(subset(read_excel(i, sheet = "Decomp_Val")[-(1),],
select = -c(ng)),id.vars = c("itemname","week")),
file = paste0(sub(".xlsx", ".csv",i)))
})
It reads an Excel file in the directory, drops first row (but headers) and the column named "ng", melts the data by labels "itemname" and "week", writes the result as a .csv to the working directory attributing the name of the original file. And then - rinse and repeat.
Simply pass an actual file path to write.csv2. Otherwise, as denoted in docs ?write.csv, the default value for file argument is empty string "" :
file: either a character string naming a file or a connection open for writing. "" indicates output to the console.
Below concatenates the Excel file stem to the specified path directory with .csv extension:
path <- "C:/Users/Ol/Desktop/Meta/"
lapply(Files, function (i){
write.csv2(
melt(subset(read_excel(i, sheet = "Sheet10")[-(1),],
select = -c(ng)),
id.vars = c("itemname","week")),
file = paste0(path, sub(".xlsx", ".csv", i))
)
})
I have to download a lot of files in the format .gz (one file ~40mb , 40k rows).
The file contain a data from another country i would like to choose data only from france -> fr (limiting the number of columns)
I am trying to automate this process but I have problems with unpacking.
The data is on a webpage
and I'm intersted in data from whole folder.
I try with:
create tempfile
dowloand zip to tempfile
unzip, read and selected rows.
save as new file and repeat to next file.
I would like to ask if this way of think is correct.(the below code will be in for loop)
temp <- tempfile()
temp1 <- "C:/Users/tdo/Desktop/data/test.txt"
download.file("https://dumps.wikimedia.org/other/pageviews/2018/2018-
06/pageviews-20180601-000000.gz",temp) # example
unzip(files = temp,exdir = temp1)
data <- read.table(..)
daata[data$name == 'fr']
write.table(...)
In this way I created links:
dumpList <- read_html("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/")
links <- data_frame(filename = html_attr(html_nodes(dumpList, "a"), "href")) %>%
filter(grepl(x = filename, "pageviews")) %>% data by project
mutate(link = paste0("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/", filename))
Why not directly read the gzipped files? I don't see the need to locally unpack the archives, if all you want to do is subset/filter the data and store as new local files.
I recommend using readr::read_table2 to directly read the gzipped file.
Here is a minimal example:
# List of files to download
# url is the link, target the local filename
lst.files <- list(
list(
url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-06/pageviews-20180601-000000.gz",
target = "pageviews-20180601-000000.gz"))
# Download gzipped files (only if file does not exist)
lapply(lst.files, function(x)
if (!file.exists(x$target)) download.file(x$url, x$target))
# Open files
library(readr)
lst <- lapply(lst.files, function(x) {
df <- read_table2(x$target)
# Filter/subset entries
# Write to file with write_delim
})
I need to read parts of an Excel file into R. I have some existing code, but the authority changed the source. Previously, there was a direct URL to the document, now the link to the document can only be accessed through a website landing page.
Could someone tell me with which package I can achieve that? The link to the Excel file is: http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/
There i am looking at the document: "Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018"
I add the previous code to give an idea what I did. As you can see, I only required read.xlsx for this first step.
Many thanks in advance!
library(ggplot2)
library(lubridate)
library(openxlsx)
library(reshape2)
library(dplyr)
Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)
# Selecting Column R from Storico_G and stored in variable Bilres_df
Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )
# Conerting pubblicazione in date format and time
Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
Bilreslast=tail(Bilres_df,1)
Bilreslast=data.frame(Bilreslast)
Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))
If you copy the URL from the web page, you can then use download.files() first to download as a binary file and use read.xlsx() to read the data. Depending on how frequently the content changes on the web page, you may be better off just copying the URL than parsing it from the page.
oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
if(!file.exists("./data/downloadedXlsx.xlsx")){
download.file(newFile,"./data/downloadedXlsx.xlsx",
method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
mode="wb") # "wb" means "write binary"
} else message("file already loaded locally, using disk version")
library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])
...and the output:
> head(Bilres[,1:3])
pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06 42736.24 1915484
2 2017_01_01_07 42736.28 1915484
3 2017_01_01_08 42736.33 1866326
4 2017_01_01_09 42736.36 1866326
5 2017_01_01_10 42736.41 1866326
6 2017_01_01_11 42736.46 1866326
>
UPDATE: Added logic to avoid downloading the file once it has been downloaded.
You can find the .xlsx links this way:
library(rvest)
library(magrittr)
pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")
# get all the Excel (xlsx) links on that page:
html_nodes(pg, xpath=".//a[contains(#href, '.xlsx')]") %>%
html_attr("href") %>%
sprintf("http://www.snamretegas.it%s", .) -> excel_links
head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"
And, pass in what you want to your Excel reading function:
openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show
BUT!!
This is a very selfish and unkind way to do this since you hit that site for the Excel file every time you want to read it, wasting their CPU and bandwidth and your bandwidth.
You should use the download.file() technique Len described to cache a local copy and only re-download when necessary.
This should get you going in the right direction.
library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)
I have figured out some part of the code, I will describe below, but I find it hard to iterate (loop) the function over a list of files:
library(Hmisc)
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311")
# This is a vector of values that I want to exclude from the files
setwd("full_path_of_directory_with_desired_files")
filepath <- "//full_path_of_directory_with_desired_files"
list.files(filepath)
predict_files <- list.files(filepath, pattern="predict.txt")
# all files that I want to filter have _predict.txt in them
predict_full <- file.path(filepath, predict_files)
# generates full pathnames of all desired files I want to filter
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
Now here is an example of a simple filtering I want to do with one specific example file, this works great. How do I repeat this in a loop on all filenames in predict_full
test_predict <- read.table("a550673-4308980_A05_RepliG_rep2_predict.txt", header = T, sep = "\t")
# this is a file in my current working directory that I set with setwd above
test_predict_filt <- test_predict[test_predict$target_id %nin% filter_173]
write.table(test_predict_filt, file = "test_predict")
Finally how do I place the filtered files in a folder with the same name as original with the suffix filtered?
predict_filt <- file.path(filepath, "filtered")
# Place filtered files in
filtered/ subdirectory
filtPreds <- file.path(predict_filt, paste0(sample_names, "_filt_predict.txt"))
I always get stuck at looping! It is hard to share a 100% reproducible example as everyone's working directory and file paths will be unique though all the code I shared works if you adapt it to an appropriate path name on your machine.
This should work to loop through each of the files and write them out to the new location with the filename specifications you needed. Just be sure to change the directory paths first.
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311") #This is a vector of values that I want to exclude from the files
filepath <- "//full_path_of_directory_with_desired_files"
filteredpath <- "//full_path_of_directory_with_filtered_results/"
# Get vector of predict.txt files
predict_files <- list.files(filepath, pattern="predict.txt")
# Get vector of full paths for predict.txt files
predict_full <- file.path(filepath, predict_files)
# Get vector of sample names
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
# Set for loop to go from 1 to the number of predict.txt files
for(i in 1:length(predict_full))
{
# Load the current file into a dataframe
df.predict <- read.table(predict_full[i], header=T, sep="\t")
# Filter out the unwanted rows
df.predict <- df.predict[!(df.predict$target_id %in% filter_173)]
# Write the filtered dataframe to the new directory
write.table(df.predict, file = file.path(filteredpath, paste(sample_names[i],"_filt_predict.txt",sep = "")))
}
I have a list of files that are all named similarly: "FlightTrackDATE.txt" where the date is expressed in YYYYMMDD. I read in all the files with the list.files() command, but this gives me all the files in that folder (only flight track files are in this folder). What I would like to do is create a new file that will combine all the files from the last 90 days (or three months, whichever is easier) and ignore the other files.
You can try this :
#date from which you want to consolidate (replace with required date)
fromDate = as.Date("2015-12-23")
for (filename in list.files()){
#extract the date from filename using substr ( characters 12- 19)
filenameDate = as.Date(substr(filename,12,19), format = "%Y%m%d")
#read and consolidate if the filedate is on or after from date
if ((filenameDate - fromDate) >=0){
#create consolidated list from first file
if (!exists('consolidated')){
consolidated <- read.table(filename, header = TRUE)
} else{
data = read.table(filename, header = TRUE)
#row bind to consolidate
consolidated = rbind(consolidated, data)
}
}
}
OUTPUT:
I have three sample files :
FlightTrack20151224.txt
FlightTrack20151223.txt
FlightTrack20151222.txt
Sample data:
Name Speed
AA101 23
Consolidated data:
Name Speed
1 AA102 24
2 AA101 23
Note:
1. Create the From date by subtracting from current date or using a fixed date like above.
2. Remember to clean up the existing consolidated data if you are running the script again. Data duplication might occur otherwise.
3. Save consolidated to file :)
Consider an lapply() solution without a need for list.files() since you know ahead of time the directory and file name structure:
path = "C:/path/to/txt/files"
# LIST OF ALL LAST 90 DATES IN YYYYMMDD FORMAT
dates <- lapply(0:90, function(x) format(Sys.Date()-x, "%Y%m%d"))
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES
dfList <- lapply(paste0(path, "FlightTrack", dates, ".txt"),
function(x) if (file.exists(x)) {read.table(x)})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# OUTPUT FINAL FILE TO TXT
write.table(df, paste0(path, "FlightTrack90Days.txt"), sep = ",", row.names = FALSE)