I have a object period which will be the current month id.
Now, I have different files with the suffix of period and I want R to read and work on those files at many places in a program.
Example
Period="202105"
file 1=SG202105
file 2=MN202105
How can a create a object period and call it at various places in the program?
You can use Period object as pattern argument in list.files to get names of the files that has that value in it.
Period="202105"
list.files(pattern = Period)
To create an object with the current year and moth in the format yyyymm, use this function. It accepts any R object of classes "Date" or "POSIXt".
new_period <- function(date = Sys.Date()){
d <- format(date, "%Y%m")
d
}
Period <- new_period()
Period
#[1] "202105"
I think you need paste0 and assign, try this example:
Period = "202105"
file1 <- paste0("SG", Period, ".csv")
file2 <- paste0("MN", Period)
# read file
myFile <- read.csv(file1)
# create a data.frame
assign("file2", data.frame(x = 1, y = 2))
ls()
# [1] "file1" "file2" "myFile", "Period"
Related
I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)
I have create a R script that analyse and manipulate 2 different data frame extension, for exemple one task is to extract certain values from data and export it as a .txt file, here is the begining of my script and the data files that i use:
setwd('C:\\Users\\Zack\\Documents\\RScripts\***')
heat_data="***.heat"
time ="***.timestamp"
ts_heat = read.table(heat_data)
ts_heat = ts_heat[-1,]
rownames(ts_heat) <- NULL
ts_time = read.table(time)
back_heat = subset(ts_heat, V3 == 'H')
back_time = ts_time$V1
library(data.table)
datDT[, newcol := fcoalesce(
nafill(fifelse(track == "H", back_time, NA_real_), type = "locf"),
0)]
last_heat = subset(ts_heat, V3 == 'H')
last_time = last_heat$newcol
x = back_time - last_heat
dataest = data.frame(back_time , x)
write_tsv(dataestimation, file="dataestimation.txt")
I than use those 2 files to calculate and extract specific values.
So can anyone plz tell me how can I run this script on each and every .heat and .timestamp files.
My objective is to calculate and extract this values for each file. I note that each file contain
.heat and .timestamp. I note also that I am a windows user.
Thank you for your help
You can use list.files
heat_data <- list.files(pattern = ".*.heat")
time <- list.files(pattern = ".*.timestamp")
and then process each file in a loop (or use lapply)
for (i in heat_data) {
h <- read.table(i)
# other code
}
for (j in time) {
t <- read.table(j)
# other code
}
you may want to pass the path to list.files as well instead of using setwd:
heat_data <- list.files("your/path/", pattern = ".*.heat")
After edit question
Let's say you have 3 .heat files and 3 .timestamp files in your path named
1.heat
2.heat
3.heat
1.timestamp
2.timestamp
3.timestamp
so there is a correspondence between heat and timestamp (given by the file name).
You can read these files with
heat_data <- list.files("your/path/", pattern = ".*.heat")
time <- list.files("your/path/", pattern = ".*.timestamp")
At this point, create a function that does exactly what you want. This function takes as input only an index and two paths
function (i, heat_data, time) {
ts_heat <- read.table (heat_data[i])
ts_time <- read.table (time[i])
#
# other code
#
write_tsv(dataestimation, file = paste ("dataestimation", i, ".txt", sep = ""))
}
This way you will have files named dataestimation_1.txt, dataestimation_2.txt and dataestimation_3.txt.
Finally use lapply to call the function for all files in the folder
lapply (1: 3, heat_data, time)
This is just one of the possible ways to proceed.
I want to work with the Health and Retirement Study in R. Their website provides ".da" files and a SAS extract program. The SAS program reads the ".da" files like a fixed width file:
libname EXTRACT 'c:\hrs1994\sas\' ;
DATA EXTRACT.W2H;
INFILE 'c:\hrs1994\data\W2H.DA' LRECL=358;
INPUT
HHID $ 1-6
PN $ 7-9
CSUBHH $ 10-10
ETC ETC
;
LABEL
HHID ="HOUSEHOLD IDENTIFIER"
PN ="PERSON NUMBER"
CSUBHH ="1994 SUB-HOUSEHOLD IDENTIFIER"
ASUBHH ="1992 SUB-HOUSEHOLD IDENTIFIER"
ETC ETC
;
1) What type of file is this? I can't find anything about this file type.
2) Is there an easy way to read this into R without the intermediate step of exporting a .csv from SAS? Is there a way for read.fwf() to work without explicitly stating hundreds of variable names?
Thank you!
After a little more research it appears that you can utilize the Stata dictionary files *.DCT to retrieve the formatting for the data files *.DA. For this to work you will need to download both the "Data files" .zip file, and the "Stata data descriptors" .zip file from the HRS website. Just remember when processing the files to use the correct dictionary file on each data file. IE, use the "W2FA.DCT" file to define "W2FA.DA".
library(readr)
# Set path to the data file "*.DA"
data.file <- "C:/h94da/W2FA.DA"
# Set path to the dictionary file "*.DCT"
dict.file <- "C:/h94sta/W2FA.DCT"
# Read the dictionary file
df.dict <- read.table(dict.file, skip = 1, fill = TRUE, stringsAsFactors = FALSE)
# Set column names for dictionary dataframe
colnames(df.dict) <- c("col.num","col.type","col.name","col.width","col.lbl")
# Remove last row which only contains a closing }
df.dict <- df.dict[-nrow(df.dict),]
# Extract numeric value from column width field
df.dict$col.width <- as.integer(sapply(df.dict$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))
# Convert column types to format to be used with read_fwf function
df.dict$col.type <- sapply(df.dict$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))
# Read the data file into a dataframe
df <- read_fwf(file = data.file, fwf_widths(widths = df.dict$col.width, col_names = df.dict$col.name), col_types = paste(df.dict$col.type, collapse = ""))
# Add column labels to headers
attributes(df)$variable.labels <- df.dict$col.lbl
i have a series of data, it looks like
sale20160101.txt,
sales20160102.txt,...,
sales20171231.
now i want to read them all and combine, but it also needs a date variable
to help me identify their occurrence time,so the date variable will be
20160101,20160102,...,20161231.
my ideas is:
split filename into sale+"time"
duplicate time whenever i read according to number of data length
cbind data and time.
thx alot.
We could do this with fread and rbindlist from data.table
library(data.table)
#find the files that have names starting as 'sales' followed by numbers
#and have .txt extension
files <- list.files(pattern = "^sale.*\\d+\\.txt", full.names = TRUE)
#get the dates
dates <- readr::parse_number(basename(files))
#read the files into a list and rbind it
dt <- rbindlist(setNames(lapply(files, fread), dates), idcol = 'date')
I usually would do a variation of the following:
# find the files
ls <- list.files(pattern = '^sales')
# Get the dates
dates <- gsub('sales', '', tools::file_path_sans_ext(ls))
# read in the data
dfs <- lapply(ls, read.table)
# match the dates
names(dfs) <- dates
# bind all data together and include the date as a column
df <- dplyr::bind_rows(dfs, .id = 'date')
I have a list of files that are all named similarly: "FlightTrackDATE.txt" where the date is expressed in YYYYMMDD. I read in all the files with the list.files() command, but this gives me all the files in that folder (only flight track files are in this folder). What I would like to do is create a new file that will combine all the files from the last 90 days (or three months, whichever is easier) and ignore the other files.
You can try this :
#date from which you want to consolidate (replace with required date)
fromDate = as.Date("2015-12-23")
for (filename in list.files()){
#extract the date from filename using substr ( characters 12- 19)
filenameDate = as.Date(substr(filename,12,19), format = "%Y%m%d")
#read and consolidate if the filedate is on or after from date
if ((filenameDate - fromDate) >=0){
#create consolidated list from first file
if (!exists('consolidated')){
consolidated <- read.table(filename, header = TRUE)
} else{
data = read.table(filename, header = TRUE)
#row bind to consolidate
consolidated = rbind(consolidated, data)
}
}
}
OUTPUT:
I have three sample files :
FlightTrack20151224.txt
FlightTrack20151223.txt
FlightTrack20151222.txt
Sample data:
Name Speed
AA101 23
Consolidated data:
Name Speed
1 AA102 24
2 AA101 23
Note:
1. Create the From date by subtracting from current date or using a fixed date like above.
2. Remember to clean up the existing consolidated data if you are running the script again. Data duplication might occur otherwise.
3. Save consolidated to file :)
Consider an lapply() solution without a need for list.files() since you know ahead of time the directory and file name structure:
path = "C:/path/to/txt/files"
# LIST OF ALL LAST 90 DATES IN YYYYMMDD FORMAT
dates <- lapply(0:90, function(x) format(Sys.Date()-x, "%Y%m%d"))
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES
dfList <- lapply(paste0(path, "FlightTrack", dates, ".txt"),
function(x) if (file.exists(x)) {read.table(x)})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# OUTPUT FINAL FILE TO TXT
write.table(df, paste0(path, "FlightTrack90Days.txt"), sep = ",", row.names = FALSE)