How can I read a ".da" file directly into R? - r

I want to work with the Health and Retirement Study in R. Their website provides ".da" files and a SAS extract program. The SAS program reads the ".da" files like a fixed width file:
libname EXTRACT 'c:\hrs1994\sas\' ;
DATA EXTRACT.W2H;
INFILE 'c:\hrs1994\data\W2H.DA' LRECL=358;
INPUT
HHID $ 1-6
PN $ 7-9
CSUBHH $ 10-10
ETC ETC
;
LABEL
HHID ="HOUSEHOLD IDENTIFIER"
PN ="PERSON NUMBER"
CSUBHH ="1994 SUB-HOUSEHOLD IDENTIFIER"
ASUBHH ="1992 SUB-HOUSEHOLD IDENTIFIER"
ETC ETC
;
1) What type of file is this? I can't find anything about this file type.
2) Is there an easy way to read this into R without the intermediate step of exporting a .csv from SAS? Is there a way for read.fwf() to work without explicitly stating hundreds of variable names?
Thank you!

After a little more research it appears that you can utilize the Stata dictionary files *.DCT to retrieve the formatting for the data files *.DA. For this to work you will need to download both the "Data files" .zip file, and the "Stata data descriptors" .zip file from the HRS website. Just remember when processing the files to use the correct dictionary file on each data file. IE, use the "W2FA.DCT" file to define "W2FA.DA".
library(readr)
# Set path to the data file "*.DA"
data.file <- "C:/h94da/W2FA.DA"
# Set path to the dictionary file "*.DCT"
dict.file <- "C:/h94sta/W2FA.DCT"
# Read the dictionary file
df.dict <- read.table(dict.file, skip = 1, fill = TRUE, stringsAsFactors = FALSE)
# Set column names for dictionary dataframe
colnames(df.dict) <- c("col.num","col.type","col.name","col.width","col.lbl")
# Remove last row which only contains a closing }
df.dict <- df.dict[-nrow(df.dict),]
# Extract numeric value from column width field
df.dict$col.width <- as.integer(sapply(df.dict$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))
# Convert column types to format to be used with read_fwf function
df.dict$col.type <- sapply(df.dict$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))
# Read the data file into a dataframe
df <- read_fwf(file = data.file, fwf_widths(widths = df.dict$col.width, col_names = df.dict$col.name), col_types = paste(df.dict$col.type, collapse = ""))
# Add column labels to headers
attributes(df)$variable.labels <- df.dict$col.lbl

Related

Cannot combine files in list of files when opening multiple .dta files [duplicate]

I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)

renaming existing .csv files within a folder

I have a folder called "data" which contains .csv files with data from individual participants. The filename of each of the participants .csv file is an alphanumeric code assigned to them (which is also stored in a column within the .csv called "ppt"), plus the word "data" and the hour they completed the study (e.g., 13 = 1pm).
So for example, the filename of this participant .csv would be "3ht2phfu7data13.csv"
ppt
choice
error
3ht2phfu7
d
0
3ht2phfu7
d
0
3ht2phfu7
k
1
whilst the filename of this participant .csv would be "3a5tzoirpdata15.csv"
ppt
choice
error
3a5tzoirp
k
1
3a5tzoirp
d
0
3a5tzoirp
k
1
These are just 2 examples, but there are 60 individual .csv files in total.
I am trying to rename each of these files, so that instead of containing the participant alphanumeric code, each participant is assigned a number ranging from 1 to 60. So for example, instead of an individual participant file being named "3ht2phfu7data.csv", I'd like it to be named "1data.csv", and for the ppt column to also change to be "1" for each row (to match the new filename), rather than the "3ht2phfu7" that it currently is.
Then going along with another example, for "3a5tzoirpdata.csv" to be named "2data.csv" and for the ppt column to also change to be "2" for each row (to match the new filename). And then so on with the remaining 58 .csv files in the folder.
I have tried the following code, no error message appears but it is not producing amended .csv files. Any help would be really appreciated
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(files, function(file){
x <- read.csv(file)
x$participant <- c(1:60)
write.csv(paste0(x, "data", file))
}
You had the right idea, but there were some problems in the sapply.
You can't iterate over filenames if you want to assign a consecutive number.
In the write.csv the object to write to file was missing. For the file name we first have to extract the file's directory with dirname and then add the desired filename.
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(1:length(files), function(i){
# read file
x <- read.csv(files[i])
# change participant code to consecutive number
x$participant <- i
write.csv(x, paste0(dirname(files[i]), "/", i, "data.csv"), row.names = F, quote = F)
})

R: how to find select files in a folder based on matching specific column title

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

searching multiple txt files for data and reporting result to new table

I have thousands of txt files containing Mass, %Base data. I need to search each file for a row within a specific mass range. Then, report that row into a new table with the filename as an additional character. The goal is a table of (Mass, %Base, Filename) for all of the text files based on the condition of the search.
Existing File example for file1name.txt:
Mass %Base
100 .1
101 26.2
...
900 0
Goal:
Mass %Base File
375.004 98 file1name
375.003 96 file2name
My current code is:
library(tidyverse)
library(readr)
#setwd to where data is located
setwd("Z:/Dnigra")
#set path where data is located
path <- "Z:/Dnigra"
mc <- 375.3 #mc is the calculated target mass
limit<- 0.1 # the width of the search window
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname,col_names=FALSE, skip =1)
#filters the data that includes the target mass
df <- between(mc,limit,limit)
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F
)
}
The end result is a table with:
df f
FALSE filename
Also errors:
Parsed with column specification: cols( X1 = col_character(), X2 = col_character() ) Warning: 2536 parsing failures.
I think there may be a few things here to address:
First, with read_tsv you might want to specify the column types as double if appropriate, so values are not read in as character strings. This would affect your ability to filter and subset based on Mass.
Next, the between statement has the syntax of:
between(x, left, right)
where x <= right and x >= left. If you want to make sure your mc value is between 375.2 and 375.4 you might want between(X1, mc-limit, mc+limit) instead. Note that since no header was read in, the Mass variable is assumed first as X1.
When you use write.table and append, you might want to set col.names to FALSE (or include header on first write).
Hope this is helpful to you.
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, col_names = FALSE, skip=1, col_types = "dd")
#filters the data that includes the target mass
df <- filter(df, between(X1, mc-limit, mc+limit))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"allwobble.csv",
append= T,
sep=",",
row = F,
col.names = FALSE
)
}
Thanks #Ben. I had gotten to that point last night and had added a tolerance calculation. The "dd" definitely helped but required a col_names to get through another error. The final code is below. A parsing error comes up, but it does what it need to do!
tol<- .02 # the width of the search window
mmneg <- mc - tol
mmpos <- mc + tol
#finds the files with the correct extensions
fs <-list.files(path, pattern=glob2rx("*.txt$"))
for (f in fs){
fname <- file.path(path, f)
df <- read_tsv(fname, skip =1,skip_empty_rows = T, col_types="dd", col_names=c("X1","X2"))
#filters the data that excludes the offending peak
df<- filter(df,between(X1,mmneg,mmpos))
#create new data based on contents
allSpectra <- data.frame(df,f)
#write new data to sep file
write.table(allSpectra ,"Caviunin_20_.csv",
append= T,
sep=",",
row = F,
col.names = F
)
}

How to read and name different CSV files in R

I would like to work on several csv files to make some comparisons, so I wrote this code to read the different csv files I have:
path <- "C:\\data\\"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
My csv files are something like this:
Start Time,End Time,Total,Diffuse,Direct,Reflected
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:02:00,04/09/14 00:02:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:03:00,04/09/14 00:03:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
(...)
Using my code, R separate correctly all files, but for each of them it creates a table adding a more extra space at the beginning:
|Start Time |End Time |Total |Diffuse |Direct |Reflected
04/09/14 00:01:00|04/09/14 00:01:00|2.221220E-003|5.797364E-004|0.000000E+000|1.641484E-003|NA
...
How can I fix it?
Moreover, considering that the original name of each file is really long, is it possible to name each data.frame using the last letters of the file? Or just a cardinal number?
I would suggest using the data.table package - it's faster and for non-blank columns in the end, it converts those to NA (in my experience). Here's some code I wrote for a simialr task:
read_func <- function(z) {
dat <- fread(z, stringsAsFactors = FALSE)
names(dat) <- c("start_time", "end_time", "Total", "Diffuse", "Direct", "Reflect")
dat$start_tme <- as.POSIXct(strptime(dat$start_tme,
format = "%d/%m/%y %H:%M:%S"), tz = "Pacific/Easter")
patrn <- "([0-9][0-9][0-9])\\.csv"
dat$type <- paste("Dataset",gsub(".csv", "", regmatches(z,regexpr(patrn, z))),sep="")
return(as.data.table(dat))
}
path <- ".//Data/"
file_list <- dir(path, pattern = "csv")
file_names <- unname(sapply(file_list, function(x) paste(path, x, sep = "")))
data_list <- lapply(file_names, read_func)
dat <- rbindlist(data_list, use.names = TRUE)
rm(path, file_list, file_names)
This will give you a list with each item as the data.table from the corresponding file name. I assumed that all file names have a three digit number before the extension which I used to assign a variable type to each data.table. You can change patrn to match your specific use case. This way, when you combine all of them into a single data.table dat, you can always sort/filter based on type. For example, if youwanted to plot diffuse vs direct for Dataset158 and datase222, you could do the following:
ggplot(data = dat[type == 'Dataset158' | type == 'Dataset222'],
aes(x = Diffuse, y = Direct)) + geom_point()
Hope this helps!
You're having a problem because your csv files have a blank column at the end... which makes your data end in a comma:
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
This leads R to think your data consists of 7 columns rather than 6. The correct solution is to resave all your csv files correctly. Otherwise, R will see 7 columns but only 6 columnnames, and will logically think that the first column is rownames. Here you can apply the patch we came up with #konradrudolph:
library(tibble)
df %>% rownames_to_column() %>% setNames(c(colnames(.)[-1], 'DROP')) %>% select(-DROP)
where df is the data from the csv. But patches like this can lead to unexpected results... better save the csv files correctly.

Resources