Using unz() to read in SAS data set into R - r

I am trying to read in a data set from SAS using the unz() function in R. I do not want to unzip the file. I have successfully used the following to read one of them in:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files))
That works great. I'm able to read the data set in and conduct the analysis. When I try to read in another data set, though, I encounter an error:
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files))
Error in read_connection_(con, tempfile()) :
Evaluation error: error reading from the connection.
I have read other questions on here saying that the file path may be incorrectly specified. Some answers mentioned submitting list.files() to the console to see what is listed.
list.files()
[1] "example_data.zip" "data.zip"
As you can see, I can see the folders, and I was successfully able to read the data set in from "example_data.zip", but I cannot access the data.zip folder.
What am I missing? Thanks in advance.

Your "dir2_files" is String vector of the names of different files in "data.zip". So for example if the files that you want to read have them names at the positions "k" in "dir_files" and "j" in "dir2_files" then let update your script like that:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files[k]))
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files[j]))

Related

how can I import several .txt files into r using loops for text analysis

I'm working on a bibliometric analysis in r that requires me to work with data obtained from the Web of Science database. I import my data into r using the following code:
file1 <- "data1.txt"
data1 <- convert2df(file = file1, dbsource = "isi", format = "plaintext")
I have about 35 text files that I need to repeat this code for. I would like to do so using loops, something I don't have much experience with. I tried something like this but it did not work:
list_of_items <- c("file1", "file2")
dataset <- vector(mode = "numeric", length = length(list_of_items))
for (i in list_of_items){
dataset[i] <- convert2df(file = list_of_items[i], dbsource = "isi", format = "plaintext")
print(dataset)}
I get the following error:
Error in file(con, "r") : invalid 'description' argument
I'm not very familiar with using loops but I need to finish this work. Any help would be appreciated!
R wants to open file1, but you only have file1.txt. The filenames in the list are incorrect.
I once had that problem as well, maybe the solution works for you too.
maybe put all text files in a folder and read the folder, this might be easier.
FOLDER_PATH <- #your path here just paste it from the explorer bar (Windows). Beware to replace`\` with `\\`
file_paths <- list.files(path = FOLDER_PATH,
full.names = TRUE # so you do not change working dir
) # please put only txt in the folder or read the doc on list.files
# using lapply you do not need a for loop, but this is optional
dataset <- lapply(file_paths, convert2df, dbsource = "isi", format = "plaintext")

R Function to predict which csv file would not be modified

I am trying to identify which types of csv files would not be modified in the future.
There are 540 csv files in one folder, and only 518 are modified. Basically, I wrote code to read and prepare this files to be modified by Java application and by running terminal on Linux they are modified.
This is what terminal shows:
data_3_5.csv
Error in mapmatching or profiling!
No edge matches found for path. Too short? Sequence size 2
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
file_names <-list.files(directory)
predict(file_names, model, filename="", fun=predict, ext=NULL,
const=NULL, index=1, na.rm=TRUE)
I think, it doesn't work only for those files what have small length? Maybe just apply code which calculates the length of all columns in all csv files and which would be small than n?
Welcome, and good job posting some code. You're pretty close, the predict function is used in modelling though, try this on:
directory <- "/path/folder"
directory_jar <- "/path/path.jar"
setwd(directory)
## let's take out a little bit of protection to ensure we are only getting csvs
file_names <-list.files(directory, pattern = ".csv", full.names = TRUE)
## ^ ok so the above gives us all the filenames, but we haven't read them in yet...
## so let's create a function that reads the files in and counts how many columns in each.
library(tidyverse)
## if the above fails, run install.packages("tidyverse")
## let's create a function that will open the csv file and read the number of columns for each.
openerFun <- function(x){ ## here x is the input, or the path
openedFile <- read.csv(x, stringsAsFactors = FALSE) ## open the file
numCols <- ncol(openedFile) ## Count columns
tibble(name = x, numCols = numCols) ## output the file with the # columns
}
## and now let's call it with map, but map_dfr it's better cause we have a nice dataframe!
map_dfr(file_names, openerFun)
Once you have that, you can use it to compare against which files failed... hopefully that will help!

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.
Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

Assigning Directory as a Variable in R

I need to create a function called PollutantMean with the following arguments: directory, pollutant, and id=1:332)
I have most of the code written but I can't figure out how to assign my directory as a variable. My current working directory is C:/Users/User/Documents. I tried writing the variable as:
directory <- "C:/Users/User/specdata" and that didn't work.
Next I tried the following:
directory <- list.files("specdata", full.names=TRUE) and that didn't work either.
Any ideas on how to change this?
If you are trying to assign the values in your current working directory to the variable "directory" Why not take the simple method and add:
directory <- getwd()
This should take the contents of the working directory and assign the values to the variable "directory".
I've already worker with directory as variables, I usually declare them like that
directory<-"C://Users//User//specdata//"
To take back your example.
Then, if I want to read a specific file in this directory, I will just go like :
read.table(paste(directory,"myfile.txt",sep=""),...)
It's the same process to write in a file
write.table(res,file=paste(directory,"myfile.txt",sep=""),...)
Is this helping ?
EDIT : you can then use read.csv and it will work fine
I think you are confused by the assignment operation in R. The following line
directory <- "C:/Users/User/specdata"
assigns a string to a new object that just happened to be called directory. It has the same effect on your working environment as
elephant <- "C:/Users/User/specdata"
To change where R reads its files, use the function setwd (short for set working directory):
setwd("C:/Users/User/specdata")
You can also specify full path names to functions that read in data (like read.table). For your specific problem,
# creates a list of all files ending with `csv` (i.e. all csv files)
all.specdata.files <- list.files(path = "C:/Users/User/specdata", pattern = "csv$")
# creates a list resulting from the application of `read.csv` to
# each of these files (which may be slow!!)
all.specdata.list <- lapply(all.specdata.files, read.csv)
Then we use dplyr::rbind_all to row-bind them into one file.
library(dplyr)
all.specdata <- rbind_all(all.specdata.list)
Then use colMeans to determine the grand means. Not sure how to do this without seeing the data.
Assuming that the columns in each of the 300+ csv files are the same, that is have column j contains the same type of data in all files, then the following example should be of use:
# let's use a temp directory for storing the files
tmpdr <- tempdir()
# Let's creat a large matrix of values and then split it into many different
# files
original_data <- data.frame(matrix(rnorm(10000L), nrow = 1000L))
# write each row to a file
for(i in seq(1, nrow(original_data), by = 1)) {
write.csv(original_data[i, ],
file = paste0(tmpdr, "/", formatC(i, format = "d", width = 4, flag = 0), ".csv"),
row.names = FALSE)
}
# get a character vector with the full path of each of the files
files <- list.files(path = tmpdr, pattern = "\\.csv$", full.names = TRUE)
# read each file into a list
read_data <- lapply(files, read.csv)
# bind the read_data into one data.frame,
read_data <- do.call(rbind, read_data)
# check that our two data.frames are the same.
all.equal(read_data, original_data)
# [1] TRUE

R loop for anova on multiple files

I would like to execute anova on multiple datasets stored in my working directory. I have come up so far with:
files <- list.files(pattern = ".csv")
for (i in seq_along(files)) {
mydataset.i <- files[i]
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.i)
summary(AnovaModel.1)
}
As you can see I am very new to loops and cannot make this work. I also understand that I need to add a code to append all summary outputs in one file. I would appreciate any help you can provide to guide to the working loop that can execute anovas on multiple .csv files in the directory (same headers) and produce outputs for the record.
you might want to use list.files with full.names = TRUE in case you are not on the same path.
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T)
# use lapply to loop over all files
out <- lapply(1:length(files), function(idx) {
# read the file
this.data <- read.csv(files[idx], header = TRUE) # choose TRUE/FALSE accordingly
aov.mod <- aov(DES ~ DOSE, data = this.data)
# if you want just the summary as object of summary.aov class
summary(aov.mod)
# if you require it as a matrix, comment the previous line and uncomment the one below
# as.matrix(summary(aov.mod)[[1]])
})
head(out)
This should give you a list with each entry of the list having a summary matrix in the same order as the input file list.
Your error is that your loop is not loading your data. Your list of file names is in "files" then you start moving through that list and set mydataset.i equal to the name of the file that matches your itterator i... but then you try to run aov on the file name that is stored in mydataset.i!
The command you are looking for to redirect your output to a file is sink. Consider the following:
sink("FileOfResults.txt") #starting the redirect to the file
files <- list.files("path_to_my_dir", pattern="*.csv", full.names = T) #using the fuller code from Arun
for (i in seq_along(files)){
mydataset.i <- files[i]
mydataset.d <- read.csv(mydataset.i) #this line is new
AnovaModel.1 <- aov(DES ~ DOSE, data=mydataset.d) #this line is modified
print(summary(AnovaModel.1))
}
sink() #ending the redirect to the file
I prefer this approach to Arun's because the results are stored directly to the file without jumping through a list and then having to figure out how to store the list to a file in a readable fashion.

Resources