How to automate read.csv command in R? - r

I'm doing something stupid and I cannot get read.csv to write a lot of files.
If I write:
write.csv(X1, file = "X1.csv")
Then it writes a ~2mb csv file which is ok. I have around 2000 variables in memory and I've tried
for (i in seq_along(fotos)) {
write.csv(paste("X", i, sep = ""), file = paste(paste("X", i, sep = ""),"csv", sep="."))}
I obtain the desired files but the files are ~2kb and X1.csv contains only one cell saying "X1.csv", and all all the files are similar because X1000.csv contains "X1000.csv", this is unlike the command write.csv(X1, file = "X1.csv") which creates a file X1.csv containing a matrix of 96x96.
Any idea of what I'm doing wrong?
Many thanks in advance.

You can get the object by name with the function get. However, it is much better to read the data frames into a list than into objects related by having common names.
So you can create a list of the data frames:
X <- lapply(seq_along(fotos), function(i) get(paste0("X", i)))
names(x) <- fotos
And then write them (and this is what you'd use if you had a list to start with):
lapply(names(X), function(name) write.csv(X[[name]], paste(name, 'csv', sep='.')))

You could try using the get() function
for (i in seq_along(fotos)) {
write.csv(get(paste("X", i, sep = "")), file = paste(paste("X", i, sep = ""),"csv", sep="."))}

Related

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

Importing multiple files in sparklyr

I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.
My problem
I have 50+ .txt files at around 300 mb each, all in the same folder, call it x, that I need to import to sparklyr, preferably one table.
I can read them individually like
spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)
If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import them all into a list with lapply
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE))
This gives me a list where element k is the k:th .txt file in filelist. So my question is: is there an equivalent way in sparklyr to do this?
What I've tried
I've tried to use lapply()and spark_read_csv, like I did above outside sparklyr. Just changed read.table to spark_read_csv and the arguments
datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))
which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.
> identical(datalist[[1]],datalist[[2]])
[1] TRUE
I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind them together.
Edit:
Found a way. The problem was that the argument "name" in spark_read_csv needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?
datalist <- list()
for(i in 1:length(filelist)){
name <- paste("dataset",i,sep = "_")
datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
name = name, delimiter="|", header=FALSE)
}
Since you (emphasis mine)
have 50+ .txt files at around 300 mb each, all in the same folder
you can just use wildcard in the path:
spark_read_csv(
path = "/path/to/folder/*.txt",
sc = sc, name = "mydata", delimiter = "|", header=FALSE)
If directory contains only the data you can simplify this even further:
spark_read_csv(
path = "/path/to/folder/",
sc = sc, name = "mydata", delimiter = "|", header = FALSE)
Native Spark readers also support reading multiple paths at once (Scala code):
spark.read.csv("/some/path", "/other/path")
but as of 0.7.0-9014 it is not properly implemented in sparklyr (current implementation of spark_normalize_path doesn't support vectors of size larger than one).

R - write.table overwrites file

My script reads in a list of text files from a folder. A calculation for all values in a few columns in each text file is made.
At the end I want to write the resulting data.frame into a new text file in a different location.
The problem is, that the script keeps overwriting the file it created before. So I end up with only one file (the last one that was read in).
But I don't get what I am doing wrong here. The output file name is different each time, so in my head it should produce separate files.
The script looks as follows:
RAW <- "C:/path/tofiles"
files <- list.files(RAW, full.names = TRUE)
for(j in length(files)) {
if(file.exists(files[[j]])){
data <- read.csv(files[[j]], skip = 0, header=FALSE)
data[9] <- do.call(cbind,lapply(data[9], function(x){(data[9]*0.01701)/0.00848}))
data[11] <- do.call(cbind,lapply(data[11], function(x){(data[11]*0.01834)/0.00848}))
data[13] <- do.call(cbind,lapply(data[13], function(x){(data[13]*0.00982)/0.00848}))
data[15] <- do.call(cbind,lapply(data[15], function(x){(data[15]*0.01011)/0.00848}))
OUT <- paste("C:/path/to/destination_folder",basename(files[[j]]),sep="")
write.table(data, OUT, sep=",", row.names = FALSE, col.names = FALSE, append = FALSE)
}
}
The problem is in your for loop. length(files) just provides 1 value, namely the length of your files-vector, while I think you want to have a sequence with that length.
Try seq_along or just for(j in files).

Function to read in multiple delimited text files

Using this answer, I have created a function that should read in all the text datasets in a directory:
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep))
}
}
However, even though there are .txt and .csv files in the specified directory, no R objects get created (I'm guessing this happens because I'm using the read.delim within a function). How to correct this?
You can add the parameter envir in your assignment, like this :
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep),
envir=.GlobalEnv)
}
}
Doing this, your object will be created in the global environment and not just in the function environment
As I said in my comment, it is necessary to return() a value after assigning. I don't really see the point in using assign() though, so here it is with a simple for-loop, assuming you want your output to be a list of data frames.
Note that I changed the reading function to read.table() for personal convenience. You might want to adjust that.
read.delims <- function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data <- list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
list.out <- as.list(1:length(list.data))
# Read them in
for (i in 1:length(list.data)) {
list.out[[i]] <- read.table(paste(dir, list.data[i], sep = "/"), sep = sep)
}
return(list.out)
}
Maybe you should also add a $ to your regular expression.
Cheers.

Executing function on objects of name 'i' within for-loop in R

I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.

Resources