How to convert .txt to .csv file in R - r

I found this code here, and it worked to convert the '.txt' to '.csv' but the file is not broken into columns, pretty sure there's an easy fix or line to add here, but I'm not finding it. Still new to r and working through, so any help or direction is appreciated.
EDIT: The file contains the following, a list of invasive plants:
Header: Noxious Weed List.
'(a) Abrus precatorius – rosary pea '
'(b) Aeginetia spp. – aeginetia'
'(c) Ageratina adenophora – crofton weed '
'(d) Alectra spp. – alectra '
And so I would like to get all the parts, i.e., genus, species, and common name, in a separate column. and if possible, delete the letters '(a)' and the ' - ' separating hyphen.
filelist = list.files(pattern = ".txt")
for (i in 1:length(filelist)) {
input<-filelist[i]
output <- paste0(gsub("\\.txt$", "", input), ".csv")
print(paste("Processing the file:", input))
data = read.delim(input, header = TRUE)
write.table(data, file=output, sep=",", col.names=TRUE, row.names=FALSE)
}

You'll need to adjust if you have common names with three or more words, but this is the general idea:
path <- "C:\\Your File Path Here\\"
file <- paste0(path, "WeedList.txt")
DT <- read.delim(file, header = FALSE, sep = " ")
DT <- DT[-c(1),-c(1,4,7)]
colnames(DT) <- c("Genus", "Species", "CommonName", "CommonName2")
DT$CommonName <- gsub("'", "", DT$CommonName)
DT$CommonName2 <- gsub("'", "", DT$CommonName2)
DT$CommonName <- paste(DT$CommonName, DT$CommonName2, sep = " ")
DT <- DT[,-c(4)]
write.csv(DT, paste0(path, "WeedList.csv"), row.names = FALSE)

Related

For loop using grepl

Example Data
I'm writing a script with the intent to copy input files, each to multiple locations. Below is an example of functional code to achieve this:
##### File 1 #####
output_paths_1 <- list(c(paste0(path_1, "file_1", ".xlsx"),
paste0(path_2, "file_1", ".xlsx"),
paste0(path_3, "file_1", " ", gsub("-", "", Sys.Date()), ".xlsx")))
lapply(output_paths_1, function (x) file.copy(paste0(input_path, "input_1.xlsx"), x, overwrite = T))
##### File 2 #####
output_paths_2 <- list(c(paste0(path_1, "file_2", ".xlsx"),
paste0(path_2, "file_2", ".xlsx"),
paste0(path_3, "file_2", " ", gsub("-", "", Sys.Date()), ".xlsx")))
lapply(output_paths_2, function (x) file.copy(paste0(input_path, "input_2.xlsx"), x, overwrite = T))
##### File 3 #####
output_paths_3 <- list(c(paste0(path_1, "file_3", ".xlsx"),
paste0(path_2, "file_3", ".xlsx"),
paste0(path_3, "file_3", " ", gsub("-", "", Sys.Date()), ".xlsx")))
lapply(output_paths_3, function (x) file.copy(paste0(input_path, "input_3.xlsx"), x, overwrite = T))
Reprex
But I suspect there are more efficient methods. In my latest attempt, which does not work, I used a nested 'for' loop. I create data frames containing each input and file name. Then (in theory), for each i in inputs, I write an output paths data frame for each i in files. I filter this data frame for only one file at a time using grepl. See code below:
files <- data.frame(data = c("file_1", "file_2", "file_3"))
inputs <- data.frame(data = c("input_1.xlsx", "input_2.xlsx", "input_3.xlsx"))
for (i in seq_along(inputs)) {
for (i in seq_along(files)) {
output_paths <- data.frame(data = c(paste0(path_1, files[[i]], ".xlsx"),
paste0(path_2, files[[i]], ".xlsx"),
paste0(path_3, files[[i]], " ", gsub("-", "", Sys.Date()), ".xlsx"))) %>%
filter(grepl(files[[i]], `data`))
lapply(output_paths, function (x) file.copy(paste0(input_path, inputs[[i]]), x, overwrite = T))
}
}
I expected this to copy the first file to three locations, then the next file to those same locations, etc. Instead, the following Warning appears, and only the first file is copied to the desired locations:
Warning message:
In grepl(files[[i]], data) :
argument 'pattern' has length > 1 and only the first element will be used
Running the code without including the grepl function does nothing at all - no files are copied to the desired locations.
Questions:
How might I tweak the code above to iterate for all elements, instead of the first element only?
Is there a more elegant approach entirely? (just looking for pointers, not reprex necessarily)
I don't understand what you are trying to accomplish with your "Reprex" approach. But if you want to do what your first but of code does by writing less code, then you could do something like
files = c("file1", "file2", "file3") # file names
opaths = c("path1", "path2", "path3") # output paths
df = expand.grid(file = files, path = opaths, stringsAsFactors = F)
df$from = file.path(input_path, df$file)
df$to = file.path(df$path, df$file)
file.copy(from = df$from, to = df$to)
If you want the timestamp in the file name for path3, you could then do something like
df$to[df$path == "path3"] <- file.path(df$path[df$path == "path3"],
paste0(format(Sys.Date(), "%Y%m%d_"), df$file[df$path == "path3"])
)

Reading multiple files in R as string in a for.loop

I have a code that reads two different csv files from a folder at the time of execution. i need to use for loop in this context to execute this multiple times and write the output in to a separate csv file of the form "bsc_.csv". The file format of the two input csv files are "base_.csv" and "fut_.csv". The files are incrementally numbered, and that that is the pattern I need to iterative over. The sample code is attached below.
library('CDFt')
d1<-read.csv("base1.csv",header=TRUE)
d2<-read.csv("fut1.csv",header=TRUE)
A1<-d1[,2]
A2<-d1[,3]
A3<-d2[,2]
CT<-CDFt(A1,A2,A3)
x<-CT$x
FGp<-CT$FGp
FGf<-CT$FGf
FRp<-CT$FRp
FRf<-CT$FRf
ds<-CT$DS
d<-round(ds,3)
dat<-replace(d,d<0,0)
write.table(dat,"bsc1.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)
Try this (untested):
bases <- list.files(pattern = "base[0-9]*\\.csv$")
futs <- list.files(pattern = "fut[0-9]*\\.csv$")
mismatches <- setdiff(gsub("^base", "", bases), gsub("^fut", "", futs) )
if (length(mismatches)) {
warning("'bases' files not in 'futs': ", paste(sQuote(mismatches), collapse = ", "))
bases <- setdiff(bases, paste0("base", mismatches))
}
# and the reverse
mismatches <- setdiff(gsub("^fut", "", futs), gsub("^base", "", bases) )
if (length(mismatches)) {
warning("'futs' files not in 'bases': ", paste(sQuote(mismatches), collapse = ", "))
futs <- setdiff(futs, paste0("fut", mismatches))
}
ign <- Map(function(fb, ff) {
bdat <- read.csv(fb, header = TRUE)
fdat <- read.csv(ff, header = TRUE)
# ...
newfn <- gsub("^base", "bsc", fb)
write.table(dat, newfn, ...)
}, bases, futs)

How to delete part of a filename for multiple files using R

I'm using R studio 3.4.4 on Windows
I have file names that are like this:
1-2.edit-sites.annot.xlsx
2-1.edit-sites.annot.xlsx
...
10-1.edit-sites.annot.xlsx
I'm using the following code to rename the files (in order, including 1,2,3...10,11, etc)
file.rename(mixedsort(sort(list.files(pattern="*edit-sites.annot.xlsx"))), paste0("Sample ", 1:30))
But I can never remove the edit-sites.annot part. It seems like the . before edit is making R mess up with the extension of the file! When I use the code above, I get Sample 1.edit-sites.annot.xlsx but I would like Sample 1.xlsx
This ought to work. For each file that is named "[num1]-[num2].edit-sites.annot.xlsx" it will rename it to "Sample [num1]-[num2].xlsx":
fnames <- dir(path = choose.dir(), pattern = '.*edit-sites.annot.xlsx')
> fnames
[1] "1-2.edit-sites.annot.xlsx" "10-2.edit-sites.annot.xlsx" "11-2.edit-sites.annot.xlsx" "21-2.edit-sites.annot.xlsx"
[5] "31-2.edit-sites.annot.xlsx" "4-2.edit-sites.annot.xlsx"
ptrn <- '^([[:digit:]]{1,3}-[[:digit:]]{1,3}).*xlsx'
ptrn2 <- '(.*).edit-sites.annot.xlsx'
lapply(fnames, function(z){
suffix <- regmatches(x = z, m = regexec(pattern = ptrn, text = z))[[1]][2] # ptrn2 also works
file.rename(from = z, to = paste('Sample', " ", suffix, ".xlsx", sep = ""))
})
After running code:
> dir(pattern = '.*xlsx')
[1] "Sample 1-2.xlsx" "Sample 10-2.xlsx" "Sample 11-2.xlsx" "Sample 21-2.xlsx" "Sample 31-2.xlsx"
[6] "Sample 4-2.xlsx"
ptrn and ptrn2 are based on the sample filenames that you supplied. Since I wasn't sure if the file names are always consistent with the pattern provided, I included a regex pattern to match for the leading digits separated by a dash.
I tend to batch rename files (same as the example here) and include some print statements in my code (using cat) to show what the original name was and what it was changed to:
lapply(fnames, function(z){
suffix <- regmatches(x = z,
m = regexec(pattern = ptrn, text = z))[[1]][2] # ptrn2 also works
new_name <- paste('Sample', " ", suffix, ".xlsx")
cat('Original file name: ')
cat(z)
cat("\n")
cat('Renamed to: ')
cat(new_name)
cat("\n")
file.rename(from = z, to = new_name)
})

Combine csv files with common file identifier

I have a list of approximately 500 csv files each with a filename that consists of a six-digit number followed by a year (ex. 123456_2015.csv). I would like to append all files together that have the same six-digit number. I tried to implement the code suggested in this question:
Import and rbind multiple csv files with common name in R but I want the appended data to be saved as new csv files in the same directory as the original files are currently saved. I have also tried to implement the below code however the csv files produced from this contain no data.
rm(list=ls())
filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test")
NAPS_ID <- gsub('.+?\\([0-9]{5,6}?)\\_.+?$', '\\1', filenames)
Unique_NAPS_ID <- unique(NAPS_ID)
n <- length(Unique_NAPS_ID)
for(j in 1:n){
curr_NAPS_ID <- as.character(Unique_NAPS_ID[j])
NAPS_ID_pattern <- paste(".+?\\_(", curr_NAPS_ID,"+?)\\_.+?$", sep = "" )
NAPS_filenames <- list.files(path = "C:/Users/smithma/Desktop/PM25_test", pattern = NAPS_ID_pattern)
write.csv(do.call("rbind", lapply(NAPS_filenames, read.csv, header = TRUE)),file = paste("C:/Users/smithma/Desktop/PM25_test/MERGED", "MERGED_", Unique_NAPS_ID[j], ".csv", sep = ""), row.names=FALSE)
}
Any help would be greatly appreciated.
Because you're not doing any data manipulation, you don't need to treat the files like tabular data. You only need to copy the file contents.
filenames <- list.files("C:/Users/smithma/Desktop/PM25_test", full.names = TRUE)
NAPS_ID <- substr(basename(filenames), 1, 6)
Unique_NAPS_ID <- unique(NAPS_ID)
for(curr_NAPS_ID in Unique_NAPS_ID){
NAPS_filenames <- filenames[startsWith(basename(filenames), curr_NAPS_ID)]
output_file <- paste0(
"C:/Users/nwerth/Desktop/PM25_test/MERGED_", curr_NAPS_ID, ".csv"
)
for (fname in NAPS_filenames) {
line_text <- readLines(fname)
# Write the header from the first file
if (fname == NAPS_filenames[1]) {
cat(line_text[1], '\n', sep = '', file = output_file)
}
# Append every line in the file except the header
line_text <- line_text[-1]
cat(line_text, file = output_file, sep = '\n', append = TRUE)
}
}
My changes:
list.files(..., full.names = TRUE) is usually the best way to go.
Because the digits appear at the start of the filenames, I suggest substr. It's easier to get an idea of what's going on when skimming the code.
Instead of looping over the indices of a vector, loop over the values. It's more succinct and less likely to cause problems if the vector's empty.
startsWith and endsWith are relatively new functions, and they're great.
You only care about copying lines, so just use readLines to get them in and cat to get them out.
You might consider something like this:
##will take the first 6 characters of each file name
six.digit.filenames <- substr(filenames, 1,6)
path <- "C:/Users/smithma/Desktop/PM25_test/"
unique.numbers <- unique(six.digit.filenames)
for(j in unique.numbers){
sub <- filenames[which(substr(filenames,1,6) == j)]
data.for.output <- c()
for(file in sub){
##now do your stuff with these files including read them in
data <- read.csv(paste0(path,file))
data.for.output <- rbind(data.for.output,data)
}
write.csv(data.for.output,paste0(path,j, '.csv'), row.names = F)
}

How to load a txt file one by one in R rather than read all at once and combine into a single matrix

I have 100 text file in a folder. I can use this function below to read all the files and store it into myfile.
file_list <- list.files("C:/Users/User/Desktop/code/Test/", full=T)
file_con <- lapply(file_list, function(x){
return(read.table(x, head=F, quote = "\"", skip = 6, sep = ","))
})
myfile <- do.call(rbind, file_con)
My question is how I can read the first file in the Test folder before I read the second file. All the text file name also are different and I cannot change it to for example number from 1 to 100. I was thinking of maybe can add a integer no infront of all my text file, then use a for loop to match the file and call but is this possible?
I need to read the first file then do some calculation and then export the result into result.txt before read the second file.but now I'm doing it manually and I have almost 800 file, so it will be a big trouble for me to sit and wait it to compute. The code below is the one that I current in use.
myfile = read.table("C:/Users/User/Desktop/code/Test/20081209014205.txt", header = FALSE, quote = "\"", skip = 0, sep = ",")
The following setup will read one file at the time, perform an analysis,
and save it back with a slightly modified name.
save_file_list <- structure(
.Data = gsub(
pattern = "\\.txt$",
replacement = "-e.txt",
x = file_list),
.Names = file_list)
your_function <- function(.file_content) {
## The analysis you want to do on the content of each file.
}
for (.file in file_list) {
.file_content <- read.table(
file = .file,
head = FALSE,
quote = "\"",
skip = 6,
sep = ",")
.result <- your_function(.file_content)
write.table(
x = .result,
file = save_file_list[.file])
}
Now I can read a file and do calculation using
for(e in 1:100){
myfile = read.table(file_list[e], header = FALSE, quote = "\"", skip = 0, sep = ",");
while(condition){
Calculation
}
myresult <- file.path("C:/Users/User/Desktop/code/Result/", paste0("-",e, ".txt"));
write.table(x, file = myresult, row.names=FALSE, col.names=FALSE ,sep = ",");
Now my problem is how I can make my output file to have the same name of the original file but add a -e value at the back?

Resources