I have a folder containing several .csv files. I need to delete the first three rows and the very last row of all those .csv files and then save them all as .txt. All the files have the same format so it's always the same rows that I would need to delete.
I know how to modify a single dataframe but I do not know how to load, modify and save as txt several dataframes.
I am a beginner using R so I do not have examples of things I have tried yet.
Any help will be really appreciated!
It's hard to start with stack overflow but the other comments about reproducible examples are worth thinking about for the future. My suggestion would be to write a function that reads, modifies, and writes and then loop it across all the files.
I can't tell exactly how to do this as I can't see your data but something like this should work:
library('tidyverse')
old_paths = list.files(
path = your_folder,
pattern = '\\.csv$',
full.names = TRUE
)
read_write = function(path){
new_filename = str_replace(
string = path,
pattern = '\\.csv$',
replacement = '.txt'
)
read_csv(path) %>%
slice(-(1:3)) %>%
slice(-n()) %>%
write_tsv(new_filename) %>%
invisible()
}
lapply(old_paths, read_write)
Let's do this for one data frame, only referencing its file name
input_file = "my_data_1.csv"
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
Now we can turn it into a function taking the file name as an argument:
my_txt_convert = function(input_file) {
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
}
Then we call the function on all your files:
to_convert = list.files(pattern='.*.csv')
for (file in to_convert) {
my_txt_convert(file)
}
# or
lapply(to_convert, my_txt_convert)
Related
I have a script that merge all csv files in a folder.
My problem is that a new column named "...20" is created with empty data. How can I avoid that ?
Thanks for helping
My script :
folderfiles <- list.files(path = "//myserver/Depots/",
pattern = "\\.csv$",
full.names = TRUE)
data_csv <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
)
and the message :
It's difficult to debug this without access to specific files. However, you can attempt to specify the columns you want to read using the cols_only function. For example, let's assume that you only want to read the mpg column. You can do that in the following manner:
library("fs")
library("readr")
library("tidyverse")
# Generating some sample files
temp_dir_files <- path_temp("cars")
dir_create(temp_dir_files)
for (i in 1:10) {
write_csv(mtcars, file = path(temp_dir_files, paste0("cars", i, ".csv")))
}
# Selected column import
# read_* can handle a vector of paths
read_csv(
file = dir_ls(temp_dir_files, glob = "*.csv"),
col_types = cols_only(
mpg = col_double()
),
id = "input_file"
)
The cols_only specification passed to read_csv will force the read_csv to skip the remaining columns and only import the column with the matching name.
So I have 2 text files. The actual files are very large (thousands of lines each) but this is an extract from them:
File 1:
L["Corn Flakes"] = ""
L["Rice Oats"] = ""
L["Shreddies"] = ""
File 2:
L["Marshmellows"] = "Tesco"
L["Golden Syrup"] = "Morrisons"
L["Corn Flakes"] = "Tesco"
L["Bran Flakes"] = "Asda"
L["Super Flakes"] = "Asda"
L["Rice Oats"] = "Asda"
L["Shreddies"] = "Morrisons"
L["Rice Krispies"] = "Tesco"
So I want to merge these files so that I end up with this in a new text file:
Merged file:
L["Corn Flakes"] = "Tesco"
L["Rice Oats"] = "Asda"
L["Shreddies"] = "Morrisons"
In other words, I want to merge file 1 and file 2 but the merged file should only contain the rows that are in file 1 but the actual rows in the merged file should be from file 2.
The merged file should be output to a new, empty text file and it needs to be in UTF-8 format so that it works in any language. All the files (file 1, file and the merged file) need to be standard .txt files.
How can I do this in R?
Thank you.
Read the two files with '=' as separator so you have files with two columns. Keep rows in file2 which has the first column (V1) present in file1. Write the result back to a new text file if needed.
file1 <- read.table('file1.txt', sep = '=', quote = '')
file2 <- read.table('file2.txt', sep = '=', quote = '')
result <- file2[file2$V1 %in% file1$V1, ]
To include all the rows in file1 irrespective if they are present in file2 you may try the join approach.
library(dplyr)
inner_join(file1 %>% select(-any_of('V2')), file2, by = 'V1') %>%
bind_rows(anti_join(file1, file2, by = 'V1')) %>%
data.frame() -> result
Write the result :
write.table(result, 'result.txt', sep = '=', col.names = FALSE, row.names = FALSE, quote = FALSE)
Some background for my question: This is an R script that a previous research assistant wrote, but he did not provide any guidance to me on using it for myself. After working through an R textbook, I attempted to use the code on my data files.
What this code is supposed to do is load multiple .csv files, delete certain items/columns from them, and then write the new cleaned .csv files to a specified directory.
When I run my code, I don't get any errors, but the code isn't going anything. I originally thought that this was a problem with file permissions, but I'm still having the problem after changing them. Not sure what to try next.
Here's the code:
library(data.table)
library(magrittr)
library(stringr)
# create a function to delete unnecessary variables from a CAFAS or PECFAS
data set and save the reduced copy
del.items <- function(file)
{
data <- read.csv(input = paste0("../data/pecfas|cafas/raw",
str_match(pattern = "cafas|pecfas", string = file) %>% tolower, "/raw/",
file), sep = ",", header = TRUE, na.strings = "", stringsAsFactors = FALSE,
skip = 0, colClasses = "character", data.table = FALSE)
data <- data[-grep(pattern = "^(CA|PEC)FAS_E[0-9]+(T(Initial|[0-
9]+|Exit)|SP[a-z])_(G|S|Item)[0-9]+$", x = names(data))]
write.csv(data, file = paste0("../data/pecfas|cafas/items-del",
str_match(pattern = "cafas|pecfas", string = file) %>% tolower, "/items-
del/", sub(pattern = "ExportData_", x = file, replacement = "")) %>%
tolower, sep = ",", row.names = FALSE, col.names = TRUE)
}
# delete items from all cafas data sets
cafas.files <- list.files("../data/cafas/raw/", pattern = ".csv")
for (file in cafas.files){
del.items(file)
}
# delete items from all pecfas data sets
pecfas.files <- list.files("../data/pecfas/raw/", pattern = ".csv")
for (file in pecfas.files){
del.items(file)
}
Some background for my question: This is an R script that a previous research assistant wrote, but he did not provide any guidance to me on using it for myself. After working through an R textbook, I attempted to use the code on my data files.
What this code is supposed to do is load multiple .csv files, delete certain items/columns from them, and then write the new cleaned .csv files to a specified directory.
Currently, the files are being created in the right directory with the right file name, but the .csv files that are being created are empty.
I am currently getting the following error message:
Warning in
fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",: Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: (variable names).
This is my code:
library(data.table)
library(magrittr)
library(stringr)
# create a function to delete unnecessary variables from a CAFAS or PECFAS
data set and save the reduced copy
del.items <- function(file){
data <- fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/raw/", file), sep = ",", header = TRUE,
na.strings = "", stringsAsFactors = FALSE, skip = 0, colClasses =
"character", data.table = FALSE)
data <- data[-grep(pattern = "^(CA|PEC)FAS_E[0-9]+(TR?(Initial|[0-
9]+|Exit)|SP[a-z])_(G|S|Item)[0-9]+$", x = names(data))]
write.csv(data, file = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/items-del/", sub(pattern = "ExportData_", x =
file, replacement = "")) %>% tolower, row.names = FALSE)
}
# delete items from all cafas data sets
cafas.files <- list.files("data/cafas/raw", pattern = ".csv")
for (file in cafas.files){
del.items(file)
}
# delete items from all pecfas data sets
pecfas.files <- list.files("data/pecfas/raw", pattern = ".csv")
for (file in pecfas.files){
del.items(file)
}
I am working on combining csv files into one large csv file that will not be able to fit into my machine's RAM. Is there anyway to go about doing that in R? I realize that I could load each individual csv file into R and append the file to an existing database table but for quirky reasons I'm looking to end up with a large csv file.
Try to read each csv file one by one and write out with write.table and option append = T.
Something like this:
read one csv file;
write.table(..., append = T) to the final csv file;
remove the table with rm();
gc().
Repeate until all files are written out.
You can use the option append = TRUE
first <- data.frame(x = c(1,2), y = c(10,20))
second <- data.frame(c(3,4), c(30,40))
write.table(first, "file.csv", sep = ",", row.names = FALSE)
write.table(second, "file.csv", append = TRUE, sep = ",", row.names = FALSE, col.names = FALSE)
First create 3 test files and then create a variable Files containing their names. We used Sys.glob to do get the vector of file names but you may need to modify this statement. Then define outFile as the name of the output file. For each component of Files read in the file with that name and write it out. If it is the first file then write it all out and if it is a subsequent file write it all except for the header being sure to use append = TRUE. Note that L is overwritten each time a file is read in so that only one file takes up space at a time.
# create test files using built in data frame BOD
write.csv(BOD, "BOD1.csv", row.names = FALSE)
write.csv(BOD, "BOD2.csv", row.names = FALSE)
write.csv(BOD, "BOD3.csv", row.names = FALSE)
Files <- Sys.glob("BOD*.csv") # modify as appropriate
outFile <- "out.csv"
for(f in Files) {
L <- readLines(f)
if (f == Files[1]) cat(L, file = outFile, sep = "\n")
else cat(L[-1], file = outFile, sep = "\n", append = TRUE)
}
# check that the output file was written properly
file.show(outFile)
The loop could alternately be replaced with this:
for(f in Files) {
d <- read.csv(f)
first <- f == Files[1]
write.table(d, outFile, sep = ",", row.names = FALSE, col.names = first, append = !first)
}