So I have 2 text files. The actual files are very large (thousands of lines each) but this is an extract from them:
File 1:
L["Corn Flakes"] = ""
L["Rice Oats"] = ""
L["Shreddies"] = ""
File 2:
L["Marshmellows"] = "Tesco"
L["Golden Syrup"] = "Morrisons"
L["Corn Flakes"] = "Tesco"
L["Bran Flakes"] = "Asda"
L["Super Flakes"] = "Asda"
L["Rice Oats"] = "Asda"
L["Shreddies"] = "Morrisons"
L["Rice Krispies"] = "Tesco"
So I want to merge these files so that I end up with this in a new text file:
Merged file:
L["Corn Flakes"] = "Tesco"
L["Rice Oats"] = "Asda"
L["Shreddies"] = "Morrisons"
In other words, I want to merge file 1 and file 2 but the merged file should only contain the rows that are in file 1 but the actual rows in the merged file should be from file 2.
The merged file should be output to a new, empty text file and it needs to be in UTF-8 format so that it works in any language. All the files (file 1, file and the merged file) need to be standard .txt files.
How can I do this in R?
Thank you.
Read the two files with '=' as separator so you have files with two columns. Keep rows in file2 which has the first column (V1) present in file1. Write the result back to a new text file if needed.
file1 <- read.table('file1.txt', sep = '=', quote = '')
file2 <- read.table('file2.txt', sep = '=', quote = '')
result <- file2[file2$V1 %in% file1$V1, ]
To include all the rows in file1 irrespective if they are present in file2 you may try the join approach.
library(dplyr)
inner_join(file1 %>% select(-any_of('V2')), file2, by = 'V1') %>%
bind_rows(anti_join(file1, file2, by = 'V1')) %>%
data.frame() -> result
Write the result :
write.table(result, 'result.txt', sep = '=', col.names = FALSE, row.names = FALSE, quote = FALSE)
Related
I have a folder containing several .csv files. I need to delete the first three rows and the very last row of all those .csv files and then save them all as .txt. All the files have the same format so it's always the same rows that I would need to delete.
I know how to modify a single dataframe but I do not know how to load, modify and save as txt several dataframes.
I am a beginner using R so I do not have examples of things I have tried yet.
Any help will be really appreciated!
It's hard to start with stack overflow but the other comments about reproducible examples are worth thinking about for the future. My suggestion would be to write a function that reads, modifies, and writes and then loop it across all the files.
I can't tell exactly how to do this as I can't see your data but something like this should work:
library('tidyverse')
old_paths = list.files(
path = your_folder,
pattern = '\\.csv$',
full.names = TRUE
)
read_write = function(path){
new_filename = str_replace(
string = path,
pattern = '\\.csv$',
replacement = '.txt'
)
read_csv(path) %>%
slice(-(1:3)) %>%
slice(-n()) %>%
write_tsv(new_filename) %>%
invisible()
}
lapply(old_paths, read_write)
Let's do this for one data frame, only referencing its file name
input_file = "my_data_1.csv"
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
Now we can turn it into a function taking the file name as an argument:
my_txt_convert = function(input_file) {
data = read.csv(input_file)
# modify
data = data[-(1:3), ] # delete first 3 rows
data = data[-nrow(data), ] # delete last row
# save as .txt
output_file = sub("csv$", "txt", input_file)
write.table(x = data, file = output_file, sep = "\t", row.names = FALSE)
}
Then we call the function on all your files:
to_convert = list.files(pattern='.*.csv')
for (file in to_convert) {
my_txt_convert(file)
}
# or
lapply(to_convert, my_txt_convert)
I have multiple EEG data files in .txt format all saved in a single folder, and I would like R to read all the files in said folder, add column headings (i.e., electrode numbers denoted by ordered numbers from 1 to 129) to every file, and overwrite old files with new ones.
rm(list=ls())
setwd("C:/path/to/directory")
files <- Sys.glob("*.txt")
for (file in files){
# read data:
df <- read.delim(file, header = TRUE, sep = ",")
# add header to every file:
colnames(df) <- paste("electrode", 1:129, sep = "")
# overwrite old text files with new text files:
write.table(df, file, append = FALSE, quote = FALSE, sep = ",", row.names = FALSE, col.names = TRUE)
}
I expect the column headings of ordered numbers (i.e., electrode1 to electrode129) to appear on first row in every text file but the code doesn't seem to work.
I bet the solution is ridiculously simple, but I just haven't found any useful information regarding this issue...
Try this one
for (file in files) {
df = read.delim(file,header = FALSE,sep = ",")
colnames(df) = paste("electrode",1:129,sep = "")
write.table(df, file = "my_data.txt", sep = ",")
}
Some background for my question: This is an R script that a previous research assistant wrote, but he did not provide any guidance to me on using it for myself. After working through an R textbook, I attempted to use the code on my data files.
What this code is supposed to do is load multiple .csv files, delete certain items/columns from them, and then write the new cleaned .csv files to a specified directory.
Currently, the files are being created in the right directory with the right file name, but the .csv files that are being created are empty.
I am currently getting the following error message:
Warning in
fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",: Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: (variable names).
This is my code:
library(data.table)
library(magrittr)
library(stringr)
# create a function to delete unnecessary variables from a CAFAS or PECFAS
data set and save the reduced copy
del.items <- function(file){
data <- fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/raw/", file), sep = ",", header = TRUE,
na.strings = "", stringsAsFactors = FALSE, skip = 0, colClasses =
"character", data.table = FALSE)
data <- data[-grep(pattern = "^(CA|PEC)FAS_E[0-9]+(TR?(Initial|[0-
9]+|Exit)|SP[a-z])_(G|S|Item)[0-9]+$", x = names(data))]
write.csv(data, file = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/items-del/", sub(pattern = "ExportData_", x =
file, replacement = "")) %>% tolower, row.names = FALSE)
}
# delete items from all cafas data sets
cafas.files <- list.files("data/cafas/raw", pattern = ".csv")
for (file in cafas.files){
del.items(file)
}
# delete items from all pecfas data sets
pecfas.files <- list.files("data/pecfas/raw", pattern = ".csv")
for (file in pecfas.files){
del.items(file)
}
I have 900 text files in my directory as seen in the following figure below
each file consists of data in the following format
667869 667869.000000
580083 580083.000000
316133 316133.000000
11065 11065.000000
I would like to extract fourth row from each text file and store the values in an array, any suggestions are welcome
This sounds more like a StackOverflow question, similar to
Importing multiple .csv files into R
You can try something like:
setwd("/path/to/files")
files <- list.files(path = getwd(), recursive = FALSE)
head(files)
myfiles = lapply(files, function(x) read.csv(file = x, header = TRUE))
mydata = lapply(myfiles, FUN = function(df){df[4,]})
str(mydata)
do.call(rbind, mydata)
A lazy answer is:
array <- c()
for (file in dir()) {
row4 <- read.table(file,
header = FALSE,
row.names = NULL,
skip = 3, # Skip the 1st 3 rows
nrows = 1, # Read only the next row after skipping the 1st 3 rows
sep = "\t") # change the separator if it is not "\t"
array <- cbind(array, row4)
}
You can further keep the name of the files
colnames(array) <- dir()
I am working on combining csv files into one large csv file that will not be able to fit into my machine's RAM. Is there anyway to go about doing that in R? I realize that I could load each individual csv file into R and append the file to an existing database table but for quirky reasons I'm looking to end up with a large csv file.
Try to read each csv file one by one and write out with write.table and option append = T.
Something like this:
read one csv file;
write.table(..., append = T) to the final csv file;
remove the table with rm();
gc().
Repeate until all files are written out.
You can use the option append = TRUE
first <- data.frame(x = c(1,2), y = c(10,20))
second <- data.frame(c(3,4), c(30,40))
write.table(first, "file.csv", sep = ",", row.names = FALSE)
write.table(second, "file.csv", append = TRUE, sep = ",", row.names = FALSE, col.names = FALSE)
First create 3 test files and then create a variable Files containing their names. We used Sys.glob to do get the vector of file names but you may need to modify this statement. Then define outFile as the name of the output file. For each component of Files read in the file with that name and write it out. If it is the first file then write it all out and if it is a subsequent file write it all except for the header being sure to use append = TRUE. Note that L is overwritten each time a file is read in so that only one file takes up space at a time.
# create test files using built in data frame BOD
write.csv(BOD, "BOD1.csv", row.names = FALSE)
write.csv(BOD, "BOD2.csv", row.names = FALSE)
write.csv(BOD, "BOD3.csv", row.names = FALSE)
Files <- Sys.glob("BOD*.csv") # modify as appropriate
outFile <- "out.csv"
for(f in Files) {
L <- readLines(f)
if (f == Files[1]) cat(L, file = outFile, sep = "\n")
else cat(L[-1], file = outFile, sep = "\n", append = TRUE)
}
# check that the output file was written properly
file.show(outFile)
The loop could alternately be replaced with this:
for(f in Files) {
d <- read.csv(f)
first <- f == Files[1]
write.table(d, outFile, sep = ",", row.names = FALSE, col.names = first, append = !first)
}