How to remove some certain column in multiple files in R? - r

everyone. I want to remove some certain columns in multiple files(csv.).
for example, I have 50 files. And I want to delete a,b,c column in every file.
The point is I don't know how to get the files. Save the change in every single file and remain the original file name.
library(tidyverse)
# I want to delet some column which contain messy code
# input a list of file
df <- list.files(here("Data"),pattern=".csv",full.names = TRUE) %>%
lapply(read_csv) %>% #read csv
lapply(subset,select = -c(a,b,c)) #To remove the messy code
write.csv(df, file = here())
# I want to save the change in the original files, but I don't know how to do it.

Read all the files (if all the files are in the working directory) directly into a list and process it.
files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,c(-a,-b,-c)

Related

Merging csv files that their name starts with the same string using R

I have a number of csv files in the working directory. Some of these files share a string (ex. ny, nj, etc.) at the beginning of their name. Below is a screenshot:
What I want to do is to import and merge the csv files that share a string. I have searched and seen people suggesting regex, however I am not sure if that is best way to go. I appreciate any help with this.
Best,
Kaveh
Here's a function that may be more efficient than for loops, though there may be more elegant solutions.
Since I dont know what your excel files contain, I created several dummy files with a few columns ("A", "B", and "C"). I dont know what you would merge by; in this example I merged by column "A".
Given the ambiguity in the files, I have edited this to include both merge and bind approaches, depending on what is needed.
To test these functions, create a few CSV files in a folder (I created NJ_1.csv, NJ_2.csv, NJ_3.csv, NY_1.csv, NY_2.csv, each with columns A, B, and C.)
For all options, this code needs to be run.
setwd("insert path where folder with csv files is located")
library(dplyr)
OPTION 1:
If you want to merge files containing different data with a unique identifier.
Example: one file contains temperature and one file contains precipitation for a given geographic location
importMerge <- function(x, mergeby){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
merge <- files %>% Reduce(function(dtf1, dtf2) left_join(dtf1, dtf2, by = mergeby), .)
return(merge)
}
NJmerge <- importMerge("NJ", "A")
NYmerge <- importMerge("NY", "A")
OPTION 2:
If you want to bind files containing the same columns.
Example: Files contain both temperature and precipitation, and each file is a given geographic location. Note: All columns need to be the same name in each file
importBind <- function(x){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
bind <- do.call("rbind", files)
return(bind)
}
NJbind <- importBind("NJ")
NYbind <- importBind("NY")
OPTION 3
If you want to bind only certain columns from files containing the same column names
Example: Files contain temperature and precipitation, along with other columns that aren't needed, and each file is a given geographic location. Note: All columns need to be the same name in each file. Since default is NULL, leaving keeps out will default to option 2 above.
importBindKeep <- function(x, keeps = NULL){ # default is to keep all columns
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
# if you wanted to only keep a few columns, use the following.
if(!is.null(keeps)) files <- lapply(files, "[", , keeps)
bind <- do.call("rbind", files)
return(bind)
}
NJbind.keeps <- importBindKeep("NJ", keeps = c("A","B")) # keep only columns A and B
NYbind.keeps <- importBindKeep("NY", keeps = c("A","B"))
See How to import multiple .csv files at once? and Simultaneously merge multiple data.frames in a list, for more information.

reading files in a batch does not read them correctly

I have 30 .txt files that I need to read to a tibble. Its panel data and altogether 108M
The issue is that some files are read correctly with all values there, but some read as NA while values are there! Also, files include a lot of blank lines....
Here is what I use:
read_clean_table<-function(x){
x<-read.table(x, header = TRUE, fill = TRUE)
x[-(1:4),] #first 4 rows are system data
}
filenames<-list.files(path="./ML", pattern = ".*.txt", full.names=TRUE)
#read files and merge to table, first rows removed, FileName is the name of file
files<-filenames%>%
set_names(.) %>%
map_df(read_clean_table, .id = "FileName")%>%
mutate(FileName=str_replace_all(basename(FileName), pattern="\\.txt",""))
I tried read.delim as well with the same success...
THis is what the issue looks like
edited:
added two files
https://drive.google.com/drive/folders/1gDss6qV9aFUMpJFGHPMQZbTITJ9av-py?usp=sharing

Error in splitting and writing txt files in R studio

I have this big file names Objects_Population - AllCells.txt that is ~3GB, the file has 25704373 rows and 132 variables. I want to read the file and split the rows based on one variable, which is the column named treatmentsum. In this column, I have experimental drug treatments under different conditions (3S or UNS), that is strings linked with "_". So the split will put all rows with the same treatment together. After split the file, I want to write out the split files and give the file names using the treatmentsum.
My code is below :
#load libraries
library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)
#read in the file, skip the first 9 rows
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)
#split the files based on treatmentsum
splited<- files %>%
group_split(files$treatmentsum)
#write out the splitted files
output<- lapply(splited, function(i){
for (i in 1:length(splited)) {
write.table(splited[[i]][,1:131],file=paste(unique(splited[[i]]$treatmentsum),".txt"), sep="\t", row.names=FALSE)
}
})
So when I run it, the file read correctly, and the split worked fine and treatments are spitted as expected, that is I get a list of 1092 (shown in the environment), each list contains the rows with the same treatment. However it the code dies every time after it writes me 233 files. I have screened shot the error, and all the files generated are 3S, no UNS files generated (as you can see in the right bottom file directory screenshot). Can someone help me with this and let me know what the error means?
I figured out some of the file names due to the name of treatments will have "/" in it. Inspired by this https://stackoverflow.com/a/49647853/12362355
library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)
splited<- files %>%
group_split(files$treatmentsum)
output<- lapply(splited, function(i){
for (i in 1:length(splited)) {
write.table(splited[[i]][,1:131],file=paste0(gsub("/","",unique(splited[[i]]$treatmentsum)),".txt"), sep="\t",
row.names=FALSE)
}
})

Reading in multiple srt files

I'd like to read in multiple srt files in R. I can read them into a list but I need to load them in sequentially by the way they were created in the file directory.
I'd also like to make a column to tell which file they come from. So I can tell which data came from file 1, file 2.. etc.
I can read them in as a list; but the files are names like "1 - FileTest"; "2 - FileTest", "#10 FileTest",... etc
This then loads the list like 1, 10, 11... etc. Even though if I arrange the files in my file directory file 11 was created after 9 for instance. I should just need a parameter for them to load sequentially so then when I put them in dataframe they show in chronological order.
list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
Files <- lapply(list_of_files, srt.read)
Files <- data.frame(matrix(unlist(Files), byrow=T),stringsAsFactors=FALSE)
The files load in but they don't load in chronological order it is difficult to tell what data is associated with which file.
I have approximately 150 files so being able to compile them into a single dataframe would be very helpful. Thanks!
Consider extracting meta data of the files with file.info (includes created/modified time, file size, owner, group, etc.). Then order that resulting data frame by created date/time, and finally import .srt files with ordered list of files:
raw_list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
# CREATE DATA FRAME OF FILE INFO
meta_df <- file.info(raw_list_of_files)
# SORT BY CREATED DATE/TIME
meta_df <- with(meta_df, meta_df[order(ctime),])
# IMPORT DATA FRAMES IN ORDERED FILES
srt_list <- lapply(row.names(meta_df), srt.read)
final_df <- data.frame(matrix(unlist(srt_list), byrow=TRUE),
stringsAsFactors=FALSE)

Creating a data frame with the contents of multiple txt files

I'm new to R programming and am having difficulties trying to create one data frame from a number of text files. I have a directory containing over 100 text files. Each of the files have a different file name but the contents are of a similar format e.g. 3 columns (name, age,gender). I want to load each of the text files into R and merge them into 1 data frame.
So far I have:
txt_files = list.files(path='names/', pattern="*.txt");
do.call("rbind", lapply(txt_files, as.data.frame))
This has created a list of the file names but not the contents of the files. I'm able to read in the content of one file and create a data frame but I can't seem to do it for multiple files at once. If anyone could offer any help I'd really appreciate it as I'm completely stuck!
Thanks in advance!
I think you might want something like this:
# Put in your actual path where the text files are saved
mypath = "C:/Users/Dave/Desktop"
setwd(mypath)
# Create list of text files
txt_files_ls = list.files(path=mypath, pattern="*.txt")
# Read the files in, assuming comma separator
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = T, sep =",")})
# Combine them
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
At least that worked for me when I created a couple of sample text files.
Hope that helps.

Resources