I have this big file names Objects_Population - AllCells.txt that is ~3GB, the file has 25704373 rows and 132 variables. I want to read the file and split the rows based on one variable, which is the column named treatmentsum. In this column, I have experimental drug treatments under different conditions (3S or UNS), that is strings linked with "_". So the split will put all rows with the same treatment together. After split the file, I want to write out the split files and give the file names using the treatmentsum.
My code is below :
#load libraries
library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)
#read in the file, skip the first 9 rows
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)
#split the files based on treatmentsum
splited<- files %>%
group_split(files$treatmentsum)
#write out the splitted files
output<- lapply(splited, function(i){
for (i in 1:length(splited)) {
write.table(splited[[i]][,1:131],file=paste(unique(splited[[i]]$treatmentsum),".txt"), sep="\t", row.names=FALSE)
}
})
So when I run it, the file read correctly, and the split worked fine and treatments are spitted as expected, that is I get a list of 1092 (shown in the environment), each list contains the rows with the same treatment. However it the code dies every time after it writes me 233 files. I have screened shot the error, and all the files generated are 3S, no UNS files generated (as you can see in the right bottom file directory screenshot). Can someone help me with this and let me know what the error means?
I figured out some of the file names due to the name of treatments will have "/" in it. Inspired by this https://stackoverflow.com/a/49647853/12362355
library(tidyverse)
library(vroom)
library(dplyr)
library(stringr)
files<-vroom("Objects_Population - AllCells.txt", delim = "\t",skip = 9,col_names = T)
splited<- files %>%
group_split(files$treatmentsum)
output<- lapply(splited, function(i){
for (i in 1:length(splited)) {
write.table(splited[[i]][,1:131],file=paste0(gsub("/","",unique(splited[[i]]$treatmentsum)),".txt"), sep="\t",
row.names=FALSE)
}
})
Related
I have a number of csv files in the working directory. Some of these files share a string (ex. ny, nj, etc.) at the beginning of their name. Below is a screenshot:
What I want to do is to import and merge the csv files that share a string. I have searched and seen people suggesting regex, however I am not sure if that is best way to go. I appreciate any help with this.
Best,
Kaveh
Here's a function that may be more efficient than for loops, though there may be more elegant solutions.
Since I dont know what your excel files contain, I created several dummy files with a few columns ("A", "B", and "C"). I dont know what you would merge by; in this example I merged by column "A".
Given the ambiguity in the files, I have edited this to include both merge and bind approaches, depending on what is needed.
To test these functions, create a few CSV files in a folder (I created NJ_1.csv, NJ_2.csv, NJ_3.csv, NY_1.csv, NY_2.csv, each with columns A, B, and C.)
For all options, this code needs to be run.
setwd("insert path where folder with csv files is located")
library(dplyr)
OPTION 1:
If you want to merge files containing different data with a unique identifier.
Example: one file contains temperature and one file contains precipitation for a given geographic location
importMerge <- function(x, mergeby){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
merge <- files %>% Reduce(function(dtf1, dtf2) left_join(dtf1, dtf2, by = mergeby), .)
return(merge)
}
NJmerge <- importMerge("NJ", "A")
NYmerge <- importMerge("NY", "A")
OPTION 2:
If you want to bind files containing the same columns.
Example: Files contain both temperature and precipitation, and each file is a given geographic location. Note: All columns need to be the same name in each file
importBind <- function(x){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
bind <- do.call("rbind", files)
return(bind)
}
NJbind <- importBind("NJ")
NYbind <- importBind("NY")
OPTION 3
If you want to bind only certain columns from files containing the same column names
Example: Files contain temperature and precipitation, along with other columns that aren't needed, and each file is a given geographic location. Note: All columns need to be the same name in each file. Since default is NULL, leaving keeps out will default to option 2 above.
importBindKeep <- function(x, keeps = NULL){ # default is to keep all columns
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
# if you wanted to only keep a few columns, use the following.
if(!is.null(keeps)) files <- lapply(files, "[", , keeps)
bind <- do.call("rbind", files)
return(bind)
}
NJbind.keeps <- importBindKeep("NJ", keeps = c("A","B")) # keep only columns A and B
NYbind.keeps <- importBindKeep("NY", keeps = c("A","B"))
See How to import multiple .csv files at once? and Simultaneously merge multiple data.frames in a list, for more information.
I have 30 .txt files that I need to read to a tibble. Its panel data and altogether 108M
The issue is that some files are read correctly with all values there, but some read as NA while values are there! Also, files include a lot of blank lines....
Here is what I use:
read_clean_table<-function(x){
x<-read.table(x, header = TRUE, fill = TRUE)
x[-(1:4),] #first 4 rows are system data
}
filenames<-list.files(path="./ML", pattern = ".*.txt", full.names=TRUE)
#read files and merge to table, first rows removed, FileName is the name of file
files<-filenames%>%
set_names(.) %>%
map_df(read_clean_table, .id = "FileName")%>%
mutate(FileName=str_replace_all(basename(FileName), pattern="\\.txt",""))
I tried read.delim as well with the same success...
THis is what the issue looks like
edited:
added two files
https://drive.google.com/drive/folders/1gDss6qV9aFUMpJFGHPMQZbTITJ9av-py?usp=sharing
everyone. I want to remove some certain columns in multiple files(csv.).
for example, I have 50 files. And I want to delete a,b,c column in every file.
The point is I don't know how to get the files. Save the change in every single file and remain the original file name.
library(tidyverse)
# I want to delet some column which contain messy code
# input a list of file
df <- list.files(here("Data"),pattern=".csv",full.names = TRUE) %>%
lapply(read_csv) %>% #read csv
lapply(subset,select = -c(a,b,c)) #To remove the messy code
write.csv(df, file = here())
# I want to save the change in the original files, but I don't know how to do it.
Read all the files (if all the files are in the working directory) directly into a list and process it.
files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,c(-a,-b,-c)
I have code here I am trying to manipulte my files so that the one column is only 1 and not 1 and 0. I have multiple files and multiple columns but regardless filtering one column to get only 1's and keeping everything else should be easy to do. I can not get the dplyr function to work with d %>% filter(CreaseUp>0). Maybe there is another command with lapply that would work? everything else works. I can get the files summarized and outputted in one file. I'm so close to getting this right. Please help.
setwd("~/OneDrive/School/R/R Workspace/2016_Coda-Brundage/cb")
#assuming your working directory is the folder with the CSVs
f = list.files(pattern="*.csv")
for (i in 1:length(f)) assign(f[i], read.csv(f[i]))
d<-lapply(f, read.csv)
f.1<-d %>%
filter(CreaseUp>0)
w<-lapply(f.1, summary)
write.table(w, file = "SeedScan_results1.csv", sep = ",", col.names = NA,
qmethod = "double")
Final script. I had to open the .txt file in office, change the spaces inbetween the headings and numbers to commas and then create a table from text. From there i could put it in excel and pull my means from this set.
setwd("~/OneDrive")
#assuming your working directory is the folder with the CSVs
f=list.files(pattern="*.csv")
library(dplyr)
sink("SeedScan_results1.txt")
for (i in 1:length(f)){
df=assign(f[i], read.csv(f[i]))
df=filter(df, CreaseUp>0)
print(lapply(df, summary))
}
sink(NULL)
The d seems to be a list of dataframes, not a dataframe, so dplyr can't handle it. Also, what is that loop doing now? Why not put the read (and possibly filtering) inside the loop?
alldfs = NULL
for (i in f){
df = read.csv(i)
df = filter(df, CreaseUp>0)
alldfs = bind_rows(alldfs, df)
}
# print summary etc.
EDIT - if you want to print the summary from within the loop:
sink("SeedScan_results1.txt")
for (i in f){
df = read.csv(i)
df = filter(df, CreaseUp>0)
print(lapply(df, summary))
}
sink(NULL)
The append flag might be helpful if you want to move sink inside the loop.
I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.