I've got let's say 10 csv files with names like as
file_1_tail.csv
file_2_tail.csv
file_3_tail.csv
...
file_10_tail.csv
The only difference in name is in the number (1 to 10). Each has the same structure - 1000 rows and 100 columns.
I need to read them to the R, select specific columns and write as new file. My code for one file is below:
file_2_tail = read_csv("file_2_tail.csv")
file_2_tail_selected = file_2_tail[,c(1:7,30)])
write.csv2(file_2_tail_selected, file = "file_2_selected.csv")
And now I want to use loop to automate this for all ten files.
for (i in 1:10){
file_"i"_tail = read_csv("file_"i"_tail.csv")
file_"i"_tail_selected = file_"i"_tail[,c(1:7,30)]
write.csv2(file_"i"_tail_selected, file = "file_"i"_selected.csv")
}
And of course it doesn't work - i is not readable in this notation. How should I fix this?
You can't assign read_csv results to a string like that. Instead you can just store it in a temporary variable tmp
for (i in 1:10){
tmp <- read_csv(paste0("file_", i, "_tail.csv"))
tmp <- tmp[, c(1:7,30)]
write.csv2(tmp, file = paste0("file_", i, "_selected.csv"))
}
Btw this is probably a more efficient way to read multiple files
library(tidyverse)
filePattern <- "\\.csv$"
fileList <- list.files(path = ".", recursive = FALSE,
pattern = filePattern, full.names = TRUE)
result <- fileList %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv, .id = "FileName") %>%
select(1:7, 30)
result
Related
I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)
I am scraping a site in R (rvest package) and I want to create in every parsed csv file a new column and 1) assign to numbers similar to my loop numbers or 2) create a new column and assign a special value (which I got using rvest nodes). I can assign these numbers if I scrape only one page, but that is not what I need. And the for loop works smoothly.
Here is my code with for loop
registered <- for (n in c(11:12)){
url_2019 <-
paste0("https://www.cvk.gov.ua/pls/vnd2019/wp033pt001f01=919pf7331=", n
,".html")
results_2019 <- read_html(url_2019)%>% html_table(fill = TRUE)
results_2019[[6]]%>%as.data.frame
#dir.create("registered_major_2019")
file <- paste0("registered_major_2019/dist_", n, ".csv")
if (!file.exists(file)) write.csv(results_2019[[6]], file, fileEncoding
= "Windows-1251")
Sys.sleep(0.5)
}
And I know to do it separately
url_2019 <-
paste0("https://www.cvk.gov.ua/pls/vnd2019/wp033pt001f01=919pf7331=11
.html")
results_2019 <- read_html(url_2019)%>% html_table(fill = TRUE)
pfont <- read_html(url_2019)%>% html_node("font")%>%html_text()
# This is actually what I need
results_2019a <- data.frame(results_2019[[6]], pfont)
But can't figure it out how to do it in for(). I tried this, but it doesn't work:
registered <- for (n in c(11:12)){
url_2019 <-
paste0("https://www.cvk.gov.ua/pls/vnd2019/wp033pt001f01=919pf7331=", n
,".html")
results_2019 <- read_html(url_2019)%>% html_table(fill = TRUE)%>%data.frame()
pfont <- read_html(url_2019)%>% html_node("font")%>%html_text()
df <- data.frame(results_2019[[6]], pfont)
#dir.create("registered_major_2019")
file <- paste0("registered_major_2019/dist_", n, ".csv")
if (!file.exists(file)) write.csv(df, file, fileEncoding = "Windows-
1251")
Sys.sleep(0.5)
}
I have managed to combined the data from several files and currently trying to extract a file number from my files and insert these into a column.
fnames = dir("../data/temperature_trials", full.names=TRUE)
print(fnames)
for (i in 1: length(fnames) ) {
#open each file in turn
temp = read.csv(fnames[i])
if (i == 1) {
res = temp
} else {
res = rbind(res, temp)
}
}
```
Imported 12 .csv files and used rbind to combine all data.Files named:
Trial1.csv
Trial2.csv
.
.
.
Trial12.csv
```
for (i in 1: length(fnames)) {
loc = regexpr(pattern = "Trial[0-9]*", text = fnames[i])
trialNumber = as.numeric(substr(fnames[i], start = loc[[1]][1]+5,
stop = loc[[1]][1] + attr(loc, 'match.length')-1))
print(trialNumber)
res1 = cbind(trialNumber, res)
```
I am trying to extract the trial numbers from each .csv file name and place them into a column named TrialNumber. When I do so it will only place a 12 into this column for every data point. Since it is using a loop I am assuming this is why, but can not figure out how to fix this or another way to do so. I need to assign the trial number to each data point corresponding with each .csv file.
Maybe you can simply add Trial number during each iteration of the loop-
for (i in 1: length(fnames) ) {
#open each file in turn
temp = read.csv(fnames[i])
if (i == 1) {
res = temp
} else {
res = rbind(res, temp)
}
res$trial_number=i
}
This way you will have a trial number column which will correspond to the file which had been imported.
You can also try extracting the numeric part of the file name as pointed out in this answer-
Extract numeric part of strings of mixed numbers and characters in R
I'd create a list of data frames from the CSV files, using the file name as the basis for each list element name:
fnames <- list.files("full/path/to/data/temperature_trials",
pattern = "*.csv", full.names = TRUE)
temp <- lapply(fnames, read.csv)
names(temp) <- tools::file_path_sans_ext(basename(fnames))
Then dplyr::bind_rows() will create a dataframe from the list with the treatment label in the .id column:
library(dplyr)
temp_df <- bind_rows(temp, .id = "TrialNumber")
I feel I am very close to the solution but at the moment i cant figure out how to get there.
I´ve got the following problem.
In my folder "Test" I´ve got stacked datafiles with the names M1_1; M1_2, M1_3 and so on: /Test/M1_1.dat for example.
No I want to seperate the files, so that I get: M1_1[1].dat, M1_1[2].dat, M1_1[3].dat and so on. These files I´d like to save in specific subfolders: Test/M1/M1_1[1]; Test/M1/M1_1[2] and so on, and Test/M2/M1_2[1], Test/M2/M1_2[2] and so on.
Now I already created the subfolders. And I got the following command to split up the files so that i get M1_1.dat[1] and so on:
for (e in dir(path = "Test/", pattern = ".dat", full.names=TRUE, recursive=TRUE)){
data <- read.table(e, header=TRUE)
df <- data[ -c(2) ]
out <- split(df , f = df$.imp)
lapply(names(out),function(z){
write.table(out[[z]], paste0(e, "[",z,"].dat"),
sep="\t", row.names=FALSE, col.names = FALSE)})
}
Now the paste0 command gets me my desired split up data (although its M1_1.dat[1] instead of M1_1[1].dat), but i cant figure out how to get this data into my subfolders.
Maybe you´ve got an idea?
Thanks in advance.
I don't have any idea what your data looks like so I am going to attempt to recreate the scenario with the gender datasets available at baby names
Assuming all the files from the zip folder are stored to "inst/data"
store all file paths to all_fi variable
all_fi <- list.files("inst/data",
full.names = TRUE,
recursive = TRUE,
pattern = "\\.txt$")
> head(all_fi, 3)
[1] "inst/data/yob1880.txt" "inst/data/yob1881.txt"
Preset function that will apply to each file in the directory
f.it <- function(f_in = NULL){
# Create the new folder based on the existing basename of the input file
new_folder <- file_path_sans_ext(f_in)
dir.create(new_folder)
data.table::fread(f_in) %>%
select(name = 1, gender = 2, freq = 3) %>%
mutate(
gender = ifelse(grepl("F", gender), "female","male")
) %>% (function(x){
# Dataset contains names for males and females
# so that's what I'm using to mimic your split
out <- split(x, x$gender)
o <- rbind.pages(
lapply(names(out), function(i){
# New filename for each iteration of the split dataframes
###### THIS IS WHERE YOU NEED TO TWEAK FOR YOUR NEEDS
new_dest_file <- sprintf("%s/%s.txt", new_folder, i)
# Write the sub-data-frame to the new file
data.table::fwrite(out[[i]], new_dest_file)
# For our purposes return a dataframe with file info on the new
# files...
data.frame(
file_name = new_dest_file,
file_size = file.size(new_dest_file),
stringsAsFactors = FALSE)
})
)
o
})
}
Now we can just loop through:
NOTE: for my purposes I'm not going to spend time looping through each file, for your purposes this would apply to each of your initial files, or in my case all_fi rather than all_fi[2:5].
> rbind.pages(lapply(all_fi[2:5], f.it))
============================ =========
file_name file_size
============================ =========
inst/data/yob1881/female.txt 16476
inst/data/yob1881/male.txt 15306
inst/data/yob1882/female.txt 18109
inst/data/yob1882/male.txt 16923
inst/data/yob1883/female.txt 18537
inst/data/yob1883/male.txt 15861
inst/data/yob1884/female.txt 20641
inst/data/yob1884/male.txt 17300
============================ =========
I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan