My files are systemically named and all in the same folder. So I want to take advantage and write a function to read them one by one instead of doing this manually for each one.
The names are stored in this text file:
DF <- read.table(text=" site column row
1 abs 1259 463
2 adm 1253 460
3 afrm 1258 463", header=T)
I want to write a function to go row by row and do this:
You can see for instance if we apply for the first row:
cor$site is abs so:
file1=read.table("C:\\Data\\abs.txt",sep = "\t")
cor$column is 1259
cor$row is 463
So
wf= read.table("C:\\Users\\measurement_ 1259_463.txt", sep =' ' , header =TRUE)
Now I do any calculations with file1 and wf.........
And then go to the second row and so on.
Create a character vector with the file names you want to read and follow the instructions in consolidating data frames in R or reading multiple csv files in R.
files <- data.frame(
site = paste("C:\\Data\\", DF$site, ".txt", sep=""),
measurement = paste("C:\\Users\\measurement_", DF$column, "_",
DF$row, ".txt", sep=""),
stringsAsFactors = FALSE)
results <- Map(function(s, m){
file1 <- read.table(s, sep="\t")
wt <- read.table(m, sep=' ', header=TRUE)
# Do stuff
return(result)
}, files$site, files$measurement)
# Alternatively
results <- vector("list", nrow(files))
for(i in 1:nrow(files)){
file1 <- read.table(files$site[i], sep="\t")
wt <- read.table(files$measurment[i], sep=' ', header=TRUE)
# Do stuff
results[[i]] <- # result
}
Related
I have 6 txt files and I want to combine them into 1 dataframe. I know how to read them simultaneously and combine them in default way.
I learned to do this in this website:
txt_files_ls = list.files(path=mypath, pattern="*.txt")
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = T, sep ="\t")})
# Combine them
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
Now I want to do is set the read.table to read the txt files in a sequential manner as i defined, So that after combining them, I will be able to labeled the rows with the name of their original txt file name. Thank you
You can try this:
txt_files_ls = list.files(path=mypath, pattern="*.txt")
#The function for reading
read.data <- function(x)
{
y <- read.table(file = x, header = T, sep ="\t")
y$var <- x
return(y)
}
#Read data
txt_files_df <- lapply(txt_files_ls,read.data)
# Combine them
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
Where var contains the name of each file.
I have 280 *.csv files in a directory. Each file has 3 columns and 1000 rows. I want to estimate Pearson's correlation between column 2 and 3 of each file and put the correlation value in the first cell of column 4, and also all 280 correlation values in a separate file. How can I do this in R?
I have tried several codes including the one below which although I know is incorrect, I do not know how to write. Please help.
files <- list.files(path="mydirectory", pattern="*.csv", full.names=TRUE,
recursive=FALSE)
function(files)
lapply(files,function(x){
x <- read.csv(files, header = TRUE)
out <- function(cor(files[,2:3])
write.csv(out, sep = "\t", quote = FALSE, row.names = FALSE)
})
As for the first part, that's easy. You can calculate the correlations in an lapply loop and write them to a new file:
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
csv_data[, 4] <- cor(csv_data[, 2], csv_data[, 3])
# Create a new filename by replacing the ending of the
# input file (.csv) with (_cor.csv)
newfile <- gsub("\\.csv$", "_cor.csv", f)
write.csv(csv_data, file = newfile, quote = FALSE)
})
Since R wants columns in data.frames to have the same number of rows, this will fill every row of the 4th column with the correlation value. I would roll with this, but if you have a lot of data this can waste storage. Here's a not very elegant solution to only have the correlation in the first row:
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
csv_data[, 4] <- cor(csv_data[, 2], csv_data[, 3])
# Now delete duplicate values of cor
csv_data[2:nrow(csv_data), 4] <- NA
# Create a new filename by replacing the ending of the
# input file (.csv) with (_cor.csv)
newfile <- gsub("\\.csv$", "_cor.csv", f)
# Now when we write, we tell R to write an empty string when it encounters
# missing values
write.csv(csv_data, file = newfile, quote = FALSE, na = "")
})
Also:
You do not need to call function() when you use functions that already exist (like lapply() or cor()). You only need to use that when you want to define a new function yourself.
If you want to have the output in a single data.frame try:
my_df <- do.call(rbind,
lapply(files, function(f) {
# Read CSV data
csv_data <- read.csv(f, header=TRUE)
# Calculate correlation
data.frame(File=f, Correlation=cor(csv_data[, 2], csv_data[, 3]))
})
)
In a directory, I have 780 files and I need to bind them by rows, using R, in 78 different files and then write a .txt by file. The names of files are like these:
S1_S1_F1.xlsx
S1_S2_F1.xlsx
...
S1_S5_F1.xlsx
S1_S6_F2.xlsx
...
S1_S10_F2.xlsx
S2_S1_F1.xlsx
The first part of the expresion S1_(.*).xlsx repeats 10 times, then changes up to S78_(.*).xlsx, with the second part changing from (.*)_S1(.*).xlsx to (.*)S10(.*).xlsx. I need to combine the files just by the second term to have 78 files from S1.txt to S78.txt.
I'm far from being an expert in R, so my approach was to do it file by file with the following code:
S1<-list.files(pattern = "^S1(.*).xlsx")
S1<-lapply(S1,read_excel)
S1 <- bind_rows(S1)
write.table(S1, "S1.txt", sep="\t",row.names=FALSE)
up to
S78<-list.files(pattern = "^S78(.*).xlsx")
S78 <-lapply(S78,read_excel)
S78 <- bind_rows(S78)
write.table(S78, "S1.txt", sep="\t",row.names=FALSE)
As you can see, this code seems to have been written by an australopithecus (which I'm not), so I beg your help! How can I do it with a for loop?
Simply wrap another lapply (which is a loop) around your lines iterating through the sequence of 1 to 78. Below will output the 78 txt files and leave you a list of 78 dataframes:
dfList <- lapply(seq(1,78), function(i) {
f <- list.files(pattern = paste0("^S", i, "(.*).xlsx"))
dfs <- lapply(f, read_excel)
df <- bind_rows(dfs) # OR base R'S do.call(rbind, dfs)
write.table(df, paste0("S", i, ".txt"), sep="\t", row.names=FALSE)
return(df)
})
dfList[[1]]
dfList[[2]]
...
dfList[[78]]
And even use sapply to return a named list:
dfList <- sapply(paste0("S",seq(1,78)), function(i) {
f <- list.files(pattern = paste0("^", i, "(.*).xlsx"))
dfs <- lapply(f, read_excel)
df <- bind_rows(dfs)
write.table(df, paste0(i, ".txt"), sep="\t", row.names=FALSE)
return(df)
}, simplify = FALSE)
dfList$S1
dfList$S2
...
dfList$S78
I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}
I have a folder with 142 tab-delimited text files. Each file has 19 variables, and then a number of rows beneath (usually no more than 30 rows, but it varies).
I want to do several things with these files in R automatically, and I can't seem to get exactly what I want with my code. I am new to loops, I got both sections of code from previous posts here at stackoverflow but can't seem to figure out how to combine their functions.
I want to turn the filename into a variable when reading the files into R, so that each row has the identifying file name
Concatenate all files (with filename variable and no header) into one dataframe with dimensions Yx19, where Y=however many resulting rows there are.
I am able to create a list of the 142 dataframes using this code:
myFiles = list.files(path="~/Documents/ForR/", pattern="*.txt")
data <- lapply(myFiles, read.table, sep="\t", header=FALSE)
names(data) <- myFiles
for(i in myFiles)
data[[i]]$Source = i
do.call(rbind, data)
I am able to create the dataframe I want with 19 variables, but the filename is not present:
files <- list.files(path="~/Documents/ForR/.", pattern=".txt")
DF <- NULL
for (f in files) {
dat <- read.csv(f, header=F, sep="\t", na.strings="", colClasses="character")
DF <- rbind(DF, dat)
}
How do I add the file name (without .txt if possible) as a variable to the loop?
add to the loop
dat$file <- unlist(strsplit(f,split=".",fixed=T))[1]
files <- list.files(path="~/Documents/ForR/.", pattern=".txt")
DF <- NULL
for (f in files) {
dat <- read.csv(f, header=F, sep="\t", na.strings="", colClasses="character")
dat$file <- unlist(strsplit(f,split=".",fixed=T))[1]
DF <- rbind(DF, dat)
}
Shouldn't the row.names from the do.call be in the format names(list)[n].i where i is 1:number_of_rows_for_data.frame n? so you can just make a column from the row.names
data <- lapply(myFiles, read.table, sep="\t", header=FALSE)
combined.data <- do.call(rbind, data)
combined.data$file_origin <- row.names(combined.data)
You can use basename to get the last path element( filename) , for example:
(files = file.path("~","Documents","ForR",c("file1.txt", "file2.txt")))
"~/Documents/ForR/file1.txt" "~/Documents/ForR/file2.txt"
(basename(files))
[1] "file1.txt" "file2.txt"
Then sub to remove the extension ".txt":
sub('.txt','',basename(files),fixed=TRUE)
[1] "file1" "file2"