I trying to read the number of rows and columns in several csv files inside a folder.
My code read all files but showed 0 row and 0 column.
files_map <- "C:/Users/Windows 10/Desktop/dados/planilhas LLS"
files <- list.files(full.names = F)
library(data.table)
output <- data.table::rbindlist(lapply(files, function(file) {
dt <- data.table::fread(paste(files_map, file, sep = " "))
list("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_file" = file)}))
How could I solve this?
Thanks
I made a test on my computer, slightly changing your files and this produces a correct output. You need to change paste to paste0 because you don't want spaces in your filenames, and then add a trailing /.
library(data.table)
setwd("Desktop/")
## make up some random files
fwrite(mtcars, "test_a")
fwrite(mtcars, "test_b")
fwrite(mtcars, "test_c")
files_map <- "~/Desktop"
output <- data.table::rbindlist(lapply(files, function(file) {
dt <- data.table::fread(paste0(files_map, "/", file))
list("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_file" = file)
})
)
number_of_cols number_of_rows name_of_file
1: 11 32 test_a
2: 11 32 test_b
3: 11 32 test_c
Related
I have radiotelemetry data that is downloaded as a series of text files. I was provided with code in 2018 that looped through all the text files and converted them into CSV files. Up until 2021 this code worked. However, now the below code (specifically the lapply loop), returns the following error:
"Error in setnames(x, value) :
Can't assign 1 names to a 4 column data.table"
# set the working directory to the folder that contain this script, must run in RStudio
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))
# get the path to the master data folder
path_to_data <- paste(getwd(), "data", sep = "/", collapse = NULL)
# extract .TXT file
files <- list.files(path=path_to_data, pattern="*.TXT", full.names=TRUE, recursive=TRUE)
# regular expression of the record we want
regex <- "^\\d*\\/\\d*\\/\\d*\\s*\\d*:\\d*:\\d*\\s*\\d*\\s*\\d*\\s*\\d*\\s*\\d*"
# vector of column names, no whitespace
columns <- c("Date", "Time", "Channel", "TagID", "Antenna", "Power")
# loop through all .TXT files, extract valid records and save to .csv files
lapply(files, function(x){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
colnames(dt) <- c("columns") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I looked up possible fixes for this and found using "setnames(skip_absent=TRUE)" in the loop resolved the setnames error but instead gave the error "Error in is.data.frame(x) : argument "x" is missing, with no default"
lapply(files, function(file){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
setnames(skip_absent = TRUE)
colnames(dt) <- c("col") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I'm confused at to why this code is no longer working despite working fine last year? Any help would be greatly appreciated!
The error occured at this line colnames(dt) <- c("columns") where you provided only one value to rename the (supposedly) 4-column dataframe. If you meant to replace a particular column, you can
colnames(dt)[i] <- c("columns")
where i is the index of the column you are renaming. Alternatively, provide a vector with 4 new names.
Is there any way to get information about the number of rows and columns of a multiple CSV file in R and save it in a CSV file? Here is my R code:
#Library
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("fs")) install.packages("fs")
#Mentioning Files Location
file_paths <- fs::dir_ls("C:\\Users\\Desktop\\FileCount\\Test")
file_paths[[2]]
#Reading Multiple CSV Files
file_paths %>%
map(function(path)
{
read_csv(path,col_names = FALSE)
})
#Counting Number of Rows
lapply(X = file_paths, FUN = function(x) {
length(count.fields(x))
})
#Counting Number of Columns
lapply(X = file_paths, FUN = function(x) {
length(ncol(x))
})
#Saving CSV File
write.csv(file_paths,"C:\\Users\\Desktop\\FileCount\\Test\\FileName.csv", row.names = FALSE)
Couple of things are not working:
Number of Columns of a multiple CSV file
When I am saving the file, I want to save Filename, number of rows and number of columns. See attached image.
How the output looks like:
Attached some CSV Files for testing: Here
Any help appreciated.
Welcome on SO! Using the tidyverse and data.table, here's a way to do it:
Note: All the .csv files are in my TestStack directory, but you can change it with your own directory (C:/Users/Desktop/FileCount/Test).
Code:
library(tidyverse)
csv.file <- list.files("TestStack") # Directory with your .csv files
data.frame.output <- data.frame(number_of_cols = NA,
number_of_rows = NA,
name_of_csv = NA) #The df to be written
MyF <- function(x){
csv.read.file <- data.table::fread(
paste("TestStack", x, sep = "/")
)
number.of.cols <- ncol(csv.read.file)
number.of.rows <- nrow(csv.read.file)
data.frame.output <<- add_row(data.frame.output,
number_of_cols = number.of.cols,
number_of_rows = number.of.rows,
name_of_csv = str_remove_all(x,".csv")) %>%
filter(!is.na(name_of_csv))
}
map(csv.file, MyF)
Output:
number_of_cols number_of_rows name_of_csv
1 3 2150 CH_com
2 2 34968 epci_com_20
3 3 732 g1g4
4 7 161905 RP
I have this output because my TestStack had 4 files named CH_com.csv, epci_com_20.csv,...
You can then write the object data.frame.output to a .csv as you wanted: data.table::fwrite(data.frame.output, file = "Output.csv")
files_map <- "test"
files <- list.files(files_map)
library(data.table)
output <- data.table::rbindlist(
lapply(files, function(file) {
dt <- data.table::fread(paste(files_map, file, sep = "/"))
list("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_csv" = file)
})
)
data.table::fwrite(output, file = "Filename.csv")
Or with map and a seperate function to do the tasks, but without using an empty table first and update it with a global assignment. I see this happen a lot on apply functions, while it is not needed at all.
myF <- function(file) {
dt <- data.table::fread(paste(files_map, file, sep = "/"))
data.frame("number_of_cols" = ncol(dt), "number_of_rows" = nrow(dt), "name_of_csv" = file)
}
output <- do.call(rbind, map(files, myF))
I have approximately 400 .csv files and need to take just one value from each of them (cell B2 if opened using spreadsheet software).
Each file is an extract from a single date and is named accordingly (i.e. extract_2017-11-01.csv, extract_2018-04-05, etc.)
I know that I can do something like this to iterate over the files (correct me if I am wrong, or if there is a better way please do tell me):
path <- "~/csv_files"
out.file <- ""
file.names <- dir(path, pattern =".csv")
for(i in 1:length(file.names)){
file <- read.table(file.names[i], header = TRUE, sep = ",")
out.file <- rbind(out.file, file)
}
I want to effectively add something at the end of this which creates a data frame consisting of two columns: the first column will show the date (this ideally would be taken from the filename) and the second column will hold the value in cell B2.
How can I do this?
This lets you select only the second row and the second column when you import:
extract_2018_11_26 <- read.table("csv_files/extract_2018-11-26.csv",
sep=";", header = T, nrows=1, colClasses = c("NULL", NA, "NULL"))
Because nrows=1 means that we read only the first rows (except the header), and
in colClasses you secify "NULL" if you want to skip a column and NA if you want to keep it.
Here following your code, gsub() lets you find a pattern and replace it in a string:
out.file <- data.frame()
for(i in 1:length(file.names)){
file <- read.table(file.names[i],
sep=";", header = T, nrows=1, colClasses = c("NULL", NA,"NULL"))
date <- gsub("csv_files/extract_|.csv", "",x=file.names[i]) # extracts the date from the file name
out.file <- rbind(out.file, data.frame(date, col=file[, 1]))
}
out.file
# date col
# 1 2018-11-26 2
# 2 2018-11-27 2
Here the two .csv original files:
#first file, name: extract_2018-11-26.csv
col1 col2 col3
1 1 2 3
2 4 5 6
#second file, name: extract_2018-11-27.csv
col1 col2 col3
1 1 2 3
2 4 5 6
data.table approach
#build a list with csv files you want to load
files <- list.files( path = "yourpath", pattern = ".*.csv$", full.names = TRUE )
library(data.table)
#get value from second row (skip = 1) , second column ( select = 2 ) from each csv, using `data.table::fread`...
#bind the list together using `data.table::rbindlist`
rbindlist( lapply( files, fread, nrows = 1, skip = 1, select = 2 ) )
extracting the data from a filename is a different question, regex related.. please ask in a different quenstion...
I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}
I would like to read multiple text files from my directory the files are arranged in following format
regional_vol_GM_atlas1.txt
regional_vol_GM_atlas2.txt
........
regional_vol_GM_atlas152.txt
Data from the files looks in following format
667869 667869
580083 580083
316133 316133
3631 3631
following is the script that i have written
library(readr)
library(stringr)
library(data.table)
array <- c()
for (file in dir(/media/dev/Daten/Task1/subject1/t1)) # path to the directory where .txt files are located
{
row4 <- read.table(file=list.files(pattern ="regional_vol*.txt"),
header = FALSE,
row.names = NULL,
skip = 3, # Skip the 1st 3 rows
nrows = 1, # Read only the next row after skipping the 1st 3 rows
sep = "\t") # change the separator if it is not "\t"
array <- cbind(array, row4)
}
I am incurring following error
Error in file(file, "rt") : invalid 'description' argument
kindly suggest me where i was wrong in the script
This seems to work fine for me. Make changes as per code comments in case files have headers :
[Answer Edited to reflect new information posted by OP]
# rm(list=ls()) #clean memory if you can afford to
mydir<- "~/Desktop/a" #change as per your path
# read full paths
myfiles<- list.files(mydir,pattern = "regional_vol*",full.names=T)
myfiles #check that files listed correctly
# initialise the dataframe from first file
# change header =T/F depending on presence of header
# make sure sep is correct
df<- read.csv( myfiles[1], header = F, skip = 0, nrows = 4, sep="" )[-c(1:3),]
#check that first line was read correctly
df
#read all the other files and update dataframe
#we read 4 lines to read the header correctly, then remove 3
ans<- lapply(myfiles[-1], function(x){ read.csv( x, header = F, skip = 0, nrows = 4, sep="")[-c(1:3),] })
ans
#update dataframe
lapply(ans, function(x){df<<-rbind(df,x)} )
#this should be the required dataframe
df
Also, if you are on Linux, a much simple method would be to simply make the OS do it for you
awk 'FNR == 4' regional_vol*.txt
This should do it for you.
# set the working directory (where files are saved)
setwd("C:/Users/your_path_here/Desktop/")
file_names = list.files(getwd())
file_names = file_names[grepl(".TXT",file_names)]
# print file_names vector
file_names
# read the WY.TXT file, just for testing
# file = read.csv("C:/Users/your_path_here/Desktop/regional_vol_GM_atlas1.txt", header=F, stringsAsFactors=F)
# see the data structure
str(file)
# run read.csv on all values of file_names
files = lapply(file_names, read.csv, header=F, stringsAsFactors = F)
files = do.call(rbind,files)
# set column names
names(files) = c("field1", "field2", "field3", "field4", "field5")
str(files)
write.table(files, "C:/Users/your_path_here/Desktop/mydata.txt", sep="\t")
write.csv(files,"C:/Users/your_path_here/Desktop/mydata.csv")