How to find a specific data in csv file - r

I have a lots of csv files, I want to find some data in this.because in each files the data position is different, so I want to know how to find the data in red box in different csv files.
In csv file, it include the same data like the different month, I have an idea is that find the EnergyconsumptionElectricallyNaturalgasmonthly in csv file and then feedback the position, according the position choose the red box data.

Read the file line by line using readLines:
con <- file("temp2Table.csv", "r")
x <- readLines(con)
close(con)
Then find the row where we need to subset:
grep("EnergyConsumptionElectricityNaturalGasMonthly", x)
# [1] 16534
Once we know the row number, we can subset following 16 rows as below, and
write it out to a file:
write(x[ grep("EnergyConsumptionElectricityNaturalGasMonthly", x) + 4:20 ], "tempOut.csv")
Then we can read the file as normal csv:
dfClean <- read.csv("tempOut.csv")
And subset columns as we need:
dfClean[, 2:3]
# X.1 ELECTRICITY.FACILITY..kWh.
# 1 January 11675.57
# 2 February 9148.04
# 3 March 13862.50
# 4 April 16274.57
# 5 May 23918.16
# 6 June 29293.78
# 7 July 32953.04
# 8 August 34111.54
# 9 September 24398.53
# 10 October 14577.93
# 11 November 13931.94
# 12 December 12137.73
# 13 NA
# 14 Annual Sum or Average 236283.34
# 15 Minimum of Months 9148.04
# 16 Maximum of Months 34111.54

I would read the csv's and subset the terms you want to. Assuming they all have the same file structure and are contained in the same folder, you could do the following:
library(data.table) # library needed for fread, you can just use read.csv if you prefer
# create a list of the files in the folder
folder <- 'address_to_folder' # skip the last "/"
files <- list.files(path = folder, pattern="*.csv")
# read the files into a list and then transform it into a data.frame
mycsv <- lapply(paste(folder, pattern, sep = '/'), fread)
mydata <- rbindlist(mycsv)
# This part will need interpretation of the data frame,
# you have to see where the column you want is,
# if it is correctly formatted and how you can search it
search_result <- mydata[ mydata$column = 'search term', ]

Related

How to deal with "text_field must refer to a character mode column"?

I was just trying to convert my dataframe into a corpus (a very normal procedure) when the following error propped up: Error in corpus.data.frame(df, text_field = "Text"): text_field must refer to a character mode column. My dataframe is a normal one with specified columns (Date and Text). The only "new" thing I did compared to the past is that I read the texts from text files and when trying to read one document I got this Error in nchar(lev, "w"): invalid multibyte string, element 198, just because there were some symbols that R could not read.
#Read the data from the folder
file.list <- list.files(pattern = '*.txt')
df.list <- lapply(file.list, read_file)
#Convert to a dataframe
df <- do.call(rbind.data.frame, df.list)
colnames(df) <- c("Text")
#Create the Corpus
cp <- corpus(df, text_field = "Text")
The dataframe looks like this:
Date Text
1 5 January 2000 Text
2 3 February 2000 Text
3 2 March 2000 Text
4 30 March 2000 Text
5 13 April 2000 text
6 11 May 2000
I have no idea on how to deal with it. Can anyone help me? Thanks a lot

Importing multiple csv file from different folder and extracting the filename as additionnal columns: headers faith and multifolders case

Use of ldply (package "plyr") to import multiple csv files from a folder: header faith, and how to do it for multiple folders?
set up:
- Desktop: MacBook Pro (Early 2011) with iOS 10.13.6
- Software version: R version 3.5.1 (2018-07-02) -- "Feather Spray"
- R studio: Version 1.1.456
I would like to import multiple csv files from specific folders and merge them into one file with 5 columns: Variable1/Variable2/file_name/experiment_nb/pulse_nb
I have managed to make the importation of all files from the same folder from previous similar questions in StackOverflow in the same data.frame, however, I am not sure of how to do it for different folder and the faith of header of each file after merge, . As the file are too big to handle manually (200 000 lines per files), I want to make sure there is not any mistake that would cause all subsequent analysis to fail, such as the line of the header before the data of each csv file imported
The csv looks like this: "20190409-0001_002.csv" with the date, followed by the name of the experiment (0001) in the example, and the number of the pulse (002)
#setting package and directory
library(plyr)
library(stringr)
setwd("/Users/macbook/Desktop/Project_Folder/File_folder1")
#Creating a list of all the filenames:
filenames <- list.files(path = "/Users/macbook/Desktop/Project_Folder/File_folder1")
#creating a function to read csv and in the same time adding an additional column with the name of the file
read_csv_filename <- function(filename)
{
ret <- read.csv(filename, header=TRUE, sep=",")
ret$Source <- filename #EDIT
ret
}
#importing
import <- ldply(filenames, read_csv_filename)
#making a copy of import
data<-import
#modifying the file name so it removes ".csv" and change the header
data$Source<-str_sub(data$Source, end=-5)
data[1,3]<-"date_expnb_pulsenb"
t<-substr(data[1,3],1,3)
head(data, n=10)
#create a column with the experiment number, extracted from the file name
data$expnb<-substr(data$Source, 10, 13)
data$expnb<-as.numeric(data$expnb)
head(data, n=10)
tail(data, n=10)
1° Now I need to manage to import all the other folders in the same files, which I could eventually do manually because the number of folder is manually doable (9-10), but I am considering making a code for this as well for future experiments with big number of experiments. How to do that ? to first list all folder, then list all files from those folder, and then regroup them in one list files ? Is this doable with list.files ?
The folder name will looks like this: "20190409-0001"
2° The result from the code above (head(data, n=10)) looks like this:
> head(data, n=10)
Time Channel.A Source pulsenb expnb
1 (us) (A) expnb_pulsenb NA NA
2 -20.00200030 -0.29219970 20190409-0001_002 2 1
3 -20.00100030 -0.29219970 20190409-0001_002 2 1
and
> tail(data, n=10)
Time Channel.A Source pulsenb expnb
20800511 179.99199405 -0.81815930 20190409-0001_105 105 1
20800512 179.99299405 -0.81815930 20190409-0001_105 105 1
I would like to run extensive data analysis on the now big list, and I am wondering how to check that in the middle of them I do not have some line with file headers. As the headers as the same in the csv file, does the ldply function already takes into account the headers? Would all the file header be in a separate line in the "import" data frame ? How to check that? (unfortunately, there is around 200 XXX lines in each file so I can not really manually check for headers).
I hope I have added all the required details and put the questions in the right format as it is my first time posting here :)
Thank you guys in advance for your help!
I have created a sham environnement of folders and files, assuming that you would logically regroup all your files and folders.
# ---
# set up folders and data
lapply( as.list(paste0("iris", 1:3)), dir.create )
iris_write <- function(name) write.csv(x = iris, file = name)
lapply( as.list(paste0("iris", 1:3, "/iris", 1:3, ".csv")), iris_write)
# Supposing you got them all in one folder, one level up
ldir <- list.dirs()
ldir <- ldir[stringr::str_detect(string = ldir, pattern = "iris")] # use 20190409-0001 in your case
# Getting all files
lfiles <- lapply( as.list(ldir), list.files )
# Getting all path
path_fun <- function(dirname) paste0(dirname, "/", list.files(dirname) )
lpath <- lapply( as.list(ldir), path_fun )
Using r base or/and the package data.table
# ---
# --- Import, with functions that detect automatically headers, sep + are way faster to read data
# *** Using data.table
library(data.table)
read_csv_filename <- function(filename){
ret <- fread(filename)
ret$Source <- filename #EDIT
ret
}
ldata <- lapply( lpath , read_csv_filename )
# --- if you want to regroup them
# with r base
df_final <- do.call("rbind", ldata)
# using data.table
df_final <- rbindlist(ldata)
Using package dplyr
# *** using dplyr
library(dplyr)
read_csv_filename2 <- function(filename){
ret <- reader(filename)
ret$Source <- filename #EDIT
ret
}
ldata <- lapply( lpath , read_csv_filename )
df_final <- bind_rows(ldata)
# you may do this with plyr::ldply also
df_final2 <- plyr::ldply(ldata, data.frame)
# *** END loading
Last suggestion : file_path_sans_ext from the package tools
# modifying the file name so it removes ".csv" and change the header
library(tools)
data$Source <- tools::file_path_sans_ext( data$Source )
#create a column with the experiment number, extracted from the file name
data$expnb <- substr(data$Source, 10, 13)
data$expnb <- as.numeric(data$expnb)
Hope this help :)
I'll add my solution, too using purrr's map_dfr
Generate Data
This will just generate a lot of csv files in a temp directory for us to manipulate. This is a good approach for helping us answer questions for you.
library(tidyverse)
library(fs)
temp_directory <- tempdir()
library(nycflights13)
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ { str(.x$carrier[[1]]); vroom::vroom_write(.x, paste0(temp_directory,"/", glue::glue("flights_{.y}.csv")),
delim = ",") }
)
Custom Function
It looks like you have a custom function to read in some information because the format of the file might be different. Here's my hack at what you were doing.
# List of files
my_files <- fs::dir_ls(temp_directory, glob = "*.csv")
custom_read_csv <- function(file){
# Read without colnames
ret <- read_csv(file, col_names = FALSE)
# Pull out column names
my_colnames <- unlist(ret[1,])
# Remove the row
ret <- ret[-1,]
# Assign the column names
colnames(ret) <- my_colnames
# Trick to remove the alpha in a row you know should be time
ret <- filter(ret, !is.na(as.numeric(T)))
}
Now you can read in all of your files with the custom function and combine into a single dataframe using map_dfr:
all_files <- map_dfr(my_files, custom_read_csv, .id = "filename")
head(all_files)
Which looks like this:
> head(all_files)
# A tibble: 6 x 20
filename year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C:/User~ 2013 1 1 810 810 0 1048 1037 11 9E 3538 N915XJ
2 C:/User~ 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105 N8444F
3 C:/User~ 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295 N920XJ
4 C:/User~ 2013 1 1 1454 1500 -6 1635 1636 -1 9E 3843 N8409N
5 C:/User~ 2013 1 1 1507 1515 -8 1651 1656 -5 9E 3792 N8631E
The you could remove the root path using the following syntax(my path is in there now):
all_files %>%
mutate(filename = str_remove(filename, "C:/Users/AppData/Local/Temp/RtmpkdmJCE/"))

Need to run R code on all text files in a folder

I have a text file. I made a R code for it to extract a certain line of information from it.
###Read file and format
txt_files <- list.files(pattern = '*.txt')
text <- lapply(txt_files, readLines)
text <- sapply(text, function(x) iconv(x, "latin1", "ASCII", sub=""))
###Search and store grep
l =grep("words" ,text)
(k<- length(l))
###Matrix to store data created
mat <- matrix(data = NA, nrow = k, ncol = 2)
nrow(mat)
###Main
for(i in 1:k){
u= 1
while(text[(l[i])-u]!=""){
line.num=u;
u=u+1
}
mat[i,2]<-text[(l[i])-u-1]
mat[i,1]<- i
}
###Write the output file
write.csv(mat, file = "Evalutaion.csv")
It runs on one file at a time. I need to run it on many files and append all the results in a single file with an additional column that tells me the name of the file from which each of the result has come. I am unable to come up with some solution. What changes do I make?
Applying your operations to all files in a folder:
txt_files <- list.files(pattern = '*.txt')
# Applying all your functions on all txt_files using for loop, you need to use indexes inside where ever you are using txt_files
for (i in 1:length(txt_files)) {
# Operation 1
# Operation 2
# Operation 3
write.table(mat,file=paste0("./",sub(".txt","",FILES[i]),".csv"),row.names=F,quote=F,sep=",")
}
Merging files together with same headers, I have two csv files with Same Header Data and Value, File Names were File1.csv and File2.csv inside Header folder, which I am merging together to get one header and all rows and columns. Make sure both the files have same number of columns and same headers in same order.
## Read into a list of files, an Example below
library(plyr)
library(gdata)
setwd("./Header") # CSV Files to be merged are in this direcory
## Read into a list of files:
filenames <- list.files(path="./",pattern="*.csv")
fullpath=file.path("./",filenames)
print (filenames)
print (fullpath)
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){read.table(files,sep=",",header=T)}))
dataset
# Data Value
# 1 ABC 23
# 2 PQR 33
# 3 MNP 43 # Till here was File.csv
# 4 AC 24
# 5 PQ 34
# 6 MN 44 # Till here was File2.csv
write.table(dataset,file="dataset.csv",sep=",",quote=F,row.names=F,col.names=T)

Merging a bunch of csv files into one with headers

I have a couple of csv files I want to combine as a list then output as one merged csv. Suppose these files are called file1.csv, file2.csv, file3.csv, etc...
file1.csv # example of what each might look like
V1 V2 V3 V4
12 12 13 15
14 12 56 23
How would I create a list of these csvs so that I can output a merged csv that would have headers as the file names and the column names at the top as comments? So a csv that would look something like this in Excel:
# 1: V1
# 2: V2
# 3: V3
# 4: V4
file1.csv
12 12 13 15
14 12 56 23
file2.csv
12 12 13 15
14 12 56 23
file3.csv
12 12 13 15
14 12 56 23
I am trying to use the list function inside of a double for loop to merge these csvs together, write each list to a variable, and write each variable to a table output. however this does not not work as intended.
# finding the correct files in the directory
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
levels <- unique(gsub( "-.*$", "", files_filter))
# merging
for(i in 1:length(levels)){
level_specific <- files_filter[grepl(levels[i], files_filter)]
bindme
for(j in 1:length(level_specific)){
bindme2 <- read.csv(paste("test files/shortened/",level_specific[j],sep=""))
bindme <- list(bindme,bindme2)
assign(levels[i],bindme)
}
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
}
Looking at your code, I think you don't need a for-loop. With the data.table package you could do it as follows:
filenames <- list.files(pattern="*.csv")
files <- lapply(filenames, fread) # fread is the fast reading function from the data.table package
merged_data <- rbindlist(files)
write.csv(merged_data, file="merged_data_file.csv", row.names=FALSE)
If at least one of the csvs has column names set, they will be used in the resulting datatable.
Considering your code, it could be improved considerably. This:
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
can be replaced by just:
filenames <- list.files(pattern="*.csv")
In your for-loop the first time you call bindme, it isn't doing anything. What is it? A list? A dataframe? You could use something like:
bindme <- data.table() # or data.frame()
Furthermore, the part:
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
will generate several csv-files, but you wanted just one merged file.
Would this help
mergeMultipleFiles <- function(dirPath, nameRegex, outputFilename){
filenames <- list.files(path=dirPath, pattern=nameRegex, full.names=TRUE, recursive=T)
dataList <- lapply(filenames, read.csv, header=T, check.names=F)
combinedData <- ldply(dataList, rbind)
write.csv(combinedData, outputFilename)
}
ps: There is a regex thrown in for filenames. Just in case you want to only merge certain "pattern" of files.
Modify this example. If I understood your question correctly it will help you.
# get the names of the csv files in your current directory
file_names = list.files(pattern = "[.]csv$")
# for every name you found go and read the csv with that name
# (this creates a list of files)
import_files = lapply(file_names, read.csv)
# append those files one after the other (collapse list elements to one dataset) and save it as d
d=do.call(rbind, import_files)

Making the same calculation for multiple files

I have different csv files with different names. I want to make some calculations and after that I want to save the results into one csv file.
My data of two csv files have this format:
File 1:
day price
2000-12-01 00:00:00 2
2000-12-01 06:00:00 3
2000-12-01 12:00:00 NA
2000-12-01 18:00:00 3
File 2:
day price
2000-12-01 00:00:00 12
2000-12-01 06:00:00 NA
2000-12-01 12:00:00 14
2000-12-01 18:00:00 13
To read the files I use this:
file1 <- read.csv(path_for_file1, header=TRUE, sep=",")
file2 <- read.csv(path_for_file2, header=TRUE, sep=",")
An example of calculation process:
library(xts)
file1 <- na.locf(file1)
file2 <- na.locf(file2)
And save the results into a csv where the timestamp is the same for the csv files:
merg <- merge(x = file1, y = file2, by = "day", all = TRUE)
write.csv(merge,file='path.csv', row.names=FALSE)
To read multiple files I tried this. Any ideas how can make the process from 2 files to be for n files?
You say that your data are comma-separated, but you show them as space-separated. I'm going to assume that your data are truly comma-separated.
Rather than reading them into separate objects, it's easier to read them into a list. It's also easier to use read.zoo instead of read.csv because merging time-series is a lot easier with xts/zoo objects.
# get list of all files (change pattern to match your actual filenames)
files <- list.files(pattern="file.*csv")
# loop over each file name and read data into an xts object
xtsList <- lapply(files, function(f) {
d <- as.xts(read.zoo(f, sep=",", header=TRUE, FUN=as.POSIXct))
d <- align.time(d, 15*60)
ep <- endpoints(d, "minutes", 15)
period.apply(d, ep, mean)
})
# set the list names to the file names
names(xtsList) <- files
# merge all the file data into one object, filling in NA with na.locf
x <- do.call(merge, c(xtsList, fill=na.locf))
# write out merged data
write.zoo(x, "path.csv", sep=",")

Resources