Add file names as row name based on condition with R

Add file names as row name based on condition with R - r

I have found some variations to this question, and tried all possibilities but it does not help. I have been able to just extract the content, but I would like to have the file name associated as well at each row in a CSV file: If content ("Flash Point") found in the “.txt” file, extract content and give the “.txt” file name as the associated row name in the csv. If content not find just skip both content and file and go to next extraction. Any help would be greatly appreciated. The issue here is that the row names are given based on a specific condition. Here is the initial code. Thanks a lot for your help
for (i in 1:length(txt)){
doc<-readLines(txt[i])
doc<-doc[grepl("Flash point",doc)]
lst[[txt[[i]]]]<-doc %>% stringr::str_extract("(\\d|>).*")
results<-lst[[txt[[i]]]]
write.table(results,file = "outputestrod.csv",row.names = FALSE,col.names = FALSE,sep = ",", append = TRUE)
}
I am adding an example here
Content Extracted
Content Extracted with Files names As row if specific content value found
Result of suggested results<-paste(txt[i],lst[[txt[[i]]]])
Results

It sounds like you need to use a paste() command to combine two strings, the file name, and the contents of the file.
Try changing the line
results<-lst[[txt[[i]]]]
to this:
results<- paste(txt, lst[[txt[[i]]]] )

Here is a tidyverse version of what I think you are trying to do. Consider the resource http://r4ds.had.co.nz/ if you want to learn to code like this. Your loop does not take advantage of R's vector operations.
library(tidyverse)
filenames <- dir(your folder)
file_and_content_with_string <- function(filename, string){
doc<-readLines(filename)
doc<-doc[grepl(string,doc)]
file_text <- doc %>% stringr::str_extract("(\\d|>).*")
results <- data.frame(filename = filename, content = file_text)
}
all_results <- map_df(filenames, function(x) file_and_content_with_string(x, "Flash point"))
write_csv(all_results, "outputestrod.csv")

Related

R Try reading a csv and if it doesn't exist read a different one

I have a script where I'd prefer to read in a certain file, but if that file doesn't exist then I want to read in a different file. Files change by day so I don't want to pick and choose manually.
I'm currently using tryCatch to do this. Basically, try to read the first file, and if there's an error then return an empty df. Then have logic where if the table is blank, read in the file I know exists. While my solution works, it feels extremely clunky and I'd like to learn best practices going forward.
Current solution:
df <- suppressWarnings(tryCatch({read.csv('this file will not exist on your computer', stringsAsFactors = FALSE)},
error = function(e) data.frame()
))
if(nrow(df) == 0){
df <- cars # this will be a file I know exists
}
## Continue with function using a df I know exists ##
I don't love using suppressWarnings if I don't need to, but I disliked the warning that kept popping up (obviously the file doesn't exist that's why I'm tryCatching).
Ideally this could be a 3-line code. 1) Check if file exists. 2) If it does exist then read it. 3) If it doesn't exist, then read the other file I know exists. Any thoughts?

As suggested by #All Downhill From Here using file.exists would check if the file exists at the given location. As file.exists returns a logical value you can wrap it in if/else block.
if(file.exists('data.csv')) {
data <- read.csv('data.csv')
} else {
data <- read.csv('confirm.csv')
}

This also could be an option using purrr::map_if. First we create a character vector of our desired file names:
library(purrr)
vec <- c("data.csv")
vec %>%
map_if(~ file.exists(.x), ~ read.csv(.x, header = TRUE),
.else = read.csv(gsub("([a-z0-9_]+)\\.([a-z]+$)", "confirm\\.\\2",
vec, perl = TRUE), header = TRUE))

We may make use of an index
data <- read.csv(c("confirm.csv", "data.csv")[1 + file.exists("data.csv")])

How to import many xslx files to R? (In one xlsx file there are many sheets and I need only one)

I'm new here and I don't know how this site works. If I make mistakes, sorry.
Soooo I have 23 xlsx files with many sheets in them.
I have to create dataset which contains all of those files but with only one sheet. Columns and names of the sheets are the same.
I have to bind them by rows.
If anyone know how to do it, I will be very grateful.
file.list <-list.files("D:/Profile/name/Desktop/Viss/foldername",pattern=".xlsx")
df.list <- lapply(file.list, read_excel)
Error: path does not exist:
df <- rbindlist(df.list, idcol = "id")
I don't know where to put the extract of this one sheet and I don't know what to write in idcol="".

I think your approach is correct, but you should use the full path in file.list <-list.files("D:/Profile/name/Desktop/Viss/foldername",pattern=".xlsx", full.names=TRUE)
EDIT: You should use pattern="\\.xlsx" in
list.files("D:/Profile/name/Desktop/Viss/foldername",pattern="\\.xlsx", full.names=TRUE)
EDIT2: You can always see any function help by running ? followed by your function name like ?rbindlist, or in RStudio, pressing F1 on the function name. The idcol parameter should be TRUE or FALSE, in your case, FALSE probably.
idcol
Generates an index column. Default (NULL) is not to. If idcol=TRUE then the column is auto named .id. Alternatively the column name can be directly provided, e.g., idcol = "id". If input is a named list, ids are generated using them, else using integer vector from 1 to length of input list. See examples.*
EDIT3 if you want to specify the sheet name you can use
lapply(file.list, function(x) read_excel(x, sheet="sheetname"))

R- import CSV file, all data fall into one (the first) column

I'm new, and I have a problem:
I got a dataset (csv file) with the 15 columns and 33,000 rows.
When I view the data in Excel it looks good, but when I try to load the data
into R- studio I have a problem:
I used the code:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="")
View(x)
The result is that the columnnames are good, but the data (row 2 and further) are
all in my first column.
In the first column the data is separated with ; . But when i try the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";")
The next problem is: Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names = NULL)
And it looks liked it worked.... But now the data is in the wrong columns (for example, the "name" column contains now the "time" value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think that is not the best way.

Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.

It happens because of Exel's specifics. The easy solution is just to copy all your data Ctrl+C to Notepad and Save it again from Notepad as filename.csv (don't forget to remove .txt if necessary). It worked well for me. R opened this newly created csv file correctly, all data was separated at columns right.

Open your file in text edit and see if it really is separated with commas...
Sometimes .csv files are separated with tabs instead of commas or semicolon and when opening in excel it has no problem but in R you have to specify the separator like this:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="\t")
I once had the same problem, this was my solution. Hope it works for you.

This problem can arise due to regional settings on the excel application where the .csv file was created.
While in most places a "," separates the columns in a COMMA separated file (which makes sense), in other places it is a ";"
Depending on your regional settings, you can experiment with:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",") #used in North America
or,
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";") #used in some parts of Asia and Europe

You could use -
df <- read.csv("filename.csv", sep = ";", quote = "")
It solved one my problems similar to yours.

So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names =
NULL) And it looks liked it worked.... But now the data is in the
wrong columns (for example, the "name" column contains now the "time"
value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think
that is not the best way.
I had the exact same issue. Did quite some research and found out, that the CSV was ill-formed.
In the header line of the CSV there were all the labels (separated by the separator) and then a line break.
Starting with line 2, there was an additional separator at the end of each line. So an example of such an ill-formed CSV file looks like this:
Field1;Field2 <-- see the *missing* semicolon at the end
12;23; <-- see the *trailing* semicolon in each of the data lines
34;67;
45;56;
Such ill-formatted files are even harder to spot for TAB-separated files.
Excel does not care about that, when importing CSV files.
But R does care.
When you use skip=1 you skip the header line that contains part of the mismatch. The data frame will be imported well, but there will be a column of "NA" at the end of each row. And obviously you will not have column names, as these were skipped.
Easiest solution: edit the CSV file and either add an additional separator at the end of the header line as well, or remove the trailing delimiters in the data lines. You can also use generic read and write functions in R for text files to automate that editing.

You can transform the data by arranging the data into many cells corresponding to columns.
1.Open your csv file
2.copy the content and paste it into txt file save and copy its content
3.open new excell file
4.in excell go to the section responsible for data . it is acually called "Data"
5.then on the left side go to external data query , in german "externe Daten abfragen"
6.go ahead step by step and seperate by commas
7. save your file as csv

I had the same problem and it was frustrating...
However, I found the ultimate solution
First take this (csv file) and then convert it online to Json file and download it ... then redo the whole thing backwards (re-convert Jason to csv) online... download the converted file... give it a name...
then put it on your Rstudio
file name <- read.csv(file='name your file.csv')
... took me 4 days to think out of the box... 🙂🙂🙂

Deleting headers of various lengths in r

I am working to combine multiple .txt files, using the read.fwf function. My issue is that each text file is preceded by several header lines, varying from 23-28 lines before the data actually start. I want to somehow delete the first n rows in the file, so that all I am importing and combing are the data themselves.
Does anyone have any clues on how to do this? The start of each data file will be the same ("01Jan") followed by a year. I basically want to delete everything before 01Jan in the file.
Right now, my code looks like:
for (i in 1:length(files.x)){
if (!exists("X")){
X<-read.fwf(files.x[i], c(11,5, 16), header=FALSE, skip=23, stringsAsFactors=FALSE)
X<-head(X, -1) #delete the last row of each table
names(X)<-c("Date", "Time", "Data")
} else if (exists("X")){
temp_X<-read.fwf(files.x[i], c(11,5,16), header=FALSE, skip=23, stringsAsFactors=FALSE) #read in fixed width file
temp_X<-head(temp_X, -1)
names(temp_X)<-c("Date", "Time", "Data")
X<-rbind(X, temp_X)
}
}
I need the skip=23 to vary according to the file being read in. Any ideas other than manually reading in each file and then combining?

Perhaps
hdr <- readLines(files.x[i],n=50) ## or some reasonable upper bound
firstLine <- grep("^01Jan",hdr)[1]
X <- read.fwf(files.x[i], skip=firstLine-1, ...)
Also, it would be more efficient to read in all the files via fileList <- lapply(files.x,getFile) (where getFile is a little utility function you write to encapsulate the logic of reading in a single file) and then do.call(rbind,fileList)

Quotation issues reading data into R

I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.

Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!

This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.