I am working to combine multiple .txt files, using the read.fwf function. My issue is that each text file is preceded by several header lines, varying from 23-28 lines before the data actually start. I want to somehow delete the first n rows in the file, so that all I am importing and combing are the data themselves.
Does anyone have any clues on how to do this? The start of each data file will be the same ("01Jan") followed by a year. I basically want to delete everything before 01Jan in the file.
Right now, my code looks like:
for (i in 1:length(files.x)){
if (!exists("X")){
X<-read.fwf(files.x[i], c(11,5, 16), header=FALSE, skip=23, stringsAsFactors=FALSE)
X<-head(X, -1) #delete the last row of each table
names(X)<-c("Date", "Time", "Data")
} else if (exists("X")){
temp_X<-read.fwf(files.x[i], c(11,5,16), header=FALSE, skip=23, stringsAsFactors=FALSE) #read in fixed width file
temp_X<-head(temp_X, -1)
names(temp_X)<-c("Date", "Time", "Data")
X<-rbind(X, temp_X)
}
}
I need the skip=23 to vary according to the file being read in. Any ideas other than manually reading in each file and then combining?
Perhaps
hdr <- readLines(files.x[i],n=50) ## or some reasonable upper bound
firstLine <- grep("^01Jan",hdr)[1]
X <- read.fwf(files.x[i], skip=firstLine-1, ...)
Also, it would be more efficient to read in all the files via fileList <- lapply(files.x,getFile) (where getFile is a little utility function you write to encapsulate the logic of reading in a single file) and then do.call(rbind,fileList)
Related
I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.
The following is my R-code:
df <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = read.csv2,
sep = ";",
quote = "", encoding = "UTF-8", skipNul = TRUE)
df_check <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = readLines,skipNul = TRUE)`
Then I verify that all data was loaded, by checking:
if(nrow(df) != (length(df_check) - dif)){
stop("some error msg")
}
dif is set to 1, to account for header in the CSV-files.
This check is the part that fails for a given CSV-file.
This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.
The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure.
I tried parsing different parameters to my readfunctions, but issue still persists.
I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.
This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.
Pick a file that fails. Slurp it with readLines.
Save the file locally using writeLines.
Your first job is to make sure that the check fails also when the file
is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.
Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.
First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.
Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.
Here is some code to give you an idea what I mean:
## first, prepare data – for this example only!
f <- file("test.csv", "w")
writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
close(f)
## actual test
## first, read the file with readlines
f <- file("test.csv", "r")
rl <- readLines(f)
close(f)
## then, read it with test.csv
csv <- read.csv("test.csv")
## third, prepare the lines as they should look based on the CSV
rl_sim <- do.call(paste, c(csv, sep=","))
## find the first mismatch
for(i in 1:length(rl_sim)) {
if(rl_sim[i] != rl[i + 1]) {
message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
break
}
}
I have found some variations to this question, and tried all possibilities but it does not help. I have been able to just extract the content, but I would like to have the file name associated as well at each row in a CSV file: If content ("Flash Point") found in the “.txt” file, extract content and give the “.txt” file name as the associated row name in the csv. If content not find just skip both content and file and go to next extraction. Any help would be greatly appreciated. The issue here is that the row names are given based on a specific condition. Here is the initial code. Thanks a lot for your help
for (i in 1:length(txt)){
doc<-readLines(txt[i])
doc<-doc[grepl("Flash point",doc)]
lst[[txt[[i]]]]<-doc %>% stringr::str_extract("(\\d|>).*")
results<-lst[[txt[[i]]]]
write.table(results,file = "outputestrod.csv",row.names = FALSE,col.names = FALSE,sep = ",", append = TRUE)
}
I am adding an example here
Content Extracted
Content Extracted with Files names As row if specific content value found
Result of suggested results<-paste(txt[i],lst[[txt[[i]]]])
Results
It sounds like you need to use a paste() command to combine two strings, the file name, and the contents of the file.
Try changing the line
results<-lst[[txt[[i]]]]
to this:
results<- paste(txt, lst[[txt[[i]]]] )
Here is a tidyverse version of what I think you are trying to do. Consider the resource http://r4ds.had.co.nz/ if you want to learn to code like this. Your loop does not take advantage of R's vector operations.
library(tidyverse)
filenames <- dir(your folder)
file_and_content_with_string <- function(filename, string){
doc<-readLines(filename)
doc<-doc[grepl(string,doc)]
file_text <- doc %>% stringr::str_extract("(\\d|>).*")
results <- data.frame(filename = filename, content = file_text)
}
all_results <- map_df(filenames, function(x) file_and_content_with_string(x, "Flash point"))
write_csv(all_results, "outputestrod.csv")
I'm new, and I have a problem:
I got a dataset (csv file) with the 15 columns and 33,000 rows.
When I view the data in Excel it looks good, but when I try to load the data
into R- studio I have a problem:
I used the code:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="")
View(x)
The result is that the columnnames are good, but the data (row 2 and further) are
all in my first column.
In the first column the data is separated with ; . But when i try the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";")
The next problem is: Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names = NULL)
And it looks liked it worked.... But now the data is in the wrong columns (for example, the "name" column contains now the "time" value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think that is not the best way.
Excel, in its English version at least, may use a comma as separator, so you may want to try
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",")
I once had a similar problem where header had a long entry that contained a character that read.csv mistook for column separator. In reality, it was a part of a long name that wasn’t quoted properly.
Try skipping header and see if the problem persists
x1 <- read.csv(file = "1energy.csv", skip = 1, head = FALSE, sep=";")
In reply to your comment:
Two things you can do. Simplest one is to assign names manually:
myColNames <- c(“col1.name”,”col2.name”)
names(x1) <- myColNames
The other way is to read just the name row (the first line in your file)
read only the first line, split it into a character vector
nameLine <- readLines(con="1energy.csv", n=1)
fileColNames <- unlist(strsplit(nameLine,”;”))
then see how you can fix the problem, then assign names to your x1 data frame. I don’t know what exactly is wrong with your first line, so I can’t tell you how to fix it.
Yet another cruder option is to open your csv file using a text editor and edit column names.
It happens because of Exel's specifics. The easy solution is just to copy all your data Ctrl+C to Notepad and Save it again from Notepad as filename.csv (don't forget to remove .txt if necessary). It worked well for me. R opened this newly created csv file correctly, all data was separated at columns right.
Open your file in text edit and see if it really is separated with commas...
Sometimes .csv files are separated with tabs instead of commas or semicolon and when opening in excel it has no problem but in R you have to specify the separator like this:
x <- read.csv(file = "1energy.csv", head = TRUE, sep="\t")
I once had the same problem, this was my solution. Hope it works for you.
This problem can arise due to regional settings on the excel application where the .csv file was created.
While in most places a "," separates the columns in a COMMA separated file (which makes sense), in other places it is a ";"
Depending on your regional settings, you can experiment with:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=",") #used in North America
or,
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";") #used in some parts of Asia and Europe
You could use -
df <- read.csv("filename.csv", sep = ";", quote = "")
It solved one my problems similar to yours.
So i made the code:
x1 <- read.csv(file = "1energy.csv", head = TRUE, sep=";", row.names =
NULL) And it looks liked it worked.... But now the data is in the
wrong columns (for example, the "name" column contains now the "time"
value, and the "time" column contains the "costs" value.
Does anybody know how to fix this? I can rename columns but i think
that is not the best way.
I had the exact same issue. Did quite some research and found out, that the CSV was ill-formed.
In the header line of the CSV there were all the labels (separated by the separator) and then a line break.
Starting with line 2, there was an additional separator at the end of each line. So an example of such an ill-formed CSV file looks like this:
Field1;Field2 <-- see the *missing* semicolon at the end
12;23; <-- see the *trailing* semicolon in each of the data lines
34;67;
45;56;
Such ill-formatted files are even harder to spot for TAB-separated files.
Excel does not care about that, when importing CSV files.
But R does care.
When you use skip=1 you skip the header line that contains part of the mismatch. The data frame will be imported well, but there will be a column of "NA" at the end of each row. And obviously you will not have column names, as these were skipped.
Easiest solution: edit the CSV file and either add an additional separator at the end of the header line as well, or remove the trailing delimiters in the data lines. You can also use generic read and write functions in R for text files to automate that editing.
You can transform the data by arranging the data into many cells corresponding to columns.
1.Open your csv file
2.copy the content and paste it into txt file save and copy its content
3.open new excell file
4.in excell go to the section responsible for data . it is acually called "Data"
5.then on the left side go to external data query , in german "externe Daten abfragen"
6.go ahead step by step and seperate by commas
7. save your file as csv
I had the same problem and it was frustrating...
However, I found the ultimate solution
First take this (csv file) and then convert it online to Json file and download it ... then redo the whole thing backwards (re-convert Jason to csv) online... download the converted file... give it a name...
then put it on your Rstudio
file name <- read.csv(file='name your file.csv')
... took me 4 days to think out of the box... 🙂🙂🙂
I have searched through this forum for most of the day trying to find the solution - I could not find so I am posting. If the answer is already out there please point me in the right direction.
What I have -
A directory with 40 texts files called the following
test_63x_disc_z00.txt
*01.txt
*02.txt
...
*39.txt
In each of these files there are 10 columns of data with no header and a varying number of rows.
What I want -
I want to have an individual data.frame in R for each text file with names:
file00
file01
...
file39.
I then want to do add a header column to each of these data.frames.
I then want to be able to manipulate the data at ease (this last part I can sort out once I have input a the data)
This is what I have accomplished (don't laugh now) -
I can input a single text file as a data frame and add a header, like so :
d<-read.delim("test_63x_disc_z00.txt", header = F)
colnames(d)<-c("cell","CentX","CentY","CountLabels","AvgGreen","DeviationsGreen","AvgRed","DeviationsRed","GUI-ID","Slice")
I am not sure how to set up a loop to perform each of the commands to all 40 files and maintain distinct file names.
To quicly read in a lot of data frames you can
listy <- apply(data.frame(list.files()), 1, read.table, sep="", header=F)
Then to name them the list items, you can:
names(listy) <- paste0("file", seq_along(1:40))
They are then called by listy$file1 etc.
Thanks Metrics for editing my input code... I wasn't sure how to give it the format you did.
So I figured it out (dirty version), but it still needs work -
stem <- c("/Users/stefanzdraljevic/Northwestern/2013/Carthew-Rotation/Sample-Images-Ritika/ritika-tes/statistics3/test_63x_disc_z")
The above is the stem for the file naming`
addition <- c("00","01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","34","35","36","37","38","39")
This will add the number of the text file to the end of stem. I am not sure how to incorporate the "00" numbering structure without writing them all out.
colnames <- c("cell","CentX","CentY","CountLabels","AvgGreen","DeviationsGreen","AvgRed","DeviationsRed","GUI-ID","Slice")
This will add column names to the data.frame
data = NULL
for(j in stem){
data[[j]] = NULL
for(i in addition){
data[[j]][[i]] = read.table(paste(j,i,".txt",sep=""), header=F, col.names=colnames)
}
}
This loop does the trick.
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.