I’ trying to read in a set of data tables. All of which is representing different parts of a larger Excel table, selected using ”filter” and saved individually as a .csv file. Most of my tables have 5 rows of data but two of them have 4 rows. The tables with 5 rows of data reads in to R as requested:
Y <- read.csv(file = "MyFile.csv", row.names = 1,header = T, sep = ";")
No problem.
The tables with 4 rows of data gives following error meassage:
In read.csv("MyFile.csv", quote = "", : incomplete final line found
by readTableHeader on ' MyFile.csv'
It’s the same problem with
Z <- read.table("MyFile.csv", quote = "", sep = ';', header = TRUE)
There is no missing data in the file. When I print the Y or Z object in R no missing data is visible (or invisible as it were).
I know the problem is extremely simple, but as I’ve got frustration pouring out of my ears my officemates would really appreciate your help.
The final line of your CSV doesn't have a line feed or carriage return.
Plan A: open the files in a text editor, go to the end of the final line, hit enter and then save the modified file.
Plan B: if there are too many files for Plan A, you could simply ignore the warnings, since the files seem to be loaded fine (apart from that message).
Related
I have a problem with one task where I have to load some data set, and I have to make sure that missing values are read in properly and that column names are unambiguous.
The format of .txt file:
At the end, data set should contain only country column and median age.
I tried using read.delim, precisely this chunk:
rawdata <- read.delim("rawdata_343.txt", sep = "", stringsAsFactors = FALSE, header = TRUE)
And when I run it, I get this:
It confuses me that if country has multiple words (Turks and Caicos Islands) it assigns every word to another column.
Since I am still a beginner in R, any suggestion would be very helpful for me. Thanks!
Three points to note about your input file: (1) the first two lines at the top are not tabular and should be skipped with skip = 2, (2) your column separators are tabs and this should be specified with sep = "\t", and (c) you have no headers, so header = FALSE. Your command should be: -
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2)
UPDATE: A fourth point is that the first column includes row numbers, so row.names = 1. This also addresses the follow-up comment.
rawdata <- read.delim("rawdata_343.txt", sep = "\t", stringsAsFactors = FALSE, header = FALSE, skip = 2, row.names = 1)
It looks like your delimiter that you are specifying in the sep= argument is telling R to consider spaces as the column delimiter. Looking at your data as a .txt file, there is no apparent delimiter (like commas that you would find in a typical .csv). If you can put the data in a tabular form in something like a .csv or .xlsx file, R is much better at reading that data as expected. As it is, you may struggle to get the .txt format to read in a tabular fashion, which is what I assume you want.
P.s. you can use read.csv() if you do end up putting the data in that format.
I am currently trying to analyze Twitter data collected via Python which is saved as tab delimited CSV file. However, a problem occurs when I try to read it into R.
The data is comprised of 8 columns (e.g.,col1: Twitter ID, col2: date of tweets, ... col4: tweet messages.. col9: location information).
So I expecting each row to contain information of those eight columns for all data points. However, for some reason in col4 where there only has to be tweet messages multiple fields (i.e., information of other rows from col1 to 8) are presented in that one specific cell.
Below are codes that I tried running. It is weird as this problem is not occuring when I am reading this CSV file in python. I just have no clue what is happening. Did anyone encountered similar issue?
data <- read.csv("Blacklives.csv", header = F, sep = '\t')
data <- read.csv2("Blacklives.csv", header = F, sep = '\t')
data <- read.delim2("Blacklives.csv", header = F, sep = '\t')
data <- read.delim2("Blacklives.csv", header = F, sep = '\t')
So I will try to provide more information to my error. The initial data generated by python looks like this in .csv form. It looks bit weird but data are tab delimited .csv file from Python. However, when I read this data into R (I read data in R and resaved it to .csv just to show you the rows with problematic cell), some cells (i.e., a cell that should contain a tweet from one person) have this large information in them. See below for the example of information contained in a cell. .csv file from R
Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.
I want to import a table (.txt file) in R with read.table().
table1<- read.table("input.txt",sep = "\t")
The file contains data like 0.09165395632583884
After reading the data, data becomes 0.09165396. Last few digits are lost,
but I want to avoid this problem.
If I used
options(digits=22)
then it creates another problem, like maindata = 0.19285969062961023 but when I write the data in file,
write.table(table1,file = "output.txt",col.names = F, row.names = F)
I get data = 0.192859690629610225. Here, last digit is extra and the second last digit is change.
Can someone give me a hint how to solve the problem?
I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.