reading a text file with an irregular header (in R) - r

I am trying to read a flat file into R.
It is separated by ';' and has 12 leading lines of comments to describe the content.
I want to read the file and exlude the comments.
The problem however is that the commented line 11 contains the data headers as follows:
# Fields: labno; name; dob; sex; location; date
Is there a way that I can extract the headers form the comments and apply them to the data. The way I thought of doing it was to read the first 11 lines only and store everything from labno as a vector. The I would read everything from line 13 and use the store vector as column names for the the date.
Is there a way to read the first 11 lines and remove everything before labno
Thanks.

Step1: (read only the eleventh row containing column names. )
hdrs <- read.table("somefile.txt", nrows=1, skip=10, comment.char="")
Step2: (read the rest of the file, allowing default automatic names)
dat <- read.table("somefile.txt", skip=12)
Step3: (remove extraneous characters before applying the ‘fields’ as column names)
names(dat) <- scan(textConnection(sub("# Fields\\:", "", hdrs)),
what="character", sep=";")
Later versions of R allow ‘scan’ to have a ‘text’ argument rather than requiring the awkward textConnection function.

Related

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.

write_csv - Exporting trailing spaces (no elimination)

I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

R - extract data which changes position from file to file (txt)

I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.
If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.

Read in data from .txt file (no header, no separator)

I have a large dataset (~ 200MB) stored in a .txt-file which I need to read into R. Unfortunately there are no separators (like " " or ",") between the values of the variables and there is no header file.
But there is a codebook, which gives the variable names and also specifies which column belongs to which variable. Some of the variable take one column of space, some take more (so read.fwf won't work); but their width is the same for all cases.
I possibly only have to read in a few of these variables, so I expect that I will just have to select the necessary columns and name the variables. What would be an elegant solution to do this (and maybe even preselect meaningful variable types)?
You can consider loading the data as is and then parsing each line using 'strsplit' with appropriate regular expression.
con <- file("yourfile.txt", open = "r")
lines <- readLines(con)
Iterate it over, apply strsplit to each line and add that to your data table with rbind.

duplicate 'row.names' are not allowed error

I am trying to load a csv file that has 14 columns like this:
StartDate, var1, var2, var3, ..., var14
when I issue this command:
systems <- read.table("http://getfile.pl?test.csv", header = TRUE, sep = ",")
I get an error message.
duplicate row.names are not allowed
It seems to me that the first column name is causing the issue. When I manually download the file and remove the StartDate name from the file, R successfully reads the file and replaces the first column name with X. Can someone tell me what is going on? The file is a (comma separated) csv file.
Then tell read.table not to use row.names:
systems <- read.table("http://getfile.pl?test.csv",
header=TRUE, sep=",", row.names=NULL)
and now your rows will simply be numbered.
Also look at read.csv which is a wrapper for read.table which already sets the sep=',' and header=TRUE arguments so that your call simplifies to
systems <- read.csv("http://getfile.pl?test.csv", row.names=NULL)
This related question points out a part of the ?read.table documentation that explains your problem:
If there is a header and the first row contains one fewer field
than the number of columns, the first column in the input is used
for the row names. Otherwise if row.names is missing, the rows are numbered.
Your header row likely has 1 fewer column than the rest of the file and so read.table assumes that the first column is the row.names (which must all be unique), not a column (which can contain duplicated values). You can fix this by using one of the following two Solutions:
adding a delimiter (ie \t or ,) to the front or end of your header row in the source file, or,
removing any trailing delimiters in your data
The choice will depend on the structure of your data.
Example:
Here the header row is interpreted as having one fewer column than the data because the delimiters don't match:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
This is how it is interpreted by default:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
The first column (with no header) values are interpreted as row.names: a1 and b1. If this column contains duplicates, which is entirely possible, then you get the duplicate 'row.names' are not allowed error.
If you set row.names = FALSE, the shift doesn't happen, but you still have a mismatching number of items in the header and in the data because the delimiters don't match.
Solution 1
Add trailing delimiter to header:
v1,v2,v3, # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 2
Remove excess trailing delimiter from non-header rows:
v1,v2,v3 # 3 items
a1,a2,a3 # 3 items!!
b1,b2,b3 # 3 items!!
In my case was a comma at the end of every line. By removing that worked
I had this error when opening a CSV file and one of the fields had commas embedded in it. The field had quotes around it, and I had cut and paste the read.table with quote="" in it. Once I took quote="" out, the default behavior of read.table took over and killed the problem. So I went from this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",", quote="")
to this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",")
I used read_csv from the readr package
In my experience, the parameter row.names=NULL in the read.csv function will lead to a wrong reading of the
file if a column name is missing, i.e. every column will be shifted.
read_csv solves this.
Another possible reason for this error is that you have entire rows duplicated. If that is the case, the problem is solved by removing the duplicate rows.
The answer here (https://stackoverflow.com/a/22408965/2236315) by #adrianoesch should help (e.g., solves "If you know of a solution that does not require the awkward workaround mentioned in your comment (shift the column names, copy the data), that would be great." and "...requiring that the data be copied" proposed by #Frank).
Note that if you open in some text editor, you should see that the number of header fields less than number of columns below the header row. In my case, the data set had a "," missing at the end of the last header field.
It seems the problem can arise from more than one reasons. Following two steps worked when I was having same error.
I saved my file as MS-DOS csv. ( Earlier it was saved in as just csv , excel starter 2010 ).
Opened the csv in notepad++. No coma was inconsistent (consistency as described above #Brian).
Noticed I was not using argument sep="," . I used and it worked ( even though that is default argument!)

Resources