Reading text files using read.table

Reading text files using read.table - r

I have a text file with an id and name column, and I'm trying to read it into a data frame in R:
d = read.table("foobar.txt", sep="\t")
But for some reason, a lot of lines get merged -- e.g., in row 500 of my data frame, I'll see something like
row 500: 500 Bob\n501\tChris\n502\tGrace
[So if my original text file has, say, 5000 lines, the dimensions of my table will only end up being 1000 rows and 2 columns.]
I've had this happen to me quite a few times. Does anyone know what the problem is, or how to fix it?

From ?read.table: The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary.
So, perhaps your data file isn't clean. Being more specific will help the data import:
d = read.table("foobar.txt",
sep="\t",
col.names=c("id", "name"),
fill=FALSE,
strip.white=TRUE)
will specify exact columns and fill=FALSE will force a two column data frame.

Related

read.csv importing two columns instead of one

I'm trying to import a csv file into a vector. There are 100 entries in this csv file, and this is what the file looks like:
My code reads as follows:
> choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")
> choice_vector
And yet, when I try to display said vector, it shows up as:
It is somehow creating a second column which I cannot figure out why it is doing so. In addition, trying to write to a new csv file actually writes the contents of that second column to that as well.

The second column was "habilitated" in excel.
Option1: Manually delete the column in excel.
Option2: Delete all columns with all NA
choice_vector2 <- choice_vector[,colSums(is.na(choice_vector))<nrow(choice_vector)]
In case of being interested in reading the first column only:
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")[,1]
Good luck!

Short answer:
You have an issue with your data file, but
choice_vector <- read.csv("choices.csv", header = FALSE, fileEncoding="UTF-8-BOM")$V1
should create the vector that you're expecting.
Long answer:
The read.csv function returns a data frame and you need to address a particular column within the data frame with the $ operator in order to extract that column as a vector. As for why you have an unexpected column of NAs, your CSV probably codes for two columns. When you read a CSV with R, a comma indicates a data field to its right. If you look at your CSV with a text editor, I'm guessing it'll look like this:
A,
B,
D,
A,
A,
F,
The absence of anything (other than another comma or a line break) to the right of a comma is interpreted as NA.

If we are using fread from data.table, there is a select option to select only the columns of interest
library(data.table)
dt <- fread("choices.csv", select = 1)
Other than that, it is not clear about why the issue happens. Could be some strange white space. If that is the case, specify strip.white = TRUE (by default it is FALSE)
read.csv(("choices.csv", header = FALSE,
fileEncoding="UTF-8-BOM", strip.white = TRUE)
Or as we commented, copy the columns of interest into a new file, save it and then read with read.csv

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:

The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

write and read.csv different number of columns

A strange problem with write and read.csv. I have ways to work around this but would be great if someone can identify what is going on.
I have code from someone else which dynamically creates a series of CSVs by appending new rows. The problem is that read.csv appears to read the newly created csv inconsistently.
Dummy code example:
datfile <- "E:/temp.csv"
write(paste("Name","tempname",sep=","),datfile,1)
write(paste("VShort",50,sep=","),datfile,1,append=T)
write(paste("Short1",1,1,sep=","),datfile,1,append=T)
write(paste("Short2",0,2,sep=","),datfile,1,append=T)
write(paste("Short3",0,2,sep=","),datfile,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile,1,append=T)
write(paste("Short4",2,0,sep=","),datfile,1,append=T)
read.csv(datfile,header=F,colClasses="character")
Seven rows of data written to CSV, but read.csv reads in 8 rows (Long is split over two rows). Eight rows and three columns read in.
The problem is fixed by opening temp.csv in Excel and saving. Then read.csv reads in the 7 lines appropriately.
The problem only appears to exist under certain conditions. For example, remove Short 3 and there is no problem:
datfile2 <- "E:/temp2.csv"
write(paste("Name","tempname",sep=","),datfile2,1)
write(paste("VShort",50,sep=","),datfile2,1,append=T)
write(paste("Short1",1,1,sep=","),datfile2,1,append=T)
write(paste("Short2",0,2,sep=","),datfile2,1,append=T)
write(paste("Long",0,0.3,0.6,1,sep=","),datfile2,1,append=T)
write(paste("Short4",2,0,sep=","),datfile2,1,append=T)
read.csv(datfile2,header=F,colClasses="character")
Six rows and five columns are read in.
Any ideas what is going on here?
R version 3.2.4 Revised
Windows 10

This is probably related to the following in ?read.csv:
The number of data columns is determined by looking at the first five
lines of input (or the whole file if it has less than five lines), or
from the length of col.names if it is specified and is longer. This
could conceivably be wrong if fill or blank.lines.skip are true, so
specify col.names if necessary (as in the ‘Examples’).
It just happens that the row with the most number of columns is the sixth row in your first example.
I suggest using col.names to get around this, e.g.:
`... read.csv(..., col.names = paste0('V', 1:6))`
As the OP notes in a comment to this answer, you can find out the number of
columns required using readLines:
Ncol <- max(unlist(lapply(strsplit(readLines(datfile), ","), length)))
and then modify the above to give:
read.csv(datfile,header=F,colClasses="character", col.names=paste0("V", 1:Ncol))

Using cbind to create a columnar .csv file but 'X' always appears in row one

I have a script that is working perfectly except that in my R cbind operation, adjacent to the numerical value that I require in the first row, is an 'X'.
Here is my script:
library(ncdf)
library(Kendall)
library(forecast)
library(zoo)
setwd("/home/cohara/RainfallData")
files=list.files(pattern="*.nc")
j=81
for (i in seq(1,9))
{
file<-open.ncdf(sprintf("/home/cohara/RainfallData/%s.nc",i))
year<-get.var.ncdf(file,"time")
data<-get.var.ncdf(file,"var61")
fit<-lm(data~year) #least sqaures regression
mean=rollmean(data,4,fill=NA)
kendall<-Kendall(data,year)
write.table(kendall[[2]],file="/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(kendall[[1]],file="/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv",append=TRUE,quote=FALSE,row.names=FALSE,col.names=FALSE)
png(sprintf("./10 percent increase over %s years.png",j))
par(family="serif",mar=c(4,6,4,1),oma=c(1,1,1,1))
plot(year,data,pch="*",col=4,ylab="Precipitation (mm)",main=(sprintf("10 percent increase over %s years",j)),cex.lab=1.5,cex.main=2,ylim=c(800,1400),abline(fit,col="red",lty=1.5))
par(new=T)
plot(year,mean,type="l",xlab="year",ylab="Precipitation (mm)",cex.lab=1.5,ylim=c(800,1400),lty=1.5)
legend("bottomright",legend=c("Kendall tau = ",kendall[[1]]))
legend("bottomleft",legend=c("Kendall 2-tailed p-value = ",kendall[[2]]))
legend(x="topright",c("4 year moving average","Simple linear trend"),lty=1.5,col=c("black","red"),cex=1.2)
legend("topleft",c("Annual total"),pch="*",col="blue",cex=1.2)
dev.off()
j=j+1
}
tmp<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv")
tmp2<-read.csv("/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_-_89_years.csv")
tmp<-cbind(tmp,tmp2)
tmp3<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv")
tmp4<-read.csv("/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_-_89_years.csv")
tmp3<-cbind(tmp3,tmp4)
write.table(tmp,"/home/cohara/RainfallAnalysis/Kendall_p-value_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
write.table(tmp3,"/home/cohara/RainfallAnalysis/Kendall_tau_for_10%_increase_over_81_to_89_years.csv",sep="\t",row.names=FALSE)
The output looks like this, from the .csv files created:
X0.0190228056162596 X0.000701081415172666
0.0395622998 0.00531819
0.0126547674 0.0108218994
0.0077754743 0.0015568719
0.0001407317 0.002680057
0.0096391216 0.012719159
0.0107234037 0.0092436085
0.0503448173 0.0103918528
0.0167525802 0.0025036721
I want to be able to use excel functions on the data, so, for simplicity, I don't want row names (I'll be running this loop maybe a hundred times), but I need column names because otherwise the first set of values is cut off.
Can anyone tell me where the 'X' is coming from and how to get rid of it?
Thanks in advance,
Ciara

Here is what I think is going on. Start by running these small examples:
df1 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994")
df2 <- read.csv(text = "0.0190228056162596, 0.000701081415172666
0.0395622998, 0.00531819
0.0126547674, 0.0108218994", header = FALSE)
df1
df2
str(df1)
str(df2)
names(df1)
names(df2)
make.names(c(0.0190228056162596, 0.000701081415172666))
Please read ?read.csv and about the header argument. As you will find, header = TRUE is default in read.csv. Thus, if the csv file you read lacks header, read.csv will still 'assume' that the file has a header, and use the values in the first row as a header. Another argument in read.csv is check.names, which defaults to TRUE:
If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names).
In your case, it seems that the data you read lack a header and that the first row is numbers only. read.csv will default treat this row as a header. make.names takes values in the first row (here numbers 0.0190228056162596, 0.000701081415172666), and spits out the 'syntactically valid variable names' X0.0190228056162596 and X0.000701081415172666. Which is not what you want.
Thus, you need to explicitly set header = FALSE to avoid that read.csvconvert the first row to (valid) variable names.
For next time, please provide a minimal, self contained example. Check these links for general ideas, and how to do it in R: here, here, here, and here

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading text files using read.table - r

Related

read.csv importing two columns instead of one

CSV with multiple datasets/different-number-of-columns

Combining CSV files and splitting the column into 2 columns using R

write and read.csv different number of columns

Using cbind to create a columnar .csv file but 'X' always appears in row one

Categories

Resources