Problem reading data vector - My csv data file (rab.csv) has just one row of > 10,000 numbers read into R with:
bab <- read.table("rab.csv") #which yields:
bab
V1
1 23,29,9,28,16,10,8,24,16,20,14,15,17,31,25,19,24,55,28,55,23, . . . and so on
In using this data vector, I get:
Error: data vector must consist of at least two distinct values!
It seems to only see the number "1" that was somehow added in front of the data.
I'm quite new to this so probably something simple, but I've spent 2 days searching every possibility I can think of without finding a solution.
We can use scan to read the file as a vector.
v1 <- scan("rab.csv", what=numeric(), sep=",")
In the read.table, if we don't specify header=FALSE, it will take the first column as header and as it is numeric, it will append X as prefix. (though, it can be avoided by using check.names=FALSE argument)
Related
I have several hundred text delimited files. In some columns, a newline before the end of the row appears in random columns. When is try to read, it looks for the correct number of columns, but because it is split on to the next row.
the arg fill=T does not help because it creates empty incorrect columns.
If I have:
"Aa|Bb|C\nc\ntwo|three|four"
But really should be two rows by three columns:
"Aa|Bb|Cc\ntwo|three|four"
How can I get there for all rows of the data (the error occurs randomly throughout)?
Note that you have C\nc in the string, which introduces c to a new line. I guess you need to ensure the format of your input string as the first step, otherwise it is difficult to fixed via post-processing.
I am not sure if the code below is what you are after. Do you mean something like using read.csv?
read.csv(text = sub("\n","",s),sep = "|",header = FALSE)
which gives
V1 V2 V3
1 Aa Bb Cc
2 two three four
If you are using data.table, you can try fread (thank #akrun)
fread(sub("\n", "", s))
Data
s <- "Aa|Bb|C\nc\ntwo|three|four"
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.
I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`
I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")