I have a file similar to
ColA ColB ColC
A 1 0.1
B 2 0.2
But with many more columns.
I want to read the table and set the correct type of data for each column.
I am doing the following:
data <- read.table("file.dat", header = FALSE, na.string = "",
dec = ".",skip = 1,
colClasses = c("character", "integer","numeric"))
But I get the following error:
Error in scan(...): scan() expected 'an integer', got 'ColB'
What am I doing wrong? Why is it trying to parse also the first line according to colClasses, despite skip=1?
Thanks for your help.
Some notes: This file has been generated in a Linux environment and is being worked on in a Windows environment. I am thinking of a problem with newline characters, but I have no idea what to do.
Also, if I read the table without colClasses the table is read correctly (skipping the first line) but all columns are factor type. I can probably change the class later, but still I would like to understand what is happening.
Instead of skipping first line, you can change header = TRUE and it should work fine.
data <- read.table("file.dat", header = TRUE, na.string = "",
dec = ".",colClasses = c("character", "integer","numeric"), sep = ",")
Related
I'm trying to read a large file into R which is separated by the "not" sign (¬). What I normally do, is to change this symbol into semicolons using Text Edit, and save it as a csv file, but this file is too large, and my computer keeps crashing when I try to do so. I have tried the following options:
my_data <- read.delim("myfile.txt", header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
which results in a dataframe with a single row. This makes sense, I know, since my file is not separated by tabs, but by the not sign. However, when I try so change sep to ¬ or \¬, I get the following message:
Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, :
invalid 'sep' value: must be one byte
I have also tried with
my_data <- read.csv2(file.choose("myfile.txt"))
and
my_data <- read.table("myfile.txt", sep="\¬", quote="", comment.char="")
getting similar results. I have searched for options similar to mine, but his kind of separator is not commonly used.
You can try to read in a piped translation of it.
Setup:
writeLines("a¬b¬c\n1¬2¬3\n", "quux.csv")
The work:
read.csv(pipe("tr '¬' ',' < quux.csv"))
# a b c
# 1 1 2 3
If commas don't work for you, this works equally well with other replacement chars:
read.table(pipe("tr '¬' '\t' < quux.csv"), header = TRUE)
# a b c
# 1 1 2 3
The tr utility is available on all linuxes, it should be available on macos, and it is included in Rtools for windows (as well as git-bash, if you have that).
If there is an issue using pipe, you can always use the tr tool to create another file (replacing your text-editor step):
system2("tr", c("¬", ","), stdin="quux.csv", stdout="quux2.csv")
read.csv("quux2.csv")
# a b c
# 1 1 2 3
I'm trying to import a dataframe to r with read.csv. I exported from my database manager (DBeaver) with UTF-8 encoding.
In some factor vectors what is NULL in the database manager is not recognized as so. I think NULL is replaced by blank space(s) and I when I try, can't turn them to NULL or NA.
I'm using:
tb1 <- read.csv("pacientes.csv", header = TRUE, sep = ",", dec = ".")
I identified the problem when I use
table(tb1$var2, useNA="ifany")
With factor variables I know have missing values and I get a table with "blank space(s)" as a category (along with the correct categories)
I have 59 columns, so using some features of read.csv is unpractical. And I really believe there's an easier way to fix the problem. Can anyone help me? thank you very much!
Try
read.csv("pacientes.csv", header = TRUE, sep = ",", dec = ".", na.strings = c("NA", "", " "))
I have data in a csv file. when i get it read, the columns are in factor levels using which I cannot do any computation.
I used
as.numeric(df$variablename) but it renders a completely different set of data for the variable.
original data in the variable: 2961,488,632,
as.numeric output: 1,8,16
When reading data using read.table you can
specify how your data is separated sep = ,
what the decimal point is dec = ,
how NA characters look like na.strings =
that you do not want to convert strings to factors stringsAsFactors = F
In your case you could use something like:
read.table("mycsv.csv", header = TRUE, sep = ",", dec = ".", stringsAsFactors = F,
na.strings = c("", "-"))
In addition to the answer by Cettt , there's also colClasses.
If you know in advance what data types the columns your csv file has, you can specify this. This stops R from "guessing" what the datatype is, and lets you know when something isn't right, rather than deciding it must be a string. e.g. if your 4-column csv file has columns that are Text, Factors, Integer, Numeric, you can use
read.table("mycsv.csv", header = T, sep = ",", dec = ".",
colClasses=c("character", "factor", "integer", "numeric"))
Edited to add:
As pointed out by gersht, the issue is likely some non-number in the numbers column. Often, this can be how the value NA was coded. Specifying colClasses causes R to give an error message when it encounters any such "not numeric or NA" values, so you can easily see the issue. If it's a non-default coding of NA, use the argument na.strings = c("NA", "YOUR NA VALUE") If it's another issue, you'll likely have to fix the file before importing. For example:
read.table(sep=",",
colClasses=c("character", "numeric"),
text="
cat,11
canary,12
dog,1O") # NB not a 10; it's a 1 and a capital-oh.
gives
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '1O'
I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")
I'm trying to produce some files that have slightly unusual field seperators.
require(data.table)
dset <- data.table(MPAN = c(rep("AAAA",1000),rep("BBBB",1000),rep("CCCC",1000)),
INT01 = runif(3000,0,1), INT02 = runif(3000,0,1), INT03 = runif(3000,0,1))
write.table(dset,"C:/testing_write_table.csv",
sep = "|",row.names = FALSE, col.names = FALSE, na = "", quote = FALSE, eol = "")
I'm findiong however that the rows are not being kept seperate in the output file, e.g.
AAAA|0.238683722680435|0.782154920976609|0.0570344978477806AAAA|0.9250325632......
Would you know how to ensure the text file retains distinct rows?
Cheers
You are using the wrong eol argument. The end of line argument needs to be a break line:
This worked for me:
require(data.table)
dset <- data.table(MPAN = c(rep("AAAA",1000),rep("BBBB",1000),rep("CCCC",1000)),
INT01 = runif(3000,0,1), INT02 = runif(3000,0,1), INT03 = runif(3000,0,1))
write.table(dset,"C:/testing_write_table.csv", #save as .txt if you want to open it with notepad as well as excel
sep = "|",row.names = FALSE, col.names = FALSE, na = "", quote = FALSE, eol = "\n")
Using the break line symbol '\n' as the end of line argument creates separate lines for me.
Turns out this was a UNIX - Windows encoding issue. So something of a red herring, but perhaps worth recording in case anyone else has this at first perplexing issue.
It turns out that Windows notepad sometimes struggles to render files generated in UNIX properly, a quick test to see if this is the issue is to open in Windows WordPad instead and you may find that it will render properly.