I have data in a csv file. when i get it read, the columns are in factor levels using which I cannot do any computation.
I used
as.numeric(df$variablename) but it renders a completely different set of data for the variable.
original data in the variable: 2961,488,632,
as.numeric output: 1,8,16
When reading data using read.table you can
specify how your data is separated sep = ,
what the decimal point is dec = ,
how NA characters look like na.strings =
that you do not want to convert strings to factors stringsAsFactors = F
In your case you could use something like:
read.table("mycsv.csv", header = TRUE, sep = ",", dec = ".", stringsAsFactors = F,
na.strings = c("", "-"))
In addition to the answer by Cettt , there's also colClasses.
If you know in advance what data types the columns your csv file has, you can specify this. This stops R from "guessing" what the datatype is, and lets you know when something isn't right, rather than deciding it must be a string. e.g. if your 4-column csv file has columns that are Text, Factors, Integer, Numeric, you can use
read.table("mycsv.csv", header = T, sep = ",", dec = ".",
colClasses=c("character", "factor", "integer", "numeric"))
Edited to add:
As pointed out by gersht, the issue is likely some non-number in the numbers column. Often, this can be how the value NA was coded. Specifying colClasses causes R to give an error message when it encounters any such "not numeric or NA" values, so you can easily see the issue. If it's a non-default coding of NA, use the argument na.strings = c("NA", "YOUR NA VALUE") If it's another issue, you'll likely have to fix the file before importing. For example:
read.table(sep=",",
colClasses=c("character", "numeric"),
text="
cat,11
canary,12
dog,1O") # NB not a 10; it's a 1 and a capital-oh.
gives
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '1O'
Related
I have a file in which every row is a string of numbers. Example of a row: 0234
Example of this file:
00020
04921
04622
...
When i use read.table it delete all the first 0 of each row (00020 becomes 20, 04921 -> 4921,...). I use:
example <- read.table(fileName, sep="\t",check.names=FALSE)
After this, for obtain a vector i use as.vector(unlist(example)).
I try different options of read.table but the problem remains
The read.table by default checks the column values and change the column type accordingly. If we want a custom type, specify it with colClasses
example <- read.table(fileName, sep="\t",check.names=FALSE,
colClasses = "character", stringsAsFactors = FALSE)
When we are not specifying the colClasses, the function use type.convert to automatically assign the column types based on the value
read.table # function
...
...
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
...
...
If I understand the issue correctly, you read in your data file with read.table but since you want a vector, not a data frame, you then unlist the df. And you want to keep the leading zeros.
There is a simpler way of doing the same, use scan.
example <- scan(file = fileName, what = character(), sep = "\t")
I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")
I am currently trying to read in a .txt file.
I have researched here and found Error in reading in data set in R - however, it did not solve my problem.
The data are political contributions listed by the Federal Election Commission of the U.S. at ftp://ftp.fec.gov/FEC/2014/webk14.zip
Upon inspection of the .txt, I realized that the data is weirdly structured. Especially, the end of the any line is not separated at all from the first cell of the next line (not by a "|", not by a space).
Strangely enough, import via Excel and Access seems to work just fine. However, R import does not work.
To avoid the Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 90 did not have 27 elements error, I use the following command:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", sep = "|", file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
This does not result in an error, however, the results a) have a different line count than with Excel import and b) fail to correctly separate columns (which is probably the reason for a))
I would like not to do a detour via Excel and directly import into R. Any ideas what I am doing wrong?
It might be related to the symbols inside the variable names so turn of interpretation of these using comment.char="", which gives you:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", comment.char="",sep = "|",file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
I have a 3 million row, 500 column dataset. Although the columns are numeric, when importing from csv file, all are treated as factor, not numeric. I am trying to convert them back to numeric with the command
wikifixedn<-as.numeric(as.character(wikifixed))
wikifixed is the dataframe.
It's taking forever... My MacBook Pro, with 16GB ram and 2.3GHz Core i7 has been churning at this for more than an hour. Can I see somewhere how far along am I in the process or if the process is moving along? Is here another, faster method to deal with the conversation problem?
BTW: I tried, when importing the csv file, to force the columns to be treated as numeric using
> wikifixed<-read.csv('~/OneDrive/kredible/finaldata/wutao/wikipediausers.csv', header = TRUE, stringsAsFactors=F)
Yet, when checking I get
> is.numeric(wikifixed)
[1] FALSE
See here
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
you probably should create a vector for colClasses
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
stringsAsFactors
logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.
colClasses
character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA.
Possible values are NA (the default, when type.convert is used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.
Note that colClasses is specified per column (not per variable) and so includes the column of row names (if any).
ALSO see here in case you want to go to data.table because you may run into more issues.
fread in R imports a large .csv file as a data frame with one row
require(data.table)
fread("pre2012_alldatapoints.csv", sep = ",", header= TRUE)
and read
the data.table FAQ at
https://github.com/Rdatatable/data.table/wiki
I'm trying to read a .csv file into R where all the column are numeric. However, they get converted to factor everytime I import them.
Here's a sample of how my CSV looks like:
This is my code:
options(StringsAsFactors=F)
data<-read.csv("in.csv", dec = ",", sep = ";")
As you can see, I set dec to , and sep to ;. Still, all the vectors that should be numerics are factors!
Can someone give me some advice? Thanks!
Your NA strings in the csv file, N/A, are interpreted as character and then the whole column is converted to character. If you have stringsAsFactors = TRUE in options or in read.csv (default), the column is further converted to factor. You can use the argument na.strings to tell read.csv which strings should be interpreted as NA.
A small example:
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";")
str(df)
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";", na.strings = "N/A")
str(df)
Update following comment
Although not apparent from the sample data provided, there is also a problem with instances of '$' concatenated to the numbers, e.g. '$3,3'. Such values will be interpreted as character, and then the dec = "," doesn't help us. We need to replace both the '$' and the ',' before the variable is converted to numeric.
df <- read.csv(text = "x;y;z
N/A;1,1;2,2$
$3,3;5,5;4,4", dec = ",", sep = ";", na.strings = "N/A")
df
str(df)
df[] <- lapply(df, function(x){
x2 <- gsub(pattern = "$", replacement = "", x = x, fixed = TRUE)
x3 <- gsub(pattern = ",", replacement = ".", x = x2, fixed = TRUE)
as.numeric(x3)
}
)
df
str(df)
You could have gotten your original code to work actually - there's a tiny typo ('stringsAsFactors', not 'StringsAsFactors'). The options command wont complain with the wrong text, but it just wont work. When done correctly, it'll read it as char, instead of factors. You can then convert columns to whatever format you want.
I just had this same issue, and tried all the fixes on this and other duplicate posts. None really worked all that well. The way I went about fixing it was actually on the excel side. If you highlight all the columns in your source file (in excel), right click==> format cells then select 'number' it'll import perfectly fine (so long as you have no non-numeric characters below the header)