I am currently trying to read in a .txt file.
I have researched here and found Error in reading in data set in R - however, it did not solve my problem.
The data are political contributions listed by the Federal Election Commission of the U.S. at ftp://ftp.fec.gov/FEC/2014/webk14.zip
Upon inspection of the .txt, I realized that the data is weirdly structured. Especially, the end of the any line is not separated at all from the first cell of the next line (not by a "|", not by a space).
Strangely enough, import via Excel and Access seems to work just fine. However, R import does not work.
To avoid the Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 90 did not have 27 elements error, I use the following command:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", sep = "|", file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
This does not result in an error, however, the results a) have a different line count than with Excel import and b) fail to correctly separate columns (which is probably the reason for a))
I would like not to do a detour via Excel and directly import into R. Any ideas what I am doing wrong?
It might be related to the symbols inside the variable names so turn of interpretation of these using comment.char="", which gives you:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", comment.char="",sep = "|",file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
Related
I have data in a csv file. when i get it read, the columns are in factor levels using which I cannot do any computation.
I used
as.numeric(df$variablename) but it renders a completely different set of data for the variable.
original data in the variable: 2961,488,632,
as.numeric output: 1,8,16
When reading data using read.table you can
specify how your data is separated sep = ,
what the decimal point is dec = ,
how NA characters look like na.strings =
that you do not want to convert strings to factors stringsAsFactors = F
In your case you could use something like:
read.table("mycsv.csv", header = TRUE, sep = ",", dec = ".", stringsAsFactors = F,
na.strings = c("", "-"))
In addition to the answer by Cettt , there's also colClasses.
If you know in advance what data types the columns your csv file has, you can specify this. This stops R from "guessing" what the datatype is, and lets you know when something isn't right, rather than deciding it must be a string. e.g. if your 4-column csv file has columns that are Text, Factors, Integer, Numeric, you can use
read.table("mycsv.csv", header = T, sep = ",", dec = ".",
colClasses=c("character", "factor", "integer", "numeric"))
Edited to add:
As pointed out by gersht, the issue is likely some non-number in the numbers column. Often, this can be how the value NA was coded. Specifying colClasses causes R to give an error message when it encounters any such "not numeric or NA" values, so you can easily see the issue. If it's a non-default coding of NA, use the argument na.strings = c("NA", "YOUR NA VALUE") If it's another issue, you'll likely have to fix the file before importing. For example:
read.table(sep=",",
colClasses=c("character", "numeric"),
text="
cat,11
canary,12
dog,1O") # NB not a 10; it's a 1 and a capital-oh.
gives
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '1O'
I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?
I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.
I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")
I have a csv file which has only one column and empty cells in the first couple cells. When I tried to read it into R with intention reading those blank lines as missing values, I run into the following error. Any help is appreciated!
Test = read.csv("test.csv", header = FALSE, blank.lines.skip = FALSE, nrows = 10)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file
I am trying to read a .tsv (tab-separated value) file into R using a specific encoding. It's supposedly windows-1252. And it has a header.
Any suggestions for the code to put it into a data frame?
Something like this perhaps?
mydf <- read.table('thefile.txt', header=TRUE, sep="\t", fileEncoding="windows-1252")
str(mydf)
You can also use:
read.delim('thefile.txt', header= T, fileEncoding= "windows-1252")
Simply entering the command into your R consol:
> read.delim
function (file, header = TRUE, sep = "\t", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
reveals that read.delim is a packaged read.table command that already specifies tabs as your data's separator. read.delim might be more convenient if you're working with a lot of tsv files.
The difference between the two commands is discussed in more detail in this Stack question.
df <- read.delim(~/file_directory/file_name.tsv, header = TRUE) will be working fine for single .tsv file, because it is already tab separated, so no need sep = "\t". fileEncoding= "windows-1252" could be used but not necessary.