Pretty simple question (I think). I'm trying to import a .csv file into R, from an experiment in which people respond by either pushing the "e" or the "i" key. In testing it, I responded only in with the "i" key, so the response variable in the data set is basically a list of "i"s (without the quotation marks). When I try and import the data into R:
noload=read.csv("~/Desktop/eprime check no load.csv", na.strings = "")
the response variable comes out all NAs. When I try it with all "e"s, or a mixture of "e" and "i", it works fine.
What is is about the letter i that makes R treat it as NA (n.b. it does this even without the na.strings = "" part)?
Thanks in advance for any help.
When you ask R to read in a table without specifying data types for the columns, it will try to "guess" the data types. In this case, it guesses "complex" for the data type. For example, if you had datafile.csv with contents
Var
i
i
i
and you do:
df = read.csv("datafile.csv", header = TRUE, na.strings = "")
class(df$Var)
you'll get
[1] "complex"
R interprets the i as the purely imaginary value. To fix this simply specify the data types with colClass, like so:
df = read.csv("datafile.csv", header = TRUE, na.strings = "", colClass = "factor")
or replace factor with whatever you want. It's good practice usually to specify data types up front like this so you don't run into confusing errors later.
Related
I am trying to import multiple CSV files in a for loop. Iteratively trying to solve the errors the code produced I go to the below to do this.
for (E in EDCODES) {
Filename <- paste("$. Data/2. Liabilities/",
E,
sep="")
Framename <- gsub("\\..*",
"",
E)
assign(Framename,
read.csv(Filename,
header = TRUE,
sep = ",",
stringsAsFactors = FALSE,
na.strings = c("\"ND",
"ND,5",
"5\""),
colClasses = c("BAA35" = "double"),
encoding = "UTF-8",
quote = ""))}
First I realized that the code does not always recognize the most important column "BAA35" as numeric, so I added the colClasses argument. Then I realized that the data has multiple versions of "NA", so I added the na.strings argument. The most common NA value is "ND, 5", which contains the separator ",". So if I add the na.strings argument as defined above I get a lot of EOF within quoted string warnings. The others are also versions of "ND, [NUMBER]" or "ND, 4, [YYYY-MM]".
If I then try to treat that issue with the most common recommendation I could find, adding quote = "" I just end up with a more columns than column names issue.
The data has 78 columns, so I don't believe posting it here will display in a usable way.
Can somebody recommend any solution for how I can reliable import this column as a numeric value and have R recognize NAs in the data correctly?
I think the issue might be that the na.strings contain commas and in some cases the ND,5 is read as one column with ND and one with a 5 and in other cases it's seen as the na.string. Any way to tell R to not split "ND,5" into two columns?
I'm still very new to R, I have no other coding experience, and I don't understand some of the fundamentals, so please bear with me.
I'm trying to do a multiple regression on the data set found at:
https://studysites.sagepub.com/dsur/study/DSUR%20Data%20Files/Chapter%207/ChildAggression.dat
The website's answers don't mention any transformation of the data, but suggest one could just go ahead with the lm() function.
aggro <- read.delim("ChildAggression.dat", header = TRUE)
aggro.reg1 <- lm(Aggression ~ Parenting_Style + Sibling_Aggression, data = aggro)
Error in eval(predvars, data, env) : object 'Aggression' not found
I don't understand why it isn't finding the object.
Any help is much appreciated.
The default separator for read.delim is \t, but the file isn't tab separated. You want sep = "" instead.
Having read in the file as you did:
aggro <- read.delim("ChildAggression.dat", header = TRUE)
there are numerous ways to detect that something is wrong:
> dim(aggro) #number of columns is clearly wrong
[1] 666 1
> names(aggro) #only one long concatenated column name
[1] "Aggression.Television.Computer_Games.Sibling_Aggression.Diet.Parenting_Style"
> colnames(aggro) #only one long concatenated column name
[1] "Aggression.Television.Computer_Games.Sibling_Aggression.Diet.Parenting_Style"
After reading a csv file
data<-read.table(paste0('C:/Users/data/','30092017ARB.csv'),header=TRUE, sep=";")
I get for rather all numeric variable factor as the type, specially for the last column.
I tried all suggestion here However, I get a warning for all suggestions
Warning message:
NAs introduced by coercion
Some one mentioned even in this post:
"Every answer in this post failed to generate results for me , NAs were getting generated."
any idea how can I solve this problem?
Addendum: in the following pic you can see one possible approach suggested in here
However, I get always the same NA .
The percent sign is clearly the problem. Replace the "%" by the empty string, "", and then convert to numeric.
data[[3]] <- sub("%", "", data[[3]])
data[[3]] <- as.numeric(data[[3]])
You can do this in one line of code,
data[[3]] <- as.numeric(sub("%", "", data[[3]]))
Also, two notes on reading the data in.
First, some files use the semi-colon as a column separator. This is very used in countries where the decimal point is the comma. That is why R has two functions to read files in the CSV format.
These functions are both calls to read.table with some defaults changed.
read.csv - Sets arguments header = TRUE and sep = ",".
read.csv2 - Sets arguments header = TRUE, sep = ";" and dec = ",".
For a full explanation see read.table or at an R prompt run help("read.table").
Second, you can avoid factor problems if you use argument stringsAsFactors = FALSE from the start, when reading in the data.
I am trying to read data from the msigdb database into my R environment, but I am having trouble reading it into the format that I would like. Right now when I read the data in it is read as the type "integer", I want it read in as the type "character" or any other type so that when I transfer data between data frames/matrices I dont get the integer value for the item instead of the written letters that comprise the name of the item.
df<-read.table("msigdb.v5.2.symbols.txt", fill = TRUE)
This is what I currently have, but like I said when I do typeof(df[1,1]) I get "integer".
To summarize:
After reading in data with columns that should be character, the current behavior is: typeof(df[1,1)] produces "integer". The desired behavior is: typeof(df[1,1]] produces "character"
Reproducible example:
library(dplyr)
write.table(band_instruments, "test.txt")
df <- read.table("test.txt", header = TRUE)
typeof(df[1,1])
# [1] "integer"
Thank you!
df<-read.table("msigdb.v5.2.symbols.txt", fill = TRUE, stringsAsFactors = FALSE)
By default, read.table reads all columns as character unless specified otherwise in colClasses*, and read.table and data.frame convert characters to factors. When you extract a single cell of a factor, it's going to show as the internal integer code.
Setting stringsAsFactors = FALSE in the call to read.table resolves this.
*despite the comment below, this is true. read.table reads all columns as character first, then converts them. This is in the documentation, and you can see it from the source code. You can confirm this with the following code:
write.table(mtcars, "mtcars.txt")
read.table("mtcars.txt", header = TRUE, quote = ".")
# Fails because it reads the decimals in the numeric data as quotes
# From the documentation: Quoting is only considered for columns read
# as character, which is all of them unless colClasses is specified
I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)