Reading data into R - r

I am trying to read data from the msigdb database into my R environment, but I am having trouble reading it into the format that I would like. Right now when I read the data in it is read as the type "integer", I want it read in as the type "character" or any other type so that when I transfer data between data frames/matrices I dont get the integer value for the item instead of the written letters that comprise the name of the item.
df<-read.table("msigdb.v5.2.symbols.txt", fill = TRUE)
This is what I currently have, but like I said when I do typeof(df[1,1]) I get "integer".
To summarize:
After reading in data with columns that should be character, the current behavior is: typeof(df[1,1)] produces "integer". The desired behavior is: typeof(df[1,1]] produces "character"
Reproducible example:
library(dplyr)
write.table(band_instruments, "test.txt")
df <- read.table("test.txt", header = TRUE)
typeof(df[1,1])
# [1] "integer"
Thank you!

df<-read.table("msigdb.v5.2.symbols.txt", fill = TRUE, stringsAsFactors = FALSE)
By default, read.table reads all columns as character unless specified otherwise in colClasses*, and read.table and data.frame convert characters to factors. When you extract a single cell of a factor, it's going to show as the internal integer code.
Setting stringsAsFactors = FALSE in the call to read.table resolves this.
*despite the comment below, this is true. read.table reads all columns as character first, then converts them. This is in the documentation, and you can see it from the source code. You can confirm this with the following code:
write.table(mtcars, "mtcars.txt")
read.table("mtcars.txt", header = TRUE, quote = ".")
# Fails because it reads the decimals in the numeric data as quotes
# From the documentation: Quoting is only considered for columns read
# as character, which is all of them unless colClasses is specified

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

How to convert a factor type into a numeric type in R after reading a csv file?

After reading a csv file
data<-read.table(paste0('C:/Users/data/','30092017ARB.csv'),header=TRUE, sep=";")
I get for rather all numeric variable factor as the type, specially for the last column.
I tried all suggestion here However, I get a warning for all suggestions
Warning message:
NAs introduced by coercion
Some one mentioned even in this post:
"Every answer in this post failed to generate results for me , NAs were getting generated."
any idea how can I solve this problem?
Addendum: in the following pic you can see one possible approach suggested in here
However, I get always the same NA .
The percent sign is clearly the problem. Replace the "%" by the empty string, "", and then convert to numeric.
data[[3]] <- sub("%", "", data[[3]])
data[[3]] <- as.numeric(data[[3]])
You can do this in one line of code,
data[[3]] <- as.numeric(sub("%", "", data[[3]]))
Also, two notes on reading the data in.
First, some files use the semi-colon as a column separator. This is very used in countries where the decimal point is the comma. That is why R has two functions to read files in the CSV format.
These functions are both calls to read.table with some defaults changed.
read.csv - Sets arguments header = TRUE and sep = ",".
read.csv2 - Sets arguments header = TRUE, sep = ";" and dec = ",".
For a full explanation see read.table or at an R prompt run help("read.table").
Second, you can avoid factor problems if you use argument stringsAsFactors = FALSE from the start, when reading in the data.

write.table unintendedly adds subscript x to header

I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.

R: read.csv importing the letter i as NA

Pretty simple question (I think). I'm trying to import a .csv file into R, from an experiment in which people respond by either pushing the "e" or the "i" key. In testing it, I responded only in with the "i" key, so the response variable in the data set is basically a list of "i"s (without the quotation marks). When I try and import the data into R:
noload=read.csv("~/Desktop/eprime check no load.csv", na.strings = "")
the response variable comes out all NAs. When I try it with all "e"s, or a mixture of "e" and "i", it works fine.
What is is about the letter i that makes R treat it as NA (n.b. it does this even without the na.strings = "" part)?
Thanks in advance for any help.
When you ask R to read in a table without specifying data types for the columns, it will try to "guess" the data types. In this case, it guesses "complex" for the data type. For example, if you had datafile.csv with contents
Var
i
i
i
and you do:
df = read.csv("datafile.csv", header = TRUE, na.strings = "")
class(df$Var)
you'll get
[1] "complex"
R interprets the i as the purely imaginary value. To fix this simply specify the data types with colClass, like so:
df = read.csv("datafile.csv", header = TRUE, na.strings = "", colClass = "factor")
or replace factor with whatever you want. It's good practice usually to specify data types up front like this so you don't run into confusing errors later.

Imported a csv-dataset to R but the values becomes factors

I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)

Resources