Prevent variable name getting mangled by read.csv/read.table? - r

My data set testdata has 2 variables named PWGTP and AGEP
The data are in a .csv file.
When I do:
> head(testdata)
The variables show up as
ï..PWGTP AGEP
23 55
26 56
24 45
22 51
25 54
23 35
So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.
HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:
Error: id variables not found in data: ï..PWGTP
Similarly, when I use some function to refer to the variable PWGTP, I get the message:
Error: id variables not found in data: PWGTP
2 Questions:
Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?
It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?

This is a BOM (Byte Order Mark) UTF-8 issue.
To prevent this from happening, 2 options:
Save your file as UTF-8 without BOM / signature -- or --
Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv
Example:
mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")

It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table
d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
However, it is better not to have a name starting with number or have spaces in between.
So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names
names(d1) <- sub(".*\\.+", "", names(d1))
As an example
sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

read txt files with left-aligned data but inconsistent number of spaces in R

I have a series of txt files formatted in the same way.
The first few rows are all about file information. There are no variable names. As you can see spaces between factors are inconsistent but Columns are left-aligned or right-aligned.I know SAS could directly read data with this format and wonder if R provide any function similar.
I tried read.csv function to load these data and I want to save them in a data.frame with 3 columns, while it turns out the option sep = "\s"(multiple spaces) in the function cannot recognize regular expression.
So I tried to read these data in a variable first and use substr function to split them as following.
step1
Factor<-data.frame(substr(Share$V1,1,9),substr(Share$V1,9,14),as.numeric(substr(Share$V1,15,30)))
step2
But this is quite unintelligent, and need to count the spaces between.
I wander if there is any method to directly load data as three columns.
> Factor
F T S
1 +B2P A 1005757219
2 +BETA A 826083789
We can use read.table to read it as 3 columns
read.table(text=as.character(Share$V1), sep="", header=FALSE,
stringsAsFactors=FALSE, col.names = c("FactorName", "Type", "Share"))
# FactorName Type Share
#1 +B2P A 1005757219
#2 +BETA A 826083789
#3 +E2P A 499237181
#4 +EF2P A 38647147
#5 +EFCHG A 866171133
#6 +IL1QNS A 945726018
#7 +INDMOM A 862690708
Another option would be to read it directly from the file, skipping the header line and change the column names
read.table("yourfile.txt", header=FALSE, skip=1, stringsAsFactors=FALSE,
col.names = c("FactorName", "Type", "Share"))

R convert exponent (read by R as string) into simple number

I read a CSV file into R with the following command:
myfile <- read.csv('C:/Users/myfilepath.csv', sep=',', header = F)
With this I get a nice data frame looking a little like this:
year / Variable1 / Variable2 / etc.
1958 / 1.42547014192473E-08 / 3.06399766669684E-10 / etc.
1959 / 2.05022315791225E-09 / 8.80152568089836E-08 / etc.
1960 / etc. .... ....
However, R seems to treat the letter E for exponents as string. So I need to convert these first into a simple number before I can analyze the data. The data set has 50 rows and 12 columns.
I tried as.numeric but get the error message
Error: (list) object cannot be coerced to type 'double'
Any ideas?
You can format the DF using:
format(myfile,scientific=FALSE)
You can use "options("scipen"=100)" before you read the file.
If you see there is tailing zeros, then I will suggest tou to check the csv file before import.
The answers by Soto and Alistair work if the cells in the csv that is imported are formatted as 'scientific'. Otherwise it doesn't. Thanks guys!
Code used:
mydata<- read.csv('C:/Users/mydata.csv', sep=',', na.strings=c("", "NA"), header = F)
mydata <- sapply(mydata, as.numeric)

Selecting a Column in R

I imported a dataset with no column headings, and I'm trying to label the columns for convenience. I've used R quite a bit before, so I'm confused as to why this code isn't working:
library(mosaic)
`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Error: unexpected symbol in "Station = 0605WindData"
I swear I have experience with R (albeit I'm a bit out of practice), but I seem to be stuck on something pretty simple. I know I've used this select column command before. Suggestions?
You forgot to quote the object name when subsetting:
> `0605WindData` <- data.frame(A = 1:10, B = 1:10)
> `0605WindData`[,1]
[1] 1 2 3 4 5 6 7 8 9 10
As Roman points out, object names are not supposed to start with a digit. Your read.csv() line only worked because you back-tick quoted the object name. You have to continue to quote the object in every line of code now because you used a non-standard name for that object. Save yourself some trouble and change the name of the object you assign the output from read.csv() to.
`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Instead of using quotes for variable start variable name with letter such as
winddata060 <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Now select the required variable name
Station = winddata060[,1]

How to read in numbers with a comma as decimal separator?

I have a series of CSV files where numbers are formatted in the european style using commas instead of decimal points, i.e. 0,5 instead of 0.5.
There are too many of these files to edit them before importing to R. I was hoping there is an easy parameter for the read.csv() function, or a method to apply to the extracted dataset in order for R to treat the data as a number rather than a string.
When you check ?read.table you will probably find all the answer that you need.
There are two issues with (continental) European csv files:
What does the c in csv stand for? For standard csv this is a ,, for European csv this is a ;
sep is the corresponding argument in read.table
What is the character for the decimal point? For standard csv this is a ., for European csv this is a ,
dec is the corresponding argument in read.table
To read standard csv use read.csv, to read European csv use read.csv2. These two functions are just wrappers to read.table that set the appropriate arguments.
If your file does not follow either of these standards set the arguments manually.
From ?read.table:
dec the character used in the file for decimal points.
And yes, you can use that for read.csv as well. (to me: no stupid, you cannot!)
Alternatively, you can also use
read.csv2
which assumes a "," decimal separator and a ";" for column separators.
read.csv(... , sep=";")
Suppose this imported field is called "amount", you can fix the type in this way if your numbers are being read in as character:
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
I have this happen to me frequently along with a bunch of other little annoyances when importing from excel or excel csv. As it seems that there's no consistent way to ensure getting what you expect when you import into R, post-hoc fixes seem to be the best method. By that I mean, LOOK at what you imported - make sure it's what you expected and fix it if it's not.
can be used as follow:
mydata <- read.table(fileIn, dec=",")
input file (fileIn):
D:\TEST>more input2.txt
06-05-2014 09:19:38 3,182534 0
06-05-2014 09:19:51 4,2311 0
Problems may also be solved if you indicate how your missing values are represented (na.strings=...). For example V1 and V2 here have the same format (decimals separated by "," in csv file), but since NAs are present in V1 it is interpreted as factor:
dat <- read.csv2("...csv", header=TRUE)
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0,237 0.621
> 2 1 0:02:00 0,242 0.675
> 3 1 0:03:00 0,232 0.398
dat <- read.csv2("...csv", header=TRUE, na.strings="---")
head(dat)
> ID x time V1 V2
> 1 1 0:01:00 0.237 0.621
> 2 1 0:02:00 0.242 0.675
> 3 1 0:03:00 0.232 0.398
maybe
as.is=T
this also prevents to convert the character columns into factors
Just to add to Brandon's answer above, which worked well for me (I don't have enough rep to comment):
If you're using
d$amount <- sub(",",".",d$amount)
d$amount <- as.numeric(d$amount)
don't forget that you may need sub("[.]", "", d$amount, perl=T) to get around the . character.
You can pass the decimal character as a parameter (dec = ","):
# Semicolon as separator and comma as decimal point by default
read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", encoding = "unknown", ...)
More info on https://r-coder.com/read-csv-r/
For the pythonistas:
import pandas as pd
mycsv = pd.read_csv("file.csv", delimiter=";", decimal=",")

Resources