Selecting a Column in R - r

I imported a dataset with no column headings, and I'm trying to label the columns for convenience. I've used R quite a bit before, so I'm confused as to why this code isn't working:
library(mosaic)
`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Error: unexpected symbol in "Station = 0605WindData"
I swear I have experience with R (albeit I'm a bit out of practice), but I seem to be stuck on something pretty simple. I know I've used this select column command before. Suggestions?

You forgot to quote the object name when subsetting:
> `0605WindData` <- data.frame(A = 1:10, B = 1:10)
> `0605WindData`[,1]
[1] 1 2 3 4 5 6 7 8 9 10
As Roman points out, object names are not supposed to start with a digit. Your read.csv() line only worked because you back-tick quoted the object name. You have to continue to quote the object in every line of code now because you used a non-standard name for that object. Save yourself some trouble and change the name of the object you assign the output from read.csv() to.

`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Instead of using quotes for variable start variable name with letter such as
winddata060 <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Now select the required variable name
Station = winddata060[,1]

Related

How can I import a vcf file onto R ? or load it? [duplicate]

I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.
This is what I have tried:
chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)
column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]
myfile.vcf
#not wanted line
#unnecessary line
#junk line
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
result
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
Maybe this could be good for you:
# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)
# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names
p.s.: If you have several vcf files then you should use lapply function.
Best,
Robert
data.table::fread reads it as intended, see example:
library(data.table)
#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")
#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")
We can also use vcfR package, see the manuals in the link.
Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).
library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

In R, command works on the variable, rather than the object the variable adresses. How can I fix this?

I am stuck with a rather stupid issue. I run a loop where numbers are appended to a string to address various columns of a data frame. These column names are then stored in a variable. Then, I run a command on this variable. However, the command is run on the name of what the variable contains, rather than the object represented by the name.
I have previously tried as.character, deparse, as.array etc.
library(qdap)
for (q in 1:9)
{
lolz <- freq_terms(paste("df$col_0", q , sep = ""))
assign (paste("freq", q, sep = ""), lolz)
Instead of the the command freq_terms being applied to the qth column of df and being stored in freq'q', the command is applied to the string df$col_01
An example of the data
col_01 col_02 col_03 col_04
hey no yes ok
Yo yes ok NA
no hey NA ok
The error I receive is
Error in names(y) <- c("WORD", "FREQ") :
'names' attribute [2] must be the same length as the vector [1]
Any help will be much appreciated!
I add this "answer" just to demonstrate that I cannot reproduce your problem with the information given. When I run
library(qdap)
df <- read.table(header = TRUE, text = "
col_01 col_02 col_03 col_04
hey no yes ok
Yo yes ok NA
no hey NA ok")
freq_terms(df[[paste0("col_0", 1)]])
I get as output
WORD FREQ
1 hey 1
2 no 1
3 yo 1
That is why I suggest adding a subset of your data to your question by copying the output of dput(head(df)) rather than how you did -- dput() will give us more information about the object df, which could be causing your problems. You may also want to add the output of sessionInfo() -- could be informative.

Prevent variable name getting mangled by read.csv/read.table?

My data set testdata has 2 variables named PWGTP and AGEP
The data are in a .csv file.
When I do:
> head(testdata)
The variables show up as
ï..PWGTP AGEP
23 55
26 56
24 45
22 51
25 54
23 35
So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.
HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:
Error: id variables not found in data: ï..PWGTP
Similarly, when I use some function to refer to the variable PWGTP, I get the message:
Error: id variables not found in data: PWGTP
2 Questions:
Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?
It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?
This is a BOM (Byte Order Mark) UTF-8 issue.
To prevent this from happening, 2 options:
Save your file as UTF-8 without BOM / signature -- or --
Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv
Example:
mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")
It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table
d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
However, it is better not to have a name starting with number or have spaces in between.
So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names
names(d1) <- sub(".*\\.+", "", names(d1))
As an example
sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"

Best way to assign variable names to values from a parameter file in R?

I have an R program with a large and growing parameter file. Currently, I read the file values and manually assign them to variables names used in the program. This is unwieldy especially when I want to add a new item in the parameter list.
For example, I can use the toy file stored in cvs or text format,
## test using comments and blank lines
3 #an integer
TRUE # a logical
5.3 #numeric
# and here is some
# commentary ...
2014-6-5 # a date
7.8 #strange here
some text #text here
and use something like this to read back the file,
some.lst <- scan("Data/TestReadData.txt", what = character(),
sep = "\n", comment.char = "#",
strip.white = TRUE, quiet = TRUE)
This will result in a list of values of mode “text". I then proceed to assign these to variable values like so:
first.int <- as.integer(some.ls[1])
my.switch <- as.logical(some.ls[2])
parm1 <- as.numeric(some.ls[3])
the.date <- as.Date(some.ls[4])
nxt.val <- as.numeric(some.ls[5])
my.text <- some.ls[6]
But, I have 40 of these, and now I want to add a new parameter in the list below, say, below the 5.3 value in the example given. I have to go in and add the new value into the parameter file but also add a new variable into the variable assignment list and reshuffle all of the indices of the assignments.
Surely, there is a better way to do this, but I can’t think of anything. Ideas much appreciated.
Suppose you have some stuff, all coerced to character when it was read into R:
some_stuff <- c("11","TRUE","1.2")
First, write down how you want each thing dealt with:
cleanem <- read.table(header=TRUE,text="
i nm type
1 first.int integer
2 my.switch logical
3 parm1 numeric
",stringsAsFactors=FALSE)
Then deal with them, storing the result in a list:
res <- with(cleanem,setNames(mapply(as,some_stuff,type,SIMPLIFY=FALSE),nm))
# $first.int
# [1] 11
# $my.switch
# [1] TRUE
# $parm1
# [1] 1.2

Resources