I'm trying to insert a check step in my R script to determine if the structure of the CSV table I'm reading is as expected.
See details:
table.csv has the following colnames:
[1] "A","B","C","D"
This file is generated by someone else, hence I'd like to make sure at beginning of my script that the colnames and the number/order of columns has not change.
I tried to do the following:
#dataframes to import
df_table <- read.csv('table.csv')
#define correct structure of file
Correct_Columns <- c('A','B','C','D')
#read current structure of table
Current_Columns <- colnames(df_table)
#Check whether CSV was correctly imported from Source
if(Current_Columns != Correct_Columns)
{
# if structure has changed, stop the script.
stop('Imported CSV has a different structure, please review export from Source.')
}
#if not, continue with the rest of the script...
Thanks in advance for any help!
Using base R, I suggest you take a look at all.equal(), identical() or any().
See the following example:
a <- c(1,2)
b <- c(1,2)
c <- c(1,2)
d <- c(1,2)
df <- data.frame(a,b,c,d)
names.df <- colnames(df)
names.check <- c("a","b","c","d")
!all.equal(names.df,names.check)
# [1] FALSE
!identical(names.df,names.check)
# [1] FALSE
any(names.df!=names.check)
# [1] FALSE
Following, your code could be modified as follows:
if(!all.equal(Current_Columns,Correct_Columns))
{
# call your stop statement here
}
Your code probably throws a warning because Current_Columns!=Correct_Columns will compare all entries of the vector (i.e. running Current_Columns!=Correct_Columns on its own on the console will return a vector with TRUE/FALSE values).
Contrary, all.equal() or identical() will compare the whole vectors while treating them as objects.
For the sake of completeness, please be aware of the slight difference between all.equal() and identical(). In your case it doesn't matter which one you use but it can get important when dealing with numerical vectors. See here for more information.
A quick way with data.table:
library(data.table)
DT <- fread("table.csv")
Correct_Columns <- c('A','B','C','D')
Current_Columns <- colnames(df_table)
Check if there is a false in pairwise matching:
if(F %in% Current_Columns == Correct_Columns){
stop('Imported CSV has a different structure, please review export from Source.')
}
}
Related
In R, is there a function or other way to make a list of all variables that have been created in the global environment after a certain point in the script? I am using an R notebook so it has chunks of code, and the goal is to eventually delete all variables that were made in certain chunks. The first part of the script has many variables (takes a long time to reread) that I would like to keep but then delete all the variables created in the second part of the script. I know I can just clear the environment etc. but for certain reasons I can't do this. I also have too many variables to selectively type the ones I want to rm(). The variables are all different A (pseudo) example of what I want to do...
x <- 1
y <- 2
df <- data.frame()
rr <- raster()
## Function here to iteratively list all variables created after this line of code##
dd <- data.frame()
z <- c(1,2,3)
rm(listofvars) #contains "dd" and "zz" only
Alternatively, is there a way to list all variables in the global environment in the order that they were created?
I hope this makes sense. Any help is appreciated.
I don't think it's a great idea to get into the business of parsing your script and determine variable definition order in that manner. Here's an alternative: set "known variables" checkpoints.
Imagine this is your notebook, first code block:
.known_vars <- list()
### first code block
# some code here
a <- 1
bb <- 2
# more code
.known_vars <- c(.known_vars, list(setdiff(ls(), unlist(.known_vars))))
End each of your code-blocks (or even more frequently, it's entirely up to you) with that last part, which appends a list of variables not known in the previous code block(s).
Next:
### second code block
# some code here
a <- 2 # over-write, not new
quux1 <- quux2 <- 9
# more code
.known_vars <- c(.known_vars, list(setdiff(ls(), unlist(.known_vars))))
Again, that last line is the same as before. Just use that same line of code.
When you want to do some cleanup, then
.known_vars
# [[1]]
# [1] "a" "bb"
# [[2]]
# [1] "quux1" "quux2"
In this case, if we want to remove all variable except those in the first code block, then we'd do
unlist(.known_vars[-1])
# [1] "quux1" "quux2"
rm(list = unlist(.known_vars[-1]))
The reason I chose a dot-leading variable name is that by default it is not shown in ls() output: you'd need ls(all.names=TRUE) to see it as well. While not a problem, I just want to keep things a little cleaner. If you choose to not start with a dot, and for some reason choose to delete variables from the same code block in which known_vars is defined, the you might lose the checkpoints for other blocks, too.
If you want this a little more formal, then you can do
.vars <- local({
.cache <- list()
function(add = NULL, clear = FALSE) {
if (clear) .cache <<- list()
if (length(add)) .cache <<- c(.cache, list(setdiff(add, unlist(.cache))))
if (is.null(add)) .cache else invisible(.cache)
}
})
Where calling it with nothing gets its current stage, and calling with ls() will make a new entry. Such as:
ls() # proof we're starting "empty"
# character(0)
.vars(clear = TRUE) # instantiate with an empty list of variables
### first code block
# some code here
a <- 1
bb <- 2
# more code
.vars(ls())
### second code block
# some code here
a <- 2 # over-write, not new
quux1 <- quux2 <- 9
# more code
.vars(ls())
.vars()
# [[1]]
# [1] "a" "bb"
# [[2]]
# [1] "quux1" "quux2"
And removing unwanted variables is done in the same way.
Since this is still just an object in the global environment, the next best way to keep this protected (and perhaps as a not-leading-dot object name), would be to make sure it is in its own environment (not .GlobalEnv) and still in R's search path. This is likely easily done with a custom package, though that may be more work than you were expecting for this simple utility.
BTW: R does not store when an object is created, modified, or deleted, so you'd need to keep track of that, too. If you feel the need to add timestamps to .vars(), then you'll need to restructure things a bit ... again, perhaps more effort than needed here.
BTW 2: this is prone to deleted-then-redefined variables: it does not know if vars have been deleted, just that they were defined at some time. If anything else removes variables, this won't know, and then rm(list=...) will complain about missing variables. Not horrible, but still good to know.
Using the script created in the Note at the end, read it in using readLines, then grep out those lines that start with optional space, a word, more optional space and <- . Then remove the <- and everything thereafter and trim off whitespace leaving the variable names v in the order encountered in the script. Next as an example form vv as a subvector of v containing "df" and the following variable names.
L <- grep("^\\s*\\S*\\s*<-", readLines("myscript.R"), value = TRUE)
v <- trimws(sub("<-.*", "", L)); v
## [1] "x" "y" "df" "rr" "dd" "z"
vv <- tail(v, -(match("df", v)-1)); vv
## [1] "df" "rr" "dd" "z"
To remove variables in vv from global environment use rm(list = vv, .GlobalEnv) .
Note
Lines <- "
x <- 1
y <- 2
df <- data.frame()
rr <- raster()
## Function here to iteratively list all variables created after this line of code##
dd <- data.frame()
z <- c(1,2,3)
"
cat(Lines, file = "myscript.R")
I need to run through a large data frame and extract a vector with the name of the variables that are numeric type.
I've got stuck in my code, perhaps someone could point me to a solution.
This is how far I have got:
numericVarNames <- function(df) {
numeric_vars<-c()
for (i in colnames(df)) {
if (is.numeric(df[i])) {
numeric_vars <- c(numeric_vars, colnames(df)[i])
message(numeric_vars[i])
}
}
return(numeric_vars)
}
To run it:
teste <-numericVarNames(semWellComb)
The is.numeric assertion is not working. There is something wrong with my syntax for catching the type of each column. What is wrong?
Rather than a looping function, how about
df <- data.frame(a = c(1,2,3),
b = c("a","b","c"),
c = c(4,5,6))
## names(df)[sapply(df, class) == "numeric"]
## updated to be 'safer'
names(df)[sapply(df, is.numeric)]
[1] "a" "c"
## As variables can have multiple classes
This question is worth a read
Without test data it is hard to be sure, but it looks like there is just a "grammar" issue in your code.
You wrote:
numeric_vars <- c(numeric_vars, colnames(df)[i])
The way to get the column name into the concatenated list is to include the whole referred to subset in the parentheses:
numeric_vars <- c(numeric_vars, colnames(df[i]))
Try running it with that one change and see what you get.
I have an initial variable:
a = c(1,2,3)
attr(a,'name') <- 'numbers'
Now I want to create a new variable that is a subset of a and then have it have the same attributes as a. Is there like a copy.over.attr function or something around that does this without me having to go inside and identify which one is user defined attributes etc. This gets complicated when I have numerous attributes attached to a single variable.
It should be used with caution and care. There is mostattributes<-, which receives a list and attempts to set the attributes in the list to the object in its argument. At the very least, reading the source code will give you some nice ideas on how to check attributes between objects. Here's a little run on your sample a vector. It succeeds since it's not violating any properties of b
a = c(1,2,3)
attr(a,'name') <- 'numbers'
b <- a[-1]
attributes(b)
# NULL
mostattributes(b) <- attributes(a)
attributes(b)
# $name
# [1] "numbers"
Here's a sample of the source code where names are checked.
if (h.nam <- !is.na(inam <- match("names", names(value)))) {
n1 <- value[[inam]]
value <- value[-inam]
}
if (h.dim <- !is.na(idin <- match("dim", names(value)))) {
d1 <- value[[idin]]
value <- value[-idin]
}
if (h.dmn <- !is.na(idmn <- match("dimnames", names(value)))) {
dn1 <- value[[idmn]]
value <- value[-idmn]
}
attributes(obj) <- value
There is also attr.all.equal. It's not the operation you want, but I think you would benefit from reading that source code too. There are many good checks you can learn about in that one.
Wouldn't a simple attributes(b) <- attributes(a) work?
This will just be executed after creating b from a subset of the data in a, so it's not really a single statement, but should work.
I have a matrix of information that I import from tab separated files. Once I import this data, I consolidate it in to a dataframe, and perform some editing on it to make it usable.
The last step, is for me to convert all the numbers to numeric. In order to do this, i use
as.numeric(as.character()). Unfortunately, the numbers do not change to numeric. They are still of chr type.
Here is my code:
stringsAsFactors=F
filelist <- list.files(path="C:\\Users\\LocalAdmin\\Desktop\\Correlation Project\\test", full.names=TRUE, recursive=FALSE)
temp <- data.frame()
TSV <- data.frame()
for (i in seq (1:length(filelist)))
{
temp <- read.table(file=filelist[i],head=TRUE,sep="\t")
TSV <- rbind(TSV,temp)
}
for (i in seq(15,1,-1)) #getting rid of extraneous dataframe entries
{
TSV <- TSV[-i,] #deleting by row
}
for(i in seq(1,ncol(TSV),1))
{
TSV[,i] <- as.numeric(as.character(TSV[,i]))
}
Thank you for your help!
You can use
TSV <- as.data.frame(as.numeric(as.matrix(TSV)))
This will only work if all values can be transformed into numbers.
A couple of things here:
Prefer vector operations whenever possible
no need to read the files in a for loop
TSV<-do.call(rbind,lapply(filelist, read.delim))
your loop to get rid of extraneous info can be reduced to a vector operation
TSV<-TSV[-(1:15),]
I'm assuming you are getting factors and integers that you want as numeric
oldClasses<-sapply(TSV[1,],class)
int2numeric<-oldClasses == "integer"
factor2numeric<-oldClasses == "factor"
TSV[,int2numeric]<-as.numeric(TSV[,int2numeric])
TSV[,factor2numeric]<-as.numeric(as.character(TSV[,factor2numeric]))
you could arguably reduce the 2 above to one, but I think this makes your intent clear
and that should be it
#JPC I finally managed to get this to work. Here is my code:
TSVnew<-apply(TSV,2,as.numeric)
rownames(TSVnew)<-rownames(TSV)
TSV<-TSVnew
However, I still don't understand why my previous attempt using this didn't work:
for(i in seq(1,ncol(TSV),1))
{
TSV[,i] <- as.numeric(as.character(TSV[,i]))
}
I have this .csv file:
ID,GRADES,GPA,Teacher,State
3,"C",2,"Teacher3","MA"
1,"A",4,"Teacher1","California"
And what I want to do is read in the file using the R statistical software and read in the Header into some kind of list or array (I'm new to R and have been looking for how to do this, but so far have had no luck).
Here's some pseudocode of what I want to do:
inputfile=read.csv("C:/somedirectory")
for eachitem in row1:{
add eachitem to list
}
Then I want to be able to use those names to call on each vertical column so that I can perform calculations.
I've been scouring over google for an hour, trying to find out how to this but there is not much out there on dealing with headers specifically.
Thanks for your help!
You mention that you will call on each vertical column so that you can perform calculations. I assume that you just want to examine each single variable. This can be done through the following.
df <- read.csv("myRandomFile.csv", header=TRUE)
df$ID
df$GRADES
df$GPA
Might be helpful just to assign the data to a variable.
var3 <- df$GPA
You need read.csv("C:/somedirectory/some/file.csv") and in general it doesn't hurt to actually look at the help page including its example section at the bottom.
As Dirk said, the function you are after is 'read.csv' or one of the other read.table variants. Given your sample data above, I think you will want to do something like this:
setwd("c:/random/directory")
df <- read.csv("myRandomFile.csv", header=TRUE)
All we did in the above was set the directory to where your .csv file is and then read the .csv into a dataframe named df. You can check that the data loaded properly by checking the structure of the object with:
str(df)
Assuming the data loaded properly, you can think go on to perform any number of statistical methods with the data in your data frame. I think summary(df) would be a good place to start. Learning how to use the help in R will be immensely useful, and a quick read through the help on CRAN will save you lots of time in the future: http://cran.r-project.org/
You can use
df <- read.csv("filename.csv", header=TRUE)
# To loop each column
for (i in 1:ncol(df))
{
dosomething(df[,i])
}
# To loop each row
for (i in 1:nrow(df))
{
dosomething(df[i,])
}
Also, you may want to have a look to the apply function (type ?apply or help(apply))if you want to use the same function on each row/column
Please check this out if it helps you
df<-read.csv("F:/test.csv",header=FALSE,nrows=1)
df
V1 V2 V3 V4 V5
1 ID GRADES GPA Teacher State
a<-c(df)
a[1]
$V1
[1] ID
Levels: ID
a[2]
$V2
[1] GRADES
Levels: GRADES
a[3]
$V3
[1] GPA
Levels: GPA
a[4]
$V4
[1] Teacher
Levels: Teacher
a[5]
$V5
[1] State
Levels: State
Since you say you want to access by position once your data is read in, you should know about R's subsetting/ indexing functions.
The easiest is
df[row,column]
#example
df[1:5,] #rows 1:5, all columns
df[,5] #all rows, column 5.
Other methods are here. I personally use the dplyr package for intuitive data manipulation (not by position).