I need to run through a large data frame and extract a vector with the name of the variables that are numeric type.
I've got stuck in my code, perhaps someone could point me to a solution.
This is how far I have got:
numericVarNames <- function(df) {
numeric_vars<-c()
for (i in colnames(df)) {
if (is.numeric(df[i])) {
numeric_vars <- c(numeric_vars, colnames(df)[i])
message(numeric_vars[i])
}
}
return(numeric_vars)
}
To run it:
teste <-numericVarNames(semWellComb)
The is.numeric assertion is not working. There is something wrong with my syntax for catching the type of each column. What is wrong?
Rather than a looping function, how about
df <- data.frame(a = c(1,2,3),
b = c("a","b","c"),
c = c(4,5,6))
## names(df)[sapply(df, class) == "numeric"]
## updated to be 'safer'
names(df)[sapply(df, is.numeric)]
[1] "a" "c"
## As variables can have multiple classes
This question is worth a read
Without test data it is hard to be sure, but it looks like there is just a "grammar" issue in your code.
You wrote:
numeric_vars <- c(numeric_vars, colnames(df)[i])
The way to get the column name into the concatenated list is to include the whole referred to subset in the parentheses:
numeric_vars <- c(numeric_vars, colnames(df[i]))
Try running it with that one change and see what you get.
Related
I am a beginner at R coming from Stata and my first head ache is to figure out how I can loop over a list of names conducting the same operation to all names. The names are variables coming from a data frame. I tried defining a list in this way: mylist<- c("df$name1", "df$name2") and then I tried: for (i in mylist) { i } which I hoped would be equivalent to writing df$name1 and then df$name2 to make R print the content of the variables with the names name1 and name2 from the data frame df. I tried other commands like deleting a variable i=NULL within the for command, but that didn't work either. I would greatly appreciate if someone could tell me what am I doing wrong? I wonder if it has somethign to do with the way I write the i, maybe R does not interpret it to mean the elements of my character vector.
For more clarification I will write out the code I would use for Stata in this instance. Instead of asking Stata to print the content of a variable I am asking it to give summary statistics of a variable i.e. the no. of observations, mean, standard deviation and min and max using the summarize command. In Stata I don't need to refer to the dataframe as I ususally have only one dataset in memory and I need only write:
foreach i in name1 name2 { #name1 and name2 being the names of the variables
summarize `i'
}
So far, I don't manage to do the same thing using the for function in R, which I naivly thought would be:
mylist<-c("df$name1", "df$name2")
for (i in mylist) {
summary(i)
}
you probably just need to print the name to see it. For example, if we have a data frame like this:
df <- data.frame("A" = "a", "B" = "b", "C" = "c")
df
# > A B C
# > 1 a b c
names(df)
# "A" "B" "C"
We can operate on the names using a for loop on the names(df) vector (no need to define a special list).
for (name in names(df)){
print(name)
# your code here
}
R is a little more reticent to let you use strings/locals as code than Stata is. You can do it with functions like eval but in general that's not the ideal way to do it.
In the case of variable names, though, you're in luck, as you can use a string to pull out a variable from a data.frame with [[]]. For example:
df <- data.frame(a = 1:10,
b = 11:20,
c = 21:30)
for (i in c('a','b')) {
print(i)
print(summary(df[[i]]))
}
Notes:
if you want an object printed from inside a for loop you need to use print().
I'm assuming that you're using the summary() function just as an example and so need the loop. But if you really just want a summary of each variable, summary(df) will do them all, or summary(df[,c('a','b')]) to just do a and b. Or check out the stargazer() function in the stargazer package, which has defaults that will feel pretty comfortable for a Stata user.
I’ve worked with SAS and SQL previously I’m trying to get into R via a course. I’ve been set the following task by my tutor:
“Using the Iris dataset, write an R function that takes as its arguments an Iris species and attribute name and returns the minimum and maximum values of the attribute for that species.”
Which sounded straightforward at first, but I’ve come unstuck trying to make the function. The below is as far as I've gotten
#write the function
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table$y)
themax <-max(new_table$y)
return(themin)
return(themax)}
#test the function - Species , Attribute, Data
question_2("setosa",Sepal.Width, iris)
I assumed I needed quotes around the species when running the function, but I get an error that there were "no non-missing arguments to min/max", which I'm guessing means my attempt at making 'new_table' has brought back zero observations.
Can anyone see where I'm going wrong?
edit: thanks all for the swift and insightful responses. i'll take that reading on board. thanks again!
Indeed, your teacher didn't give you the easiest thing to do in R. You were almost right. You can't return twice in a function.
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table[[y]])
themax <-max(new_table[[y]])
return(list(themin, themax))}
question_2("setosa","Sepal.Width", iris)
df$colname cannot be used with a variable to the right of $, because it will search for the column named "colname" ("y" in your case) rather than the character the variable colname (if it even exists) represents.
The syntax df[["colname"]] is useful in this case because it allows for character input (which may also be a variable representing a character). This holds for both object types list and data.frame. In fact, a data.frame can be seen as a list of vectors.
Example
df <- data.frame(col1 = 5:7, col2 = letters[1:3])
a <- "col1"
# $ vs [[
df$col1 # works because "col1" is a column of df
df$a # does not work because "a" is not a column of df
df[["col1"]] # works because "col1" is a column of df
df[[a]] # works because "col1" is a column of df
# dataframes can be seen as list of vectors
ls <- list(col1 = 5:7, col2 = letters[1:3])
ls$col1 # works
ls[[a]] # works
One problem is that Sepal.Width seems to be some object in the workspace. Otherwise R would yell at you Object "Sepal.Width" not found.. Whatever Sepal.Width (the object) is, it is probably not a character string with the value "Sepal.Width". But even if it were, R would not know how to use the $ operator to get that named column from new_table, not without some needlessly advanced programming. #Flo.P's suggestion of using [[ is a good one.
You must pass y as "Sepal.Width".
Another approach: you can take advantage of subset by writing this:
question_2 <- function(x, y, data){
newy <- subset(data, subset=Species==x, select=y)
themin <-min(newy)
themax <-max(newy)
return(c(themin, themax))
}
question_2("setosa","Sepal.Width", iris)
I'm trying to insert a check step in my R script to determine if the structure of the CSV table I'm reading is as expected.
See details:
table.csv has the following colnames:
[1] "A","B","C","D"
This file is generated by someone else, hence I'd like to make sure at beginning of my script that the colnames and the number/order of columns has not change.
I tried to do the following:
#dataframes to import
df_table <- read.csv('table.csv')
#define correct structure of file
Correct_Columns <- c('A','B','C','D')
#read current structure of table
Current_Columns <- colnames(df_table)
#Check whether CSV was correctly imported from Source
if(Current_Columns != Correct_Columns)
{
# if structure has changed, stop the script.
stop('Imported CSV has a different structure, please review export from Source.')
}
#if not, continue with the rest of the script...
Thanks in advance for any help!
Using base R, I suggest you take a look at all.equal(), identical() or any().
See the following example:
a <- c(1,2)
b <- c(1,2)
c <- c(1,2)
d <- c(1,2)
df <- data.frame(a,b,c,d)
names.df <- colnames(df)
names.check <- c("a","b","c","d")
!all.equal(names.df,names.check)
# [1] FALSE
!identical(names.df,names.check)
# [1] FALSE
any(names.df!=names.check)
# [1] FALSE
Following, your code could be modified as follows:
if(!all.equal(Current_Columns,Correct_Columns))
{
# call your stop statement here
}
Your code probably throws a warning because Current_Columns!=Correct_Columns will compare all entries of the vector (i.e. running Current_Columns!=Correct_Columns on its own on the console will return a vector with TRUE/FALSE values).
Contrary, all.equal() or identical() will compare the whole vectors while treating them as objects.
For the sake of completeness, please be aware of the slight difference between all.equal() and identical(). In your case it doesn't matter which one you use but it can get important when dealing with numerical vectors. See here for more information.
A quick way with data.table:
library(data.table)
DT <- fread("table.csv")
Correct_Columns <- c('A','B','C','D')
Current_Columns <- colnames(df_table)
Check if there is a false in pairwise matching:
if(F %in% Current_Columns == Correct_Columns){
stop('Imported CSV has a different structure, please review export from Source.')
}
}
I have an initial variable:
a = c(1,2,3)
attr(a,'name') <- 'numbers'
Now I want to create a new variable that is a subset of a and then have it have the same attributes as a. Is there like a copy.over.attr function or something around that does this without me having to go inside and identify which one is user defined attributes etc. This gets complicated when I have numerous attributes attached to a single variable.
It should be used with caution and care. There is mostattributes<-, which receives a list and attempts to set the attributes in the list to the object in its argument. At the very least, reading the source code will give you some nice ideas on how to check attributes between objects. Here's a little run on your sample a vector. It succeeds since it's not violating any properties of b
a = c(1,2,3)
attr(a,'name') <- 'numbers'
b <- a[-1]
attributes(b)
# NULL
mostattributes(b) <- attributes(a)
attributes(b)
# $name
# [1] "numbers"
Here's a sample of the source code where names are checked.
if (h.nam <- !is.na(inam <- match("names", names(value)))) {
n1 <- value[[inam]]
value <- value[-inam]
}
if (h.dim <- !is.na(idin <- match("dim", names(value)))) {
d1 <- value[[idin]]
value <- value[-idin]
}
if (h.dmn <- !is.na(idmn <- match("dimnames", names(value)))) {
dn1 <- value[[idmn]]
value <- value[-idmn]
}
attributes(obj) <- value
There is also attr.all.equal. It's not the operation you want, but I think you would benefit from reading that source code too. There are many good checks you can learn about in that one.
Wouldn't a simple attributes(b) <- attributes(a) work?
This will just be executed after creating b from a subset of the data in a, so it's not really a single statement, but should work.
Does anyone why the result of the following code is different?
a <- cbind(1:10,1:10)
b <- a
colnames(a) <- c("a","b")
colnames(b) <- c("c","d")
colnames(cbind(a,b))
> [1] "a" "b" "c" "d"
colnames(cbind(ts(a),ts(b)))
> [1] "ts(a).a" "ts(a).b" "ts(b).c" "ts(b).d"
Is this or compatibility reasons? Cbind for xts and zoo does not have this feature.
I always accepted this as given, but now my code is littered with the following:
ca<-colnames(a)
cb<-colnames(b)
out <- cbind(a,b)
colnames(out) <- c(ca,cb)
This is just what the cbind.ts method does. You can see the relevant code via stats:::cbind.ts, stats:::.cbind.ts, and stats:::.makeNamesTs.
I can't explain why it was made to be different, since I didn't write it, but here's a work-around.
cbts <- function(...) {
dots <- list(...)
ists <- sapply(dots,is.ts)
if(!all(ists)) stop("argument ", which(!ists), " is not a ts object")
do.call(cbind,unlist(lapply(dots,as.list),recursive=FALSE))
}
I take it that you're interested in why this happens.
Taking a look at the body of stats:::.cbind.ts, which is the function that does column binding for time series, shows that naming is performed by .makeNamesTs. Taking a look at stats:::.make.Names.Ts reveals that the names are derived directly from the arguments you pass to cbind, and there is no obvious way to influence this. As an example, try:
cbind(ts(a),ts(b, start = 2))
You will find that the start specification of the second time series appears in the name of the respective columns.
As to why that's the way things are ... I can't help you there!