I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?
dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.
Related
I am trying to change the column nature for multiple data frames.
And here is an simple example, which I would like to have both data frames B_1 & B_2 column covert from integer to factor. However I have an error message from the code below.
B_1 = data.frame( x=c("01","02","03"))
B_1$x = as.integer(B_1$x)
B_2 = data.frame( x=c(1,2,3,4,5,6,7) )
B_2$x = as.integer(B_2$x)
for (i in 1:2)
get(paste0("B_",i))[["x"]] <- as.factor(get(paste0("B_",i))[["x"]])
We can use mget to get all the datasets in a list and then with mutate_each convert the columns to factor (if there are multiple columns)
library(dplyr)
lst <- lapply(mget(paste('B', 1:2, sep="_")),
function(x) mutate_each(x, funs(factor(.))) )
str(lst)
#List of 2
# $ B_1:'data.frame': 3 obs. of 1 variable:
# ..$ x: Factor w/ 3 levels "01","02","03": 1 2 3
# $ B_2:'data.frame': 7 obs. of 1 variable:
# ..$ x: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
It is better to keep the datasets in a list.
One possible solution would be to first make a list with all your B_X data frames and then convert them to factors. Like this:
B_1 = data.frame( x=c("01","02","03"))
B_2 = data.frame( x=c(1,2,3,4,5,6,7) )
li = list()
for (i in 1:2) {li = c(li, get(paste0("B_",i)))}
for (k in 1:2) {li[[k]] = as.factor(li[[k]])}
this will return:
> li
$x
[1] 01 02 03
Levels: 01 02 03
$x
[1] 1 2 3 4 5 6 7
Levels: 1 2 3 4 5 6 7
I hope this solves your problem.
I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.
Here is the pseudocode of what I'm trying to achieve
START
IMPORT DATASET WITH ALL VARIABLES
num_variables = num_dataset_cols
i= 1
WHILE (i <= num_variables)
{
CREATE NEW DATASET WITH x COLUMNs
BUILD THE MODEL
GET THE ERROR RATE
ADD IN NEXT COLUMN
i = i + 1
}
Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.
#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
num_variables <- ncol(dataset)
i <- 1
while i <= num_variables
{
data <- dataset[c(1, i+1)]
str(data)
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?
I think this is what you want.
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
dataset
num_variables <- ncol(dataset)
num_variables
i <- 1
while (i <= num_variables) {
data <- dataset[, 1:i]
print(str(data))
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame': 5 obs. of 2 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
NULL
'data.frame': 5 obs. of 3 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame': 5 obs. of 4 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame': 5 obs. of 5 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
$ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL
You can use append function after defining output variable
data <- dataset[c(1, i+1)]
append(output, data)
str(data)
Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:
dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)
...
This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished
I have a wide data.frame that is all character vectors (df1). I have a separate vector(vec1) that contains the column classes I'd like to assign to each of the columns in df1.
If I was using read.csv(), I'd use the colClasses argument and set it equal to vec1, but there doesn't appear to be a similar option for an existing data.frame.
Any suggestions for a fast way to do this besides a loop?
I don't know if it will be of help but I have run into the same need many times and I have created a function in case it helps:
reclass <- function(df, vec){
df[] <- Map(function(x, f){
#switch below shows the accepted values in the vector
#you can modify it and/or add more
f <- switch(f,
as.is = 'force',
factor = 'as.factor',
num = 'as.numeric',
char = 'as.character')
#takes the name of the function and fetches the function
f <- get(f)
#apply the function
f(x)
},
df,
vec)
df
}
It uses Map to pass in a vector of classes to the data.frame. Each element corresponds to the class of the column. The length of both the dataframe and the vector need to be the same.
I am using switch as well to make the corresponding classes shorter to type. Use as.is to keep the class the same, the rest are self explanatory I think.
Small example:
df1 <- data.frame(1:10, letters[1:10], runif(50))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : int 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : num 0.0969 0.1957 0.8283 0.1768 0.9821 ...
And after the function:
df1 <- reclass(df1, c('num','as.is','char'))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : chr [1:50] "0.0968757788650692" "0.19566105119884" "0.828283685725182" "0.176784737734124" ...
I guess Map internally is a loop but it is written in C so it should be fast enough.
May be you could try this function that makes the same work.
reclass <- function (df, vec_types) {
for (i in 1:ncol(df)) {
type <- vec_types[i]
class(df[ , i]) <- type
}
return(df)
}
and this is an example of vec_types (vector of types):
vec_types <- c('character', rep('integer', 3), rep('character', 2))
you can test the function (reclass) whith this table (df):
table <- data.frame(matrix(sample(1:10,30, replace = T), nrow = 5, ncol = 6))
str(table) # original column types
# apply the function
table <- reclass(table, vec_types)
str(table) # new column types
I have a data frame with two columns: one is strings, the other one is integers.
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> df
x rnames
1 5 item.1
2 3 item.2
3 5 item.3
4 3 item.4
5 1 item.5
6 3 item.6
7 4 item.7
8 5 item.8
9 4 item.9
10 5 item.10
11 5 item.11
12 2 item.12
13 2 item.13
14 1 item.14
15 3 item.15
16 4 item.16
17 5 item.17
18 4 item.18
19 1 item.19
20 1 item.20
I'm trying to aggregate the strings into list or vectors of strings (characters) with the 'c' or the 'list' function, but getting weird results:
> aggregate(rnames ~ x, df, c)
x rnames
1 1 16, 6, 11, 13
2 2 4, 5
3 3 12, 15, 17, 7
4 4 18, 20, 8, 10
5 5 1, 14, 19, 2, 3, 9
When I use 'paste' instead of 'c', I can see that the aggregate is working correctly - but the result is not what I'm looking for.
> aggregate(rnames ~ x, df, paste)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
What I'm looking for is that every aggregated group would be presented as a vector or a lit (hence the use of c) as opposed to the single string I'm getting with 'paste'. Something along the lines of the following (which in reality doesn't work):
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
Any help would be appreciated.
You fell in the usual trap of data.frame: your character column is not a character column, it is a factor column! Hence the numbers instead of the characters in your result:
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 2 5 5 5 5 4 3 3 2 4 ...
$ rnames: Factor w/ 20 levels "item.1","item.10",..: 1 12 14 15 16 17 18 19 20 2 ...
To prevent the conversion to factors, use argument stringAsFactors=FALSE in your call to data.frame:
> df <- data.frame(x, rnames,stringsAsFactors=FALSE)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 5 5 3 5 5 3 2 5 1 5 ...
$ rnames: chr "item.1" "item.2" "item.3" "item.4" ...
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.9, item.13, item.17
2 2 item.7
3 3 item.3, item.6, item.19
4 4 item.12, item.15, item.16
5 5 item.1, item.2, item.4, item.5, item.8, item.10, item.11, item.14, item.18, item.20
Another solution to avoid the conversion to factor is function I:
> df <- data.frame(x, I(rnames))
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 3 5 4 5 4 5 3 3 1 1 ...
$ rnames:Class 'AsIs' chr [1:20] "item.1" "item.2" "item.3" "item.4" ...
Excerpt from ?I:
In function data.frame. Protecting an object by enclosing it in I() in
a call to data.frame inhibits the conversion of character vectors to
factors and the dropping of names, and ensures that matrices are
inserted as single columns. I can also be used to protect objects
which are to be added to a data frame, or converted to a data frame
via as.data.frame.
It achieves this by prepending the class "AsIs" to the object's
classes. Class "AsIs" has a few of its own methods, including for [,
as.data.frame, print and format.
'm not sure just exactly what it is that you are looking for... so perhaps some reference output would be good to give us an idea of what we are aiming at?
But, since your last bit of code seems to be close to what you are after, maybe a solution like the following would work:
> library(plyr)
> ddply(df, .(x), summarize, rnames = paste(rnames, collapse = "|"))
x rnames
1 1 item.9|item.11|item.20
2 2 item.1|item.2|item.15|item.16
3 3 item.7|item.8
4 4 item.4|item.5|item.6|item.12|item.13
5 5 item.3|item.10|item.14|item.17|item.18|item.19
You can vary how the individual elements are stuck together by changing the collapse argument to paste().
Alternatively, if you want to just have each of the groups as a vetor then you could use this:
> df$rnames = as.character(df$rnames)
> L = dlply(df, .(x), function(df) {df$rnames})
> L
$`1`
[1] "item.9" "item.11" "item.20"
$`2`
[1] "item.1" "item.2" "item.15" "item.16"
$`3`
[1] "item.7" "item.8"
$`4`
[1] "item.4" "item.5" "item.6" "item.12" "item.13"
$`5`
[1] "item.3" "item.10" "item.14" "item.17" "item.18" "item.19"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
x
1 1
2 2
3 3
4 4
5 5
This gives you a list of vectors, which is what you were after. And each group can be indexed out of the resulting list:
> L[[1]]
[1] "item.9" "item.11" "item.20"
I use ddply a lot. I use ordered factors occasionally. Calling ddply on a data frame that contains an ordered factor drops any ordering in the recombined data frame.
I wrote the following wrapper for ddply that records level ordering and then re-applies it on any columns that were ordered originally:
dat <- data.frame(a=runif(10),b=factor(letters[10:1],
levels=letters[10:1],ordered=TRUE),
c = rep(letters[1:2],times=5),
d = factor(rep(c('lev1','lev2'),times=5),ordered=TRUE))
#Drops ordering on b and d
dat1 <- ddply(dat,.(c),transform,log_a = log(a))
ddplyKeepOrder <- function(dat,...){
orderedCols <- colnames(dat)[sapply(dat,is.ordered)]
levs <- lapply(dat[,orderedCols,drop=FALSE],levels)
result <- ddply(.data = dat,...)
ind <- match(orderedCols,colnames(result))
levs <- levs[!is.na(ind)]
orderedCols <- orderedCols[!is.na(ind)]
ind <- ind[!is.na(ind)]
if (length(ind) > 0){
for (i in 1:length(ind)){
result[,orderedCols[i]] <- factor(result[,orderedCols[i]],
levels=levs[[i]],ordered=TRUE)
}
}
return(droplevels(result))
}
#Preserves ordering on b and d
dat2 <- ddplyKeepOrder(dat,.variables = .(c),.fun = transform,log_a = log(a))
I haven't checked this function thoroughly so there might be cases it doesn't handle. Is there a better/more complete way to handle this? I could probably remove the for loop if I thought about it a bit, I suppose.
In particular, the checking I do after the ddply call to see if there are still any of the original ordered factors present seems really ugly, but I would like the function to be able to handle cases where ddply alters which columns are present, possibly removing ordered factors.
Thoughts?
I use the code below for these types of problems ("ddply" not "ordered factor") and it seems to handle your specific example without issue (other than different row names).
> dat2 <- do.call(rbind, lapply(split(dat, dat$c), transform, log_a=log(a)))
> str(dat2)
'data.frame': 10 obs. of 5 variables:
$ a : num 0.216 0.607 0.197 0.171 0.797 ...
$ b : Ord.factor w/ 10 levels "j"<"i"<"h"<"g"<..: 1 3 5 7 9 2 4 6 8 10
$ c : Factor w/ 2 levels "a","b": 1 1 1 1 1 2 2 2 2 2
$ d : Ord.factor w/ 2 levels "lev1"<"lev2": 1 1 1 1 1 2 2 2 2 2
$ log_a: num -1.532 -0.499 -1.625 -1.767 -0.227 ...