Say I have a data frame which is a cross product of the sequence of 1 to 20 with itself:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
So the data looks like:
DaysBack DaysForward
1 1
2 1
...
19 20
20 20
I want to apply a function which takes the days back, and days forward, and returns several values, and then adds these as columns to this data frame. So my function would look something like:
## operation to apply on each row
do_something <- function(days_back, days_forward)
{
# logic to work out some values
...
# return those values
c(value_1, value_2, value_3)
}
And then to add this to the original data frame, so "combis" should for example look like:
DaysBack DaysForward Value1 Value2 Value3
1 1 5 6 7
2 1 4 2 3
...
19 20 1 9 3
20 20 2 6 8
How do I do this and get back a data frame.
EDIT:
My do_something function currently operates on two values, days_back and days_forward. It uses these in the context of another dataframe called pod, which (for this example) looks something like:
Date Price
2016-01-01 3.1
2016-01-02 3.33
...
2016-04-12 2.12
Now say i pass in days_back=1 and days_forward=2, what I do is for each row i find the price 1 day back, and the price 2 days forward, and I add this as a column called Diff to the data. I do this by adding lead/lag columns as appropriate (i found shift code to do this here What's the opposite function to lag for an R vector/dataframe?), so I'm not doing any looping. Once I have the differences per row, I calculate the mean and standard deviation of Diff and return these two values. I.e. for combination days_back=1 and days_forward=2 I have some mean and sd of the diff. Now I want this for all combinations of days_back and days_forward with each ranging from 1 to 20. In the example data i gave when i first asked the question, mean_diff would correspond to Value1 and sd_diff would correspond to Value2 for example
So to be clear, currently my do_something operates directly on two values and not on two sets of column vectors. I'm sure it can be re-written to operate on two vectors, but then again I have the same issue in that I don't know how to return this data so that in the end I get a data frame that looks like what I showed above as my target output.
Thanks
Something like this
# data
d <- matrix(1,3,2)
# function
foo <- function(x,y) {
m <- cbind(a=x+1,b=y+2) # calculations
m # return
}
# execute the function
res <- foo(d[,1],d[,2])
# add results to data.frame/matrix
cbind(d,res)
Edit: As you asked in the comments I use your data:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
# function
do_something <- function(x,y) cbind(a=x+1,b=y+2)
# results
m <- cbind(combis,do_something(combis$DaysBack,combis$DaysForward))
head(m)
DaysBack DaysForward a b
1 1 2 3
2 1 3 3
3 1 4 3
4 1 5 3
5 1 6 3
6 1 7 3
I have large dataset that I am using to train a machine learning algorithm in R. After all the data preprocessing, I have a dataframe that contains factors and numeric values. I split such dataset into a training set and a test set, and save them to file with write.csv().
When I read back the test.csv and train.csv files it may happen that some of the factors have lost levels. This makes some of the algorithms fail when creating design matrices.
Here is a detailed example. Assume that originally I had a dataset with 12 rows that I split into a training set of 8 rows and a test set of 4 rows. I save the 8 training rows to train.csv and the 4 rows to test.csv. Note that factor2 has levels (a,b,c,d) in train.csv:
factor1 factor2 value
1 1 a 1
2 2 b 0
3 3 c 1
4 4 d 1
5 2 a 0
6 4 c 1
7 3 b 0
8 1 a 1
but only (a,b,c) in test.csv:
factor1 factor2
1 4 a
2 2 b
3 4 c
4 1 a
And same for factor1, level 3 is missing in the test set.
When I read back the file test.csv, R assumes that factor1 has levels (1,2,4) and factor2 has levels (a,b,c). I would like to find a way to tell R the actual levels.
The solution that I thought is to save the levels at the beginning, from the original dataset with 12 points and then reassign them after reading train.csv and test.csv.
I would like to avoid using the save() method from R, because the datasets that I am creating may go to other languages/packages.
Thanks!
In R, subsetting should keep all factor levels in a vector. Here let's imagine a is our data, column a is our categorical variable, and b is our response:
a <- data.frame(a = c("a", "b", "c"), b = c(1, 2, 3))
z <- a[1:2, ]
z$a
[1] a b
Levels: a b c
If you are losing factors in your sub-setting to train and test sets, you need a better way of sub-setting.
If your problem is writing a .csv, you probably want to reinclude them as an NA in the response column. You can do this a ton of ways - here's a merge:
merge(data.frame(a = levels(z$a)), z, all=TRUE)
a b
1 a 1
2 b 2
3 c NA
Edit: From your example, using the first data as dat and the second as dat2:
levels(dat2$factor1) <- levels(dat$factor1)
levels(dat2$factor2) <- levels(dat$factor2)
would be the easiest way.
I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1
Why is it that I can't assign a value to an entire column of a data frame, and then a single element in the same "within" statement? The code:
foo <- data.frame( a=seq(1,10) )
foo <- within(foo, {
b <- 1 # set all of b to 1
})
foo <- within(foo, {
c <- 1 # set all of c to 1
c[2] <- 20 # set one element to 20
b[2] <- 20
})
foo
Gives:
a b c
1 1 1 1
2 2 20 20
3 3 1 1
4 4 1 20
5 5 1 1
6 6 1 20
7 7 1 1
8 8 1 20
9 9 1 1
10 10 1 20
The value of b is what I expected. The value of c is strange. It seems to do what I expect if the assignment to the entire column (ie b <- 1) is in a different "within" statement than the assignment to a single element (ie b[2] <- 20). But not if they're in the same "within".
Is this a bug, or something I just don't understand about R?
My guess is that the assignments to new columns are done as you "leave" the function. When doing
c <- 1
c[2] <- 20
all you have really created is a vector c <- c(1, 20). When R has to assign this to a new column, the vector is recycled, creating the 1,20,1,20,... pattern you are seeing.
That's an interesting one.
It has to do with the fact that c is defined only up to length 2, and after that the typical R "recycling rule" takes over and repeats c until it matches the length of the data frame. (And as an aside, this only works for whole multiples: you would not be able to replicate a vector of length 3 or 4 in a data frame of ten 10 rows.)
Recycling has its critics. I think it is an asset for a dynamically-typed interpreted language R, particularly when one wants to interactively explore data. "Expanding" data to fit a container and expression is generally a good thing -- even if it gives the odd puzzle as it does here.
This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656