Single element assignment inside same within as entire column of data frame - r

Why is it that I can't assign a value to an entire column of a data frame, and then a single element in the same "within" statement? The code:
foo <- data.frame( a=seq(1,10) )
foo <- within(foo, {
b <- 1 # set all of b to 1
})
foo <- within(foo, {
c <- 1 # set all of c to 1
c[2] <- 20 # set one element to 20
b[2] <- 20
})
foo
Gives:
a b c
1 1 1 1
2 2 20 20
3 3 1 1
4 4 1 20
5 5 1 1
6 6 1 20
7 7 1 1
8 8 1 20
9 9 1 1
10 10 1 20
The value of b is what I expected. The value of c is strange. It seems to do what I expect if the assignment to the entire column (ie b <- 1) is in a different "within" statement than the assignment to a single element (ie b[2] <- 20). But not if they're in the same "within".
Is this a bug, or something I just don't understand about R?

My guess is that the assignments to new columns are done as you "leave" the function. When doing
c <- 1
c[2] <- 20
all you have really created is a vector c <- c(1, 20). When R has to assign this to a new column, the vector is recycled, creating the 1,20,1,20,... pattern you are seeing.

That's an interesting one.
It has to do with the fact that c is defined only up to length 2, and after that the typical R "recycling rule" takes over and repeats c until it matches the length of the data frame. (And as an aside, this only works for whole multiples: you would not be able to replicate a vector of length 3 or 4 in a data frame of ten 10 rows.)
Recycling has its critics. I think it is an asset for a dynamically-typed interpreted language R, particularly when one wants to interactively explore data. "Expanding" data to fit a container and expression is generally a good thing -- even if it gives the odd puzzle as it does here.

Related

R set column to maximum of current entry and specified value in an elegant way [duplicate]

This question already has answers here:
R: Get the min/max of each item of a vector compared to single value
(1 answer)
Replace negative values by zero
(5 answers)
Closed 1 year ago.
NOTE: I technically know how to do this, but I feel like there has to be a "nicer" way to do this. If such questions are not allowed here just delete it, but I would really like to improve my R style, so any suggestions are welcome.
I have a dataframe data <- data.frame(foo=rep(c(-1,2),5))
foo
1 -1
2 2
3 -1
4 2
5 -1
6 2
7 -1
8 2
9 -1
10 2
Now I would like to be able to set the entries of foo to a certain value (for this example, let's say 1) if the current entry is smaller than that value.
So my desired output would be
foo
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 2
I feel like there should be something like data$foo <- max(data$foo,1) that does the job (but ofc, it "maxes" over the whole column).
Is there an elegant way to do this?
data$foo <- ifelse(data$foo < 1,1,data$foo) and data$foo <- lapply(data$foo,function(x) max(1,x)) just feel somewhat "ugly".
max gives you maximum of the whole column but for your case you need pmax(parallel maximum) so it gives you maximum of 1 or each number in the vector.
data$foo <- pmax(data$foo, 1)
data
# foo
#1 1
#2 2
#3 1
#4 2
#5 1
#6 2
#7 1
#8 2
#9 1
#10 2
This works:
data <- data.frame(foo=rep(c(-1,2),5))
val <- 1
data[data$foo < val, ] <- val
Let's break this down. data$foo takes the column and makes it into a vector. data$foo < val checks which elements of this vector are smaller than val, creating a new vector of similar lenghts filled with TRUE and FALSE at the correct positions.
Finally, the entire line data[data$foo < val, ] <- val uses that vector of TRUE and FALSE to select the rows (using the [, ]) of data to which val is now used.

Converting Summary Table of Binary Outcome to Long Tidy DataFrame

I want to convert a table that has several categorical variables, as well as summary of the result of a binary experiment to long format to easily run a logistic regression model.
Is there an easy way to do this that does not involve just making a bunch of vectors with rep() and then combining those into a dataframe? Ideally, I'd like one function that does this automatically, but maybe I'll just need to make my own.
For example, if I start with this summary table:
test group success n
A control 1 2
A treat 2 3
B control 3 5
B treat 1 3
I want to be able to switch it back to the following format:
test group success
A control 1
A control 0
A treat 1
A treat 1
A treat 0
B control 1
B control 1
B control 1
B control 0
B control 0
B treat 1
B treat 0
B treat 0
Thanks!
The reshape package is your friend, here. In this case, melt() and untable() are useful for normalizing the data.
If the example summary data.frame is in a variable called df, an abbreviated answer is:
# replace total n with number of failures
df$fail = df$n - df$success
df$n = NULL
# melt and untable the data.frame
df = melt(df)
df = untable(df, df$value)
# recode the results, e.g., here by creating a new data.frame
df = data.frame(
test = df$test,
group = df$group,
success = as.numeric(df$variable == "success")
)
This is a great example of a very general problem. The idea is to back calculate the list of data that underlies a cross-tabulation. Given the cross-tabulation, the back-calculated list of data has one row for each datum and contains the attributes of each datum. Here is a post to the inverse of this question.
In "data geek" parlance, this is a question of putting tabulated data in First Normal Form -- if that is helpful to anyone. You can google data normalization, which will help you design agile data.frames that can be cross-tabulated and analyzed in many different ways.
In detail, for melt() and untable() to work here, the raw data need to be tweaked a bit to include fail (number of failures) rather than total n, but that is simple enough:
df$fail <- df$n - df$success
df$n <- NULL
which gives:
test group success fail
1 A control 1 1
2 A treat 2 1
3 B control 3 2
4 B treat 1 2
Now we can "melt" the table. melt() can back-calculate the original list of data that was used to create a cross tabulation.
df <- melt(df)
In this case, we get new column called variable that contains either "success" or "fail", and a column called value that contains the datum from the original success or fail column.
test group variable value
1 A control success 1
2 A treat success 2
3 B control success 3
4 B treat success 1
5 A control fail 1
6 A treat fail 1
7 B control fail 2
8 B treat fail 2
The untable() function repeats each row of a table according to the value of a numeric "count" vector. In this case, df$value is the count vector, because it contains the number of successes and fails.
df <- untable(df, df$value)
which will yield one record for each datum, either a "success" or a "fail":
test group variable value
1 A control success 1
2 A treat success 2
2.1 A treat success 2
3 B control success 3
3.1 B control success 3
3.2 B control success 3
4 B treat success 1
5 A control fail 1
6 A treat fail 1
7 B control fail 2
7.1 B control fail 2
8 B treat fail 2
8.1 B treat fail 2
This is the solution. If required, the data can now be recoded to replace "success" with 1 and "fail" with 0 (and get rid of the the extraneous value and variable columns...)
df <- data.frame(
test = df$test,
group = df$group,
success = as.numeric(df$variable == "success")
)
This returns the requested solution, tho the rows are sorted differently:
test group success
1 A control 1
2 A treat 1
3 A treat 1
4 B control 1
5 B control 1
6 B control 1
7 B treat 1
8 A control 0
9 A treat 0
10 B control 0
11 B control 0
12 B treat 0
13 B treat 0
Obviously, the data.frame can be resorted, if necessary. How to sort a data.frame in R.

R apply function returning multiple values to each row of data frame and add these as new columns to the data frame

Say I have a data frame which is a cross product of the sequence of 1 to 20 with itself:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
So the data looks like:
DaysBack DaysForward
1 1
2 1
...
19 20
20 20
I want to apply a function which takes the days back, and days forward, and returns several values, and then adds these as columns to this data frame. So my function would look something like:
## operation to apply on each row
do_something <- function(days_back, days_forward)
{
# logic to work out some values
...
# return those values
c(value_1, value_2, value_3)
}
And then to add this to the original data frame, so "combis" should for example look like:
DaysBack DaysForward Value1 Value2 Value3
1 1 5 6 7
2 1 4 2 3
...
19 20 1 9 3
20 20 2 6 8
How do I do this and get back a data frame.
EDIT:
My do_something function currently operates on two values, days_back and days_forward. It uses these in the context of another dataframe called pod, which (for this example) looks something like:
Date Price
2016-01-01 3.1
2016-01-02 3.33
...
2016-04-12 2.12
Now say i pass in days_back=1 and days_forward=2, what I do is for each row i find the price 1 day back, and the price 2 days forward, and I add this as a column called Diff to the data. I do this by adding lead/lag columns as appropriate (i found shift code to do this here What's the opposite function to lag for an R vector/dataframe?), so I'm not doing any looping. Once I have the differences per row, I calculate the mean and standard deviation of Diff and return these two values. I.e. for combination days_back=1 and days_forward=2 I have some mean and sd of the diff. Now I want this for all combinations of days_back and days_forward with each ranging from 1 to 20. In the example data i gave when i first asked the question, mean_diff would correspond to Value1 and sd_diff would correspond to Value2 for example
So to be clear, currently my do_something operates directly on two values and not on two sets of column vectors. I'm sure it can be re-written to operate on two vectors, but then again I have the same issue in that I don't know how to return this data so that in the end I get a data frame that looks like what I showed above as my target output.
Thanks
Something like this
# data
d <- matrix(1,3,2)
# function
foo <- function(x,y) {
m <- cbind(a=x+1,b=y+2) # calculations
m # return
}
# execute the function
res <- foo(d[,1],d[,2])
# add results to data.frame/matrix
cbind(d,res)
Edit: As you asked in the comments I use your data:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
# function
do_something <- function(x,y) cbind(a=x+1,b=y+2)
# results
m <- cbind(combis,do_something(combis$DaysBack,combis$DaysForward))
head(m)
DaysBack DaysForward a b
1 1 2 3
2 1 3 3
3 1 4 3
4 1 5 3
5 1 6 3
6 1 7 3

What's the difference between [1], [1,], [,1], [[1]] for a dataframe in R? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In R, what is the difference between the [] and [[]] notations for accessing the elements of a list?
I'm confused with the difference of [1], [1,], [,1], [[1]] for dataframe type.
As I know, [1,] will fetch the first row of a matrix, [,1] will fetch the first column. [[1]] will fetch the first element of a list.
But I checked the document of data.frame, which says
A data frame is a list of variables of the same number of rows with
unique row names
Then I typed in some code to test the usage.
>L3 <- LETTERS[1:3]
>(d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE)))
x y fac
1 1 1 C
2 1 2 B
3 1 3 C
4 1 4 C
5 1 5 A
6 1 6 B
7 1 7 C
8 1 8 A
9 1 9 A
10 1 10 A
> d[1]
x
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
>d[1,]
x y fac
1 1 1 C
>d[,1]
[1] 1 1 1 1 1 1 1 1 1 1
>d[[1]]
[1] 1 1 1 1 1 1 1 1 1 1
What confused me is: [1,] and [,1] is only used in matrix. [[1]] is only used in list, and [1] is used in vector, but why all of them are available in dataframe?
Could anybody explain the difference of these usage?
In R, operators are not used for one data type only. Operators can be overloaded for whatever data type you like (e.g. also S3/S4 classes).
In fact, that's the case for data.frames.
as data.frames are lists, the [i] and [[i]] (and $) show list-like behaviour.
row, colum indices do have an intuitive meaning for tables, and data.frames look like tables. Probably that is the reason why methods for data.frame [i, j] were defined.
You can even look at the definitions, they are coded in the S3 system (so methodname.class):
> `[.data.frame`
and
> `[[.data.frame`
(the backticks quote the function name, otherwise R would try to use the operator and end up with a syntax error)

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources