Converting Summary Table of Binary Outcome to Long Tidy DataFrame - r

I want to convert a table that has several categorical variables, as well as summary of the result of a binary experiment to long format to easily run a logistic regression model.
Is there an easy way to do this that does not involve just making a bunch of vectors with rep() and then combining those into a dataframe? Ideally, I'd like one function that does this automatically, but maybe I'll just need to make my own.
For example, if I start with this summary table:
test group success n
A control 1 2
A treat 2 3
B control 3 5
B treat 1 3
I want to be able to switch it back to the following format:
test group success
A control 1
A control 0
A treat 1
A treat 1
A treat 0
B control 1
B control 1
B control 1
B control 0
B control 0
B treat 1
B treat 0
B treat 0
Thanks!

The reshape package is your friend, here. In this case, melt() and untable() are useful for normalizing the data.
If the example summary data.frame is in a variable called df, an abbreviated answer is:
# replace total n with number of failures
df$fail = df$n - df$success
df$n = NULL
# melt and untable the data.frame
df = melt(df)
df = untable(df, df$value)
# recode the results, e.g., here by creating a new data.frame
df = data.frame(
test = df$test,
group = df$group,
success = as.numeric(df$variable == "success")
)
This is a great example of a very general problem. The idea is to back calculate the list of data that underlies a cross-tabulation. Given the cross-tabulation, the back-calculated list of data has one row for each datum and contains the attributes of each datum. Here is a post to the inverse of this question.
In "data geek" parlance, this is a question of putting tabulated data in First Normal Form -- if that is helpful to anyone. You can google data normalization, which will help you design agile data.frames that can be cross-tabulated and analyzed in many different ways.
In detail, for melt() and untable() to work here, the raw data need to be tweaked a bit to include fail (number of failures) rather than total n, but that is simple enough:
df$fail <- df$n - df$success
df$n <- NULL
which gives:
test group success fail
1 A control 1 1
2 A treat 2 1
3 B control 3 2
4 B treat 1 2
Now we can "melt" the table. melt() can back-calculate the original list of data that was used to create a cross tabulation.
df <- melt(df)
In this case, we get new column called variable that contains either "success" or "fail", and a column called value that contains the datum from the original success or fail column.
test group variable value
1 A control success 1
2 A treat success 2
3 B control success 3
4 B treat success 1
5 A control fail 1
6 A treat fail 1
7 B control fail 2
8 B treat fail 2
The untable() function repeats each row of a table according to the value of a numeric "count" vector. In this case, df$value is the count vector, because it contains the number of successes and fails.
df <- untable(df, df$value)
which will yield one record for each datum, either a "success" or a "fail":
test group variable value
1 A control success 1
2 A treat success 2
2.1 A treat success 2
3 B control success 3
3.1 B control success 3
3.2 B control success 3
4 B treat success 1
5 A control fail 1
6 A treat fail 1
7 B control fail 2
7.1 B control fail 2
8 B treat fail 2
8.1 B treat fail 2
This is the solution. If required, the data can now be recoded to replace "success" with 1 and "fail" with 0 (and get rid of the the extraneous value and variable columns...)
df <- data.frame(
test = df$test,
group = df$group,
success = as.numeric(df$variable == "success")
)
This returns the requested solution, tho the rows are sorted differently:
test group success
1 A control 1
2 A treat 1
3 A treat 1
4 B control 1
5 B control 1
6 B control 1
7 B treat 1
8 A control 0
9 A treat 0
10 B control 0
11 B control 0
12 B treat 0
13 B treat 0
Obviously, the data.frame can be resorted, if necessary. How to sort a data.frame in R.

Related

R apply function returning multiple values to each row of data frame and add these as new columns to the data frame

Say I have a data frame which is a cross product of the sequence of 1 to 20 with itself:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
So the data looks like:
DaysBack DaysForward
1 1
2 1
...
19 20
20 20
I want to apply a function which takes the days back, and days forward, and returns several values, and then adds these as columns to this data frame. So my function would look something like:
## operation to apply on each row
do_something <- function(days_back, days_forward)
{
# logic to work out some values
...
# return those values
c(value_1, value_2, value_3)
}
And then to add this to the original data frame, so "combis" should for example look like:
DaysBack DaysForward Value1 Value2 Value3
1 1 5 6 7
2 1 4 2 3
...
19 20 1 9 3
20 20 2 6 8
How do I do this and get back a data frame.
EDIT:
My do_something function currently operates on two values, days_back and days_forward. It uses these in the context of another dataframe called pod, which (for this example) looks something like:
Date Price
2016-01-01 3.1
2016-01-02 3.33
...
2016-04-12 2.12
Now say i pass in days_back=1 and days_forward=2, what I do is for each row i find the price 1 day back, and the price 2 days forward, and I add this as a column called Diff to the data. I do this by adding lead/lag columns as appropriate (i found shift code to do this here What's the opposite function to lag for an R vector/dataframe?), so I'm not doing any looping. Once I have the differences per row, I calculate the mean and standard deviation of Diff and return these two values. I.e. for combination days_back=1 and days_forward=2 I have some mean and sd of the diff. Now I want this for all combinations of days_back and days_forward with each ranging from 1 to 20. In the example data i gave when i first asked the question, mean_diff would correspond to Value1 and sd_diff would correspond to Value2 for example
So to be clear, currently my do_something operates directly on two values and not on two sets of column vectors. I'm sure it can be re-written to operate on two vectors, but then again I have the same issue in that I don't know how to return this data so that in the end I get a data frame that looks like what I showed above as my target output.
Thanks
Something like this
# data
d <- matrix(1,3,2)
# function
foo <- function(x,y) {
m <- cbind(a=x+1,b=y+2) # calculations
m # return
}
# execute the function
res <- foo(d[,1],d[,2])
# add results to data.frame/matrix
cbind(d,res)
Edit: As you asked in the comments I use your data:
a <- seq(1,20,1)
combis <- expand.grid(a,a)
colnames(combis) <- c("DaysBack","DaysForward")
# function
do_something <- function(x,y) cbind(a=x+1,b=y+2)
# results
m <- cbind(combis,do_something(combis$DaysBack,combis$DaysForward))
head(m)
DaysBack DaysForward a b
1 1 2 3
2 1 3 3
3 1 4 3
4 1 5 3
5 1 6 3
6 1 7 3

How to retain/reassing the factor levels for a data.frame in R?

I have large dataset that I am using to train a machine learning algorithm in R. After all the data preprocessing, I have a dataframe that contains factors and numeric values. I split such dataset into a training set and a test set, and save them to file with write.csv().
When I read back the test.csv and train.csv files it may happen that some of the factors have lost levels. This makes some of the algorithms fail when creating design matrices.
Here is a detailed example. Assume that originally I had a dataset with 12 rows that I split into a training set of 8 rows and a test set of 4 rows. I save the 8 training rows to train.csv and the 4 rows to test.csv. Note that factor2 has levels (a,b,c,d) in train.csv:
factor1 factor2 value
1 1 a 1
2 2 b 0
3 3 c 1
4 4 d 1
5 2 a 0
6 4 c 1
7 3 b 0
8 1 a 1
but only (a,b,c) in test.csv:
factor1 factor2
1 4 a
2 2 b
3 4 c
4 1 a
And same for factor1, level 3 is missing in the test set.
When I read back the file test.csv, R assumes that factor1 has levels (1,2,4) and factor2 has levels (a,b,c). I would like to find a way to tell R the actual levels.
The solution that I thought is to save the levels at the beginning, from the original dataset with 12 points and then reassign them after reading train.csv and test.csv.
I would like to avoid using the save() method from R, because the datasets that I am creating may go to other languages/packages.
Thanks!
In R, subsetting should keep all factor levels in a vector. Here let's imagine a is our data, column a is our categorical variable, and b is our response:
a <- data.frame(a = c("a", "b", "c"), b = c(1, 2, 3))
z <- a[1:2, ]
z$a
[1] a b
Levels: a b c
If you are losing factors in your sub-setting to train and test sets, you need a better way of sub-setting.
If your problem is writing a .csv, you probably want to reinclude them as an NA in the response column. You can do this a ton of ways - here's a merge:
merge(data.frame(a = levels(z$a)), z, all=TRUE)
a b
1 a 1
2 b 2
3 c NA
Edit: From your example, using the first data as dat and the second as dat2:
levels(dat2$factor1) <- levels(dat$factor1)
levels(dat2$factor2) <- levels(dat$factor2)
would be the easiest way.

Using Table on data frame by mutliple variables

I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1

Single element assignment inside same within as entire column of data frame

Why is it that I can't assign a value to an entire column of a data frame, and then a single element in the same "within" statement? The code:
foo <- data.frame( a=seq(1,10) )
foo <- within(foo, {
b <- 1 # set all of b to 1
})
foo <- within(foo, {
c <- 1 # set all of c to 1
c[2] <- 20 # set one element to 20
b[2] <- 20
})
foo
Gives:
a b c
1 1 1 1
2 2 20 20
3 3 1 1
4 4 1 20
5 5 1 1
6 6 1 20
7 7 1 1
8 8 1 20
9 9 1 1
10 10 1 20
The value of b is what I expected. The value of c is strange. It seems to do what I expect if the assignment to the entire column (ie b <- 1) is in a different "within" statement than the assignment to a single element (ie b[2] <- 20). But not if they're in the same "within".
Is this a bug, or something I just don't understand about R?
My guess is that the assignments to new columns are done as you "leave" the function. When doing
c <- 1
c[2] <- 20
all you have really created is a vector c <- c(1, 20). When R has to assign this to a new column, the vector is recycled, creating the 1,20,1,20,... pattern you are seeing.
That's an interesting one.
It has to do with the fact that c is defined only up to length 2, and after that the typical R "recycling rule" takes over and repeats c until it matches the length of the data frame. (And as an aside, this only works for whole multiples: you would not be able to replicate a vector of length 3 or 4 in a data frame of ten 10 rows.)
Recycling has its critics. I think it is an asset for a dynamically-typed interpreted language R, particularly when one wants to interactively explore data. "Expanding" data to fit a container and expression is generally a good thing -- even if it gives the odd puzzle as it does here.

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources