R ddply with multiple variables - r

Here is a simple data frame for my real data set:
df <- data.frame(ID=rep(101:102,each=9),phase=rep(1:3,6),variable=rep(LETTERS[1:3],each=3,times=2),mm1=c(1:18),mm2=c(19:36),mm3=c(37:54))
I would like to first group by ID and variable, then for values(mm1, mm2, mm3), phase 3 is subtracted from all phases(phase1 to phase3), which would make mm(1-3) in phase 1 all -2, in phase 2 all -1, and phase 3 all 0.
R throws an error of "Error in Ops.data.frame(x, x[3, ]) : - only defined for equally-sized data frames"
as I tried:
df1 <- ddply(df, .(ID, variable), function(x) (x - x[3,]))
Any advice would be greatly appreciated.
The output should be look like this:
ID phase variable mm1 mm2 mm3
101 1 A -2 -2 -2
101 2 A -1 -1 -1
101 3 A 0 0 0
101 1 B -2 -2 -2
101 2 B -1 -1 -1
101 3 B 0 0 0
101 1 C -2 -2 -2
101 2 C -1 -1 -1
101 3 C 0 0 0
102 1 A -2 -2 -2
102 2 A -1 -1 -1
102 3 A 0 0 0
102 1 B -2 -2 -2
102 2 B -1 -1 -1
102 3 B 0 0 0
102 1 C -2 -2 -2
102 2 C -1 -1 -1
102 3 C 0 0 0

Okay, took me a little bit to figure out what you want, but here is a solution:
cols.to.sub <- paste0("mm", 1:3)
df1 <- ddply(
df, .(ID, variable),
function(x) {
x[cols.to.sub] <- t(t(as.matrix(x[cols.to.sub])) - unlist(x[x$phase == 3, cols.to.sub]))
x
} )
This produces (first 6 rows):
ID phase variable mm1 mm2 mm3
1 101 1 A -2 -2 -2
2 101 2 A -1 -1 -1
3 101 3 A 0 0 0
4 101 1 B -2 -2 -2
5 101 2 B -1 -1 -1
6 101 3 B 0 0 0
Generally speaking the best way to debug this type of issue is to put a browser() statement inside the function you are passing to ddply, so you can examine the objects at your leisure. Doing so would have revealed that:
The data frame passed to your function includes the ID columns, as well as the phase columns, so your mm columns are not the first three (hence the need to define cols.to.sub)
Even if you address that, you can't operate on data frames that have unequal dimensions, so what I do here is convert to matrix, and then take advantage of vector recycling to subtract the one row from the rest of the matrix. I need to t (transpose) because vector recycling is column-wise.

Related

in R: how to take value from i+1th row of 1 dataframe and subtract from every row in i+1th column of 2nd dataframe

Note that the actual dataset is 1000s of columns and 100s of rows so I am looking for a way that does not require that i manually name either columns or rows.
With a dataset that has similar structure as follows:
subvalues <- c(1:10)
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))
call the rows of subvalues SVa, SVb, SVc...
call the rows of the dataframe's columns Xa, Xb, Xc... Ya, Yb, Yc... etc.
What I am trying to build is the following: A function that takes first the first cell of subvalues (SVa) and subtracts it from every row in column X (Xa, Xb, Xc, etc.), 2nd to take the 2nd cell of subvalues (SVb) and subtract it from every row in column y (Ya, Yb, Yc, etc.)
What I have so far is:
res <- numeric(length = length(x))
for (i in seq_along(x)) {
res[i] <- xpos - [**SVi+1**]
}
res
I need to figure out the 'SVi+1' loop and how to properly do the loop-within a loop.
Any help is much appreciated
The example dataset you provide won't work, because you need the same length for subvalues and the number of df columns.
After some modifications, here is an example. You don't need to extract the value from subvalues, as it's just a substraction.
Note that I've saved df in tmp, to modify this data.frame without loosing your initial data. Also, if the entire data.frame is numeric, consider using matrix, which can save you time.
subvalues <- c(1:5) # Note here the length 5 for the 5 columns of df.
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))
tmp <- df
for(i in seq_along(subvalues)){
# print(subvalues[i])
tmp[,i] <- tmp[,i] - subvalues[i]
}
tmp[,i] is a vector returning the i column of the data.frame, and so you can substract a value to a vector, and save it in it's initial place.
Maybe you can try replicate to create a matrix of same dimensions as df, and do subtraction afterwards, i.e.,
dfout <- df - t(replicate(nrow(df),subvalues))
such that
> dfout
x y z q t
1 0 1 -1 2 -4
2 0 0 0 -2 -1
3 1 1 -2 -2 -3
4 3 0 -2 -3 -2
5 0 0 0 -1 -1
6 3 1 -2 -2 -3
7 3 -2 0 -2 -5
8 1 0 -3 -3 -4
9 1 1 -2 -3 -2
10 -1 1 -2 -2 -4
11 0 0 -2 -2 -3
12 0 2 -3 -4 -2
13 2 0 -1 -4 -2
14 0 -1 1 -2 -4
15 2 -2 0 0 -4
16 1 -2 0 -2 -1
17 2 -1 -1 -2 -3
18 5 0 -1 -2 -2
19 0 0 0 2 -3
20 2 0 -1 -2 -1
21 3 2 -1 -1 -4
22 0 -1 -2 -2 -4
23 1 0 -2 -3 -1
24 -1 -1 3 -3 -3
25 0 0 -1 -1 -1
26 0 -1 -2 -2 -4
27 -1 0 -3 -3 -2
28 0 1 -1 -1 -2
29 3 -2 1 -4 -1
30 0 2 -1 0 -3
31 1 -1 2 -2 -2
32 1 1 0 -2 -4
33 1 -1 -2 -3 -5
34 0 -1 -1 -2 -1
35 2 0 -2 -2 -4
36 1 2 -3 -3 -3
37 2 2 0 -2 -5
38 -1 -1 -3 -4 -2
39 2 1 -1 -3 -4
40 1 3 -1 -3 -2
DATA
set.seed(1)
subvalues <- c(1:5) # Note here the length 5 for the 5 columns of df.
df <- data.frame(x = rpois(40,2), y = rpois(40,2), z = rpois(40,2), q = rpois(40,2), t = rpois(40,2))

n number of column wise subtraction

need help in N number or column wise subtraction, Below are the columns in a input dataframe.
input dataframe:
A B C D
1 4 6 2
3 3 3 4
1 2 2 2
4 4 4 4
5 2 3 2
Expected Output:
A B-A C-B D-C
1 3 2 -4
3 0 0 1
1 1 0 0
4 0 0 0
5 -3 1 -1
similarly there will be many column upto 10.
i am able to write the code for 2 columns:
Code:
df$(B-A) <- df$B - df$A
df$(C-B) <- df$C - df$B
and so on... but in this should come in loop as there are almost 10 to 12 columns. Please help me.
Here is a Vectorized way to do this,
cbind.data.frame(df[1], df[-1] - df[-ncol(df)])
which gives,
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1
Here is the instructive/pedagogic straightforward solution:
df <- data.frame(A=c(1,3,1,4,5), B=c(4,3,2,4,2), C=c(6,3,2,4,3), D=c(2,4,2,4,2))
df
Get the pattern:
cbind(df[1], df[2] - df[1], df[3] - df[2], df[4] - df[3]) # solved
Now, use dynamic programming in R to finish (in the general case):
cbind(df[1], sapply(1:(ncol(df)-1), function(i) df[i+1] - df[i]))
Output:
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1
Using apply() you can also try this
cbind(df[1], t(apply(df, 1, diff)))
Output:
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1

building matrix out of a vector with the difference of each value

dataset2 <- data.frame(bird=c("A","B","C","D","E","F"), rank=c(1:6))
I have this example dataset and now i want to build a 6*6 matrix with the rank difference between each bird. How can i do this?
Is this what you want?
m <- with(dataset2, outer(rank, rank, '-'))
rownames(m) <- colnames(m) <- dataset2$bird
# A B C D E F
# A 0 -1 -2 -3 -4 -5
# B 1 0 -1 -2 -3 -4
# C 2 1 0 -1 -2 -3
# D 3 2 1 0 -1 -2
# E 4 3 2 1 0 -1
# F 5 4 3 2 1 0
You might also want to do this afterwards:
m[upper.tri(m)] <- 0
tail(m[,-ncol(m)],-1)
To get:
# A B C D E
#B 1 0 0 0 0
#C 2 1 0 0 0
#D 3 2 1 0 0
#E 4 3 2 1 0
#F 5 4 3 2 1
This is kind of the definition of the distance matrix, no?
dist(dataset2, method="maximum")
####
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
With the distinction that it returns positive distance only... maybe it doesn't suits the OP..

Sampling a contingency table in R

I'd like to know if anyone can suggest an efficient method to sample a contingency table such that both the total number of observations and the column totals remain the same.
For example, in the following table where the rows are cases and the columns observations, I'd like to "scramble" the observations such that (a) the total number of observations is 54, and (b) the total number of observations in a variable (e.g., A) is 16 18, the same as the original column total for A.
x<-matrix(c(
4,6,0,0,8,0,0,
1,1,1,1,4,0,0,
3,0,1,1,6,0,1,
2,1,0,0,1,0,0,
1,1,0,1,0,1,1,
2,0,0,2,1,2,0),
ncol=6,byrow=F)
colnames(x)<-c("A","B","C","D","E","F")
I've seen a discussion of contingency table sampling in which the cell frequencies are the source of the sampling probabilities for a sample(...) call. This won't work for my purposes because, among other reasons, the column totals do not remain equal to the original column totals.
Any help would be greatly appreciated,
Patrick
EDIT
If there isn't an easy solution to this problem, perhaps someone can help me with my overly complicated (and failed) attempt. I first create a vector composed of the number of observations of each variable, e.g.,
m <- matrix()
v <- matrix()
for (h in 1:cols) {
m <- rep(colnames(x)[h], sum(x[, h]))
v <- c(v, m)}
I then sample it to randomly shuffle the observations, and bind it to a sample of values equal to the number of cases
v<-sample(v,length(v))
p<-sample(seq(1:nrow(x)),length(v),T)
n<-as.data.frame(cbind(v,p))
t(table(n))
v
p A B C D E F
1 3 1 3 1 1 1
2 1 1 0 0 0 0
3 3 0 3 0 2 1
4 3 2 1 2 1 2
5 2 1 0 0 0 1
6 3 2 3 1 1 1
7 3 1 2 0 0 1
colSums(t(table(n)))
A B C D E F
18 8 12 4 5 7
This works great except when the sample p fails to contain one of the values in the sequence (i.e., a "case" is missing), which as I've learned happens quite frequently, particularly when there are many iterations of the sample (e.g., 1000).
Thanks again,
Patrick
Another way would be:
indx <- cbind(c(replicate(ncol(x), sample(1:nrow(x)))), c(col(x)))
x1 <- x
x1[] <- x[indx]
colSums(x1)
# A B C D E F
#18 8 12 4 5 7
colSums(x)
#A B C D E F
#18 8 12 4 5 7
sum(x1)
#[1] 54
Update
Based on the new info, which is confusing, may be this helps:
cSum <- colSums(x)
ind1 <- vector("list", length=ncol(x))
for(i in seq_along(cSum)){
repeat{ind1[[i]] <- sample(0:cSum[i], nrow(x)-1, replace=TRUE)
if(sum(ind1[[i]]) <=cSum[i]) break
}
}
x1 <- do.call(cbind, ind1)
x2 <- rbind(x1,cSum-colSums(x1))
colSums(x2)
# A B C D E F
#18 8 12 4 5 7
sum(colSums(x2))
#[1] 54
x2
# A B C D E F
#[1,] 0 0 0 0 0 0
#[2,] 9 5 1 2 0 1
#[3,] 0 1 1 1 0 2
#[4,] 0 0 4 0 0 1
#[5,] 8 0 5 0 4 2
#[6,] 0 0 1 0 1 1
#[7,] 1 2 0 1 0 0
You can use
x.swapped <- apply(x, MARGIN=2, FUN=sample)
apply applies the function passed in the parameter FUN to the columns (if MARGIN is 2, rows when it is 1) of the matrix x.
In this case we apply the sample function.
When called without extra parameters sample just reorders the element in the vector (see ?sample for more help).
We can check that the totals in each column remain the same.
colSums(x)
A B C D E F
18 8 12 4 5 7
colSums(x.swapped)
A B C D E F
18 8 12 4 5 7
And obviously
sum(x)
[1] 54
sum(x.swapped)
[1] 54
An example of output may be (note that, unless you fix the RNG seed using set.seed the result from sample will differ each time).
x
A B C D E F
[1,] 4 1 3 2 1 2
[2,] 6 1 0 1 1 0
[3,] 0 1 1 0 0 0
[4,] 0 1 1 0 1 2
[5,] 8 4 6 1 0 1
[6,] 0 0 0 0 1 2
[7,] 0 0 1 0 1 0
x.swapped
A B C D E F
[1,] 6 4 1 0 1 0
[2,] 0 1 3 2 0 0
[3,] 0 0 1 1 1 2
[4,] 0 0 0 0 1 2
[5,] 8 1 1 1 1 2
[6,] 4 1 0 0 1 0
[7,] 0 1 6 0 0 1

R function works with some dataframes but not others?

I've a data frame that summarises the number of missing and non-missing observations in a data frame that is passed to it[1]. I then have been asked to test for differences between two treatment arms in the data I have (personally I disagree with the need or utility of doing so, but its what I've been asked to do). So I've written a small function to do this...
quick.test <- function(x, y){
chisq <- chisq.test(x = x, y = y)
fisher <- fisher.test(x = x, y = y)
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}
I then use apply() to pass the relevant columns to this function as follows...
apply(miss.t1, 1, function(x) quick.test(x[2:3], x[4:5]))
This is fine for the above specified miss.t1 data frame, but I'm working with time-series data and have three time-points I wish to summarise so have miss.t2 and miss.t3 (each of which is summarising the number of present/missing data for each time point, and have been created in the same manner using the function described in [1]).
miss.t2 fails with the following error...
apply(miss.t2, 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
My initial thought was that one of the columns had a missing value for some reason, but this doesn't appear to be the case...
> describe(miss.t2)
miss.t2
5 Variables 171 Observations
--------------------------------------------------------------------------------
variable
n missing unique
171 0 171
lowest : Abtotal Abyn agg_ment agg_phys All.score
highest: z_pf z_re z_rp z_sf z_vt
--------------------------------------------------------------------------------
nmiss.1
n missing unique Mean
171 0 4 8.649
0 (6, 4%), 8 (9, 5%), 9 (153, 89%), 10 (3, 2%)
--------------------------------------------------------------------------------
npresent.1
n missing unique Mean
171 0 4 9.351
8 (3, 2%), 9 (153, 89%), 10 (9, 5%), 18 (6, 4%)
--------------------------------------------------------------------------------
nmiss.2
n missing unique Mean
171 0 4 10.65
0 (6, 4%), 11 (160, 94%), 12 (4, 2%), 13 (1, 1%)
--------------------------------------------------------------------------------
npresent.2
n missing unique Mean
171 0 4 14.35
12 (1, 1%), 13 (4, 2%), 14 (160, 94%), 25 (6, 4%)
--------------------------------------------------------------------------------
Next thing I tried was trying subsets of miss.t2 by taking head(miss.t2, n=XX) and it works fine upto row 54...
> apply(head(miss.t2, n=53), 1, function(x) quick.test(x[2:3], x[4:5]))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
There were 50 or more warnings (use warnings() to see the first 50)
> apply(head(miss.t2, n=54), 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
> miss.t2[54,]
variable nmiss.1 npresent.1 nmiss.2 npresent.2
54 psq 10 8 11 14
> traceback()
5: stop("'x' and 'y' must have at least 2 levels") at #2
4: chisq.test(x = x, y = y) at #2
3: quick.test(x[2:3], x[4:5])
2: FUN(newX[, i], ...)
1: apply(head(miss.t2, n = 54), 1, function(x) quick.test(x[2:3],
x[4:5]))
Similarly with the 'bottom' of the data frame the last 26 rows are parsed fine, but not the 27th from last...
> apply(tail(miss.t2, n=26), 1, function(x) quick.test(x[2:3], x[4:5]))
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
164 165 166 167 168 169 170 171
[1,] 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 1 1 1 1 1 1
There were 26 warnings (use warnings() to see them)
> apply(tail(miss.t2, n=27), 1, function(x) quick.test(x[2:3], x[4:5]))
Error in chisq.test(x = x, y = y) :
'x' and 'y' must have at least 2 levels
In addition: Warning message:
In chisq.test(x = x, y = y) : Chi-squared approximation may be incorrect
> miss.t2[118,]
variable nmiss.1 npresent.1 nmiss.2 npresent.2
118 sf16 9 9 11 14
I can't see anything wrong with these two lines that means they should fail and the traceback() shown above doesn't reveal anything useful (to my mind).
Can anyone offer any suggestions as to why or where things are going wrong?
Many thanks in advance,
Neil
EDIT : Formatted reply to Vincent Zoonekynd ...
I opted for the chisq.test(x = x, y = y) version described in ?chisq.test(), using cbind() as you suggest to produce a matrix results in
Error in sum(x) : invalid 'type' (character) of argument.
Putting print statements and showing the length of x and y results in the same error, but shows the values and lenghts as being...
> miss.t2.res <- data.frame(t(apply(miss.t2, 1, function(x) quick.test(x[2:3], x[4:5]))))
[1] "Your x is : 9" "Your x is : 9"
[1] 2 ### < Length of x
[1] "Your y is : 11" "Your y is : 14"
[1] 2 ### < Length of y
Error in chisq.test(x = x, y = y) : 'x' and 'y' must have at least 2 levels
EDIT 2 : Thanks to Vincent Zoonekynd pointers the problem was that because the counts were identical for two cells the call to chisq.test() treats these as factors and collapses them. The solution was to modify the quick.test() function and coerce the arguments that are being passed into a matrix, so the function that worked is now....
quick.test <- function(x, y){
chisq <- chisq.test(rbind(as.numeric(x), as.numeric(y)))
fisher <- fisher.test(rbind(as.numeric(x), as.numeric(y)))
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}
Many thanks for the help & pointers Vincent, very much appreciated.
[1] http://gettinggeneticsdone.blogspot.co.uk/2011/02/summarize-missing-data-for-all.html
The solution suggested by Vincent Zoonkeynd in the comments above, was to modify the quick.test() function and coerce the arguments that are being passed into a matrix, so the function that worked is now....
quick.test <- function(x, y){
chisq <- chisq.test(rbind(as.numeric(x), as.numeric(y)))
fisher <- fisher.test(rbind(as.numeric(x), as.numeric(y)))
results <- cbind(chisq = chisq$statistic,
df = chisq$parameter,
p = chisq$p.value,
fisher = fisher$p.value)
results
}

Resources