Finding variance of columns from 2 dataframes - r

I have 2 dataframes
DataFrame A and Dataframe B.
A <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
B <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
A
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
B
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
Expected Output:
C
a b c x y
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
Both have a key column which is alpha-numeric.
Both dataframes have 260 columns in all out of which 250 are float.
Is there an eaiser way to easily compute the variance of each of the 250 columns and store the variance in another dataframe?

I think you want difference brtween respective columns of two dataframes
temp = names(A)
data.frame(A["a"], do.call(cbind, lapply(temp[!temp %in% "a"], function(x) A[x] - B[x])))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0

We can use Map/mapply to find the difference between the corresponding columns of 'A' and 'B'
cbind(A[1], mapply(`-`, A[-1], B[names(A)[-1]]))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0
Or just
cbind(A[1], A[-1] - B[-1])

Related

formatting table/matrix in R

I am trying to use a package where the table they've used is in a certain format, I am very new to R and don't know how to get my data in this same format to be able to use the package.
Their table looks like this:
Recipient
Actor 1 10 11 12 2 3 4 5 6 7 8 9
1 0 0 0 1 3 1 1 2 3 0 2 6
10 1 0 0 1 0 0 0 0 0 0 0 0
11 13 5 0 5 3 8 0 1 3 2 2 9
12 0 0 2 0 1 1 1 3 1 1 3 0
2 0 0 2 0 0 1 0 0 0 2 2 1
3 9 9 0 5 16 0 2 8 21 45 13 6
4 21 28 64 22 40 79 0 16 53 76 43 38
5 2 0 0 0 0 0 1 0 3 0 0 1
6 11 22 4 21 13 9 2 3 0 4 39 8
7 5 32 11 9 16 1 0 4 33 0 17 22
8 4 0 2 0 1 11 0 0 0 1 0 1
9 0 0 3 1 0 0 1 0 0 0 0 0
Where mine at the moment is:
X0 X1 X2 X3 X4 X5
0 0 2 3 3 0 0
1 1 0 4 2 0 0
2 0 0 0 0 0 0
3 0 2 2 0 1 0
4 0 0 3 2 0 2
5 0 0 3 3 1 0
I would like to add the recipient and actor to mine, as well as change to row and column names to 1, ..., 6.
Also my data is listed under Data in my Workspace and it says:
'num' [1:6,1:6] 0 1 ...
Whereas the example data in the workspace is shown in Values as:
'table' num [1:12,1:12] 0 1 13 ...
Please let me know if you have suggestion to get my data in the same type and style as theirs, all help is greatly appreciated!
OK, so you have a matrix like so:
m <- matrix(c(1:9), 3)
rownames(m) <- 0:2
colnames(m) <- paste0("X", 0:2)
# X0 X1 X2
#0 1 4 7
#1 2 5 8
#2 3 6 9
First you need to remove the Xs and turn it into a table:
colnames(m) <- sub("X", "", colnames(m))
m <- as.table(m)
# 0 1 2
#0 1 4 7
#1 2 5 8
#2 3 6 9
Then you can set the dimension names:
names(dimnames(m)) <- c("Actor", "Recipient")
# Recipient
#Actor 0 1 2
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
However, usually you would create the contingency table from raw data using the table function, which would automatically return a table object. So, maybe you should fix the step creating your matrix?

How to create a list of data frames as a result of subsetting an old one based on some conditions?

I have the following data frame:
T a b c
1 1 0 0 0
2 2 1 0 0
3 5 1 0 0
4 6 1 0 0
5 7 0 1 0
6 9 0 1 0
7 10 0 0 1
8 12 0 0 0
9 14 0 0 0
10 15 1 0 0
11 16 1 0 0
12 17 0 1 0
13 18 0 0 1
I want to subset this data frame and create a list of data frames. Each data frame has to be populated with the rows (of the old one) that there is a sequence of successively "1" in a column, then in b column and last in c column. The expected result (for this data frame) would be a list of 2 data frames:
data frame 1:
T a b c
1 2 1 0 0
2 5 1 0 0
3 6 1 0 0
4 7 0 1 0
5 9 0 1 0
6 10 0 0 1
and data frame 2:
T a b c
1 15 1 0 0
2 16 1 0 0
3 17 0 1 0
4 18 0 0 1
Any ideas?
Thank you in advance!
Based on the expected output
i1 <- do.call(pmax, df1[-1])
grp <- inverse.rle(within.list(rle(i1 ==1), {values <- seq_along(values)}))
split(df1[i1==1,], grp[i1==1])
#$`2`
# T a b c
#2 2 1 0 0
#3 5 1 0 0
#4 6 1 0 0
#5 7 0 1 0
#6 9 0 1 0
#7 10 0 0 1
#$`4`
# T a b c
#10 15 1 0 0
#11 16 1 0 0
#12 17 0 1 0
#13 18 0 0 1

How to convert two factors to adjacency matrix in R?

I have a data frame with two columns (key and value) where each column is a factor:
df = data.frame(gl(3,4,labels=c('a','b','c')), gl(6,2))
colnames(df) = c("key", "value")
key value
1 a 1
2 a 1
3 a 2
4 a 2
5 b 3
6 b 3
7 b 4
8 b 4
9 c 5
10 c 5
11 c 6
12 c 6
I want to convert it to adjacency matrix (in this case 3x6 size) like:
1 2 3 4 5 6
a 1 1 0 0 0 0
b 0 0 1 1 0 0
c 0 0 0 0 1 1
So that I can run clustering on it (group keys that have similar values together) with either kmeans or hclust.
Closest that I was able to get was using model.matrix( ~ value, df) which results in:
(Intercept) value2 value3 value4 value5 value6
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 1 0 0 0
7 1 0 0 1 0 0
8 1 0 0 1 0 0
9 1 0 0 0 1 0
10 1 0 0 0 1 0
11 1 0 0 0 0 1
12 1 0 0 0 0 1
but results aren't grouped by key yet.
From another side I can collapse this dataset into groups using:
aggregate(df$value, by=list(df$key), unique)
Group.1 x.1 x.2
1 a 1 2
2 b 3 4
3 c 5 6
But I don't know what to do next...
Can someone help to solve this?
An easy way to do it in base R:
res <-table(df)
res[res>0] <-1
res
value
#key 1 2 3 4 5 6
# a 1 1 0 0 0 0
# b 0 0 1 1 0 0
# c 0 0 0 0 1 1

cumulative counter in dataframe R

I have a dataframe with many rows, but the structure looks like this:
year factor
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
I would need to add a counter as a third column. It should count the cumulative cells that contains zero until it set again to zero once the value 1 is encountered. The result should look like this:
year factor count
1 0 0
2 0 1
3 0 2
4 0 3
5 0 4
6 0 5
7 0 6
8 0 7
9 1 0
10 0 1
11 0 2
12 0 3
13 0 4
14 0 5
15 0 6
16 0 7
17 1 0
18 0 1
19 0 2
20 0 3
I would be glad to do it in a quick way, avoiding loops, since I have to do the operations for hundreds of files.
You can copy my dataframe, pasting the dataframe in "..." here:
dt <- read.table( text="...", , header = TRUE )
Perhaps a solution like this with ave would work for you:
A <- cumsum(dt$factor)
ave(A, A, FUN = seq_along) - 1
# [1] 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
Original answer:
(Missed that the first value was supposed to be "0". Oops.)
x <- rle(dt$factor == 1)
y <- sequence(x$lengths)
y[dt$factor == 1] <- 0
y
# [1] 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3

In R, how to replace values in multiple columns with a vector of values equal to the same width?

I am trying to replace every row's values in 2 columns with a vector of length 2. It is easier to show you.
First here is a some data.
set.seed(1234)
x<-data.frame(x=sample(c(0:3), 10, replace=T))
x$ab<-0 #column that will be replaced
x$cd<-0 #column that will be replaced
The data looks like this:
x ab cd
1 0 0 0
2 2 0 0
3 2 0 0
4 2 0 0
5 3 0 0
6 2 0 0
7 0 0 0
8 0 0 0
9 2 0 0
10 2 0 0
Every time x=2 or x=3, I want to ab=0 and cd=1.
My attempt is this:
x[with(x, which(x==2|x==3)), c(2:3)] <- c(0,1)
Which does not have the intended results:
x ab cd
1 0 0 0
2 2 0 1
3 2 1 0
4 2 0 1
5 3 1 0
6 2 0 1
7 0 0 0
8 0 0 0
9 2 1 0
10 2 0 1
Can you help me?
The reason it doesn't work as you want is because R stores matrices and arrays in column-major layout. And when you a assign a shorter array to a longer array, R cycles through the shorter array. For example if you have
x<-rep(0,20)
x[1:10]<-c(2,3)
then you end up with
[1] 2 3 2 3 2 3 2 3 2 3 0 0 0 0 0 0 0 0 0 0
What is happening in your case is that the sub-array where x is equal to 2 or 3 is being filled in column-wise by cycling through the vector c(0,1). I don't know of any simple way to change this behavior.
Probably the easiest thing to do here is simply fill in the columns one at a time. Or, you could do something like this:
indices<-with(x, which(x==2|x==3))
x[indices,c(2,3)]<-rep(c(0,1),each=length(indices))
Another alternative: Using a data.table, this is a one-liner:
require(data.table)
DT <- data.table(x)
DT[x%in%2:3,`:=`(ab=0,cd=1)]
Original answer: You can pass a matrix of row-column pairs:
ijs <- expand.grid(with(x, which(x==2|x==3)),c(2:3))
ijs <- ijs[order(ijs$Var1),]
x[as.matrix(ijs)] <- c(0,1)
which yields
x ab cd
1 0 0 0
2 2 0 1
3 2 0 1
4 2 0 1
5 3 0 1
6 2 0 1
7 0 0 0
8 0 0 0
9 2 0 1
10 2 0 1
My original answer worked on my computer, but not a commenter's.
Generalized for multi-columns and multi-values:
mycol<-as.list(names(x)[-1])
myvalue<-as.list(c(0,1))
kk<-Map(function(y,z) list(x[x[,1] %in% c(2,3),y]<-z,x),mycol, myvalue)
myresult<-data.frame(kk[[2]][[2]])
x ab cd
1 1 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 3 0 1
7 2 0 1
8 3 0 1
9 3 0 1
10 0 0 0
You could use ifelse:
> set.seed(1234)
> dat<-data.frame(x=sample(c(0:3), 10, replace=T))
> dat$ab <- 0
> dat$cd <- ifelse(dat$x==2 | dat$x==3, 1, 0)
x ab cd
1 0 0 0
2 2 0 1
3 2 0 1
4 2 0 1
5 3 0 1
6 2 0 1
7 0 0 0
8 0 0 0
9 2 0 1
10 2 0 1
What about that?
x[x$x%in%c(2,3),c(2,3)]=matrix(rep(c(0,1),sum(x$x%in%c(2,3))),ncol=2,byrow=TRUE)
x$ab[x$x==2 | x$x==3] <- 0
x$cd[x$x==2 | x$x==3] <- 1
EDIT
Here is a general approach that would work with lots of columns. You simply create a vector of the replacement values you wish to use for each column.
set.seed(1234)
y<-data.frame(x=sample(c(0:3), 10, replace=T))
y$ab<-4 #column that will be replaced
y$cd<-2 #column that will be replaced
y$ef<-0 #column that will be replaced
y
# x ab cd ef
#1 0 4 2 0
#2 2 4 2 0
#3 2 4 2 0
#4 2 4 2 0
#5 3 4 2 0
#6 2 4 2 0
#7 0 4 2 0
#8 0 4 2 0
#9 2 4 2 0
#10 2 4 2 0
replacement.values <- c(10,20,30)
y2 <- y
y2[,2:ncol(y)] <- sapply(2:ncol(y), function(j) {
apply(y, 1, function(i) {
ifelse((i[1] %in% c(2,3)), replacement.values[j-1], i[j])
})
})
y2
# x ab cd ef
#1 0 4 2 0
#2 2 10 20 30
#3 2 10 20 30
#4 2 10 20 30
#5 3 10 20 30
#6 2 10 20 30
#7 0 4 2 0
#8 0 4 2 0
#9 2 10 20 30
#10 2 10 20 30

Resources