Insert new columns based on the union of colnames of two data frames - r

I want to write a R function to insert many 0 vectors into a existed data.frame. Here is the example:
Data.frame 1
A B C D
1 1 3 4 5
2 4 5 6 7
3 4 5 6 2
4 4 55 2 3
Data.frame 2
A B E X
11 5 1 5 5
22 44 55 9 6
33 12 4 2 4
44 9 7 4 2
Based on the union of two colnames (that is A,B,C,D,E, X), I want to update the two data frames like:
Data.frame 1 (new)
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
Data.frame 2 (new)
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Thanks in advance.

Option 1 (Thanks #Jilber for the edits)
I'm assuming the order of columns don't matter -
df2part <- subset(df2,select = setdiff(colnames(df2),colnames(df1)))*0
df1f <- cbind(df1,df2part)
df1part <- subset(df1,select = setdiff(colnames(df1),colnames(df2)))*0
df2f <- cbind(df2,df1part)
If the order really matters, then just reorder the columns
df2f <- df2f[, sort(names(df2f))]
Output
> df1f
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
> df2f
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Option 2 -
library(data.table)
df1 <- data.table(df1)
df2 <- data.table(df2)
df1names <- colnames(df1)
df2names <- colnames(df2)
df1[,setdiff(df2names,df1names) := 0]
df2[,setdiff(df1names,df2names) := 0]

Related

Subset data frame based on a column number threshold [duplicate]

I have a question about counting zeros per row. I have a dataframe like this:
a = c(1,2,3,4,5,6,0,2,5)
b = c(0,0,0,2,6,7,0,0,0)
c = c(0,5,2,7,3,1,0,3,0)
d = c(1,2,6,3,8,4,0,4,0)
e = c(0,4,6,3,8,4,0,6,0)
f = c(0,2,5,5,8,4,2,7,4)
g = c(0,8,5,4,7,4,0,0,0)
h = c(1,3,6,7,4,2,0,4,2)
i = c(1,5,3,6,3,7,0,5,3)
j = c(1,5,2,6,4,6,8,4,2)
DF<- data.frame(a=a,b=b,c=c,d=d,e=e,f=f,g=g,h=h,i=i,j=j)
a b c d e f g h i j
1 1 0 0 1 0 0 0 1 1 1
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
7 0 0 0 0 0 2 0 0 0 8
8 2 0 3 4 6 7 0 4 5 4
9 5 0 0 0 0 4 0 2 3 2
I want to count the numbers of zeros per row. If the number of zeros per row is more than a certain number, say 4, I want to remove the complete row. The resulting dataframe looks like this:
a b c d e f g h i j
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
8 2 0 3 4 6 7 0 4 5 4
Is that possible?? Thank you!
It's not only possible, but very easy:
DF[rowSums(DF == 0) <= 4, ]
You could also use apply:
DF[apply(DF == 0, 1, sum) <= 4, ]

count number of 0 in each row and build a new column in r [duplicate]

I have a question about counting zeros per row. I have a dataframe like this:
a = c(1,2,3,4,5,6,0,2,5)
b = c(0,0,0,2,6,7,0,0,0)
c = c(0,5,2,7,3,1,0,3,0)
d = c(1,2,6,3,8,4,0,4,0)
e = c(0,4,6,3,8,4,0,6,0)
f = c(0,2,5,5,8,4,2,7,4)
g = c(0,8,5,4,7,4,0,0,0)
h = c(1,3,6,7,4,2,0,4,2)
i = c(1,5,3,6,3,7,0,5,3)
j = c(1,5,2,6,4,6,8,4,2)
DF<- data.frame(a=a,b=b,c=c,d=d,e=e,f=f,g=g,h=h,i=i,j=j)
a b c d e f g h i j
1 1 0 0 1 0 0 0 1 1 1
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
7 0 0 0 0 0 2 0 0 0 8
8 2 0 3 4 6 7 0 4 5 4
9 5 0 0 0 0 4 0 2 3 2
I want to count the numbers of zeros per row. If the number of zeros per row is more than a certain number, say 4, I want to remove the complete row. The resulting dataframe looks like this:
a b c d e f g h i j
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
8 2 0 3 4 6 7 0 4 5 4
Is that possible?? Thank you!
It's not only possible, but very easy:
DF[rowSums(DF == 0) <= 4, ]
You could also use apply:
DF[apply(DF == 0, 1, sum) <= 4, ]

Find rows in dataframe with more than one zero [duplicate]

I have a question about counting zeros per row. I have a dataframe like this:
a = c(1,2,3,4,5,6,0,2,5)
b = c(0,0,0,2,6,7,0,0,0)
c = c(0,5,2,7,3,1,0,3,0)
d = c(1,2,6,3,8,4,0,4,0)
e = c(0,4,6,3,8,4,0,6,0)
f = c(0,2,5,5,8,4,2,7,4)
g = c(0,8,5,4,7,4,0,0,0)
h = c(1,3,6,7,4,2,0,4,2)
i = c(1,5,3,6,3,7,0,5,3)
j = c(1,5,2,6,4,6,8,4,2)
DF<- data.frame(a=a,b=b,c=c,d=d,e=e,f=f,g=g,h=h,i=i,j=j)
a b c d e f g h i j
1 1 0 0 1 0 0 0 1 1 1
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
7 0 0 0 0 0 2 0 0 0 8
8 2 0 3 4 6 7 0 4 5 4
9 5 0 0 0 0 4 0 2 3 2
I want to count the numbers of zeros per row. If the number of zeros per row is more than a certain number, say 4, I want to remove the complete row. The resulting dataframe looks like this:
a b c d e f g h i j
2 2 0 5 2 4 2 8 3 5 5
3 3 0 2 6 6 5 5 6 3 2
4 4 2 7 3 3 5 4 7 6 6
5 5 6 3 8 8 8 7 4 3 4
6 6 7 1 4 4 4 4 2 7 6
8 2 0 3 4 6 7 0 4 5 4
Is that possible?? Thank you!
It's not only possible, but very easy:
DF[rowSums(DF == 0) <= 4, ]
You could also use apply:
DF[apply(DF == 0, 1, sum) <= 4, ]

identifying rows having common values in two columns

How to identify rows having same values in two columns (here: treatment, replicate) at least in another one row?
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, treatment=treatment, replicate=replicate)
table(d$treatment, d$replicate)
# 1 2 3 4 5 6 7 8
# a 1 0 0 1 1 2 0 0
# b 1 1 0 0 1 2 0 0
# c 0 0 0 0 2 0 1 2
# d 2 0 1 1 0 0 1 0
# e 0 2 1 1 0 0 0 1
# f 0 1 1 0 1 1 1 0
# g 0 1 0 2 0 0 1 1
# h 1 0 2 0 0 0 1 1
From the above output, my guess is that the output should contain 16 rows. Any idea how to achieve this?
Update:
d %>% group_by(treatment, replicate) %>% filter(n()>1)
# A tibble: 16 x 4
x y treatment replicate
<int> <dbl> <fctr> <fctr>
1 2 7.050445 h 3
2 5 1.840198 b 6
3 8 9.160838 d 1
4 9 4.254486 h 3
5 2 8.870106 g 4
6 4 7.821616 a 6
7 6 9.752492 e 2
8 7 9.988579 c 5
9 9 10.480931 c 8
10 1 2.770469 c 8
11 2 7.913338 e 2
12 3 13.743080 d 1
13 9 5.692010 b 6
14 10 11.100722 a 6
15 3 12.198432 g 4
16 5 5.955146 c 5
I have identified one approach where the results seem to satisfy the condition. Any other better solutions?
You can use duplicated as a condition:
dups <- d[which(duplicated(d[,c("treatment", "replicate")]) |
duplicated(d[ ,c("treatment", "replicate")], fromLast = TRUE)),]
>dups
x y treatment replicate
2 2 7.050445 h 3
5 5 1.840198 b 6
8 8 9.160838 d 1
9 9 4.254486 h 3
12 2 8.870106 g 4
14 4 7.821616 a 6
16 6 9.752492 e 2
17 7 9.988579 c 5
19 9 10.480931 c 8
21 1 2.770469 c 8
22 2 7.913338 e 2
23 3 13.743080 d 1
29 9 5.692010 b 6
30 10 11.100722 a 6
33 3 12.198432 g 4
35 5 5.955146 c 5

determining total number of times distinct values 0 or 1 or na in each column in a data frame in R

I have 15 columns and I want to group by values in each column by either 0 or 1 or na.
my dataset
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA
1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA
I want output to be like:
A B C D E F G H I J K L M N O
0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always"
sapply(df1, table, useNA="always")
If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1
sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always"))
# A B C D E F G H I J K L M N O
#0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9
#1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11
#<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5

Resources