How to suppress printing of 0 lines of a table? - r

I have a two factor vectors v1 and v2, which appear to be closely related (the entropy of each is very close to their joint entropy). Indeed, when I do table(v1,v2), I see something like this:
v2
v1 a2 b2 c2
a1 0 100 0
b1 0 0 0
c1 0 0 0
v2
v1 d2 e2 f2
a1 0 0 0
b1 0 0 0
c1 0 0 0
and so on - each factor has dozens of levels, so I get plenty of lines with all 0.
How to I print a table omitting lines which have only zeros in them?

Everybody seems to use rowSums(d)==0 or equivalent, but that will also suppress any row with equal numbers of ones and minus ones or any other zero sum combo. Safer would be to use:
d[ rowSums(d==0) != ncol(d) , ]
I suppose in the case where the object is the result of 'table', there would not be the risk of negative entries, but the risk would occur when this strategy is inappropariately applied to other settings.

Using your example:
v1 <- factor(rep("a1", 100), levels = paste0(letters[1:3], 1))
v2 <- factor(rep("b2", 100), levels = paste0(letters[1:6], 2))
R> table(v1, v2)
v2
v1 a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0
b1 0 0 0 0 0 0
c1 0 0 0 0 0 0
Then the rowSums() function will compute the row sums for use. This works because a table is a either a vector or a matrix in disguise. Note in the sequence below showing intermediate steps how we convert the row sums into a logical vector by asking if they exceed 0.
R> rowSums(tab)
a1 b1 c1
100 0 0
R> rowSums(tab) > 0
a1 b1 c1
TRUE FALSE FALSE
R> tab[rowSums(tab) > 0, ]
a2 b2 c2 d2 e2 f2
0 100 0 0 0 0
The above drops the empty dimension. If you want to keep the table format, add drop = FALSE to the call, though note the extra , in there as we want all columns hence the empty argument between , ,:
R> tab[rowSums(tab) > 0, , drop = FALSE]
v2
v1 a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0

I'd approach this with rowsums to get a logical vector of those greater than 0. And then use that vextor with indexing as in:
#make an example (please do this for yourself in the future)
d <- table(x=1:5, y=1:5)
d[1, 1] <- 0 #make one row have all 0s
d[rowSums(d) > 0, ]

Borrowing example data from #Gavin's answer
v1 <- factor(rep("a1", 100), levels = paste0(letters[1:3], 1))
v2 <- factor(rep("b2", 100), levels = paste0(letters[1:6], 2))
You can use droplevels to eliminate those value that do not appear anywhere (equivalent to rows with all 0's, or columns with all 0's)
> table(droplevels(v1), droplevels(v2))
b2
a1 100
If you only want to drop rows:
> table(droplevels(v1), v2)
v2
a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0

Related

How to change values of R cells (dataframe) based on a condition for specific rows>?

I have the following dataframe,
C1
C2
C3
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
I want to now apply the following condition on the dataframe for specific indexes only.
C1 should be equal to 0
A random number should be less than 0.5
If the above conditions match, I want to change the value of the Cell in C1 and C2 to 1 else do nothing.
I am trying the following: (rowIndex is the specific indexes on which I want to apply the conditions)
apply(DF[rowsIndex,], 2, fun)
where fun is:
fun<- function(x) {
ifelse(x==0,ifelse(runif(n=1)<0.5,x <- 1,x),x )
print(x)
}
My questions are:
In my function, How do I apply the conditions to a certain column only i.e C1 (I have tried using DF[rowsIndex,c(1)], but gives an error
Is there any other approach I can take Since this approach is not giving me any results and the same DF is printed.
Thanks
If you want to stay in base R:
#your dataframe
DF <- data.frame(C1 = c(0, 1, 0, 1, 0),
C2 = c(0, 1, 0, 1, 0),
C3 = c(0, 0, 0, 0, 0))
fun<- function(x) {
if(x[1]==0 & runif(n=1)<0.5) {
x[1:2] <- 1
}
return(x)
}
#your selection of rows you want to process
rowsIndex <- c(1, 2, 3, 4)
#Using MARGIN = 1 applies the function to the rows of a dataframe
#this returns a dataframe containing your selected and processed rows
DF_processed <- t(apply(DF[rowsIndex,], 1, fun))
#replace the selected rows in the original DF by the processed rows
DF[rowsIndex, ] <- DF_processed
print(DF)
Something like this?
library(dplyr)
df %>%
mutate(across(c(C1, C2), ~ifelse(C1 == 0 & runif(1) < 0.5, 1, .)))
C1 C2 C3
1 1 0 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 0 0
Applying it to your function:
fun<- function(df, x, y) {
df %>%
mutate(across(c({{x}}, {{y}}), ~ifelse({{x}} == 0 & runif(1) < 0.5, 1, .)))
}
fun(df, C1, C2)
C1 C2 C3
1 0 0 0
2 1 1 0
3 0 0 0
4 1 1 0
5 0 0 0

How to mapply a function with multiple variables across multiple lists

I'm a little out of my depth with mapply() and do.call...
So I have two lists like so:
ID START END
a1 1/1/2000 1/30/2000
a2 5/4/2000 3/1/2002
a3 5/8/2004 8/7/2005
a4 1/3/2012 5/7/2015
ID START END
b1 5/1/2000 1/30/2020
b2 6/4/2007 3/1/2008
b3 5/8/2014 8/7/2015
b4 1/3/1999 5/7/2019
Many of the dates overlap with each other, and that's what I'm trying to identify. I'm trying to create a column for each entry on the second list onto the first that says whether or not the date ranges overlap...
ID START END b1 b2 b3 b4
a1 1/1/2000 1/30/2000 0 0 0 1
a2 5/4/2000 3/1/2002 1 0 0 1
a3 5/8/2004 8/7/2005 1 0 0 1
a4 1/3/2012 5/7/2015 1 0 1 1
where a 0 represents un-overlapping events, and 1 represents overlap.
My effort so far has been to use dplyr mutate in a function with multiple variables. Then I'm trying to use mapply to feed the whole lists in as those variables...
builder <- function(id,start,finish){
resource_const_all <- resource_const %>%
mutate(id = ifelse(start > START & start < END,"1",
ifelse(finish > START & finish < END, "1",
ifelse(start < START & finish > END, "1", "0"))))
}
###if the start date falls in the date range, it returns 1.
###if the end date falls in the date range, it returns 1.
###if the start date is before the date range and the end date is after, it
###returns 1.
###Else the dates don't overlap, returns 0.
builder_output <- mapply(builder,id_list,start_list,end_list))
Thanks for any help!
Assume the data shown reproducibly in the Note at the end where we ensure that the START and END columns are of Date class. Then use outer as shown.
Note that overlap is a generic test and overlapAB makes it specific to A and B.
No packages are used.
overlap <- function(start1, end1, start2, end2) {
(start1 >= start2 & start1 <= end2) | (start2 >= start1 & start2 <= end1)
}
overlapAB <- function(idA, idB) {
i <- match(idA, A$ID)
j <- match(idB, B$ID)
overlap(A$START[i], A$END[i], B$START[j], B$END[j])
}
cbind(A, +outer(A$ID, B$ID, overlapAB))
giving:
ID START END b1 b2 b3 b4
1 a1 2000-01-01 2000-01-30 0 0 0 1
2 a2 2000-05-04 2002-03-01 1 0 0 1
3 a3 2004-05-08 2005-08-07 1 0 0 1
4 a4 2012-01-03 2015-05-07 1 0 1 1
Note
LinesA <- "ID START END
a1 1/1/2000 1/30/2000
a2 5/4/2000 3/1/2002
a3 5/8/2004 8/7/2005
a4 1/3/2012 5/7/2015"
LinesB <- "ID START END
b1 5/1/2000 1/30/2020
b2 6/4/2007 3/1/2008
b3 5/8/2014 8/7/2015
b4 1/3/1999 5/7/2019"
fmt <- "%m/%d/%Y"
A <- read.table(text = LinesA, header = TRUE, as.is = TRUE)
A$START <- as.Date(A$START, fmt)
A$END <- as.Date(A$END, fmt)
B <- read.table(text = LinesB, header = TRUE, as.is = TRUE)
B$START <- as.Date(B$START, fmt)
B$END <- as.Date(B$END, fmt)

How to create a csv. from a table that came from a merge routine (R)? Somehow the table is not saved and I can't convert it into a dataframe

library(reshape2)
Customer<- c("Susan","Louis", "Frank","Susan")
Seller<- c("Ivan", "Donald","Chris","Ivan")
Service<-c("COU","CAR", "FCL","CAR")
Billingmean<- c(100,200,300,400)
WrsHoldSum<-c(0,0,0,0)
Group<- c("n1","n2"," "," ")
B1<- c(0,2,2,1)
B2<-c(9,8,7,6)
B3<- c(5,4,3,2)
This dataframe includes information such as Billing mean of last years
sales, Seller, Type of service
df<- data.frame(Customer, Seller,Service, Billingmean,WrsHoldSum, Group,B1,B2,B3)
This section uses dcast to change the configuration of the dframe so I can use this file later in Word to fill some Directories in the "Mail Merge" mode
sub1<- dcast(data= df, formula= Customer+Group+Seller+WrsHoldSum~Service,fun.aggregate= sum,value.var= "Billingmean")
sub2<- dcast(data= df, formula= Customer+Group+Seller+WrsHoldSum~Service,fun.aggregate= sum,value.var= "B1")
sub3<- dcast(data= df, formula= Customer+Group+Seller+WrsHoldSum~Service,fun.aggregate= sum,value.var= "B2")
sub4<- dcast(data= df, formula= Customer+Group+Seller+WrsHoldSum~Service,fun.aggregate= sum,value.var= "B3")
Here I add new columns to add the information in a table, however is not nested on the environment to call it on the write.csv() function.
tNames <- grep(x = ls(), pattern = "^sub", value = T)
lapply(seq_along(tNames), function(x){
tSym <- as.name(tNames[[x]])
d1 <- copy(eval(tSym))
cols <- grep(x = names(d1), pattern = "^CAR|^COU|^FCL", value = T)
setnames(d1, old = cols, new = paste0(cols, " B", x))
return(d1)
}) %>% Reduce(function(x, y) merge(x, y, by = c("Customer","Group","Seller","WrsHoldSum")), .)
here I don't know if there's another way to merge the new columns of Billing1 (B1), Billing 2 (B2), Billing 3(B3)..
This is the expected output
Customer Group Seller WrsHoldSum CAR B1 COU B1 FCL B1 CAR B2 COU B2 FCL B2 CAR B3 COU B3 FCL B3 CAR B4 COU B4 FCL B4
1 Frank Chris 0 0 0 300 0 0 2 0 0 7 0 0 3
2 Louis n2 Donald 0 200 0 0 2 0 0 8 0 0 4 0 0
3 Susan Ivan 0 400 0 0 1 0 0 6 0 0 2 0 0
4 Susan n1 Ivan 0 0 100 0 0 0 0 0 9 0 0 5 0

Phylogenetic Tree - how to create a branch by species matrix?

Working with a phylogenetic tree in R, I would like to create a matrix which indicates if each branch of the tree (B1 to B8) is associated with each species (A to E), where 1s indicate that the branch is associated. (Shown below)
The R function which.edge() is useful for identifying the terminal branch for a species. but it doesn't identify ALL the branches associated with each species. What function could I use to identify all the branches in the tree that go from the root to the tip for each species?
Example Tree
library(ape)
ex.tree <- read.tree(text="(A:4,((B:1,C:1):2,(D:2,E:2):1):1);")
plot(ex.tree)
edgelabels() #shows branches 1-8
The is the matrix I would like to create (Species A-E as columns, Branches B1-B8 as rows), but with an easy function rather than by hand.
B1 <- c(1,0,0,0,0)
B2 <- c(0,1,1,1,1)
B3 <- c(0,1,1,0,0)
B4 <- c(0,1,0,0,0)
B5 <- c(0,0,1,0,0)
B6 <- c(0,0,0,1,1)
B7 <- c(0,0,0,1,0)
B8 <- c(0,0,0,0,1)
Mat <- rbind(B1,B2,B3,B4,B5,B6,B7,B8)
colnames(Mat) <- c("A","B","C","D","E")
Mat
For example, Branch B2 goes to species B-E, but not to species A. For Species E, branches B2, B6, B8 are present.
Which R function(s) would be best? Thanks in advance!
I am unaware of any built-in function that does this. I wrote a helper function that can calculate this from the edge data stored in the tree object.
branchNodeAdjacency <- function(x) {
m <- matrix(0, ncol=nt, nrow=nrow(x$edge))
from <- x$edge[,1]
to <- x$edge[,2]
g <- seq_along(x$tip.label)
while (any(!is.na(g))) {
i <- match(g, to)
m[cbind(i, seq_along(i))] <- 1
g <- from[i]
}
rownames(m) <- paste0("B", seq.int(nrow(m)))
colnames(m) <- x$tip.label
m
}
branchNodeAdjacency(ex.tree)
# A B C D E
# B1 1 0 0 0 0
# B2 0 1 1 1 1
# B3 0 1 1 0 0
# B4 0 1 0 0 0
# B5 0 0 1 0 0
# B6 0 0 0 1 1
# B7 0 0 0 1 0
# B8 0 0 0 0 1
The idea is we keep track of which leaf node values are represented by each internal node.

Insert columns by column index

Given the following data frame:
> header = c("A1","A2","A3","B1","B2","B3")
> df = matrix(c(0,0,0,0,0,0),nrow = 1)
> colnames(df) = header
> df
A1 A2 A3 B1 B2 B3
[1,] 0 0 0 0 0 0
I know the column index numbers of the headers containing "2" by:
> index2 = grep("2", colnames(df))
> index2
[1] 2 5
I want to add two extra columns named "A2.1","A2.2" and "B2.1", "B2.2" next to the columns with index 2 and 5, so that:
A1 A2 A2.1 A2.2 A3 B1 B2 B2.1 B2.2 B3
[1,] 0 0 0 0 0 0 0 0 0 0 0
Ho can I do this?
Many thanks in advance!
Assuming that you want to insert columns based on 'index2', one option is
df1 <- cbind(df, do.call(cbind,
replicate(2,df[,index2, drop=FALSE], simplify=FALSE)))
df2 <- df1[,order(colnames(df1)), drop=FALSE]
colnames(df2) <- make.unique(colnames(df2))
df2
# A1 A2 A2.1 A2.2 A3 B1 B2 B2.1 B2.2 B3
#[1,] 0 0 0 0 0 0 0 0 0 0
You could try something like this:
set.seed(1234)
df <- data.frame(matrix(runif(100),ncol=5))
colnames(df) <- LETTERS[1:ncol(df)]
B.1 <- runif(20)
df <- cbind(df,B.1)
df <- df[,order(colnames(df))]
#> head(df)
# A B B.1 C D E
#1 0.1137034 0.31661245 0.03545673 0.5533336 0.86483383 0.9264005
#2 0.6222994 0.30269337 0.56507611 0.6464061 0.04185728 0.4719097
#3 0.6092747 0.15904600 0.28025778 0.3118243 0.31718216 0.1426153
#4 0.6233794 0.03999592 0.20419632 0.6218192 0.01374994 0.5442698
#5 0.8609154 0.21879954 0.13373890 0.3297702 0.23902573 0.1961747
#6 0.6403106 0.81059855 0.32568192 0.5019975 0.70649462 0.8985805
It means that you are first attaching the column on the right with cbind() and order the columns sequence afterwards. Hope this helps.

Resources