I have data set as follows:
A B C
R1 1 0 1
R2 0 1 0
R3 0 0 0
I want to add another column in data set named index such that it gives column names for each row where the column value is greater than zero. The result I want is as follows:
A B C Index
R1 1 0 1 A,C
R2 0 1 0 B
R3 0 0 0 NA
Here is one approach using base:
use apply to go over rows, find elements that are equal to one and paste together the corresponding column names:
df$Index <- apply(df, 1, function(x) paste(colnames(df)[which(x == 1)], collapse = ", "))
df$Index <- crate a new column called Index where the result of the operation will be held
apply - applies a function over rows and/or columns of a matrix/data frame
1 - specify that the function should be applied to rows (2 - means over columns)
function(x) an unnamed function which is further defined - x corresponds to each row
which(x == 1) which elements of a row are equal to 1 output is TRUE/FALSE
colnames(df) - names of the columns of the data frame
colnames(df)[which(x == 1] - subsets the column names which are TRUE for the expression which(x == 1)
paste with collapse = ", " - collapse a character vector (in this case a vector of column names that we acquired before) into a string where each element will be separated by ,.
now replace empty entries with NA
df$Index[df$Index == ""] <- NA_character_
here is how the output looks like
#output
sample A B C Index
1 R1 1 0 1 A, C
2 R2 0 1 0 B
3 R3 0 0 0 <NA>
data:
structure(list(sample = structure(1:3, .Label = c("R1", "R2",
"R3"), class = "factor"), A = c(1L, 0L, 0L), B = c(0L, 1L, 0L
), C = c(1L, 0L, 0L)), .Names = c("sample", "A", "B", "C"), class = "data.frame", row.names = c(NA,
-3L))
Slightly different flavored apply()solution:
df$index <- apply(df, 1, function(x) ifelse(any(x), toString(names(df)[x == 1]), NA))
A B C index
R1 1 0 1 A, C
R2 0 1 0 B
R3 0 0 0 <NA>
data:
df <- structure(
list(
A = c(1L, 0L, 0L),
B = c(0L, 1L, 0L),
C = c(1L, 0L, 0L)
),
row.names = paste0('R', 1:3),
class = "data.frame"
)
Related
Background
I recently asked this question. I however made the example slightly too simple, so I am adding some complexity here, where the vector length is no longer equal to the table length.
Problem
I have a table as follows:
tableA <- structure(c(1L, 0L, 0L, 0L, 4L, 6L, 0L, 6L, 1L, 3L, 0L, 0L, 0L, 0L, 1L), dim = c(3L,
5L), dimnames = structure(list(c("X", "Y",
"Z"), c("A", "B", "C","D", "E")), names = c("", "")), class = "table")
A B C D E
X 1 0 0 3 0 (two positive numbers)
Y 0 4 6 0 0 (two positive numbers)
Z 0 6 1 0 1 (three positive numbers)
And a list of vectors as follows:
listB <- list(
"X" = c(0, 4),
"Y" = c(4, 5),
"Z" = c(7, 1, 0))
The vectors are not equal, but the amount of numbers in the vectors equal the amount positive numbers in the table.
I would like to replace all values in tableA, of columns B,C, and D, that are bigger than zero, with the corresponding values of listB.
Desired output:
A B C D E
X 1 0 0 4 0
Y 0 4 5 0 0
Z 0 7 1 0 1
Previous answer
The original answer by sindri_baldur suggested the following:
cols2replace = match(c('B', 'C', 'D'), colnames(tableA))
cells2replace = tableA[, cols2replace] > 0
tableB = matrix(unlist(listB), nrow = 3, byrow = TRUE)
tableA[, cols2replace][cells2replace] = tableB[, cols2replace][cells2replace]
In my actual data however the vectors in the list of vectors are unequal. Therefore:
tableB = matrix(unlist(listB), nrow = 3, byrow = TRUE)
Does not work.
Suggestion
I am wondering if I cannot simply get all positive values from tableA, replace them with all positive values of of listB (so that values which are 0 in listB are not replaced) and then put them back in the table. I started as follows:
# Get all positive values from the table
library(dplyr)
library(tidyr)
library(stringr)
out <- tableA %>%
pivot_longer(cols = -rn) %>%
filter(str_detect(value, '\\b0\\b', negate = TRUE)) %>%
group_by(rn) %>%
summarise(freq = list(value), .groups = 'drop')
But the code does not work on a table.
Assume the amount of numbers in each vector in listB equals the amount of positive numbers in each row of tableA, you could solve it by
tab <- t(tableA)
tab[tab > 0] <- unlist(listB)
tableA[, c('B', 'C', 'D')] <- t(tab)[, c('B', 'C', 'D')]
tableA
# A B C D E
# X 1 0 0 4 0
# Y 0 4 5 0 0
# Z 0 7 1 0 1
You could also solve your problem as follow:
x = unlist(listB)
tableA = t(tableA)
tableA[c("B", "C", "D"), ][which(tableA[c("B", "C", "D"), ]>0)] = x[x>0]
tableA = t(tableA)
A B C D E
X 1 0 0 4 0
Y 0 4 5 0 0
Z 0 7 1 0 1
I want to check if columns of one data frame are present in another data frame and the values of those columns in the second data frame should be non zero. For example,
I have a data frame df1 as follows:
indx1 indx2
aa 1 ac
ac tg 0
I have another data frame df as follows:
col1 aa 1 ab 2 ac bd 5 tg 0
A 1 0 0 1 4
B 0 0 1 1 0
C 1 1 0 1 1
D 0 0 0 5 5
E 0 0 1 0 9
I want to check if any of the rows of df can satisfy the criteria: df1[i,1]>0 and df1[i,2]>0. i goes from 1 to nrow(df1). For example:
when i = 1, I want to check if any of the row of df can satisfy the condition: aa > 0 & ac > 0. Since, none of the rows satisfy the condition, the code will return 0. when i = 2, the condition would be: ac > 0 & tg > 0. here one row of df (5th row) satisfy the condition, so the code will return 1. The output will be saved to a new column of df1. The output will be as follows:
indx1 indx2 count_occ
aa 1 ac 0
ac tg 0 1
I have tried as follows:
for(i in 1:nrow(df1)){
d1 = subset(df, as.name(df1[i,1]) > 0 & as.name(df1[i,2]) > 0)
if(nrow(d1) >= 1){
df1[i,3] = 1
}else{
df1[i,3] = 0
}
}
But d1 = subset(df, as.name(df1[i,1]) > 0 & as.name(df1[i,2]) > 0) is not giving me the correct output. Any help would be highly appreciated. TIA.
We can use Map
Loop over the 'indx1', 'indx2' columns of 'df' in Map
Extract the corresponding columns of 'df1' - df1[[x]], df1[[y]]
Create the multiple logical expression with > and &
Check if there any TRUE value from the rows of 'df1'
Coerce to binary (+( - or use as.integer)
Convert the list output to a vector - unlist and assign it to create the 'count_occ' column in 'df'
df$count_occ <- unlist(Map(function(x, y)
+(any(df1[[x]] > 0 & df1[[y]] > 0, na.rm = TRUE)), df$indx1, df$indx2))
-output
df
indx1 indx2 count_occ
1 aa 1 ac 0
2 ac tg 0 1
data
df <- structure(list(indx1 = c("aa 1", "ac"), indx2 = c("ac", "tg 0"
)), class = "data.frame", row.names = c(NA, -2L))
df1 <- structure(list(col1 = c("A", "B", "C", "D", "E"), `aa 1` = c(1L,
0L, 1L, 0L, 0L), `ab 2` = c(0L, 0L, 1L, 0L, 0L), ac = c(0L, 1L,
0L, 0L, 1L), `bd 5` = c(1L, 1L, 1L, 5L, 0L), `tg 0` = c(4L, 0L,
1L, 5L, 9L)), class = "data.frame", row.names = c(NA, -5L))
I think the following solution can be used with a for loop:
for(i in 1:nrow(df)) {
for(j in 1:nrow(df1)) {
if(df1[j, df[i, 1]] > 0 & df1[j, df[i, 2]] > 0) {
df1[j, "id"] <- 1
} else {
df1[j, "id"] <- 0
}
}
if(any(df1$id == 1)) {
df[i, "count"] <- 1
} else {
df[i, "count"] <- 0
}
}
indx1 indx2 count
1 aa 1 ac 0
2 ac tg 0 1
I want to loop through a large dataframe counting in the first column how many values >0, removing those rows that were counted.... then moving on to column 2 counting the number of values>0 and removing those rows etc...
the data frame
taxonomy A B C
1 cat 0 2 0
2 dog 5 1 0
3 horse 3 0 0
4 mouse 0 0 4
5 frog 0 2 4
6 lion 0 0 2
can be generated with
DF1 = structure(list(taxonomy = c("cat", "dog","horse","mouse","frog", "lion"),
A = c(0L, 5L, 3L, 0L, 0L, 0L), D = c(2L, 1L, 0L, 0L, 2L, 0L), C = c(0L, 0L, 0L, 4L, 4L, 2L)),
.Names = c("taxonomy", "A", "B", "C"),
row.names = c(NA, -6L), class = "data.frame")
and i expect the outcome to be
A B C
count 2 2 2
i wrote this loop but it does not remove the rows as it goes
res <- data.frame(DF1[1,], row.names = c('count'))
for(n in 1:ncol(DF1)) {
res[colnames(DF1)[n]] <- sum(DF1[n])
DF1[!DF1[n]==1]
}
it gives this incorrect result
A B C
count 2 3 3
You could do ...
DF = DF1[, -1]
cond = DF != 0
p = max.col(cond, ties="first")
fp = factor(p, levels = seq_along(DF), labels = names(DF))
table(fp)
# A B C
# 2 2 2
To account for rows that are all zeros, I think this works:
fp[rowSums(cond) == 0] <- NA
We can update the dataset in each run. Create a temporary dataset without the 'taxonomy' column ('tmp'). Initiate a named vector ('n'), loop through the columns of 'tmp', get a logical index based on whether the column is greater than 0 ('i1'), get the sum of TRUE values, update the 'n' for the corresponding column, then update the 'tmp' by removing those rows using 'i1' as row index
tmp <- DF1[-1]
n <- setNames(numeric(ncol(tmp)), names(tmp))
for(i in seq_len(ncol(tmp))) {
i1 <- tmp[[i]] > 0
n[i] <- sum(i1)
tmp <- tmp[!i1, ]}
n
# A B C
# 2 2 2
It can also be done with Reduce
sapply(Reduce(function(x, y) y[!x] > 0, DF1[3:4],
init = DF1[,2] > 0, accumulate = TRUE ), sum)
#[1] 2 2 2
Or using accumulate from purrr
library(purrr)
accumulate(DF1[3:4], ~ .y[!.x] > 0, .init = DF1[[2]] > 0) %>%
map_int(sum)
#[1] 2 2 2
This is easy with Reduce and sapply:
> first <- Reduce(function(a,b) b[a==0], df[-1], accumulate=TRUE)
> first
[[1]]
[1] 0 5 3 0 0 0
[[2]]
[1] 2 0 2 0
[[3]]
[1] 0 4 2
> then <- sapply(setNames(first, names(df[-1])), function(x) length(x[x>0]))
> then
A B C
2 2 2
I have a data frame of arbitrary but non-trivial size. Each entry has one of three distinct values 0, 1, or 2 randomly distributed. For example:
col.1 col.2 col.3 col.4 ...
0 0 1 0 ...
0 2 2 1 ...
2 2 2 2 ...
0 0 0 0 ...
0 1 1 1 ...
... ... ... ... ...
My goal is to remove any row that only contains one unique element or to select only those rows with at least two distinct elements. Originally I selected those rows where the row mean was a not a whole number, but I realized that could eliminate rows containing equal amounts of 0 and 2 which I want to keep.
My current thought process is to use unique on each row of the data frame, followed by length to determine how many unique elements each contains but I can't seem to get the syntax right. I'm looking for something like this
DataFrame[length(unique(DataFrame)) != 1, ]
Try any of these:
nuniq <- function(x) length(unique(x))
subset(dd, apply(dd, 1, nuniq) >= 2)
subset(dd, apply(dd, 1, sd) > 0)
subset(dd, apply(dd[-1] != dd[[1]], 1, any))
subset(dd, rowSums(dd[-1] != dd[[1]]) > 0)
subset(dd, lengths(lapply(as.data.frame(t(dd)), unique)) >= 2)
subset(dd, lengths(apply(dd, 1, table)) >= 2)
# nuniq is from above
subset(dd, tapply(as.matrix(dd), row(dd), nuniq) >= 2)
giving:
col.1 col.2 col.3 col.4
1 0 0 1 0
2 0 2 2 1
5 0 1 1 1
Alternatives to nuniq
In the above nuniq could be replaced with any of these:
function(x) nlevels(factor(x))
function(x) sum(!duplicated(x))
funtion(x) length(table(x))
dplyr::n_distinct
Note
dd in reproducible form is:
dd <- structure(list(col.1 = c(0L, 0L, 2L, 0L, 0L), col.2 = c(0L, 2L,
2L, 0L, 1L), col.3 = c(1L, 2L, 2L, 0L, 1L), col.4 = c(0L, 1L,
2L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))
What about something like this:
# some fake data
df<-data.frame(col1 = c(2,2,1,1),
col2 = c(1,0,2,0),col3 = c(0,0,0,0))
col1 col2 col3
1 2 1 0
2 2 0 0
3 1 2 0
4 1 0 0
# first we can convert 0 to NA
df[df == 0] <- NA
# a function that calculates the length of uniques, not counting NA as levels
fun <- function(x){
res <- unique(x[!is.na(x)])
length(res)
}
# apply it: not counting na, we can use 2 as threshold
df <- df[apply(df,1,fun)>=2,]
# convert the na to 0 as original
df[is.na(df)] <- 0
df
col1 col2 col3
1 2 1 0
3 1 2 0
I have a data frame in the following format
1 2 a b c
1 a b 0 0 0
2 b 0 0 0
3 c 0 0 0
I want to fill columns a through c with a TRUE/FALSE that says whether the column name is in columns 1 or 2
1 2 a b c
1 a b 1 1 0
2 b 0 1 0
3 c 0 0 1
I have a dataset of about 530,000 records, 4 description columns, and 95 output columns so a for loop does not work. I have tried code in the following format, but it was too time consuming:
> for(i in 3:5) {
> for(j in 1:3) {
> for(k in 1:2){
> if(df[j,k]==colnames(df)[i]) df[j, i]=1
> }
> }
> }
Is there an easier, more efficient way to achieve the same output?
Thanks in advance!
One option is mtabulate from qdapTools
library(qdapTools)
df1[-(1:2)] <- mtabulate(as.data.frame(t(df1[1:2])))[-3]
df1
# 1 2 a b c
#1 a b 1 1 0
#2 b 0 1 0
#3 c 0 0 1
Or we melt the dataset after converting to matrix, use table to get the frequencies, and assign the output to the columns that are numeric.
library(reshape2)
df1[-(1:2)] <- table(melt(as.matrix(df1[1:2]))[-2])[,-1]
Or we can 'paste' the first two columns and use cSplit_e to get the binary format.
library(splitstackshape)
cbind(df1[1:2], cSplit_e(as.data.table(do.call(paste, df1[1:2])),
'V1', ' ', type='character', fill=0, drop=TRUE))
data
df1 <- structure(list(`1` = c("a", "b", "c"), `2` = c("b", "", ""),
a = c(0L, 0L, 0L), b = c(0L, 0L, 0L), c = c(0L, 0L, 0L)), .Names = c("1",
"2", "a", "b", "c"), class = "data.frame", row.names = c("1",
"2", "3"))