Select columns based on columns sum - r

Any suggestion to select the columns of the row when value =1 and the sum columns values =1. it means that I will just select unique values, non-shared with the other individuals.
indv. X Y Z W T J
A 1 0 1 0 0 1
B 0 1 1 0 0 0
C 0 0 1 1 0 0
D 0 0 1 0 1 0
A: X, J
B: Y
C: W
D: T

Here you go! A solution in base r.
First we simulate your data, a data.frame with named rows and columns.
You can use sapply() to loop over the column indices.
A for-loop over the column indices will achieve the same thing.
Finally, save the results in a data.frame however you want.
# Simulate your example data
df <- data.frame(matrix(c(1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0), nrow = 4, byrow = T))
# Names rows and columns accordingly
names(df) <- c("X", "Y", "Z", "W", "T", "J")
rownames(df) <- c("A", "B","C", "D")
> df
X Y Z W T J
A 1 0 1 0 0 1
B 0 1 1 0 0 0
C 0 0 1 1 0 0
D 0 0 1 0 1 0
Then we select columns where the sum == 1- columns with unique values.
For every one of these columns, we find the row of this value.
# Select columns with unique values (if sum of column == 1)
unique.cols <- which(colSums(df) == 1)
# For every one of these columns, select the row where row-value==1
unique.rows <- sapply(unique.cols, function(x) which(df[, x] == 1))
> unique.cols
X Y W T J
1 2 4 5 6
> unique.rows
X Y W T J
1 2 3 4 1
The rows are not named correctly yet (they are still the element named of unique.cols). So we reference the rownames of df to get the rownames.
# Data.frame of unique values
# Rows and columns in separate columns
df.unique <- data.frame(Cols = unique.cols,
Rows = unique.rows,
Colnames = names(unique.cols),
Rownames = rownames(df)[unique.rows],
row.names = NULL)
The result:
df.unique
Cols Rows Colnames Rownames
1 1 1 X A
2 2 2 Y B
3 4 3 W C
4 5 4 T D
5 6 1 J A
Edit:
This is how you could summarise the values per row using dplyr.
library(dplyr)
df.unique %>% group_by(Rownames) %>%
summarise(paste(Colnames, collapse=", "))
# A tibble: 4 x 2
Rownames `paste(Colnames, collapse = ", ")`
<fct> <chr>
1 A X, J
2 B Y
3 C W
4 D T

One idea is to use rowwise apply to find the columns with 1, after we filter out the columns with sum != to 1, i.e.
apply(df[colSums(df) == 1], 1, function(i) names(df[colSums(df) == 1])[i == 1])
$A
[1] "X" "J"
$B
[1] "Y"
$C
[1] "W"
$D
[1] "T"
You can play around with the output to get it to desired state, i.e.
apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1]))
# A B C D
#"X, J" "Y" "W" "T"
Or
data.frame(cols = apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1])))
# cols
#A X, J
#B Y
#C W
#D T

Here is an option with tidyverse. We gather the dataset to 'long' format, grouped by 'key', fiter the rows where 'val' is 1 and the sum of 'val is 1, grouped by 'indv.', summarise the 'key' by pasteing the elements together
library(dplyr)
library(tidyr)
gather(df1, key, val, -indv.) %>%
group_by(key) %>%
filter(sum(val) == 1, val == 1) %>%
group_by(indv.) %>%
summarise(key = toString(key))
# A tibble: 4 x 2
# indv. key
# <chr> <chr>
#1 A X, J
#2 B Y
#3 C W
#4 D T

Related

Insert a blank row before zero

x<-c(0,1,1,0,1,1,1,0,1,1)
aaa<-data.frame(x)
How to insert a blank row before zero? When the first row is zeroļ¼Œdo not add blank row. Thank you.
Result:
0
1
1
.
0
1
1
1
.
0
1
1
Below we used dot but you can replace "." with NA or "" or something else depending on what you want.
1) We can use Reduce and append:
Append <- function(x, y) append(x, ".", y - 1)
data.frame(x = Reduce(Append, setdiff(rev(which(aaa$x == 0)), 1), init = aaa$x))
2) gsub Another possibility is to convert to a character string, use gsub and convert back:
data.frame(x = strsplit(gsub("(.)0", "\\1.0", paste(aaa$x, collapse = "")), "")[[1]])
3) We can create a two row matrix in which the first row is dot before each 0 and NA otherwise. Then unravel it to a vector and use na.omit to remove the NA values.
data.frame(x = na.omit(c(rbind(replace(ifelse(aaa$x == 0, ".", NA), 1, NA), aaa$x))))
4) We can lapply over aaa$x[-1] outputting c(".", 9) or 1. Unlist that and insert aaa$x[1] back in. No packages are used.
repl <- function(x) if (!x) c(".", 0) else 1
data.frame(x = c(aaa$x[1], unlist(lapply(aaa$x[-1], repl))))
5) Create a list of all but the first element and replace the 0's in that list with c(".", 0) . Unlist that and insert the first element back in. No packages are used.
L <- as.list(aaa$x[-1])
L[x[-1] == 0] <- list(c(".", 0))
data.frame(x = c(aaa$x[1], unlist(L)))
6) Assuming aaa has two columns where the second column is character (NOT factor). Append a row of dots to aaa and then create an index vector using unlist and Map to access the appropriate row of the extended aaa.
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10],
stringsAsFactors = FALSE)
nr <- nrow(aaa); nc <- ncol(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, rep(".", nc))[unlist(Map(fun, 1:nr, aaa$x)), ]
If we did want to have y be factor then note that we can't just add a dot to a factor if it is not a level of that factor so there is the question of what levels the factor can have. To get around that let us add an NA rather than a dot to the factor. Then we get the following which is the same except that aaa has been redefined so that y is a factor, we no longer need nc since we are assuming 2 columns and rep(...) in the last line is replaced with c(".", NA).
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
nr <- nrow(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, c(".", NA))[unlist(Map(fun, 1:nr, aaa$x)), ]
One dplyr and tidyr possibility may be:
aaa %>%
uncount(ifelse(row_number() > 1 & x == 0, 2, 1)) %>%
mutate(x = ifelse(x == 0 & lag(x == 1, default = first(x)), NA_integer_, x))
x
1 0
2 1
3 1
4 NA
5 0
6 1
7 1
8 1
9 NA
10 0
11 1
12 1
It is not adding a blank row as you have a numeric vector. Instead, it is adding a row with NA. If you need a blank row, you can convert it into a character vector and then replace NA with blank.
ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
d = aaa[rep(1:NROW(aaa), ind), , drop = FALSE]
transform(d, x = replace(x, sequence(ind) == 2, NA))
Here is an option with rleid
library(data.table)
setDT(aaa)[, .(x = if(x[.N] == 1) c(x, NA) else x), rleid(x)][-.N, .(x)]
# x
# 1: 0
# 2: 1
# 3: 1
# 4: NA
# 5: 0
# 6: 1
# 7: 1
# 8: 1
# 9: NA
#10: 0
#11: 1
#12: 1
data.frame(x = unname(unlist(by(aaa$x,cumsum(aaa==0),c,'.'))))
x
1 0
2 1
3 1
4 .
5 0
6 1
7 1
8 1
9 .
10 0
11 1
12 1
13 .
My solution is
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
aaa$ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
aaa<-aaa[rep(1:nrow(aaa), aaa$ind), ,]
aaa[(aaa$ind== 2 & !grepl(".1",rownames(aaa))),]<-NA
aaa$ind<- NULL
aaa
x y
1 0 a
2 1 b
3 1 c
4 NA <NA>
4.1 0 d
5 1 e
6 1 f
7 1 g
8 NA <NA>
8.1 0 h
9 1 i
10 1 j

Perform a function on a dataframe across variable number of columns after removing zeros

I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1

Going from a list of elements to chemical formula

I have a list of elemental compositions, each element in it's own row. Sometimes these elements have a zero.
C H N O S
1 5 5 0 0 0
2 6 4 1 0 1
3 4 6 2 1 0
I need to combine them so that they read, e.g. C5H5, C6H4NS, C4H6N2O.
This means that for any element of value "1" I should only take the column name, and for anything with value 0, the column should be skipped altogether.
I'm not really sure where to start here. I could add a new column to make it easier to read across the columns, e.g.
c C h H n N o O s S
1 C 5 H 5 N 0 O 0 S 0
2 C 6 H 4 N 1 O 0 S 1
3 C 4 H 6 N 2 O 1 S 0
This way, I just need the output to be a single string, but I need to ignore any zero values, and drop the one after the element name.
And here a base R solution:
df = read.table(text = "
C H N O S
5 5 0 0 0
6 4 1 0 1
4 6 2 1 0
", header=T)
apply(df, 1, function(x){return(gsub('1', '', paste0(colnames(df)[x > 0], x[x > 0], collapse='')))})
[1] "C5H5" "C6H4NS" "C4H6N2O"
paste0(colnames(df)[x > 0], x[x > 0], collapse='') pastes together the column names where the row values are bigger than zero. gsub then removes the ones. And apply does this for each row in the data frame.
Here's a tidyverse solution that uses some reshaping:
df = read.table(text = "
C H N O S
5 5 0 0 0
6 4 1 0 1
4 6 2 1 0
", header=T)
library(tidyverse)
df %>%
mutate(id = row_number()) %>% # add row id
gather(key, value, -id) %>% # reshape data
filter(value != 0) %>% # remove any zero rows
mutate(value = ifelse(value == 1, "", value)) %>% # replace 1 with ""
group_by(id) %>% # for each row
summarise(v = paste0(key, value, collapse = "")) # create the string value
# # A tibble: 3 x 2
# id v
# <int> <chr>
# 1 1 C5H5
# 2 2 C6H4NS
# 3 3 C4H6N2O
Assume that the input matrix m is as given reproducibly in the Note at the end -- convert it to a matrix if it is a data frame using as.matrix.
Now create a matrix the same shape as m with just the letters so now lets contains the letters and m contains the numbers. Then paste the letters and numbers together and replace those cells for which the number is zero with the empty string. Also replace any cells for which the number is 1 with just the letter. Finally paste each row together. No packages are used and no loops or *apply are used.
lets <- t(replace(t(m), TRUE, colnames(m)))
mm <- paste0(lets, m)
mm <- replace(mm, m == 0, "")
mm <- ifelse(m == 1, lets, mm)
do.call("paste0", as.data.frame(mm))
## [1] "C5H5" "C6H4NS" "C4H6N2O"
Note
the input matrix m in reproducible form is assumed to be:
m <- matrix(c(5, 6, 4, 5, 4, 6, 0, 1, 2, 0, 0, 1, 0, 1, 0), 3, 5,
dimnames = list(NULL, c("C", "H", "N", "O", "S")))
Another idea that avoids the apply with margin 1,
gsub('1', '', sapply(split(df, 1:nrow(df)), function(i)
paste(paste0(names(i)[i != 0], i[i != 0]), collapse = '')))
# 1 2 3
# "C5H5" "C6H4NS" "C4H6N2O"
Another option
library(dplyr)
#Get indices of all non-zero numbers in the dataframe
inds <- which(df!=0, arr.ind = TRUE)
#Create a dataframe with row index, column index and value at that position
vals <- data.frame(inds, val = df[inds])
#For each row paste the name of the column and value together and then replace 1
vals %>%
group_by(row) %>%
summarise(chemical = paste0(names(df)[col], val,collapse = "")) %>%
mutate(chemical = gsub("[1]", "", chemical))
# row chemical
# <int> <chr>
#1 1 C5H5
#2 2 C6H4NS
#3 3 C4H6N2O

Removing columns that are all 0

I am trying to remove all columns in my dataframe that solely contain the value 0. My code is the following that I found on this website.
dataset = dataset[ ,colSums(dataset != 0) > 0]
However, I keep returning an error:
Error in [.data.frame(dataset, , colSums(dataset != 0) > 0) :
undefined columns selected
It's because you have an NA in at least one column. Fix like this:
dataset = dataset[ , colSums(dataset != 0, na.rm = TRUE) > 0]
Here's some code that will check which columns are numeric (or integer) and drop those that contain all zeros and NAs:
# example data
df <- data.frame(
one = rep(0,100),
two = sample(letters, 100, T),
three = rep(0L,100),
four = 1:100,
stringsAsFactors = F
)
# create function that checks numeric columns for all zeros
only_zeros <- function(x) {
if(class(x) %in% c("integer", "numeric")) {
all(x == 0, na.rm = TRUE)
} else {
FALSE
}
}
# apply that function to your data
df_without_zero_cols <- df[ , !sapply(df, only_zeros)]
There is an alternative using all():
dataset[, !sapply(dataset, function(x) all(x == 0))]
a c d f
1 1 -1 -1 a
2 2 0 NA a
3 3 1 1 a
In case of a large dataset, time and memory consuming copying can be avoided through removing the columns by reference
library(data.table)
cols <- which(sapply(dataset, function(x) all(x == 0)))
setDT(dataset)[, (cols) := NULL]
dataset
a c d f
1: 1 -1 -1 a
2: 2 0 NA a
3: 3 1 1 a
Data
dataset <- data.frame(a = 1:3, b = 0, c = -1:1, d = c(-1, NA, 1), e = 0, f ="a")
dataset
a b c d e f
1 1 0 -1 -1 0 a
2 2 0 0 NA 0 a
3 3 0 1 1 0 a

Imputing labels based on a comparison of columns

I don't think this question has been asked on this board before. I have two columns of 1s and 0s in a dataframe. Let's call these columns X and Y, respectively. In a comparison of X and Y for any row, one of four combinations is obviously possible:
A: 1, 0
B: 0, 1
C: 1, 1
D: 0, 0
Imagine the dataframe has m columns total, but we're interested only in X and Y. I'd like to write a function that compares only X and Y and then characterizes the particular combination with the corresponding labels A, B, C, or D in a new column (let's call it Z).
So say the data looks like:
X Y
1 1
0 1
0 0
1 1
The function will ouput:
X Y Z
1 1 C
0 1 B
0 0 D
1 1 C
I imagine this would be trivial but I'm an R newbie. Thanks for any guidance!
We create a key/value combination unique dataset and then merge with the input dataset based on 'X' and 'Y' columns
merge(df1, KeyDat, by = c("X", "Y"), all.x=TRUE)
# X Y Z
#1 0 0 D
#2 0 1 B
#3 1 1 C
#4 1 1 C
Or to get the output in the same order, use left_join
library(dplyr)
left_join(df1, keyDat)
#Joining by: c("X", "Y")
# X Y Z
#1 1 1 C
#2 0 1 B
#3 0 0 D
#4 1 1 C
data
keyDat <- data.frame(X= c(1, 0, 1, 0), Y = c(0, 1, 1,
0), Z = c("A", "B", "C", "D"), stringsAsFactors=FALSE)
df1 <- data.frame(X= c(1, 0, 0, 1), Y=c(1, 1, 0, 1))

Resources