Imputing labels based on a comparison of columns - r

I don't think this question has been asked on this board before. I have two columns of 1s and 0s in a dataframe. Let's call these columns X and Y, respectively. In a comparison of X and Y for any row, one of four combinations is obviously possible:
A: 1, 0
B: 0, 1
C: 1, 1
D: 0, 0
Imagine the dataframe has m columns total, but we're interested only in X and Y. I'd like to write a function that compares only X and Y and then characterizes the particular combination with the corresponding labels A, B, C, or D in a new column (let's call it Z).
So say the data looks like:
X Y
1 1
0 1
0 0
1 1
The function will ouput:
X Y Z
1 1 C
0 1 B
0 0 D
1 1 C
I imagine this would be trivial but I'm an R newbie. Thanks for any guidance!

We create a key/value combination unique dataset and then merge with the input dataset based on 'X' and 'Y' columns
merge(df1, KeyDat, by = c("X", "Y"), all.x=TRUE)
# X Y Z
#1 0 0 D
#2 0 1 B
#3 1 1 C
#4 1 1 C
Or to get the output in the same order, use left_join
library(dplyr)
left_join(df1, keyDat)
#Joining by: c("X", "Y")
# X Y Z
#1 1 1 C
#2 0 1 B
#3 0 0 D
#4 1 1 C
data
keyDat <- data.frame(X= c(1, 0, 1, 0), Y = c(0, 1, 1,
0), Z = c("A", "B", "C", "D"), stringsAsFactors=FALSE)
df1 <- data.frame(X= c(1, 0, 0, 1), Y=c(1, 1, 0, 1))

Related

Perform a function on a dataframe across variable number of columns after removing zeros

I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1

dplyr mutate_at function applied to multiple columns - using dynamic column names

I have a data frame x.
x <- data.frame(a = c(10, 20, 30, 0), b = c(1, 2, 3, 0), c = c(1, 2, 3, 0), d = c(8, 16, 24, 0))
x
denominator_var <- "a"
numerator_vars <- c("b", "c", "d")
Using dplyr, I'm trying to add new columns (b_share, c_share, and d_share) such that each of them are equal to the corresponding column (b, c, and d) divided into a.
However, it is important to me to use NOT the original variable names but dynamic variable names.
My code below is not working. What's wrong?
x %>% mutate_at(vars(one_of(numerator_vars)),
funs(share = ifelse(!!(denominator_var) > 0, round(./!!(denominator_var) * 100, 2), 0)))
Thank you very much!
You can try using as.name before applying !!:
x %>% mutate_at(vars(one_of(numerator_vars)), funs(share =
ifelse(!!(as.name(denominator_var)), round(./!!(as.name(denominator_var))) * 100, 2)))
# a b c d b_share c_share d_share
# 1 10 1 1 8 0 0 100
# 2 20 2 2 16 0 0 100
# 3 30 3 3 24 0 0 100
# 4 0 0 0 0 2 2 2
You can get the result you want by quoting your denominator beforehand:
denominator_var <- quo(a)
x %>% mutate_at(numerator_vars,
funs(share = ifelse(!!(denominator_var) > 0,
round(./!!(denominator_var) * 100, 2),
0)))
Also note that you don't need to use vars for your vector numerator_vars.

Select columns based on columns sum

Any suggestion to select the columns of the row when value =1 and the sum columns values =1. it means that I will just select unique values, non-shared with the other individuals.
indv. X Y Z W T J
A 1 0 1 0 0 1
B 0 1 1 0 0 0
C 0 0 1 1 0 0
D 0 0 1 0 1 0
A: X, J
B: Y
C: W
D: T
Here you go! A solution in base r.
First we simulate your data, a data.frame with named rows and columns.
You can use sapply() to loop over the column indices.
A for-loop over the column indices will achieve the same thing.
Finally, save the results in a data.frame however you want.
# Simulate your example data
df <- data.frame(matrix(c(1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 0, 1, 0, 1, 0), nrow = 4, byrow = T))
# Names rows and columns accordingly
names(df) <- c("X", "Y", "Z", "W", "T", "J")
rownames(df) <- c("A", "B","C", "D")
> df
X Y Z W T J
A 1 0 1 0 0 1
B 0 1 1 0 0 0
C 0 0 1 1 0 0
D 0 0 1 0 1 0
Then we select columns where the sum == 1- columns with unique values.
For every one of these columns, we find the row of this value.
# Select columns with unique values (if sum of column == 1)
unique.cols <- which(colSums(df) == 1)
# For every one of these columns, select the row where row-value==1
unique.rows <- sapply(unique.cols, function(x) which(df[, x] == 1))
> unique.cols
X Y W T J
1 2 4 5 6
> unique.rows
X Y W T J
1 2 3 4 1
The rows are not named correctly yet (they are still the element named of unique.cols). So we reference the rownames of df to get the rownames.
# Data.frame of unique values
# Rows and columns in separate columns
df.unique <- data.frame(Cols = unique.cols,
Rows = unique.rows,
Colnames = names(unique.cols),
Rownames = rownames(df)[unique.rows],
row.names = NULL)
The result:
df.unique
Cols Rows Colnames Rownames
1 1 1 X A
2 2 2 Y B
3 4 3 W C
4 5 4 T D
5 6 1 J A
Edit:
This is how you could summarise the values per row using dplyr.
library(dplyr)
df.unique %>% group_by(Rownames) %>%
summarise(paste(Colnames, collapse=", "))
# A tibble: 4 x 2
Rownames `paste(Colnames, collapse = ", ")`
<fct> <chr>
1 A X, J
2 B Y
3 C W
4 D T
One idea is to use rowwise apply to find the columns with 1, after we filter out the columns with sum != to 1, i.e.
apply(df[colSums(df) == 1], 1, function(i) names(df[colSums(df) == 1])[i == 1])
$A
[1] "X" "J"
$B
[1] "Y"
$C
[1] "W"
$D
[1] "T"
You can play around with the output to get it to desired state, i.e.
apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1]))
# A B C D
#"X, J" "Y" "W" "T"
Or
data.frame(cols = apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1])))
# cols
#A X, J
#B Y
#C W
#D T
Here is an option with tidyverse. We gather the dataset to 'long' format, grouped by 'key', fiter the rows where 'val' is 1 and the sum of 'val is 1, grouped by 'indv.', summarise the 'key' by pasteing the elements together
library(dplyr)
library(tidyr)
gather(df1, key, val, -indv.) %>%
group_by(key) %>%
filter(sum(val) == 1, val == 1) %>%
group_by(indv.) %>%
summarise(key = toString(key))
# A tibble: 4 x 2
# indv. key
# <chr> <chr>
#1 A X, J
#2 B Y
#3 C W
#4 D T

Transform categorical attribute vector into similarity matrix

I need to transfrom a categorical attribute vector into a "same attribute matrix" using R.
For example I have a vector which reports gender of N people (male = 1, female = 0). I need to convert this vector into a NxN matrix named A (with people names on rows and columns), where each cell Aij has the value of 1 if two persons (i and j) have the same gender and 0 otherwise.
Here is an example with 3 persons, first male, second female, third male, which produce this vector:
c(1, 0, 1)
I want to transform it into this matrix:
A = matrix( c(1, 0, 1, 0, 1, 0, 1, 0, 1), nrow=3, ncol=3, byrow = TRUE)
Like lmo said in acomment it's impossible to know the structure of your dataset so what follows is just an example for you to see how it could be done.
First, make up some data.
set.seed(3488) # make the results reproducible
x <- LETTERS[1:5]
y <- sample(0:1, 5, TRUE)
df <- data.frame(x, y)
Now tabulate it according to your needs
A <- outer(df$y, df$y, function(a, b) as.integer(a == b))
dimnames(A) <- list(df$x, df$x)
A
# A B C D E
#A 1 1 1 0 0
#B 1 1 1 0 0
#C 1 1 1 0 0
#D 0 0 0 1 1
#E 0 0 0 1 1

Create a vector of counts

I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))

Resources