Frame that looks like this:
Col1 Col2 Col3
0 3 25
45 0 0
0 0 12
I would like to compute correlation index between elements only if the two elements are != from 0 because 0 in my case is "not informative item" so it does not make sense to compute a correlation between for ex: 3 that is informative and 0 that is not informative.
I cannot remove the columns containing 0 elements simply because 0 elements are sparse in my data.frame.
One half of what you are looking for is use = "pairwise.complete.obs" in cor:
If use has the value "pairwise.complete.obs" then the correlation or
covariance between each pair of variables is computed using all
complete pairs of observations on those variables.
However, it requires to have NA values instead of zeros, so let us transform our data first:
data <- data.frame(x = c(1, 0, -1, 0, 1),
y = c(-1, 0, 1, -1, 0),
z = c(0, 0, 1, -1, -1))
data
# x y z
# 1 1 -1 0
# 2 0 0 0
# 3 -1 1 1
# 4 0 -1 -1
# 5 1 0 -1
tempData <- data
tempData[tempData == 0] <- NA
tempData
# x y z
# 1 1 -1 NA
# 2 NA NA NA
# 3 -1 1 1
# 4 NA -1 -1
# 5 1 NA -1
Finally:
cor(tempData, use = "pairwise.complete.obs")
# x y z
# x 1 -1 -1
# y -1 1 1
# z -1 1 1
Related
I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
I have a sequence of treatments, one per day (binary), say:
trt <- c(0, 0, 1, 0, 0, 0, 1, 0, 0)
I want to create a vector, days_since, that:
Is NA up until the first treatment.
Is 0 where trt is 1
Counts the days since the last treatment
So, the output days_since should be:
days_since <- c(NA, NA, 0, 1, 2, 3, 0, 1, 2)
How would I do this in R? To get days_since, I basically need to lag by one element and add 1, but resetting every time the original vector (trt) is 1. If this is doable without a for-loop, that would be ideal, but not absolutely necessary.
Maybe you can try the code below
v <- cumsum(trt)
replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
which gives
[1] NA NA 0 1 2 3 0 1 2
Explanation
First, we apply cumsum over trt to group the treatments
> v <- cumsum(trt)
> v
[1] 0 0 1 1 1 1 2 2 2
Secondly, using ave helps to add sequential indices within each group
> ave(trt,v,FUN = seq_along)-1
[1] 0 1 0 1 2 3 0 1 2
Finally, since the value is NA before the first treatment, it means all the value before v == 1 appears should be replaced by NA. Thus we use replace, and the index logic follows v < 1
> replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
[1] NA NA 0 1 2 3 0 1 2
We can also use
(NA^!cummax(trt)) * sequence(table(cumsum(trt)))-1
#[1] NA NA 0 1 2 3 0 1 2
Or with rowid from data.table
library(data.table)
(NA^!cummax(trt)) *rowid(cumsum(trt))-1
#[1] NA NA 0 1 2 3 0 1 2
I already searched the web and found no answer. I have a big data.frame that contains multiple columns. Each column is a factor variable.
I want to transform the data.frame such that each possible value of the factor variables is a variable that either contains a "1" if the variable is present in the factor column or "0" otherwise.
Here is an example of what I mean.
labels <- c("1", "2", "3", "4", "5", "6", "7")
#create data frame (note, not all factor levels have to be in the columns,
#NA values are possible)
input <- data.frame(ID = c(1, 2, 3),
Cat1 = factor(c( 4, 1, 1), levels = labels),
Cat2 = factor(c(2, NA, 4), levels = labels),
Cat3 = factor(c(7, NA, NA), levels = labels))
#the seven factor levels now are the variables of the data.frame
desired_output <- data.frame(ID = c(1, 2, 3),
Dummy1 = c(0, 1, 1),
Dummy2 = c(1, 0, 0),
Dummy3 = c(0, 0, 0),
Dummy4 = c(1, 0, 1),
Dummy5 = c(0, 0, 0),
Dummy6 = c(0, 0, 0),
Dummy7 = c(1, 0, 0))
input
ID Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
desired_output
ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
My actual data.frame has over 3000 rows and factors with more than 100 levels.
I hope you can help me converting the input to the desired output.
Greetings
sush
A couple of methods, that riff off of Gregor's and Aaron's answers.
From Aaron's. factorsAsStrings=FALSE keeps the factor variables hence all labes when using dcast
library(reshape2)
dcast(melt(input, id="ID", factorsAsStrings=FALSE), ID ~ value, drop=FALSE)
ID 1 2 3 4 5 6 7 NA
1 1 0 1 0 1 0 0 1 0
2 2 1 0 0 0 0 0 0 2
3 3 1 0 0 1 0 0 0 1
Then you just need to remove the last column.
From Gregor's
na.replace <- function(x) replace(x, is.na(x), 0)
options(na.action='na.pass') # this keeps the NA's which are then converted to zero
Reduce("+", lapply(input[-1], function(x) na.replace(model.matrix(~ 0 + x))))
x1 x2 x3 x4 x5 x6 x7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
Then you just need to cbind the ID column
One way to do this is with matrix indexing. You have data specifying which locations in your output matrix should be 1 (the rest should be zero), so we'll make a matrix of zeros and then fill in the 1's based on your data. To do that, your data needs to be in a two column matrix, with the first column being the row (ID) of the output and the second column being the columns.
Put input data in long format, remove missings, convert values to integers matching the labels, then make a matrix as needed.
in2 <- reshape2::melt(input, id.vars="ID")
in2 <- subset(in2, !is.na(value))
in2$value <- match(in2$value, labels)
in2$variable <- NULL
in2 <- as.matrix(in2)
Then make the new output matrix with all zeros, and fill in the ones using that matrix.
out <- matrix(0, nrow=nrow(input), ncol=length(labels))
colnames(out) <- labels
rownames(out) <- input$ID
out[in2] <- 1
out
## 1 2 3 4 5 6 7
## 1 0 1 0 1 0 0 1
## 2 1 0 0 0 0 0 0
## 3 1 0 0 1 0 0 0
Here's a way using model.matrix. We convert the missing values to 0s, and specify 0 as the reference level for the factor contrasts. Then we just add the individual model matrices together and stick on the IDs:
new_lab = as.character(0:7)
for (i in 2:4) {
temp = as.character(input[[i]])
temp[is.na(temp)] = "0"
input[[i]] = factor(temp, levels = new_lab)
}
mm =
model.matrix(~ Cat1, data = input) +
model.matrix(~ Cat2, data = input) +
model.matrix(~ Cat3, data = input)
mm[, 1] = input$ID
colnames(mm) = c("ID", paste0("Dummy", 1:(ncol(mm) - 1)))
mm
# ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
# 1 1 0 1 0 1 0 0 1
# 2 2 1 0 0 0 0 0 0
# 3 3 1 0 0 1 0 0 0
# attr(,"assign")
# [1] 0 1 1 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$Cat1
# [1] "contr.treatment"
You can leave the result as a model matrix, change it back to a data frame, or whatever else.
This should work on your data frame. I converted the values to numeric before running the ifelse statement. Hope it works:
# Make dummy df
Cat1 = factor(c( 4, 1, 1))
Cat2 = factor(c(2, NA, 4))
Cat3 = factor(c(7, NA, NA))
df <- data.frame(Cat1,Cat2,Cat3)
# Specify columns
cols <- c(1:length(df))
# Convert Values To Numeric
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Perform ifelse. If its NA print 0, else print 1
df[,cols] %<>% lapply(function(x) ifelse(x == is.na(x) | (x) %in% NA, 0, 1))
Based on input:
Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
Output looks like this:
Cat1 Cat2 Cat3
1 1 1 1
2 1 0 0
3 1 1 0
I have a vector 'y' and I count the different values using table:
y <- c(0, 0, 1, 3, 4, 4)
table(y)
# y
# 0 1 3 4
# 2 1 1 2
However, I also want the result to include the fact that there are zero 2's and zero 5's. Can I use table() for this?
Desired result:
# y
# 0 1 2 3 4 5
# 2 1 0 1 2 0
Convert your variable to a factor, and set the categories you wish to include in the result using levels. Values with a count of zero will then also appear in the result:
y <- c(0, 0, 1, 3, 4, 4)
table(factor(y, levels = 0:5))
# 0 1 2 3 4 5
# 2 1 0 1 2 0
Here is small example:
A <- c(1,1,1,1, 0, 0, 0, 2,2,2)
B <- c(1,1,1,1, 0, 0, 0, 0,2,2)
C <- c(1,1,3,3, 0,0, 2,2,2, NA)
myd <- data.frame (A, B, C)
I need to apply a function say "prod" (prod (myd$myvar, na.rm = TRUE), before applying I need to count number of 0's.
(1) If number zeros are equal to or less than 3, I need to replace with NA
myd$A[myd$A ==0] <- NA
(2) If number of zeros are greater than 3, no replacement action need to be done.
myd$B[myd$B ==0] <- 0
How can I count zeros and apply the coditions to get the results.
Edit:
In the above dataset, A and C meets condition 1 and B condition 2.
Are you looking for something like this?
f <- function(X) {
if(sum(X==0, na.rm=TRUE) <= 3) X[X==0] <- NA
X
}
data.frame(lapply(myd, f))
# A B C
# 1 1 1 1
# 2 1 1 1
# 3 1 1 3
# 4 1 1 3
# 5 NA 0 NA
# 6 NA 0 NA
# 7 NA 0 2
# 8 2 0 2
# 9 2 2 2
# 10 2 2 NA