I have a data frame which looks like this:
df
colA colB
0 0
1 1
0 1
0 1
0 1
1 0
0 0
1 1
0 1
I would like to convert a certain proportion of the 0 in colA to NA and a certain proportion of 1 in colB to NA
if I do this:
df["colA"][df["colA"] == 0] <- NA
all the 0 in columns A will be converted to NA, however I just want half of them to be converted
Similarly, for colB I want only 1/3 of the 1 to be converted:
df["colB"][df["colB"] == 1] <- NA
Expected output:
colA colB
0 0
1 1
NA 1
0 1
NA 1
1 0
0 0
1 NA
NA NA
One way
tmp=which(df["colA"]==0)
df$colA[sample(tmp,round(length(tmp)/2))]=NA
similar for colB
tmp=which(df["colB"]==1)
df$colB[sample(tmp,round(length(tmp)/3))]=NA
You can use prodNA from the missForest package
set.seed(1)
library(missForest)
df[df$colA == 0, "colA"] <- prodNA(df[df$colA == 0, "colA", drop=F], noNA = 0.5)
df[df$colB == 1, "colB"] <- prodNA(df[df$colB == 1, "colB", drop=F], noNA = 1/3)
df
colA colB
1 NA 0
2 1 NA
3 0 NA
4 NA 1
5 NA 1
6 1 0
7 0 0
8 1 1
9 0 1
I'll contribute a tidyverse approach here.
library(tidyverse)
df %>% mutate(id_colA = ifelse(colA == 1, NA, 1:n()),
colA = ifelse(id_colA %in% sample(na.omit(id_colA), sum(!is.na(id_colA))/2), NA, colA),
id_colB = ifelse(colB == 0, NA, 1:n()),
colB = ifelse(id_colB %in% sample(na.omit(id_colB), sum(!is.na(id_colB))/3), NA, colB)) %>%
select(-starts_with("id_"))
Related
I am essentially trying to merge information from two columns in the following manner:
if column hf is 0 and hf1 is 1 then create new column combining both information.
df <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), hf = c("1","0","0","0","1","0"), hf1 = c("NA","1","1","1","NA","0")), class = "data.frame", row.names = c (NA, -6L))
desired output:
subject_id
hf
hf1
hf2
191-5467
1
NA
1
191-6784
0
1
1
191-3457
0
1
1
191-0987
0
1
1
191-1245
1
NA
1
191-2365
0
0
0
What I've tried:
df$hf2 <- with(df, +(!(df$hf1 == 1 & df$hf1 == 1)))
Output
subject_id
hf
hf1
hf2
191-5467
1
NA
1
191-6784
0
1
1
191-3457
0
1
1
191-0987
0
1
1
191-1245
1
NA
1
191-2365
0
0
1
Which is close but incorrect because 191-2365 hf2 should have remained 0.
The structure was created with columns as character, It should be numeric. We may use pmax on the numeric columns
df <- type.convert(df, as.is = TRUE)
df$hf2 <- with(df, pmax(hf, hf1, na.rm = TRUE))
-output
> df
subject_id hf hf1 hf2
1 191-5467 1 NA 1
2 191-6784 0 1 1
3 191-3457 0 1 1
4 191-0987 0 1 1
5 191-1245 1 NA 1
6 191-2365 0 0 0
I have two measures for the same object. The measure is binary (1,0) but many observations are also missing, such that the possible options are: 1, 0, NA.
Data Have:
Source1 Source2
NA NA
NA 0
NA 1
0 NA
0 0
0 1
1 NA
1 0
1 1
(Sources can contradict each other, ignore that for now).
I would like to create a third composite variable that summarizes the two variables, such that IF EITHER of the two sources = 1, then the composite variable should be equal to 1. Otherwise, if either of the sources is not missing, then the composite variable should be equal to zero. Lastly, only if both sources are missing, the composite variable should be set to missing.
Data Want:
Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1
I have tried different approaches but continue to have the same issue.
Attempt 1:
df<- df %>% mutate(combined = ifelse(df$source1==1 | df$source2==1, 1,
ifelse(df$source1==0 | df$source2==0, 0, NA)))
Attempt 2:
df2<- df %>% mutate(combined = ifelse(is.na(df$source1) & is.na(df$source2), NA,
ifelse(df$source1 == 1 | df$source2 ==1, 1, 0)))
Attempt 3:
df3<- df %>% mutate(combined = ifelse(df$source1==1, 1,
ifelse(df$source1==0 & df$source2==1, 1,
ifelse(df$source1==0 & df$source2==0, 0,
ifelse(df$source1==0 & is.na(df$source2), 0,
ifelse(is.na(df$source1) & df$source2'==1, 1,
ifelse(is.na(df$source1) & df$source2==0, 0, NA)))))))
The codes identify whether there is a 1 in either source, but the rest of the values are all missing regardless of there being a 0 or not.
Actual Output:
Source1 Source2 Composite
NA NA NA
NA 0 NA
NA 1 1
0 NA NA
0 0 NA
0 1 1
1 NA 1
1 0 1
1 1 1
Assuming both Source1 and Source2 columns are composed of 0's,1's, and NA's (as you noted). You could use this as a base R solution. I.e., this uses do.call() to call pmax() over each of the relevant columns in your dataframe.
cols = paste0("Source", 1:2)
df$newcol = do.call(pmax, c(df[cols], na.rm = TRUE))
# equivalent to: pmax(df$Source1, df$Source2, na.rm = TRUE)
df
Source1 Source2 Composite newcol
1 NA NA NA NA
2 NA 0 0 0
3 NA 1 1 1
4 0 NA 0 0
5 0 0 0 0
6 0 1 1 1
7 1 NA 1 1
8 1 0 1 1
9 1 1 1 1
Data:
df = read.table(header = TRUE, text = "Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1")
One approach is to use case_when rather than if-else. It seems simplest to check for missing variables first, and then check the non-missing cases afterwards:
library(tidyverse)
df %>%
mutate(S1Miss = is.na(Source1),
S2Miss = is.na(Source2)) %>%
mutate(Composite = case_when(
S1Miss & S2Miss ~ NA,
S1Miss | S2Miss ~ 0,
Source1 == 1 & Source2 == 1 ~ 1,
TRUE ~ 0
)) %>%
select(Source1, Source2, Composite)
Note here I made it "easier to read" by first storing the variables in 1 call to mutate and remove these intermediary results using select.
this was fun but i wouldn't recommend doing it like this.
source1<-c(NA, NA, NA, 0, 0, 0, 1, 1, 1)
source2<-c(NA, 0, 1, NA, 0, 1, NA, 0, 1)
df<-data.frame(source1, source2)
df$composite<-ifelse(test = is.na(df$source1) & is.na(df$source2), yes = NA,
no = ifelse(test = is.na(df$source1) & !is.na(df$source2), yes = df$source2,
no = ifelse(is.na(df$source2) & !is.na(df$source1), yes = df$source1,
no = ifelse(df$source1 > df$source2, yes = df$source1,
no = df$source2))))
source1 source2 composite
1 NA NA NA
2 NA 0 0
3 NA 1 1
4 0 NA 0
5 0 0 0
6 0 1 1
7 1 NA 1
8 1 0 1
9 1 1 1
I have searched on stack overflow for examples similar to the problem I am facing and am stuck, so any help would be appreciated! I have a dataframe that is similar to the one below:
df <- data.frame( "ID" = c(rep(1,6), rep(2,6), rep(3,5), rep(4,5)), "A" = c(0, rep(0,4),1 ,rep(0, 5), 1, rep(0,3), 1, rep(0,2), 1, rep(0,3)), "count" = NA)
and I would like to edit the "count" variable so the dataframe looks like this:
df2 <- data.frame( "ID" = c(rep(1,6), rep(2,6), rep(3,5), rep(4,5)), "A" = c(0, rep(0,4),1 ,rep(0, 5), 1, rep(0,3), 1, rep(0,2), 1, rep(0,3)), "count" = c(NA, NA, -3:-1,1, NA, NA, -3:-1,1, -3:-1, 1:2, -1, 1:3, NA ))
Within each df$ID, when df$A = 1 I need df$count = 1. Additionally, I need df$count to count forward from 1:3 and count backwards from -1:-3, omitting zero so df2 is produced. Any help is appreciated!
You can write a function which gives you the desired sequence :
library(dplyr)
add_num <- function(x) {
#Get the index of 1
inds <- which(x == 1)
#Create a sequence with that index as 0
num <- lapply(inds, function(i) {
num <- seq_along(x) - i
#Add 1 to values greater than equal to 0
num[num >= 0] <- num[num >= 0] + 1
num[num < -3 | num > 3] <- NA
num
})
#Select the first non-NA values from the sequence
do.call(coalesce, num)
}
now apply this function each ID :
df %>% group_by(ID) %>% mutate(count = add_num(A))
# ID A count
#1 1 0 NA
#2 1 0 NA
#3 1 0 -3
#4 1 0 -2
#5 1 0 -1
#6 1 1 1
#7 1 0 2
#8 1 0 3
#9 1 0 NA
#...
#...
#46 4 0 NA
#47 4 0 NA
#48 4 0 -3
#49 4 0 -2
#50 4 0 -1
#51 4 1 1
#52 4 0 2
#53 4 0 3
#54 4 0 NA
#55 4 0 -3
#56 4 0 -2
#57 4 0 -1
#58 4 1 1
#59 4 0 2
I already searched the web and found no answer. I have a big data.frame that contains multiple columns. Each column is a factor variable.
I want to transform the data.frame such that each possible value of the factor variables is a variable that either contains a "1" if the variable is present in the factor column or "0" otherwise.
Here is an example of what I mean.
labels <- c("1", "2", "3", "4", "5", "6", "7")
#create data frame (note, not all factor levels have to be in the columns,
#NA values are possible)
input <- data.frame(ID = c(1, 2, 3),
Cat1 = factor(c( 4, 1, 1), levels = labels),
Cat2 = factor(c(2, NA, 4), levels = labels),
Cat3 = factor(c(7, NA, NA), levels = labels))
#the seven factor levels now are the variables of the data.frame
desired_output <- data.frame(ID = c(1, 2, 3),
Dummy1 = c(0, 1, 1),
Dummy2 = c(1, 0, 0),
Dummy3 = c(0, 0, 0),
Dummy4 = c(1, 0, 1),
Dummy5 = c(0, 0, 0),
Dummy6 = c(0, 0, 0),
Dummy7 = c(1, 0, 0))
input
ID Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
desired_output
ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
My actual data.frame has over 3000 rows and factors with more than 100 levels.
I hope you can help me converting the input to the desired output.
Greetings
sush
A couple of methods, that riff off of Gregor's and Aaron's answers.
From Aaron's. factorsAsStrings=FALSE keeps the factor variables hence all labes when using dcast
library(reshape2)
dcast(melt(input, id="ID", factorsAsStrings=FALSE), ID ~ value, drop=FALSE)
ID 1 2 3 4 5 6 7 NA
1 1 0 1 0 1 0 0 1 0
2 2 1 0 0 0 0 0 0 2
3 3 1 0 0 1 0 0 0 1
Then you just need to remove the last column.
From Gregor's
na.replace <- function(x) replace(x, is.na(x), 0)
options(na.action='na.pass') # this keeps the NA's which are then converted to zero
Reduce("+", lapply(input[-1], function(x) na.replace(model.matrix(~ 0 + x))))
x1 x2 x3 x4 x5 x6 x7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
Then you just need to cbind the ID column
One way to do this is with matrix indexing. You have data specifying which locations in your output matrix should be 1 (the rest should be zero), so we'll make a matrix of zeros and then fill in the 1's based on your data. To do that, your data needs to be in a two column matrix, with the first column being the row (ID) of the output and the second column being the columns.
Put input data in long format, remove missings, convert values to integers matching the labels, then make a matrix as needed.
in2 <- reshape2::melt(input, id.vars="ID")
in2 <- subset(in2, !is.na(value))
in2$value <- match(in2$value, labels)
in2$variable <- NULL
in2 <- as.matrix(in2)
Then make the new output matrix with all zeros, and fill in the ones using that matrix.
out <- matrix(0, nrow=nrow(input), ncol=length(labels))
colnames(out) <- labels
rownames(out) <- input$ID
out[in2] <- 1
out
## 1 2 3 4 5 6 7
## 1 0 1 0 1 0 0 1
## 2 1 0 0 0 0 0 0
## 3 1 0 0 1 0 0 0
Here's a way using model.matrix. We convert the missing values to 0s, and specify 0 as the reference level for the factor contrasts. Then we just add the individual model matrices together and stick on the IDs:
new_lab = as.character(0:7)
for (i in 2:4) {
temp = as.character(input[[i]])
temp[is.na(temp)] = "0"
input[[i]] = factor(temp, levels = new_lab)
}
mm =
model.matrix(~ Cat1, data = input) +
model.matrix(~ Cat2, data = input) +
model.matrix(~ Cat3, data = input)
mm[, 1] = input$ID
colnames(mm) = c("ID", paste0("Dummy", 1:(ncol(mm) - 1)))
mm
# ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
# 1 1 0 1 0 1 0 0 1
# 2 2 1 0 0 0 0 0 0
# 3 3 1 0 0 1 0 0 0
# attr(,"assign")
# [1] 0 1 1 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$Cat1
# [1] "contr.treatment"
You can leave the result as a model matrix, change it back to a data frame, or whatever else.
This should work on your data frame. I converted the values to numeric before running the ifelse statement. Hope it works:
# Make dummy df
Cat1 = factor(c( 4, 1, 1))
Cat2 = factor(c(2, NA, 4))
Cat3 = factor(c(7, NA, NA))
df <- data.frame(Cat1,Cat2,Cat3)
# Specify columns
cols <- c(1:length(df))
# Convert Values To Numeric
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Perform ifelse. If its NA print 0, else print 1
df[,cols] %<>% lapply(function(x) ifelse(x == is.na(x) | (x) %in% NA, 0, 1))
Based on input:
Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
Output looks like this:
Cat1 Cat2 Cat3
1 1 1 1
2 1 0 0
3 1 1 0
This question is slightly similar to this question with a more theoretical component.
Given df below:
varA <- c(1,0,0,NA,NA)
varB <- c(NA,NA,NA,1,0)
df <- data.frame(varA, varB)
varA varB
1 NA
0 NA
0 NA
NA 1
NA 0
What's the most elegant method to generate var (with consideration given to NA) which combines the information from varA and varB?
varA varB var
1 NA 1
0 NA 0
0 NA 0
NA 1 1
NA 0 0
My approach, right now, is as follows:
df$var[df$varA == 1 | df$varB == 1] <- 1
df$var[df$varA == 0 | df$varB == 0] <- 0
As a side question, how does R handle NA in ifelse statements? For example, if I write the following code, it does not produce the output I intended.
df$var <- ifelse(df$varA == 1 | df$varB == 1, 1,
ifelse(df$varA == 0 | df$varB == 0, 0, NA)
combines the information from varA and varB
Seems like you are looking for coalesce:
library(dplyr)
df %>% mutate(var = coalesce(varA, varB))
# varA varB var
#1 1 NA 1
#2 0 NA 0
#3 0 NA 0
#4 NA 1 1
#5 NA 0 0
For your purposes, NA is equivalent to 0, so why not convert them to 0?
df[is.na(df)] <- 0
df$var <- with(df, as.integer(varA | varB))
> df
varA varB var
1 1 0 1
2 0 0 0
3 0 0 0
4 0 1 1
5 0 0 0
We can use pmax
df$var <- do.call(pmax, c(df, na.rm = TRUE))
df$var
#[1] 1 0 0 1 0