I have two binary columns:
col1 col2
0 1
0 0
1 0
1 1
I would like to merge this columns and if value 1 exist into one of in both columns I would like to have the 1 value. Example of output
merged_col
1
0
1
1
The general merged I tried is this:
merge(df$col1, df$col2, all = TRUE)
Any idea how can I handle the values?
You can just treat them as logical values and use or...
df$col3 <- as.integer(df$col1|df$col2)
The code below should do what you need:
df <- data.frame(col1 = c(0, 0, 1, 1), col2 = c(1, 0, 0, 1))
df$merge_col <- ifelse(df$col1 == 1 | df$col2 == 1, 1, 0)
Related
I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.
ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)
We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))
I have the following dataframe,
C1
C2
C3
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
I want to now apply the following condition on the dataframe for specific indexes only.
C1 should be equal to 0
A random number should be less than 0.5
If the above conditions match, I want to change the value of the Cell in C1 and C2 to 1 else do nothing.
I am trying the following: (rowIndex is the specific indexes on which I want to apply the conditions)
apply(DF[rowsIndex,], 2, fun)
where fun is:
fun<- function(x) {
ifelse(x==0,ifelse(runif(n=1)<0.5,x <- 1,x),x )
print(x)
}
My questions are:
In my function, How do I apply the conditions to a certain column only i.e C1 (I have tried using DF[rowsIndex,c(1)], but gives an error
Is there any other approach I can take Since this approach is not giving me any results and the same DF is printed.
Thanks
If you want to stay in base R:
#your dataframe
DF <- data.frame(C1 = c(0, 1, 0, 1, 0),
C2 = c(0, 1, 0, 1, 0),
C3 = c(0, 0, 0, 0, 0))
fun<- function(x) {
if(x[1]==0 & runif(n=1)<0.5) {
x[1:2] <- 1
}
return(x)
}
#your selection of rows you want to process
rowsIndex <- c(1, 2, 3, 4)
#Using MARGIN = 1 applies the function to the rows of a dataframe
#this returns a dataframe containing your selected and processed rows
DF_processed <- t(apply(DF[rowsIndex,], 1, fun))
#replace the selected rows in the original DF by the processed rows
DF[rowsIndex, ] <- DF_processed
print(DF)
Something like this?
library(dplyr)
df %>%
mutate(across(c(C1, C2), ~ifelse(C1 == 0 & runif(1) < 0.5, 1, .)))
C1 C2 C3
1 1 0 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 0 0
Applying it to your function:
fun<- function(df, x, y) {
df %>%
mutate(across(c({{x}}, {{y}}), ~ifelse({{x}} == 0 & runif(1) < 0.5, 1, .)))
}
fun(df, C1, C2)
C1 C2 C3
1 0 0 0
2 1 1 0
3 0 0 0
4 1 1 0
5 0 0 0
I have a dataset named df,
Here I want to get the columns (make a new dataset with them), that have a zero ratio of > %50
df_new <- get columns where zero_ratio> %50
could you support?
Thank You
Try with colMeans :
df_new <- df[, colMeans(df == 0, na.rm = TRUE) > 0.5]
With a reproducible example :
df <- data.frame(a = c(1, 2, 0, 1, 3), b = c(0, 0, 1, 0, 1), c = 0)
df_new <- df[, colMeans(df == 0, na.rm = TRUE) > 0.5]
df_new
# b c
#1 0 0
#2 0 0
#3 1 0
#4 0 0
#5 1 0
I'm trying to programmatically change a variable from a 0 to a 1 if there are three 1s before and after a 0.
For example, if the number in a vector were 1, 1, 1, 0, 1, 1, and 1, then I want to change the 0 to a 1.
Here is data in the vector dummy_code in the data.frame df:
original_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1))
Here is how I'm trying to have the values be recoded:
desired_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1)
I tried to use the function fill in the package tidyr, but this fills in missing values, so it won't work. If I were to recode the 0 values to be missing, then that would not work either, because it would simply code every NA as 1, when I would only want to code every NA surrounded by three 1s as 1.
Is there a way to do this in an efficient way programmatically?
An rle alternative, using the x from #G. Grothendieck's answer:
r <- rle(x)
Find indexes of runs of three 1:
i1 <- which(r$lengths == 3 & r$values == 1)
Check which of the "1 indexes" that surround a 0, and get the indexes of the 0 to be replaced:
i2 <- i1[which(diff(i1) == 2)] + 1
Replace relevant 0 with 1:
r$values[i2] <- 1
Reverse the rle operation on the updated runs:
inverse.rle(r)
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
A similar solution based on data.table::rleid, slightly more compact and perhaps easier to read:
library(data.table)
d <- data.table(x)
Calculate length of each run:
d[ , n := .N, by = rleid(x)]
For "x" which are zero and the preceeding and subsequent runs of 1 are of length 3, set "x" to 1:
d[x == 0 & shift(n) == 3 & shift(n, type = "lead") == 3, x := 1]
d$x
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Here is a one-liner using rollapply from zoo:
library(zoo)
rollapply(c(0, 0, 0, x, 0, 0, 0), 7, function(x) if (all(x[-4] == 1)) 1 else x[4])
## [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Note: Input used was:
x <- c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1)
I have a data frame with only zeros and ones, e.g.
df <- data.frame(v1 = rbinom(100, 1, 0.5),
v2 = rbinom(100, 1, 0.2),
v3 = rbinom(100, 1, 0.4))
Now I want to modify this data set so that each row sums to 1.
So this
1 0 0
1 1 0
0 0 1
1 1 1
0 0 0
should become this:
1 0 0
0.5 0.5 0
0 0 1
0.33 0.33 0.33
0 0 0
edit: rows with all zeros should be left as is
As already pointed out by #lmo the data.frame (or matrix) can be modified with
df <- df / rowSums(df)
In the case of rows containing only zeros this will lead to rows containing only NaN. Since these rows should be kept as they were, the easiest way is probably to correct for this afterwards with
df[is.na(df)] <- 0
Here is a quick method:
# create matrix
temp <- matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1), ncol=3, byrow=T)
temp / rowSums(temp)
This exploits the fact that matrices are ordered column-wise, so that the element by element division of rowsSums and the recycling are aligned.
In the case that all elements in a row are zero, and you don't want an Inf, another method from #RHertel s is the following:
# save rowSum:
mySums <- rowSums(temp)
temp / ifelse(mySums != 0, mySums, 1)