Taking a subset of equal value and other value of column - r

I have a dataframe:
> df <- data.frame(x = c('x1','x1','x2','x2','x2','x3','x3','x3'),
+ y = c(0,0,1,1,1,0,0,0),
+ z = c(1,1,0,0,0,0,0,0))
> df
x y z
1 x1 0 1
2 x1 0 1
3 x2 1 0
4 x2 1 0
5 x2 1 0
6 x3 0 0
7 x3 0 0
8 x3 0 0
I would like to create a subset based on y column where it is equal to 1, keep the value of x column based on the condition and make the 1 be 0.
I have only found how I could find the first step:
> length(which(df$y == 1))
[1] 3
How could a have a final output like this:
x y
x2 0
x2 0
x2 0

require(dplyr)
df %>%
filter(y == 1) %>%
select(x, y) %>%
mutate(y = 0)

transform(subset(df[1:2],y==1),y=0)
x y
3 x2 0
4 x2 0
5 x2 0

If you're open to using other packages, data.table is another option:
library(data.table)
setDT(df)[y == 1, .(x, y = 0)]
# x y
#1: x2 0
#2: x2 0
#3: x2 0

Related

pairwise subtraction of columns in a dataframe in R

I was wondering is there a way to automate (e.g., loop) the subtraction of (X2-X1), (X3-X1), (X3-X2) in my data below and add them as three new columns to the data?
m="
id X1 X2 X3
A 1 0 4
B 2 2 2
C 3 4 1"
data <- read.table(text = m, h = T)
This is very similar to this question; we basically just need to change the function that we are using in map2_dfc:
library(tidyverse)
combn(names(data)[-1], 2) %>%
map2_dfc(.x = .[1,], .y = .[2,],
.f = ~transmute(data, !!paste0(.y, "-", .x) := !!sym(.y) - !!sym(.x))) %>%
bind_cols(data, .)
#> id X1 X2 X3 X2-X1 X3-X1 X3-X2
#> 1 A 1 0 4 -1 3 4
#> 2 B 2 2 2 0 0 0
#> 3 C 3 4 1 1 -2 -3
With combn:
dif <- combn(data[-1], 2, \(x) x[, 2] - x[, 1])
colnames(dif) <- combn(names(data)[-1], 2, \(x) paste(x[2], x[1], sep = "-"))
cbind(data, dif)
# id X1 X2 X3 X2-X1 X3-X1 X3-X2
#1 A 1 0 4 -1 3 4
#2 B 2 2 2 0 0 0
#3 C 3 4 1 1 -2 -3

How to use mutate and ifelse in a loop?

What I do is to create dummies to indicate whether a continuous variable exceeds a certain threshold (1) or is below this threshold (0). I achieved this via several repetitive mutates, which I would like to substitute with a loop.
# load tidyverse
library(tidyverse)
# create data
data <- data.frame(x = runif(1:100, min=0, max=100))
# What I do
data <- data %>%
mutate(x20 = ifelse(x >= 20, 1, 0)) %>%
mutate(x40 = ifelse(x >= 40, 1, 0)) %>%
mutate(x60 = ifelse(x >= 60, 1, 0)) %>%
mutate(x80 = ifelse(x >= 80, 1, 0))
# What I would like to do
for (i in seq(from=0, to=100, by=20)){
data %>% mutate(paste(x,i) = ifelse(x >= i, 1,0))
}
Thank you.
You can use map_dfc here :
library(dplyr)
library(purrr)
breaks <- seq(from=0, to=100, by=20)
bind_cols(data, map_dfc(breaks, ~
data %>% transmute(!!paste0('x', .x) := as.integer(x > .x))))
# x x0 x20 x40 x60 x80 x100
#1 6.2772517 1 0 0 0 0 0
#2 16.3520358 1 0 0 0 0 0
#3 25.8958212 1 1 0 0 0 0
#4 78.9354970 1 1 1 1 0 0
#5 35.7731737 1 1 0 0 0 0
#6 5.7395139 1 0 0 0 0 0
#7 49.7069551 1 1 1 0 0 0
#8 53.5134559 1 1 1 0 0 0
#...
#....
Although, I think it is much simpler in base R :
data[paste0('x', breaks)] <- lapply(breaks, function(x) as.integer(data$x > x))
You can use reduce() in purrr.
library(dplyr)
library(purrr)
reduce(seq(0, 100, by = 20), .init = data,
~ mutate(.x, !!paste0('x', .y) := as.integer(x >= .y)))
# x x0 x20 x40 x60 x80 x100
# 1 61.080545 1 1 1 1 0 0
# 2 63.036673 1 1 1 1 0 0
# 3 71.064322 1 1 1 1 0 0
# 4 1.821416 1 0 0 0 0 0
# 5 24.721454 1 1 0 0 0 0
The corresponding base way with Reduce():
Reduce(function(df, y){ df[paste0('x', y)] <- as.integer(df$x >= y); df },
seq(0, 100, by = 20), data)
Ronak's base R is probably the best, but for completeness here's another way similar to how you were originally doing it, just with dplyr:
for (i in seq(from=0, to=100, by=20)){
var <- paste0('x',i)
data <- mutate(data, !!var := ifelse(x >= i, 1,0))
}
x x0 x20 x40 x60 x80 x100
1 99.735037 1 1 1 1 1 0
2 9.075226 1 0 0 0 0 0
3 73.786282 1 1 1 1 0 0
4 89.744719 1 1 1 1 1 0
5 34.139207 1 1 0 0 0 0
6 88.138611 1 1 1 1 1 0

Create a new variable based on any 2 conditions being true

I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)

using crossprod under specific conditions

I am trying to organise a dataset in a very specific way for my research, however I am new to R and I am really struggling, any assistance would be greatly appreciated.
I am attempting to take the value of the cell at every third column (starting from the first one) and multiply it by the column beside it, but only if there is a negative value in said cell. Following this, I would like to sum the results together and store it in a new column in an external spreadsheet.
so far the code I have written is as follows:
NegTotal = NULL
p = NULL
for (i in 1:nrow(Datafile))
{for (j in 1:ncol(Datafile))
{if ((j %% 3 == 0) && (Datafile [i,j] < 0)) {
p <- (datafile[i,j] * datafile[i,j+1])
NegTotal <- sum(p) }
else { }
}
}
for (l in seq(along = NegTotal)) {
dim(newColumn)
AsNewData.DataColumn("datafile", GetType(System.String))
NewColumn.DefaultValue = "NegTotal"
table.Columns.Add(newColumn)
}
I am aware that this code is probably completely wrong, this is the first time I've used R and I am not very proficient at computer programming in general.
The current data is arranged as follows:
df <- data.frame(F1 = c( 1, -2, -1), E1 = c(1, 1, 0), Y1 = c(0, 0, 1),
F2 = c(-1, 2, -1), E2 = c(1, 1, 1), Y2 = c(0, 0, 1),
F3 = c(-2, -2, -1), E3 = c(1, 1, 1), Y3 = c(1, 1, 0))
# F1 E1 Y1 F2 E2 Y2 F3 E3 Y3
# 1 1 1 0 -1 1 0 -2 1 1
# 2 -2 1 0 2 1 0 -2 1 1
# 3 -1 0 1 -1 1 1 -1 1 0
Desired Output:
# F1 E1 Y1 F2 E2 Y2 F3 E3 Y3 NegTotal
# 1 1 1 0 -1 1 0 -2 1 1 -3
# 2 -2 1 0 2 1 0 -2 1 1 -4
# 3 -1 0 1 -1 1 1 -1 1 0 -2
So if x = Fy * Ey;
NegTotal = x1 + x2 + x3, only when F$y < 0.
I hope that all makes sense!
Here's how I would approach this with dplyr and tidyr:
library(dplyr)
library(tidyr)
# Add a respondent column (i.e. row number)
df$respondent <- 1:nrow(df)
df %>%
gather(key, value, -respondent) %>%
separate(key, c("letter", "letter_sub"), sep = 1) %>%
spread(letter, value) %>%
mutate(Neg = ifelse(F < 0, E * F, NA)) %>%
group_by(respondent) %>%
summarise(NegTotal = sum(Neg, na.rm = TRUE))
# Source: local data frame [3 x 2]
#
# respondent NegTotal
# (int) (dbl)
# 1 1 -3
# 2 2 -4
# 3 3 -2
To understand what's going on, I would run the pipeline in pieces. For example, look at the results of the first few functions:
df %>%
gather(key, value, -respondent) %>%
separate(key, c("letter", "letter_sub"), sep = 1) %>%
spread(letter, value)
# respondent letter_sub E F Y
# 1 1 1 1 1 0
# 2 1 2 1 -1 0
# 3 1 3 1 -2 1
# 4 2 1 1 -2 0
# 5 2 2 1 2 0
# 6 2 3 1 -2 1
# 7 3 1 0 -1 1
# 8 3 2 1 -1 1
# 9 3 3 1 -1 0
Getting the data in this form, makes it easier to perform the summary tasks.
This code will give you your desired output. However, if your actual dataset is more complex than the example you gave, you may need a more elegant solution.
df$NegTotal<- (pmin(0,df$F1) * df$E1) + (pmin(0,df$F2) * df$E2) + (pmin(0,df$F3) * df$E3)

Create a co-occurrence matrix from dummy-coded observations

Is there a simple approach to converting a data frame with dummies (binary coded) on whether an aspect is present, to a co-occurrence matrix containing the counts of two aspects co-occuring?
E.g. going from this
X <- data.frame(rbind(c(1,0,1,0), c(0,1,1,0), c(0,1,1,1), c(0,0,1,0)))
X
X1 X2 X3 X4
1 1 0 1 0
2 0 1 1 0
3 0 1 1 1
4 0 0 1 0
to this
X1 X2 X3 X4
X1 0 0 1 0
X2 0 0 2 1
X3 1 2 0 1
X4 0 1 1 0
This will do the trick:
X <- as.matrix(X)
out <- crossprod(X) # Same as: t(X) %*% X
diag(out) <- 0 # (b/c you don't count co-occurrences of an aspect with itself)
out
# [,1] [,2] [,3] [,4]
# [1,] 0 0 1 0
# [2,] 0 0 2 1
# [3,] 1 2 0 1
# [4,] 0 1 1 0
To get the results into a data.frame exactly like the one you showed, you can then do something like:
nms <- paste("X", 1:4, sep="")
dimnames(out) <- list(nms, nms)
out <- as.data.frame(out)
Though nothing can match the simplicity of answer above, just posting tidyverse aproach for future reference
Y <- X %>% mutate(id = row_number()) %>%
pivot_longer(-id) %>% filter(value !=0)
merge(Y, Y, by = "id", all = T) %>%
filter(name.x != name.y) %>%
group_by(name.x, name.y) %>%
summarise(val = n()) %>%
pivot_wider(names_from = name.y, values_from = val, values_fill = 0, names_sort = T) %>%
column_to_rownames("name.x")
X1 X2 X3 X4
X1 0 0 1 0
X2 0 0 2 1
X3 1 2 0 1
X4 0 1 1 0

Resources