Create dummy variable based on conditions in multiple other columns [duplicate] - r

This question already has answers here:
How do I create a new column based on multiple conditions from multiple columns?
(3 answers)
Closed 7 months ago.
I am looking for help in adding a dummy variable to an existing dataframe based on conditions in multiple columns (this last bit is what separates my question from the answers I already found).
Here's a simple example:
y <- c(1,2,5,2,3,3)
z <- c("A", "B", "B", "A", "A", "B")
df <- as.data.frame(y,z)
Now I'd like to have a third column, which takes the value '1' if y is equal to 2 or if z is equal to B. So the column would show a value of 1 for all observations except the first (A,1) and the fifth (A,3).
I'm sure I know all the ingredients for doing this, I just cannot put it together right now. Any help would be much appreciated!

dplyr option using case_when:
y <- c(1,2,5,2,3,3)
z <- c("A", "B", "B", "A", "A", "B")
df <- data.frame(y = y, z = z)
library(dplyr)
df %>%
mutate(dummy = case_when(y == 2|z == "B"~1,
TRUE ~ 0))
#> y z dummy
#> 1 1 A 0
#> 2 2 B 1
#> 3 5 B 1
#> 4 2 A 1
#> 5 3 A 0
#> 6 3 B 1
Created on 2022-07-19 by the reprex package (v2.0.1)

Related

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

Take data weights into account when using dplyr:count [R]

I am working with purely factorial data (survey), and I need to aggregate the data in order to visualise it. I am currently using the count() function from dplyr, but there is no option to take data weights into account. In particular, I want count() to count each row as its given weight.
Currently count(data, var1, var2, var3) returns an aggregated dataframe where each row from data is counted as 1. I want to be able to specify a numeric weight column within my data so that each row is counted as the value in data$weight in stead of simply 1.
You could repeat the rows data$weight times and then count. This can be done with the splitstackshape package:
library(splitstackshape)
library(dplyr)
mydf <- data.frame(x = c("a", "b", "q","a","b"),
y = c("c", "d", "r","c","r"),
count = c(2, 5, 3,4,4))
mydf
x y count
1 a c 2
2 b d 5
3 q r 3
4 a c 4
5 b r 4
mydf %>%
expandRows("count") %>%
count(x)
x y n
1 a c 6
2 b d 5
3 b r 4
4 q r 3

How to sum the values of a numeric variable based on a string variable [duplicate]

This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 3 years ago.
Consider the following dataframe:
df <- data.frame(numeric=c(1,2,3,4,5,6,7,8,9,10), string=c("a", "a", "b", "b", "c", "d", "d", "e", "d", "f"))
print(df)
numeric string
1 1 a
2 2 a
3 3 b
4 4 b
5 5 c
6 6 d
7 7 d
8 8 e
9 9 d
10 10 f
It has a numeric variable and a string variable. Now, I would like to create another dataframe in which the string variable displays only the list of unique values "a", "b", "c", "d", "e", "f", and the numeric variable is the result of the sum of the numeric valuesin the previous dataframe, resulting in this data frame:
print(new_df)
numeric string
1 3 a
2 7 b
3 5 c
4 22 d
5 8 e
6 10 f
This can be done using a for loop, but it would be rather inefficient in large datasets, and I would prefer other options. I have tried using dplyr package, but I did not get the expected result:
library(dplyr)
> df %>% group_by(string) %>% summarize(result = sum(numeric))
result
1 55
It could be an issue of masking function from plyr (summarise/mutate functions are also there in plyr). We can explicitly specify the summarise from dplyr
library(dplyr)
df %>%
group_by(string) %>%
dplyr::summarise(numeric = sum(numeric))
You can do this without loading any extra packages using tapply or aggregate.

stacking rows as columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm trying to stack rows of data into columns so that the variables in another column will repeat. I would like to turn something like this
tib <- tribble(~x, ~y, ~z, "a", 1,2, "b", 3,4)
> tib
# A tibble: 2 x 3
x y z
<chr> <dbl> <dbl>
1 a 1 2
2 b 3 4
into
t <- tribble(~X, ~Y, "a", 1, "a", 2, "b", 3, "b", 4)
> t
# A tibble: 4 x 2
X Y
<chr> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
Thanks for your help and sorry if I've missed this solution somewhere. I did a search, and tried applying gather(), spread(), but couldn't get it to work out.
Here is an example using data.table::melt():
# Assuming your data is a data.frame
xyz <- data.frame(
x = c("a", "b"),
y = c(1, 3),
z = c(2, 4)
)
library(data.table)
melt(xyz, id.vars = "x")[c(1, 3)]
x value
1 a 1
2 b 3
3 a 2
4 b 4
This can be done with many packages. One possibility is tidyr and the function gather (link)
EDIT
Using #sindri_baldur data:
library(tidyr)
xyz %>%
gather(class, measurement, -x)

Creating a matrix of co-occurence from many variables, and plotting it

I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.
I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:
Here's some data on individuals and whether or not they have conditions "a", "b" or "c":
library(tidyverse)
library(viridis)
dat <- tibble(
id = c(1:15),
a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))
I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.
plotdat <- tibble(
var1 = c("a", "a", "a", "b", "b", "c"),
var2 = c("a", "b", "c", "b", "c", "c"),
count = c(7, 6, 2, 8, 3, 8))
ggplot(plotdat) +
geom_tile(aes(var1, var2, fill = count)) +
scale_fill_viridis()
Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!
Here is a way
library(tidyverse)
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
filter(!is.na(value))
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 b b 8
#5 c b 3
#6 c c 8
The most important piece is crossprod, I think. But let's go through it step by step.
You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.
as.matrix(dat[-1]) %>%
crossprod()
# a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8
Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.
The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname)
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 a b NA
#5 b b 8
#6 c b 3
#7 a c NA
#8 b c NA
#9 c c 8
Finally remove NAs to get desired output.

Resources