R if row value equals colnames assign 1 else 0 [duplicate] - r

This question already has answers here:
R - How to one hot encoding a single column while keep other columns still?
(5 answers)
Closed 2 years ago.
original table is like this:
id
food
1
fish
2
egg
2
apple
for each id, should have 1 or 0 value of its food, so the table should look like this:
id
food
fish
egg
apple
1
fish
1
0
0
2
egg
0
1
0
2
apple
0
0
1

A proposition using the dcast() function of the reshape2 package :
df1 <- read.table(header = TRUE, text = "
id food
1 fish
2 egg
2 apple
")
###
df2 <- reshape2::dcast(data = df1,
formula = id+food ~ food,
fun.aggregate = length,
value.var = "food")
df2
#> id food apple egg fish
#> 1 1 fish 0 0 1
#> 2 2 apple 1 0 0
#> 3 2 egg 0 1 0
###
df3 <- reshape2::dcast(data = df1,
formula = id+factor(food, levels=unique(food)) ~
factor(food, levels=unique(food)),
fun.aggregate = length,
value.var = "food")
names(df3) <- c("id", "food", "fish", "egg", "apple")
df3
#> id food fish egg apple
#> 1 1 fish 1 0 0
#> 2 2 egg 0 1 0
#> 3 2 apple 0 0 1
# Created on 2021-01-29 by the reprex package (v0.3.0.9001)
Regards,

Related

Filter for if rows where ID is same but Value column has different value to first occurrence

I am looking for advice on the principle of filtering a dataset in R. I currently have the below code which allows for easy filtering of records where a value in column 'Value' is within the required list that I have created:
ValuesNumber <-
read.table(textConnection("CustomerID Value
1 Ball
1 Cat
2 Ball
2 Ball
3 Dog
4 Ball
4 Blitz"), header=TRUE)
#Filter for required values only
Values_List <- "Ball|Twist|Tester"
ValuesNumberFiltered <- ValuesNumber[grep(Values_List, ValuesNumber$Value
),]
I am looking to amend this so that the below criteria are met:
'CustomerID' appears in the dataset at least twice
The entry in 'Value' column for the second entry does not appear within a list of my choosing.
So for example if working with this dataset:
CustomerID
Value
1
Ball
1
Cat
2
Ball
2
Ball
3
Dog
4
Ball
4
Blitz
I would then like to create a new column entitled 'Y/N' which has:
'1' if the value in all occurrences after the first occurrence does not match my list or
'0' if it does not.
So the output would look like this:
CustomerID
Value
Y/N
1
Ball
0
1
Cat
1
2
Ball
0
2
Ball
0
3
Dog
0
4
Ball
0
4
Blitz
1
tidyverse solution:
library(dplyr)
Values_List <- c("Ball", "Twist", "Tester")
ValuesNumber %>%
group_by(CustomerID) %>%
mutate(`Y/N` = +(n() >= 2 & !(Value %in% Values_List)))
CustomerID Value `Y/N`
1 1 Ball 0
2 1 Cat 1
3 2 Ball 0
4 2 Ball 0
5 3 Dog 0
6 4 Ball 0
7 4 Blitz 1
library(dplyr)
ValuesNumber %>%
group_by(CustomerID) %>%
mutate(`Y/N` = case_when(
row_number() == 1 ~ 0,
grepl(Values_List, Value) ~ 0,
TRUE ~ 1
)) %>%
ungroup()
# # A tibble: 7 × 3
# CustomerID Value `Y/N`
# <int> <chr> <dbl>
# 1 1 Ball 0
# 2 1 Cat 1
# 3 2 Ball 0
# 4 2 Ball 0
# 5 3 Dog 0
# 6 4 Ball 0
# 7 4 Blitz 1
rm(list = ls())
library(tidyverse)
values_number <- read.table(
textConnection("CustomerID Value
1 Ball
1 Cat
2 Ball
2 Ball
3 Dog
4 Ball
4 Blitz"), header = TRUE)
# Filter for required values only
value_list <- c("Ball", "Twist", "Tester")
count_id <- values_number |>
group_by(CustomerID) |>
summarise(count = length(CustomerID)) |> # count the occurance of each customer id
right_join(values_number, by = "CustomerID") |> # combined to the original data
mutate("Y/N" = case_when(
count > 1 & !(Value %in% value_list) ~ 1, # if the occurance of customer id > 1 and
TRUE ~ 0) # the entry did not involved in the list
) # mark as 1, the others mark as 0

Create contingency table that displays the frequency distribution of pairs of variables

I want to create a contingency table that displays the frequency distribution of pairs of variables. Here is an example dataset:
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
All variables are binary with 1 indicating either the presence of specfic movie type or the male gender. In the end, I would like to have the table that counts the presence of different movie types under specific gender. Something like this:
male female
Horror 1 1
Thriller 1 3
Comedy 2 2
Romantic 0 0
Sci.fi 2 0
I know I can create two tables of different movie types for male and female individually (see TarJae's answer here Create count table under specific condition) and cbind them later but I would like to do it in one chunk of code. How to achieve this in an efficient way?
You could do
sapply(split(df, df$gender), function(x) colSums(x[names(x)!="gender"]))
#> 0 1
#> Horror 1 1
#> Thriller 1 3
#> Comedy 0 0
#> Romantic 0 0
#> Sci.fi 1 3
Here is a solution using dplyr and tidyr:
df %>% pivot_longer(cols = -gender, names_to = "type") %>%
mutate(gender = fct_recode(as.character(gender),Male = "0",Female = "1")) %>%
group_by(gender,type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = gender,values_from = sum)
Which gives
# A tibble: 5 x 3
type Male Female
<chr> <dbl> <dbl>
1 Comedy 0 1
2 Horror 1 3
3 Romantic 1 1
4 Sci.fi 1 1
5 Thriller 1 1
The second line is optional but allows to get the levels for the variable gender.
Please find below a reprex with an alternative solution using data.table and magrittr (for the pipes), also in one chunk.
Reprex
Your data (I set a seed for reproducibility)
set.seed(452)
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
df
#> Horror Thriller Comedy Romantic Sci.fi gender
#> 1 0 1 1 0 0 0
#> 2 0 0 0 0 1 0
#> 3 1 0 1 1 0 1
#> 4 0 1 0 0 0 1
#> 5 0 1 0 0 0 1
Code in one chunk
library(data.table)
library(magrittr) # for the pipes!
df %>%
transpose(., keep.names = "rn") %>%
setDT(.) %>%
{.[, .(rn = rn,
male = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 1]),
female = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 0]))][rn !="gender"]}
Output
#> rn male female
#> 1: Horror 1 0
#> 2: Thriller 2 1
#> 3: Comedy 1 1
#> 4: Romantic 1 0
#> 5: Sci.fi 0 1
Created on 2021-11-25 by the reprex package (v2.0.1)

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

R how to create a dictionary of unique values [duplicate]

This question already has answers here:
Split character column into several binary (0/1) columns
(7 answers)
Closed 2 years ago.
I have a column in a dataframe that contains multiple values like this
fruits
1 apple,banana
2 banana,peaches
3 peaches
4 mango
Is there a way to create a dictionary of unique values for fruits which
is will create a new column fruits with values :
fruits = apple,banana,peaches,mango
UPDATE: I need the value as a column and not a list of just unique values . So that I can create a final dataframe that would have the following :
fruits fruit_apple fruit_banana fruit_mango fruit_peacheas
1 apple,banana 1 0 0 0
2 banana,peaches 0 1 0 1
3 peaches 0 0 0 1
4 mango 0 0 1 0
We can do this easily with cSplit_e from splitstackshape
library(splitstackshape)
cSplit_e(df1, "fruits", ",", type = "character", fill = 0)
# fruits fruits_apple fruits_banana fruits_mango fruits_peaches
#1 apple,banana 1 1 0 0
#2 banana,peaches 0 1 0 1
#3 peaches 0 0 0 1
#4 mango 0 0 1 0
data
df1 <- structure(list(fruits = c("apple,banana", "banana,peaches", "peaches",
"mango")), .Names = "fruits", class = "data.frame", row.names = c("1",
"2", "3", "4"))
Do you want the new column to be that concatenated list repeated? Sorry, it's not particularly clear. Assuming that's the case though, and that your data.frame consists of strings not factors;
df <- read.delim(
text="fruits
apple,banana
banana,peaches
peaches
mango",
sep="\n",
header=TRUE,
stringsAsFactors=FALSE)
df
#> fruits
#> 1 apple,banana
#> 2 banana,peaches
#> 3 peaches
#> 4 mango
df$uniquefruits <- paste0(unique(unlist(strsplit(df$fruits, split=","))), collapse=",")
df
#> fruits uniquefruits
#> 1 apple,banana apple,banana,peaches,mango
#> 2 banana,peaches apple,banana,peaches,mango
#> 3 peaches apple,banana,peaches,mango
#> 4 mango apple,banana,peaches,mango
Or do you mean taking only the values from your first fruits column that are not duplicated elsewhere?
Update: Based on comments, I think this is what you're after:
uniquefruits <- unique(unlist(strsplit(df$fruits, split=",")))
uniquefruits
#> [1] "apple" "banana" "peaches" "mango"
df2 <- cbind(df,
sapply(uniquefruits,
function(y) apply(df, 1,
function(x) as.integer(y %in% unlist(strsplit(x, split=","))))))
df2
#> fruits apple banana peaches mango
#> 1 apple,banana 1 1 0 0
#> 2 banana,peaches 0 1 1 0
#> 3 peaches 0 0 1 0
#> 4 mango 0 0 0 1
In theory, you could do this with dplyr but I can't figure out how to automate the column processing for the rowwise mutate (anyone know how?)
library(dplyr)
df %>% rowwise() %>% mutate(apple = as.integer("apple" %in% unlist(strsplit(fruits, ","))),
banana = as.integer("banana" %in% unlist(strsplit(fruits, ","))),
peaches = as.integer("peaches" %in% unlist(strsplit(fruits, ","))),
mango = as.integer("mango" %in% unlist(strsplit(fruits, ","))))
#> Source: local data frame [4 x 5]
#> Groups: <by row>
#>
#> # A tibble: 4 x 5
#> fruits apple banana peaches mango
#> <chr> <int> <int> <int> <int>
#> 1 apple,banana 1 1 0 0
#> 2 banana,peaches 0 1 1 0
#> 3 peaches 0 0 1 0
#> 4 mango 0 0 0 1
with base R:
fruits <- sort(unique(unlist(strsplit(as.character(df$fruits), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(fruits)), ncol=length(fruits)))
names(cols) <- fruits
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){fruits <- strsplit(x['fruits'], split=','); x[unlist(fruits)] <- 1;x})))
df
fruits apple banana mango peaches
1 apple,banana 1 1 0 0
2 banana,peaches 0 1 0 1
3 peaches 0 0 0 1
4 mango 0 0 1 0
You can use below steps,
1) Just split dataframe by comma using strsplit function.
2) Unlist a split list of vectors into a single vector.
3) Then take unique of list.fruits character vector.
Here is the solution,
# DataFrame of fruits
f <- c("apple,banana","banana,peaches","peaches","mango")
fruits <- as.data.frame(f)
# fruits dataframe
f
#1 apple,banana
#2 banana,peaches
#3 peaches
#4 mango
list.fruits <- unlist(strsplit(f,split=","))
unique.fruits <- unique(list.fruits)
# Result
unique.fruits
[1] "apple" "banana" "peaches" "mango"

Reshape from long to wide and create columns with binary value

I am aware of the spread function in the tidyr package but this is something I am unable to achieve.
I have a data.frame with 2 columns as defined below. I need to transpose the column Subject into binary columns with 1 and 0.
Below is the data frame:
studentInfo <- data.frame(StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
> studentInfo
StudentID Subject
1 1 Maths
2 1 Science
3 1 English
4 2 Maths
5 3 History
6 3 History
And the output I am expecting is:
StudentID Maths Science English History
1 1 1 1 1 0
2 2 1 0 0 0
3 3 0 0 0 1
How can I do this with the spread() function or any other function.
Using reshape2 we can dcast from long to wide.
As you only want a binary outcome we can unique the data first
library(reshape2)
si <- unique(studentInfo)
dcast(si, formula = StudentID ~ Subject, fun.aggregate = length)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Another approach using tidyr and dplyr is
library(tidyr)
library(dplyr)
studentInfo %>%
mutate(yesno = 1) %>%
distinct %>%
spread(Subject, yesno, fill = 0)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Although I'm not a fan (yet) of tidyr syntax...
We can use table from base R
+(table(studentInfo)!=0)
# Subject
#StudentID English History Maths Science
# 1 1 0 1 1
# 2 0 0 1 0
# 3 0 1 0 0
Using tidyr :
library(tidyr)
studentInfo <- data.frame(
StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
pivot_wider(studentInfo,
names_from = "Subject",
values_from = 'Subject',
values_fill = 0,
values_fn = function(x) 1)
#> # A tibble: 3 x 5
#> StudentID Maths Science English History
#> <dbl> <int> <int> <int> <int>
#> 1 1 1 1 1 0
#> 2 2 1 0 0 0
#> 3 3 0 0 0 1
Created on 2019-09-19 by the reprex package (v0.3.0)

Resources