How to One-Hot Encoding stacked columns in R

How to One-Hot Encoding stacked columns in R - r

I have data that look like this
+---+-------+
| | col1 |
+---+-------+
| 1 | A |
| 2 | A,B |
| 3 | B,C |
| 4 | B |
| 5 | A,B,C |
+---+-------+
Expected Output
+---+-----------+
| | A | B | C |
+---+-----------+
|1 | 1 | 0 | 0 |
|2 | 1 | 1 | 0 |
|3 | 0 | 1 | 1 |
|4 | 0 | 1 | 0 |
|5 | 1 | 1 | 1 |
+---+---+---+---+
How can I encode it like this?

Maybe this could help
df %>%
mutate(r = 1:n()) %>%
unnest(col1) %>%
table() %>%
t()
which gives
col1
r A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
Data
df <- tibble(
col1 = list(
"A",
c("A", "B"),
c("B", "C"),
"B",
c("A", "B", "C")
)
)
If your data is given in the following format
df <- data.frame(
col1 = c("A", "A,B", "B,C", "B", "A,B,C")
)
then you can try
with(
df,
table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
)
which gives
values
ind A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1

You could use table() with map_df() from purrr to count the occurrences
in each element of a list, and return a data frame. Putting it into a
function with some post-processing, and using dplyrs data frame unpacking in
mutate(), you could do something like this to stay within a data frame
context:
library(tidyverse)
one_hot <- function(x) {
map_df(x, table) %>%
mutate_all(as.integer) %>%
mutate_all(replace_na, 0L)
}
df <- data.frame(col1 = c("A", "A,B", "B,C", "B", "A,B,C"))
df %>%
mutate(
one_hot(strsplit(col1, ","))
)
#> col1 A B C
#> 1 A 1 0 0
#> 2 A,B 1 1 0
#> 3 B,C 0 1 1
#> 4 B 0 1 0
#> 5 A,B,C 1 1 1

An additional base R solution:
+(
with(
df,
sapply(
unique(
unlist(
strsplit(
col1,
","
)
)
),
`grepl`,
col1
)
)
)

Related

calculate frequency of unique values per group in R

How can I count the number of unique values such that I go from:
organisation <- c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D")
variable <- c("0","0","1","2","0","0","1","1","0","0","1","1","0","0","2","2")
df <- data.frame(organisation,variable)
organisation | variable
A | 0
A | 1
A | 2
A | 2
B | 0
B | 0
B | 1
B | 1
C | 0
C | 0
C | 1
C | 1
D | 0
D | 2
D | 2
D | 2
To:
unique_values | frequency
0,1,2 | 1
0,1 | 2
0,2 | 1
There are only 3 possible sequences:
0,1,2
0,1
0,2

Try this
s <- aggregate(. ~ organisation , data = df , \(x) names(table(x)))
s$variable <- sapply(s$variable , \(x) paste0(x , collapse = ","))
setNames(aggregate(. ~ variable , data = s , length) , c("unique_values" , "frequency"))
output
unique_values frequency
1 0,1 2
2 0,1,2 1
3 0,2 1

You can do something simple like this:
library(dplyr)
library(stringr)
distinct(df) %>%
arrange(variable) %>%
group_by(organisation) %>%
summarize(unique_values = str_c(variable,collapse = ",")) %>%
count(unique_values)
Output:
unique_values n
<chr> <int>
1 0,1 2
2 0,1,2 1
3 0,2 1

Threshold Flag creation in R

id | 85| 291| 5680| 41
---+---+----+-----+----
597| 1 | 1 | 1 | 1
672| 1 | 0 | 0 | 0
680| 1 | 1 | 1 | 0
683| 1 | 1 | 1 | 1
I have a table that looks something like above. I want to make a flag each row where the 1 values account for 90% of the row (not including the id column)
So for this example only row 1 and 4 would be flagged.
intended output:
id | 85| 291| 5680| 41 | flag |
---+---+----+-----+----+------+
597| 1 | 1 | 1 | 1 | yes |
672| 1 | 0 | 0 | 0 | no |
680| 1 | 1 | 1 | 0 | no |
683| 1 | 1 | 1 | 1 | yes |
how can i do this in R using tidyverse syntax? I tried some stuff dealing with rowSums(), but i can't come up with a solution.

Perhaps try using rowMeans:
df$flag = rowMeans(df[-1]) >= .9
This assumes you have only 1 and 0 for values here.

If your "table" is actually a data frame with all columns except the first being columns of 1s and 0s, you could do:
df %>% mutate(flag = apply(df[-1], 1, function(x) sum(x)/length(x) > 0.9)

An option in tidyverse, would be to reshape to 'long' format, get the mean and bind with the original dataset
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id) %>%
group_by(id) %>%
summarise(flag = mean(value) > 0.9) %>%
right_join(df1) %>%
select(names(df1), everything())
# A tibble: 4 x 6
# id `85` `291` `5680` `41` flag
# <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#1 597 1 1 1 1 TRUE
#2 672 1 0 0 0 FALSE
#3 680 1 1 1 0 FALSE
#4 683 1 1 1 1 TRUE
data
df1 <- structure(list(id = c(597, 672, 680, 683), `85` = c(1, 1, 1,
1), `291` = c(1, 0, 1, 1), `5680` = c(1, 0, 1, 1), `41` = c(1,
0, 0, 1)), class = "data.frame", row.names = c(NA, -4L))

R: How to filter column as long as it contains combination of values?

I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)

Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))

An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]

Generate Data Frame from Count Data

I am trying to create an unsummarized data frame from a data frame of count data.
I have had some experience creating sample datasets but I am having some trouble trying to get a specific number of rows and proportion for each state/person without coding each of them separately and then combining them. I was able to do it using the following code but I feel like there is a better way.
set.seed(2312)
dragon <- sample(c(1),3,replace=TRUE)
Maine <- sample(c("Maine"),3,replace=TRUE)
Maine1 <- data.frame(dragon, Maine)
dragon <- sample(c(0),20,replace=TRUE)
Maine <- sample(c("Maine"),20,replace=TRUE)
Maine2 <- data.frame(dragon, Maine)
Maine2
library(dplyr)
maine3 <- bind_rows(Maine1, Maine2)
Is there a better way to generate this dataset then the code above?
I am trying to create a data frame from the following count data:
+-------------+--------------+--------------+
| | # of dragons | # no dragons |
+-------------+--------------+--------------+
| Maine | 3 | 20|
| California | 1 | 10|
| Jocko | 28 | 110515 |
| Jessica Day | 17 | 26122 |
| | 14 | 19655 |
+-------------+--------------+--------------+
And I would like it to look like this:
+-----------------------+---------------+
| | Dragons (1/0) |
+-----------------------+---------------+
| Maine | 1 |
| Maine | 1 |
| Maine | 1 |
| Maine | 0 |
| Maine….(2:20) | 0…. |
| California | 1 |
| California….(2:10) | 0… |
| Ect.. | |
+-----------------------+---------------+
I do not want the code written for me but would love with ideas on function or examples that you think might be helpful.

I am not completely sure what does sampling have to do with this problem?
It looks to me like you are looking for untable.
Here is an example
data:
set.seed(1)
no_drag = sample(1:5, 5)
drag = sample(15:25, 5)
df <- data.frame(names = LETTERS[1:5],
drag,
no_drag)
names drag no_drag
1 A 24 2
2 B 25 5
3 C 20 4
4 D 23 3
5 E 15 1
library(reshape)
library(tidyverse)
df %>%
gather(key, value, 2:3) %>% #convert to long format
{untable(.,num = .$value)} %>% #untable by value column
mutate(value = ifelse(key == "drag", 0, 1)) %>% #convert values to 0/1
select(-key) %>% #remove unwanted column
arrange(names) #optional
#part of output
names value
1 A 0
2 A 0
3 A 0
4 A 0
5 A 0
6 A 0
7 A 0
8 A 0
9 A 0
10 A 0
11 A 0
12 A 0
13 A 0
14 A 0
15 A 0
16 A 0
17 A 0
18 A 0
19 A 0
20 A 0
21 A 0
22 A 0
23 A 0
24 A 0
25 A 1
26 A 1
27 B 0
28 B 0
29 B 0
30 B 0
there are other ways to tackle the problem here is one:
One is like #Frank mentioned in the comment:
df %>%
gather(key, val, 2:3) %>%
mutate(v = Map(rep, key == "drag", val)) %>%
unnest %>%
select(-key, -val)
Another:
df <- gather(df, key, value, 2:3)
df <- df[rep(seq_len(nrow(df)), df$value), 1:2]
df$key[df$key == "drag"] <- FALSE
df$key[df$key != "drag"] <- TRUE

One can use tidyr::expand to expand rows in desired format.
The solution using df used by #missuse can be shown as:
library(tidyverse)
df %>% gather(key,value,-names) %>%
mutate(key = ifelse(key=="drag", 1, 0)) %>%
group_by(names,key) %>%
expand(value = 1:value) %>%
select(names, value = key) %>%
as.data.frame()
# names value
# 1 A 0
# 2 A 0
# 3 A 1
# 4 A 1
# 5 A 1
# 6 A 1
# 7 A 1
# 8 A 1
# 9 A 1
# 10 A 1
# ...so on
# 117 E 1
# 118 E 1
# 119 E 1
# 120 E 1
# 121 E 1
# 122 E 1

Identify the occurence of value after another specific value

I have the following table:
+----+------------+----------+
| ID | Date | Variable |
+----+------------+----------+
| a | 12/03/2017 | d |
| a | 15/04/2017 | d |
| a | 20/06/2017 | c |
| b | 14/05/2017 | c |
| b | 15/08/2017 | c |
| b | 16/09/2017 | c |
+----+------------+----------+
For each ID, I'd like to have a check in the separate column which tells whether there was a "c" value after the occurence of "d" value, like this:
+----+------------+----------+-------+------------+
| ID | Date | Variable | Check | Date |
+----+------------+----------+-------+------------+
| a | 12/03/2017 | d | 1 | 20/06/2017 |
| a | 15/04/2017 | d | 1 | 20/06/2017 |
| a | 20/06/2017 | c | 1 | 20/06/2017 |
| b | 14/05/2017 | c | 0 | 0 |
| b | 15/08/2017 | c | 0 | 0 |
| b | 16/09/2017 | c | 0 | 0 |
+----+------------+----------+-------+------------+
It's not just about finding the occurence of "c", but about seeing whether "c" occurs after d or not. It would also help to have the corresponding date in a separate column. I was trying with removing the duplicates & then identifying the lead value (or n of rows > 1), but is there a simpler way to do this?
Any dplyr or data.table approach would be most helpful.

A solution using dplyr. There must be a better way than this, but I think this should work. unique(Variable[!is.na(Variable)]) is to get a vector with only c("c", "d"), c("d", "c"), "c", or "d". If you are sure there are no NA, you can remove !is.na. Date[Variable %in% "c"][1] is to select the first date.
dat2 <- dat %>%
group_by(ID) %>%
mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")),
1L, 0L)) %>%
mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
ungroup()
dat2
# # A tibble: 6 x 5
# ID Date Variable Check Date2
# <chr> <chr> <chr> <int> <chr>
# 1 a 12/03/2017 d 1 20/06/2017
# 2 a 15/04/2017 d 1 20/06/2017
# 3 a 20/06/2017 c 1 20/06/2017
# 4 b 14/05/2017 c 0 0
# 5 b 15/08/2017 c 0 0
# 6 b 16/09/2017 c 0 0
DATA
dat <- read.table(text = "ID Date Variable
a '12/03/2017' d
a '15/04/2017' d
a '20/06/2017' c
b '14/05/2017' c
b '15/08/2017' c
b '16/09/2017' c",
header = TRUE, stringsAsFactors = FALSE)

A data.table solution. Also suggested by #RYoda, you can use data.table::shift to test for your condition and then merge the results back to the original dataset
check <- dat[, {
idx <- Variable =='d' & shift(Variable, type="lead") == "c"
list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"),
Check=as.integer(any(idx)))
}, by=.(ID)]
dat[check, on=.(ID)]
# ID Date Variable MatchDate Check
# 1: a 12/03/2017 d 20/06/2017 1
# 2: a 15/04/2017 d 20/06/2017 1
# 3: a 20/06/2017 c 20/06/2017 1
# 4: b 14/05/2017 c 0 0
# 5: b 15/08/2017 c 0 0
# 6: b 16/09/2017 c 0 0
data:
library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
Variable=c('d','d','c','c','c','c'))

One solution can be arrived using fill from tidyr package. The approach is as:
First populate Check and C_Date for rows with Variable as c. Then fill up the rows above using fill function on both Check and C_Date columns. This steps will populate desired values in rows with d value. Finally, just replace the value of Check and C_Date for rows having Variable as c.
Note: OP suggested that Check for rows with Variable as c can be either 0 or 1. My solution has considered it to be 0.
# Data
df <- read.table(text = "ID Date Variable
a 12/03/2017 d
a 15/04/2017 d
a 20/06/2017 c
b 14/05/2017 c
b 15/08/2017 c
b 16/09/2017 c", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
arrange(ID, Date) %>%
mutate(Check = ifelse(Variable == "c", 1L, NA),
c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
fill(Check, .direction = "up") %>%
fill(c_Date, .direction = "up") %>%
mutate(Check = ifelse(Variable == "c", 0L, Check),
c_Date = ifelse(Variable == "c", NA, c_Date) )
# Result
# ID Date Variable Check c_Date
# <chr> <dttm> <chr> <int> <chr>
# 1 a 2017-03-12 00:00:00 d 1 2017-06-20
# 2 a 2017-04-15 00:00:00 d 1 2017-06-20
# 3 a 2017-06-20 00:00:00 c 0 <NA>
# 4 b 2017-05-14 00:00:00 c 0 <NA>
# 5 b 2017-08-15 00:00:00 c 0 <NA>
# 6 b 2017-09-16 00:00:00 c 0 <NA>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to One-Hot Encoding stacked columns in R - r

An additional base R solution: +( with( df, sapply( unique( unlist( strsplit( col1, "," ) ) ), `grepl`, col1 ) ) )

Related

calculate frequency of unique values per group in R

Threshold Flag creation in R

R: How to filter column as long as it contains combination of values?

Generate Data Frame from Count Data

Identify the occurence of value after another specific value

Categories

Resources