id | 85| 291| 5680| 41
---+---+----+-----+----
597| 1 | 1 | 1 | 1
672| 1 | 0 | 0 | 0
680| 1 | 1 | 1 | 0
683| 1 | 1 | 1 | 1
I have a table that looks something like above. I want to make a flag each row where the 1 values account for 90% of the row (not including the id column)
So for this example only row 1 and 4 would be flagged.
intended output:
id | 85| 291| 5680| 41 | flag |
---+---+----+-----+----+------+
597| 1 | 1 | 1 | 1 | yes |
672| 1 | 0 | 0 | 0 | no |
680| 1 | 1 | 1 | 0 | no |
683| 1 | 1 | 1 | 1 | yes |
how can i do this in R using tidyverse syntax? I tried some stuff dealing with rowSums(), but i can't come up with a solution.
Perhaps try using rowMeans:
df$flag = rowMeans(df[-1]) >= .9
This assumes you have only 1 and 0 for values here.
If your "table" is actually a data frame with all columns except the first being columns of 1s and 0s, you could do:
df %>% mutate(flag = apply(df[-1], 1, function(x) sum(x)/length(x) > 0.9)
An option in tidyverse, would be to reshape to 'long' format, get the mean and bind with the original dataset
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id) %>%
group_by(id) %>%
summarise(flag = mean(value) > 0.9) %>%
right_join(df1) %>%
select(names(df1), everything())
# A tibble: 4 x 6
# id `85` `291` `5680` `41` flag
# <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#1 597 1 1 1 1 TRUE
#2 672 1 0 0 0 FALSE
#3 680 1 1 1 0 FALSE
#4 683 1 1 1 1 TRUE
data
df1 <- structure(list(id = c(597, 672, 680, 683), `85` = c(1, 1, 1,
1), `291` = c(1, 0, 1, 1), `5680` = c(1, 0, 1, 1), `41` = c(1,
0, 0, 1)), class = "data.frame", row.names = c(NA, -4L))
Related
I have data that look like this
+---+-------+
| | col1 |
+---+-------+
| 1 | A |
| 2 | A,B |
| 3 | B,C |
| 4 | B |
| 5 | A,B,C |
+---+-------+
Expected Output
+---+-----------+
| | A | B | C |
+---+-----------+
|1 | 1 | 0 | 0 |
|2 | 1 | 1 | 0 |
|3 | 0 | 1 | 1 |
|4 | 0 | 1 | 0 |
|5 | 1 | 1 | 1 |
+---+---+---+---+
How can I encode it like this?
Maybe this could help
df %>%
mutate(r = 1:n()) %>%
unnest(col1) %>%
table() %>%
t()
which gives
col1
r A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
Data
df <- tibble(
col1 = list(
"A",
c("A", "B"),
c("B", "C"),
"B",
c("A", "B", "C")
)
)
If your data is given in the following format
df <- data.frame(
col1 = c("A", "A,B", "B,C", "B", "A,B,C")
)
then you can try
with(
df,
table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
)
which gives
values
ind A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
You could use table() with map_df() from purrr to count the occurrences
in each element of a list, and return a data frame. Putting it into a
function with some post-processing, and using dplyrs data frame unpacking in
mutate(), you could do something like this to stay within a data frame
context:
library(tidyverse)
one_hot <- function(x) {
map_df(x, table) %>%
mutate_all(as.integer) %>%
mutate_all(replace_na, 0L)
}
df <- data.frame(col1 = c("A", "A,B", "B,C", "B", "A,B,C"))
df %>%
mutate(
one_hot(strsplit(col1, ","))
)
#> col1 A B C
#> 1 A 1 0 0
#> 2 A,B 1 1 0
#> 3 B,C 0 1 1
#> 4 B 0 1 0
#> 5 A,B,C 1 1 1
An additional base R solution:
+(
with(
df,
sapply(
unique(
unlist(
strsplit(
col1,
","
)
)
),
`grepl`,
col1
)
)
)
I have a dataframe in r which contains information about clients purchasing history of the last year the data frame looks something like this:
Client | Prod A | Prod B | Prod C
---------------------------------
A | 1 | 0 | 1
B | 1 | 1 | 0
C | 1 | 0 | 1
D | 0 | 0 | 1
E | 1 | 0 | 0
---------------------------------
Where 1 means the client has purchased the product at some point and 0 it hasnt bought it at all.
In this particular table the most frequent combination is Product A and Product C with 2 cases out of 5.
I want to find a method/function that will get me the most common combination of products for a data frame of this type of any dimensions.
Thanks in advance for your help.
res <- as.data.frame(xtabs(~., data=dat[,-1]))
res
# Prod.A Prod.B Prod.C Freq
# 1 0 0 0 0
# 2 1 0 0 1
# 3 0 1 0 0
# 4 1 1 0 1
# 5 0 0 1 1
# 6 1 0 1 2
# 7 0 1 1 0
# 8 1 1 1 0
From this you can see the counts of combinations, the "max" of which is
subset(res, Freq == max(Freq))
# Prod.A Prod.B Prod.C Freq
# 6 1 0 1 2
Your dataframe in df
aggregate(Client~Prod.A+Prod.B+Prod.C,df,length)
Prod.A Prod.B Prod.C Client
1 1 0 0 1
2 1 1 0 1
3 0 0 1 1
4 1 0 1 2
the last column Client giving the count
Solution using dplyr
library(dplyr)
df <- data.frame(Client = c("A","B","C","D","E"),
`Prod A` = c(1,1,1,0,1),
`Prod B` = c(0,1,0,0,0),
`Prod C` = c(1,0,1,1,0))
df %>%
dplyr::group_by(Prod.A,Prod.B,Prod.C) %>%
dplyr::summarise(count = n())
# A tibble: 4 x 4
# Groups: Prod.A, Prod.B [3]
Prod.A Prod.B Prod.C count
<dbl> <dbl> <dbl> <int>
1 0 0 1 1
2 1 0 0 1
3 1 0 1 2
4 1 1 0 1
library(dplyr)
df <- data.frame(Client = c("A", "B", "C", "D", "E"),
`Prod A` = c(1, 1, 1, 0, 1),
`Prod B` = c(0, 1, 0, 0, 0),
`Prod C` = c(1, 0, 1, 1, 0))
df %>%
rowwise() %>%
mutate(length = sum(Prod.A, Prod.B, Prod.C)) %>%
group_by(Prod.A, Prod.B, Prod.C) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == max(count) & length > 1) %>%
select(1:4)
which will produce:
Client Prod.A Prod.B Prod.C
<chr> <dbl> <dbl> <dbl>
1 A 1 0 1
2 C 1 0 1
I have a df looking like this:
| Speak_Dur|CNC_count|TNT_count|...
|0.5 | 1 | 0
|0.8 | 0 | 1
|4.3 | | 1
|5.5 | 1 | 0
I want to make a few new columns using if else.
for the example, let's say I got only those 3 columns.
So I want to make a new column for each "X_count" variable I already have based on this condition:
new column "CNC_dur"= if cnc_count=1, then paste the "speak_dur" value from the same row.
new column "TNT_dur" = if tnt_count=1, then paste the "speak_dur" value from the same row.
results should be:
| Speak_Dur|CNC_count|TNT_count|CNC_dur|TNT_dur|
|0.5 | 1 | 0 |0.5 |0 |
|0.8 | 0 | 1 | 0 | 0.8
|4.3 | 0 | 1 | 0 | 4.3
|5.5 | 1 | 0 |5.5 | 0
for now, I tried:
mutate(
CNC_DUR = if_else(CNC_count[row_number() -1] =="1","speak_dur",0,0))
I guess the last line should be something else,
hoping to get any help, thank you.
Here is one potential solution:
library(tidyverse)
df <- tribble(~"Speak_Dur", ~"CNC_count", ~"TNT_count",
0.5, 1, 0,
0.8, 0, 1,
4.3, NA, 1,
5.5, 1, 0)
df2 <- df %>%
mutate(CNC_dur = ifelse(CNC_count == 1, Speak_Dur, 0),
TNT_Dur = ifelse(TNT_count == 1, Speak_Dur, 0))
df2
#> A tibble: 4 x 5
#> Speak_Dur CNC_count TNT_count CNC_dur TNT_Dur
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#>1 0.5 1 0 0.5 0
#>2 0.8 0 1 0 0.8
#>3 4.3 NA 1 NA 4.3
#>4 5.5 1 0 5.5 0
Then, if you want to write the file using the pipe delimiter:
write.table(x = df2, file = "file_with_new_columns.txt", sep = "|", row.names = FALSE)
I would like to do a conditional sum in R and I have a table such as this below. With this data, I would like to have a forward projection of total value per desk for next 5 days. Value should be included for the date started to the out_date.
+-------+------------+-------+-------+------------+------+
| Index | Date | Desk | Value | Out_date | Days |
+-------+------------+-------+-------+------------+------+
| 16 | 2020-07-30 | Desk1 | 1 | 2020-08-17 | 12 |
| 51 | 2020-08-13 | Desk2 | 2.000 | 2020-08-14 | 4 |
| 52 | 2020-08-13 | Desk3 | 2.000 | 2020-08-15 | 4 |
| 53 | 2020-08-13 | Desk3 | 2.000 | 2020-08-16 | 4 |
+-------+------------+-------+-------+------------+------+
How do I solve this?
How the output should like:
+-------+------------+------------+------------+------------+------------+
| Desk | 2020-08-14 | 2020-08-15 | 2020-08-16 | 2020-08-17 | 2020-08-18 |
+-------+------------+------------+------------+------------+------------+
| Desk1 | 1 | 1 | 1 | 1 | 0 |
| Desk2 | 2 | 0 | 0 | 0 | 0 |
| Desk3 | 4 | 4 | 2 | 0 | 0 |
+-------+------------+------------+------------+------------+------------+
From your description, it sounds as though each row in your table represents a Value associated with a Desk for a given period of time. The Value associated with that desk starts on a particular Date, and continues until the Out_date. However, these associations can occur concurrently, which means that on any particular day, a desk may have several associated values. Your intention is to sum these values.
If my understanding is correct, then the following code will get you the relevant sums:
library(dplyr)
df %>%
mutate(Days = as.numeric(difftime(Out_date, Date, units = "day")) + 1) %>%
add_row(Index = max(df$Index) + 1, Date = max(df$Date),
Desk = "Desk1", Value = 0, Out_date = max(df$Date) + 1,
Days = 6) %>%
mutate(entry = seq(nrow(.)), n = Days) %>%
tidyr::uncount(Days) %>%
group_by(entry) %>%
mutate(Date_out = seq.Date(min(Date), length.out = max(n), by = "1 day")) %>%
group_by(Desk, Date_out) %>%
summarize(Value = sum(Value)) %>%
tidyr::pivot_wider(names_from = "Date_out", values_from = "Value") %>%
mutate_if(function(x) any(is.na(x)), function(x) replace(x, is.na(x), 0)) %>%
as.data.frame()
#> Desk 2020-07-30 2020-07-31 2020-08-01 2020-08-02 2020-08-03 2020-08-04
#> 1 Desk1 1 1 1 1 1 1
#> 2 Desk2 0 0 0 0 0 0
#> 3 Desk3 0 0 0 0 0 0
#> 2020-08-05 2020-08-06 2020-08-07 2020-08-08 2020-08-09 2020-08-10 2020-08-11
#> 1 1 1 1 1 1 1 1
#> 2 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 0
#> 2020-08-12 2020-08-13 2020-08-14 2020-08-15 2020-08-16 2020-08-17 2020-08-18
#> 1 1 1 1 1 1 1 0
#> 2 0 2 2 0 0 0 0
#> 3 0 4 4 4 2 0 0
Data from question
df <- structure(list(Index = c(16L, 51L, 52L, 53L), Date = structure(c(18473,
18487, 18487, 18487), class = "Date"), Desk = c("Desk1", "Desk2",
"Desk3", "Desk3"), Value = c(1, 2, 2, 2), Out_date = structure(c(18491,
18488, 18489, 18490), class = "Date"), Days = c(12L, 4L, 4L,
4L)), row.names = c(NA, -4L), class = "data.frame")
Created on 2020-08-14 by the reprex package (v0.3.0)
The dplyr and tidyr packages have what you need. Use group_by(Desk, Date) and summarize(forecast = your_function). Then you can pivot_wider() to get your desired output.
library(dplyr)
library(tidyr)
df %>%
group_by(Desk, Date) %>%
summarize(forecast = your_function) %>%
pivot_wider(names_from = "Date", values_from = "forecast")
you can use dplyr and tidyr for this.
input <- tibble::tibble(Desk = c("Desk1",
"Desk2",
"Desk1",
"Desk3"),
Date = c("30.07.20",
"10.08.20",
"10.08.20",
"13.08.20"),
Value = c(0.006,
5.500,
0.300,
2.500))
input %>%
dplyr::group_by(Desk, Date) %>%
dplyr::summarise(sum_value = sum(Value)) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = Date, values_from = sum_value)
I am trying to create an unsummarized data frame from a data frame of count data.
I have had some experience creating sample datasets but I am having some trouble trying to get a specific number of rows and proportion for each state/person without coding each of them separately and then combining them. I was able to do it using the following code but I feel like there is a better way.
set.seed(2312)
dragon <- sample(c(1),3,replace=TRUE)
Maine <- sample(c("Maine"),3,replace=TRUE)
Maine1 <- data.frame(dragon, Maine)
dragon <- sample(c(0),20,replace=TRUE)
Maine <- sample(c("Maine"),20,replace=TRUE)
Maine2 <- data.frame(dragon, Maine)
Maine2
library(dplyr)
maine3 <- bind_rows(Maine1, Maine2)
Is there a better way to generate this dataset then the code above?
I am trying to create a data frame from the following count data:
+-------------+--------------+--------------+
| | # of dragons | # no dragons |
+-------------+--------------+--------------+
| Maine | 3 | 20|
| California | 1 | 10|
| Jocko | 28 | 110515 |
| Jessica Day | 17 | 26122 |
| | 14 | 19655 |
+-------------+--------------+--------------+
And I would like it to look like this:
+-----------------------+---------------+
| | Dragons (1/0) |
+-----------------------+---------------+
| Maine | 1 |
| Maine | 1 |
| Maine | 1 |
| Maine | 0 |
| Maine….(2:20) | 0…. |
| California | 1 |
| California….(2:10) | 0… |
| Ect.. | |
+-----------------------+---------------+
I do not want the code written for me but would love with ideas on function or examples that you think might be helpful.
I am not completely sure what does sampling have to do with this problem?
It looks to me like you are looking for untable.
Here is an example
data:
set.seed(1)
no_drag = sample(1:5, 5)
drag = sample(15:25, 5)
df <- data.frame(names = LETTERS[1:5],
drag,
no_drag)
names drag no_drag
1 A 24 2
2 B 25 5
3 C 20 4
4 D 23 3
5 E 15 1
library(reshape)
library(tidyverse)
df %>%
gather(key, value, 2:3) %>% #convert to long format
{untable(.,num = .$value)} %>% #untable by value column
mutate(value = ifelse(key == "drag", 0, 1)) %>% #convert values to 0/1
select(-key) %>% #remove unwanted column
arrange(names) #optional
#part of output
names value
1 A 0
2 A 0
3 A 0
4 A 0
5 A 0
6 A 0
7 A 0
8 A 0
9 A 0
10 A 0
11 A 0
12 A 0
13 A 0
14 A 0
15 A 0
16 A 0
17 A 0
18 A 0
19 A 0
20 A 0
21 A 0
22 A 0
23 A 0
24 A 0
25 A 1
26 A 1
27 B 0
28 B 0
29 B 0
30 B 0
there are other ways to tackle the problem here is one:
One is like #Frank mentioned in the comment:
df %>%
gather(key, val, 2:3) %>%
mutate(v = Map(rep, key == "drag", val)) %>%
unnest %>%
select(-key, -val)
Another:
df <- gather(df, key, value, 2:3)
df <- df[rep(seq_len(nrow(df)), df$value), 1:2]
df$key[df$key == "drag"] <- FALSE
df$key[df$key != "drag"] <- TRUE
One can use tidyr::expand to expand rows in desired format.
The solution using df used by #missuse can be shown as:
library(tidyverse)
df %>% gather(key,value,-names) %>%
mutate(key = ifelse(key=="drag", 1, 0)) %>%
group_by(names,key) %>%
expand(value = 1:value) %>%
select(names, value = key) %>%
as.data.frame()
# names value
# 1 A 0
# 2 A 0
# 3 A 1
# 4 A 1
# 5 A 1
# 6 A 1
# 7 A 1
# 8 A 1
# 9 A 1
# 10 A 1
# ...so on
# 117 E 1
# 118 E 1
# 119 E 1
# 120 E 1
# 121 E 1
# 122 E 1