I would like to do a conditional sum in R and I have a table such as this below. With this data, I would like to have a forward projection of total value per desk for next 5 days. Value should be included for the date started to the out_date.
+-------+------------+-------+-------+------------+------+
| Index | Date | Desk | Value | Out_date | Days |
+-------+------------+-------+-------+------------+------+
| 16 | 2020-07-30 | Desk1 | 1 | 2020-08-17 | 12 |
| 51 | 2020-08-13 | Desk2 | 2.000 | 2020-08-14 | 4 |
| 52 | 2020-08-13 | Desk3 | 2.000 | 2020-08-15 | 4 |
| 53 | 2020-08-13 | Desk3 | 2.000 | 2020-08-16 | 4 |
+-------+------------+-------+-------+------------+------+
How do I solve this?
How the output should like:
+-------+------------+------------+------------+------------+------------+
| Desk | 2020-08-14 | 2020-08-15 | 2020-08-16 | 2020-08-17 | 2020-08-18 |
+-------+------------+------------+------------+------------+------------+
| Desk1 | 1 | 1 | 1 | 1 | 0 |
| Desk2 | 2 | 0 | 0 | 0 | 0 |
| Desk3 | 4 | 4 | 2 | 0 | 0 |
+-------+------------+------------+------------+------------+------------+
From your description, it sounds as though each row in your table represents a Value associated with a Desk for a given period of time. The Value associated with that desk starts on a particular Date, and continues until the Out_date. However, these associations can occur concurrently, which means that on any particular day, a desk may have several associated values. Your intention is to sum these values.
If my understanding is correct, then the following code will get you the relevant sums:
library(dplyr)
df %>%
mutate(Days = as.numeric(difftime(Out_date, Date, units = "day")) + 1) %>%
add_row(Index = max(df$Index) + 1, Date = max(df$Date),
Desk = "Desk1", Value = 0, Out_date = max(df$Date) + 1,
Days = 6) %>%
mutate(entry = seq(nrow(.)), n = Days) %>%
tidyr::uncount(Days) %>%
group_by(entry) %>%
mutate(Date_out = seq.Date(min(Date), length.out = max(n), by = "1 day")) %>%
group_by(Desk, Date_out) %>%
summarize(Value = sum(Value)) %>%
tidyr::pivot_wider(names_from = "Date_out", values_from = "Value") %>%
mutate_if(function(x) any(is.na(x)), function(x) replace(x, is.na(x), 0)) %>%
as.data.frame()
#> Desk 2020-07-30 2020-07-31 2020-08-01 2020-08-02 2020-08-03 2020-08-04
#> 1 Desk1 1 1 1 1 1 1
#> 2 Desk2 0 0 0 0 0 0
#> 3 Desk3 0 0 0 0 0 0
#> 2020-08-05 2020-08-06 2020-08-07 2020-08-08 2020-08-09 2020-08-10 2020-08-11
#> 1 1 1 1 1 1 1 1
#> 2 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 0
#> 2020-08-12 2020-08-13 2020-08-14 2020-08-15 2020-08-16 2020-08-17 2020-08-18
#> 1 1 1 1 1 1 1 0
#> 2 0 2 2 0 0 0 0
#> 3 0 4 4 4 2 0 0
Data from question
df <- structure(list(Index = c(16L, 51L, 52L, 53L), Date = structure(c(18473,
18487, 18487, 18487), class = "Date"), Desk = c("Desk1", "Desk2",
"Desk3", "Desk3"), Value = c(1, 2, 2, 2), Out_date = structure(c(18491,
18488, 18489, 18490), class = "Date"), Days = c(12L, 4L, 4L,
4L)), row.names = c(NA, -4L), class = "data.frame")
Created on 2020-08-14 by the reprex package (v0.3.0)
The dplyr and tidyr packages have what you need. Use group_by(Desk, Date) and summarize(forecast = your_function). Then you can pivot_wider() to get your desired output.
library(dplyr)
library(tidyr)
df %>%
group_by(Desk, Date) %>%
summarize(forecast = your_function) %>%
pivot_wider(names_from = "Date", values_from = "forecast")
you can use dplyr and tidyr for this.
input <- tibble::tibble(Desk = c("Desk1",
"Desk2",
"Desk1",
"Desk3"),
Date = c("30.07.20",
"10.08.20",
"10.08.20",
"13.08.20"),
Value = c(0.006,
5.500,
0.300,
2.500))
input %>%
dplyr::group_by(Desk, Date) %>%
dplyr::summarise(sum_value = sum(Value)) %>%
dplyr::ungroup() %>%
tidyr::pivot_wider(names_from = Date, values_from = sum_value)
Related
I have a question about converting a dataframe from a wide format into a long format. I haven't found any solutions that fit with my dataframe. We had three measurement timeslots with the same questionnaires (e.g. PANAS and two more questionnaires). My dataframe looks like this right now:
| code| PANAS_1| PANAS_2| PANAS1_1| PANAS1_2| PANAS2_1| PANAS2_2|
|CAPQ | 4 | 3 | 1 | 5 | 2 | 4 |
|BANI | 2 | 3 | 4 | 4 | 3 | 2 |
I want to put it into a format that looks like this:
| code| timeslot| PANAS_1| PANAS_2 |
|CAPQ | 1 | 4 | 3 |
|CAPQ | 2 | 1 | 5 |
|CAPQ | 3 | 2 | 4 |
|BANI | 1 | 2 | 3 |
|BANI | 2 | 4 | 4 |
|BANI | 3 | 3 | 2 |
I tried melt(), but I just don't know what to do because the variable names of the questionnaires aren't the same (the name of the variables in the first timeslot are plain "PANAS_1", the ones in the second timeslot begin with a 1 "PANAS1_1" and the ones in the third timeslot begin with a 2 "PANAS2_1). On top of that I have no variable that explains from what timeslot condition the items are.
I hope you can understand my problem and help me solve this. If you need further information, just let me know.
Here is an approach using data.table. With melt.data.table() you can use groups of measure.vars. In this case you can use patterns() to find the the groups by their suffix.
library(data.table)
df <- read.table(text = "code| PANAS_1| PANAS_2| PANAS1_1| PANAS1_2| PANAS2_1| PANAS2_2
CAPQ | 4 | 3 | 1 | 5 | 2 | 4
BANI | 2 | 3 | 4 | 4 | 3 | 2
", sep = "|", header = TRUE)
setDT(df)
DT.long <- melt(df,
id.vars = "code",
measure.vars = patterns("_1", "_2"),
variable.name = "timeslot",
value.name = c("PANAS_1", "PANAS_2")
)[order(code), ]
DT.long
#> code timeslot PANAS_1 PANAS_2
#> 1: BANI 1 2 3
#> 2: BANI 2 4 4
#> 3: BANI 3 3 2
#> 4: CAPQ 1 4 3
#> 5: CAPQ 2 1 5
#> 6: CAPQ 3 2 4
Created on 2021-08-19 by the reprex package (v2.0.1)
Here is one approach using tidyverse. You can use pivot_longer to put into long format, and separate out the last number after the underscore. Then, you can add a timeslot variable for each code/number combination, assuming the times are in order. Finally, you can revert to wide format with pivot_wider (or leave as is for further processing/analysis).
library(tidyverse)
df %>%
pivot_longer(cols = -code, names_to = c("var", "PANAS"), names_sep = "_") %>%
group_by(code, PANAS) %>%
mutate(timeslot = 1:n()) %>%
pivot_wider(id_cols = c(code, timeslot), names_from = PANAS, names_prefix = "PANAS_", values_from = value)
Output
code timeslot PANAS_1 PANAS_2
<chr> <int> <dbl> <dbl>
1 CAPQ 1 4 3
2 CAPQ 2 1 5
3 CAPQ 3 2 4
4 BANI 1 2 3
5 BANI 2 4 4
6 BANI 3 3 2
Alternatively, you can rename your column names and include the time inside them explicitly:
names(df) <- c("code", paste("PANAS", rep(1:3, each = 2), rep(1:2, times = 3), sep = "_"))
df %>%
pivot_longer(cols = -code, names_to = c("timeslot", "PANAS"), names_pattern = "PANAS_(\\d+)_(\\d+)") %>%
pivot_wider(id_cols = c(code, timeslot), names_from = PANAS, names_prefix = "PANAS_", values_from = value)
I have a df looking like this:
| Speak_Dur|CNC_count|TNT_count|...
|0.5 | 1 | 0
|0.8 | 0 | 1
|4.3 | | 1
|5.5 | 1 | 0
I want to make a few new columns using if else.
for the example, let's say I got only those 3 columns.
So I want to make a new column for each "X_count" variable I already have based on this condition:
new column "CNC_dur"= if cnc_count=1, then paste the "speak_dur" value from the same row.
new column "TNT_dur" = if tnt_count=1, then paste the "speak_dur" value from the same row.
results should be:
| Speak_Dur|CNC_count|TNT_count|CNC_dur|TNT_dur|
|0.5 | 1 | 0 |0.5 |0 |
|0.8 | 0 | 1 | 0 | 0.8
|4.3 | 0 | 1 | 0 | 4.3
|5.5 | 1 | 0 |5.5 | 0
for now, I tried:
mutate(
CNC_DUR = if_else(CNC_count[row_number() -1] =="1","speak_dur",0,0))
I guess the last line should be something else,
hoping to get any help, thank you.
Here is one potential solution:
library(tidyverse)
df <- tribble(~"Speak_Dur", ~"CNC_count", ~"TNT_count",
0.5, 1, 0,
0.8, 0, 1,
4.3, NA, 1,
5.5, 1, 0)
df2 <- df %>%
mutate(CNC_dur = ifelse(CNC_count == 1, Speak_Dur, 0),
TNT_Dur = ifelse(TNT_count == 1, Speak_Dur, 0))
df2
#> A tibble: 4 x 5
#> Speak_Dur CNC_count TNT_count CNC_dur TNT_Dur
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#>1 0.5 1 0 0.5 0
#>2 0.8 0 1 0 0.8
#>3 4.3 NA 1 NA 4.3
#>4 5.5 1 0 5.5 0
Then, if you want to write the file using the pipe delimiter:
write.table(x = df2, file = "file_with_new_columns.txt", sep = "|", row.names = FALSE)
id | 85| 291| 5680| 41
---+---+----+-----+----
597| 1 | 1 | 1 | 1
672| 1 | 0 | 0 | 0
680| 1 | 1 | 1 | 0
683| 1 | 1 | 1 | 1
I have a table that looks something like above. I want to make a flag each row where the 1 values account for 90% of the row (not including the id column)
So for this example only row 1 and 4 would be flagged.
intended output:
id | 85| 291| 5680| 41 | flag |
---+---+----+-----+----+------+
597| 1 | 1 | 1 | 1 | yes |
672| 1 | 0 | 0 | 0 | no |
680| 1 | 1 | 1 | 0 | no |
683| 1 | 1 | 1 | 1 | yes |
how can i do this in R using tidyverse syntax? I tried some stuff dealing with rowSums(), but i can't come up with a solution.
Perhaps try using rowMeans:
df$flag = rowMeans(df[-1]) >= .9
This assumes you have only 1 and 0 for values here.
If your "table" is actually a data frame with all columns except the first being columns of 1s and 0s, you could do:
df %>% mutate(flag = apply(df[-1], 1, function(x) sum(x)/length(x) > 0.9)
An option in tidyverse, would be to reshape to 'long' format, get the mean and bind with the original dataset
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id) %>%
group_by(id) %>%
summarise(flag = mean(value) > 0.9) %>%
right_join(df1) %>%
select(names(df1), everything())
# A tibble: 4 x 6
# id `85` `291` `5680` `41` flag
# <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
#1 597 1 1 1 1 TRUE
#2 672 1 0 0 0 FALSE
#3 680 1 1 1 0 FALSE
#4 683 1 1 1 1 TRUE
data
df1 <- structure(list(id = c(597, 672, 680, 683), `85` = c(1, 1, 1,
1), `291` = c(1, 0, 1, 1), `5680` = c(1, 0, 1, 1), `41` = c(1,
0, 0, 1)), class = "data.frame", row.names = c(NA, -4L))
I have data in the form of:
M | Y | title | terma | termb | termc
4 | 2009 | titlea | 2 | 0 | 1
6 | 2001 | titleb | 0 | 1 | 0
4 | 2009 | titlec | 1 | 0 | 1
I'm using dplyr's group_by() and summarise() to count instances of terms for each title:
data %>%
gather(key = term, value = total, terma:termc) %>%
group_by(m, y, title, term) %>%
summarise(total = sum(total))
Which gives me something like this:
M | Y | title |term | count
4 | 2009 | titlea | terma | 2
4 | 2009 |titlea |termc | 1
6 | 2001 | titleb | termb | 1
4 | 2009 | titlec | terma | 1
4 | 2009 | titlec | termc | 1
Instead, I would like to be able to group by M, Y, and term, then concatenate any titles that are grouped and add their totals together. Desired output would look like this:
M | Y | title | term | count
4 | 2009 | titlea, titlec | terma | 3
4 | 2009 | titlea, titlec | termc | 2
6 | 2001 | titleb | termb | 1
How can I do this? Any help appreciated!
#akrun was very close. This ended up working:
data %>%
pivot_longer(cols = terma:termc), names_to = 'term', values_to = 'count') %>%
filter(count != 0) %>%
group_by(M, Y, term) %>%
summarise(title = toString(title), count = sum(count))
We can do
library(dplyr)
library(tidyr)
data %>%
mutate_at(vars(starts_with('term')), na_if, '0') %>%
pivot_longer(cols = starts_with('term'), names_to = 'term',
values_to = 'count', values_drop_na = TRUE) %>%
group_by(M, Y, term) %>%
summarise(title = toString(title), count = sum(count))
# A tibble: 3 x 5
# Groups: M, Y [2]
# M Y term title count
# <int> <int> <chr> <chr> <int>
#1 4 2009 terma titlea, titlec 3
#2 4 2009 termc titlea, titlec 2
#3 6 2001 termb titleb 1
data
data <- structure(list(M = c(4L, 6L, 4L), Y = c(2009L, 2001L, 2009L),
title = c("titlea", "titleb", "titlec"), terma = c(2L, 0L,
1L), termb = c(0L, 1L, 0L), termc = c(1L, 0L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
I am trying to create an unsummarized data frame from a data frame of count data.
I have had some experience creating sample datasets but I am having some trouble trying to get a specific number of rows and proportion for each state/person without coding each of them separately and then combining them. I was able to do it using the following code but I feel like there is a better way.
set.seed(2312)
dragon <- sample(c(1),3,replace=TRUE)
Maine <- sample(c("Maine"),3,replace=TRUE)
Maine1 <- data.frame(dragon, Maine)
dragon <- sample(c(0),20,replace=TRUE)
Maine <- sample(c("Maine"),20,replace=TRUE)
Maine2 <- data.frame(dragon, Maine)
Maine2
library(dplyr)
maine3 <- bind_rows(Maine1, Maine2)
Is there a better way to generate this dataset then the code above?
I am trying to create a data frame from the following count data:
+-------------+--------------+--------------+
| | # of dragons | # no dragons |
+-------------+--------------+--------------+
| Maine | 3 | 20|
| California | 1 | 10|
| Jocko | 28 | 110515 |
| Jessica Day | 17 | 26122 |
| | 14 | 19655 |
+-------------+--------------+--------------+
And I would like it to look like this:
+-----------------------+---------------+
| | Dragons (1/0) |
+-----------------------+---------------+
| Maine | 1 |
| Maine | 1 |
| Maine | 1 |
| Maine | 0 |
| Maine….(2:20) | 0…. |
| California | 1 |
| California….(2:10) | 0… |
| Ect.. | |
+-----------------------+---------------+
I do not want the code written for me but would love with ideas on function or examples that you think might be helpful.
I am not completely sure what does sampling have to do with this problem?
It looks to me like you are looking for untable.
Here is an example
data:
set.seed(1)
no_drag = sample(1:5, 5)
drag = sample(15:25, 5)
df <- data.frame(names = LETTERS[1:5],
drag,
no_drag)
names drag no_drag
1 A 24 2
2 B 25 5
3 C 20 4
4 D 23 3
5 E 15 1
library(reshape)
library(tidyverse)
df %>%
gather(key, value, 2:3) %>% #convert to long format
{untable(.,num = .$value)} %>% #untable by value column
mutate(value = ifelse(key == "drag", 0, 1)) %>% #convert values to 0/1
select(-key) %>% #remove unwanted column
arrange(names) #optional
#part of output
names value
1 A 0
2 A 0
3 A 0
4 A 0
5 A 0
6 A 0
7 A 0
8 A 0
9 A 0
10 A 0
11 A 0
12 A 0
13 A 0
14 A 0
15 A 0
16 A 0
17 A 0
18 A 0
19 A 0
20 A 0
21 A 0
22 A 0
23 A 0
24 A 0
25 A 1
26 A 1
27 B 0
28 B 0
29 B 0
30 B 0
there are other ways to tackle the problem here is one:
One is like #Frank mentioned in the comment:
df %>%
gather(key, val, 2:3) %>%
mutate(v = Map(rep, key == "drag", val)) %>%
unnest %>%
select(-key, -val)
Another:
df <- gather(df, key, value, 2:3)
df <- df[rep(seq_len(nrow(df)), df$value), 1:2]
df$key[df$key == "drag"] <- FALSE
df$key[df$key != "drag"] <- TRUE
One can use tidyr::expand to expand rows in desired format.
The solution using df used by #missuse can be shown as:
library(tidyverse)
df %>% gather(key,value,-names) %>%
mutate(key = ifelse(key=="drag", 1, 0)) %>%
group_by(names,key) %>%
expand(value = 1:value) %>%
select(names, value = key) %>%
as.data.frame()
# names value
# 1 A 0
# 2 A 0
# 3 A 1
# 4 A 1
# 5 A 1
# 6 A 1
# 7 A 1
# 8 A 1
# 9 A 1
# 10 A 1
# ...so on
# 117 E 1
# 118 E 1
# 119 E 1
# 120 E 1
# 121 E 1
# 122 E 1