Is there a way to transpose different column names - r

Is there a way to transpose dataframe with different column names FOr example
Col A Col B
Table1 Date
Table1 Country
Table2 Name
Table2 Date
Table3 ID
Table3 Place
Required Output (Columns with same name should be aligned in the same column like Date)
Col A Col1 Col2 Col3
Table1 Date Country
Table2 Date Name
Table3 ID Place

It seems like to get the desired output you have to adress the cases where there is > 1 instance of a ColB value and the cases where there is only 1 separately.
Option 1:
library(data.table)
setDT(df)
df[, single := .N == 1L, ColB]
df[, b_id := frank(ColB, ties.method = 'dense')]
out <-
merge(
dcast(df[single == F], ColA ~ b_id, value.var = 'ColB'),
dcast(df[single == T], ColA ~ rowid(ColA), value.var = 'ColB'),
by = 'ColA',
all = T
)
setnames(out, replace(paste0('Col', seq(0, ncol(out) - 1)), 1, names(out)[1]))
out
# ColA Col1 Col2 Col3
# 1: Table1 Date Country <NA>
# 2: Table2 Date Name <NA>
# 3: Table3 <NA> ID Place
Option 2:
library(data.table)
setDT(df)
df[, single := .N == 1L, ColB]
df[, b_id :=
interaction(single, fifelse(single, rowid(ColA), frank(ColB, ties.method = 'dense')))]
dcast(df, ColA ~ paste0('Col', as.integer(b_id)), value.var = 'ColB')
# ColA Col2 Col3 Col4
# 1: Table1 <NA> Date Country
# 2: Table2 Name Date <NA>
# 3: Table3 ID <NA> Place
Input data:
df <- fread('
ColA ColB
Table1 Date
Table1 Country
Table2 Name
Table2 Date
Table3 ID
Table3 Place
')

Related

replace cell values greater than 0 with column name

I have a dataframe with the following structure:
Df = data.frame(
Col1 = c(1,0,0),
Col2 = c(0,2,1),
Col3 = c(0,0,0)
)
What I'm trying to get is a dataframe where those cells with a value greater than 0 get replaced with the column name and those lower than 1 get replaced by NA. The resulting dataframe would be something like this:
Df = data.frame(
Col1 = c("Col1",NA,NA),
Col2 = c(NA,"Col2","Col2"),
Col3 = c(NA,NA,NA)
)
So far I tried with this solution and with functions like apply(), mutate_if(), and across() but I can't get what I'm after.
You could do:
Df %>%
mutate(across(everything(), ~ if_else(. > 0, cur_column(), NA_character_)))
Col1 Col2 Col3
1 Col1 <NA> <NA>
2 <NA> Col2 <NA>
3 <NA> Col2 <NA>

Keep empty groups when grouping with data.table in R

I want to keep empty groups (with a default value like NA or 0) when grouping by multiple conditions.
dt = data.table(user = c("A", "A", "B"), date = c("t1", "t2", "t1"), duration = c(1, 2, 1))
dt[, .("total" = sum(duration)), by = .(date, user)]
Result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
Desired result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
3: t2 B NA
One solution could be to add rows with 0 values before grouping, but it would require to create the Descartes product of many columns and manually checking if a value already exists for that combination, but I would prefer a built-in / simpler one.
You can try:
dt[CJ(user = user, date = date, unique = TRUE), on = .(user, date)]
user date duration
1: A t1 1
2: A t2 2
3: B t1 1
4: B t2 NA
Here is an option with complete from tidyr
library(tidyr)
library(dplyr)
dt1 <- dt[, .("total" = sum(duration)), by = .(date, user)]
dt1 %>%
complete(user, date)
# user date total
# <chr> <chr> <dbl>
# A t1 1
#2 A t2 2
#3 B t1 1
#4 B t2 NA
Or using dcast/melt
melt(dcast(dt, user ~ date, value.var = 'duration', sum),
id.var = 'user', variable.name = 'date', value.name = 'total')

data.table: calculate statistics of rows time within time moving window

library(data.table)
library(lubridate)
df <- data.table(col1 = c('A', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))
For each row I want to calculate standard deviation of time(col2) of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include)
I use next approach:
df$col2 <- as_datetime(df$col2)
gap <- 10L
df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(sd_time = sd(as.numeric(col2))), by = .EACHI]$sd_time][]
as result I see only NA values instead of values in seconds
For example for third row (col="A" and col2 = "2015-03-06 01:45:28")
I have calculated manually by next way:
v <- c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28")
v <- as_datetime(v)
sd(v) = 233.5815
Two alternative data.table solutions (variations on my previous answer):
# option 1
df[.(col1 = col1, t1 = col2, t2 = col2 + gap * 60L)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(col1, col2 = x.col2, times = as.numeric(t1))
][, .(feat1 = sd(times))
, by = .(col1, col2)]
# option 2
df[, feat1 := .SD[.(col1 = col1, t1 = col2, t2 = col2 + gap * 60L)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(col1, col2 = x.col2, times = as.numeric(t1))
][, .(sd_times = sd(times))
, by = .(col1, col2)]$sd_times][]
which both give:
col1 col2 feat1
1: A 2015-03-06 00:37:57 NA
2: A 2015-03-06 00:39:57 84.85281
3: A 2015-03-06 00:45:28 233.58153
4: B 2015-03-06 01:31:44 NA
5: B 2015-03-06 02:55:45 NA
6: B 2015-03-06 03:01:40 251.02291
A pure data.table solution:
df[,col3:=as.numeric(col2)]
df[, feat1 := {
d <- data$col3 - col3
sd(data$col3[col1 == data$col1 & d <= 0 & d >= -gap * 60L])
},
by = list(col3, col1)]
Another way to loop over all combinations of col1, col2 with mapply:
df[,col3:=as.numeric(col2)]
df[, feat1:=mapply(Date = col3,ID = col1, function(Date, ID) {
DateVect=df[col1 == ID,col3]
d <- DateVect - Date
sd(DateVect[d <= 0 & d >= -gap * 60L])})][]

data.table: count rows within time moving window

library(data.table)
df <- data.table(col1 = c('B', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))
For each row I want to count number of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include)
I run next code:
df$col2 <- as_datetime(df$col2)
window = 10L
(counts = setDT(df)[.(t1=col2-window*60L, t2=col2), on=.((col2>=t1) & (col2<=t2)),
.(counts=.N), by=col1]$counts)
df[, counts := counts]
and got next mistake:
Error in `[.data.table`(setDT(df), .(t1 = col2 - window * 60L, t2 = col2), : Column(s) [(col2] not found in x
I want result like next:
col1 col2 counts
B 2015-03-06 01:37:57 1
A 2015-03-06 01:39:57 1
A 2015-03-06 01:45:28 2
B 2015-03-06 02:31:44 1
B 2015-03-06 03:55:45 1
B 2015-03-06 04:01:40 2
A possible solution:
df[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .(counts = .N), by = .EACHI][, (2) := NULL][]
which gives:
col1 col2 counts
1: B 2015-03-06 01:37:57 1
2: A 2015-03-06 01:39:57 1
3: A 2015-03-06 01:45:28 2
4: B 2015-03-06 02:31:44 1
5: B 2015-03-06 03:55:45 1
6: B 2015-03-06 04:01:40 2
A couple of notes about your approach:
You don't need setDT because you already constructed df with data.table(...).
You on-statement isn't specified correctly: you need to separate the join conditions with a , and not with a &. For example: on = .(col1, col2 >= t1, col2 <= t2)
Use by = .EACHI to get the result for each row.
An alternative approach:
df[, counts := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2)
, on = .(col1, col2 >= t1, col2 <= t2)
, .N, by = .EACHI]$N][]
which gives the same result.

Find duplicated rows with original

I can get duplicated rows in R on a data.table dt using
dt[duplicated(dt, by=someColumns)]
However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt:
col1, col2, col3
A B C1
A B C2
A B1 C1
Now, dt[duplicated(dt, by=c('col1', "col2")) would give me something along the lines of
col1, col2, col3
A B C2
I would like to get this together with the row that it did not chose to be duplicated, that is
col1, col2, col3
A B C1
A B C2
Speed comparison of answers:
> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
user system elapsed
0.008 0.000 0.009
> system.time(dt[, .SD[.N > 1], by = t])
user system elapsed
77.555 0.100 77.703
I believe this is essentially a duplicate of this question, though i can see how you may not have found it...
...here's an answer building off the logic outlined in the referenced question:
dt <- read.table(text = "col1 col2 col3
A B C1
A B C2
A B1 C1", header = TRUE, stringsAsFactors = FALSE)
idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)
dt[idx, ]
#---
col1 col2 col3
1 A B C1
2 A B C2
Since you are using data.table, this is probably what you want:
library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
col1 col2 col3
1: A B C1
2: A B C2
You can easily achieve this just by using .N:
dt[, .SD[.N > 1], by = list(col1, col2)]
## col1 col2 col3
## 1: A B C1
## 2: A B C2
Edit:
You can also try to use binary search which is very efficient, though it seems like duplicated is still more efficient
setkey(dt[, indx := .N, by = list(col1, col2)], indx)[!J(1)]
## col1 col2 col3
## 1: A B C1
## 2: A B C2

Resources