Count occurrences of value in multiple columns with duplicates - r

My problem is very similar to:
R: Count occurrences of value in multiple columns
However, the solution proposed there doesn't work for me because in the same row the value may appear twice but I want to count only the rows where this appears. I have worked out a solution but it seems too long:
> toy_data = data.table(from=c("A","A","A","C","E","E"), to=c("B","C","A","D","F","E"))
> toy_data
from to
1: A B
2: A C
3: A A
4: C D
5: E F
6: E E
> #get a table with intra-link count
> A = data.table(table(unlist(toy_data[from==to,from ])))
> A
V1 N
1: A 1
2: E 1
A #get a table with total count
> B = data.table(table(unlist(toy_data[,c(from,to)])))
> B
V1 N
1: A 4
2: B 1
3: C 2
4: D 1
5: E 3
6: F 1
>
> # concatenate changing sign
> table = rbind(B,A[,.(V1,-N)],use.names=FALSE)
> # groupby and subtract
> table[,sum(N),by=V1]
V1 V1
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Is there some function that would do the job in less lines? I thought in python I'd concatenate from and to then match(), cannot find the right sintax though
EDIT: I know this would work A=length(toy_data[from=="A"|to=="A",from]) but I would like avoiding loops among the various "A","B"... (and I don't know how to format output in this way)

You can try the code below
> toy_data[, to := replace(to, from == to, NA)][, data.frame(table(unlist(.SD)))]
Var1 Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
or
toy_data %>%
mutate(to = replace(to, from == to, NA)) %>%
unlist() %>%
table() %>%
as.data.frame()
which gives
. Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1

Using data.table
library(data.table)
toy_data[from == to, to := NA][, .(to = na.omit(c(from, to)))][, .N, to]

You could just subset the to vector:
data.table(table(unlist(toy_data[,c(from,to[to!=from])])))
V1 N
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1

Using to:=NA as suggested by akrun, one can wrap the result in table(unlist()) and convert to data.table
data.table(table(unlist(toy_data[from==to, to:=NA, from])))

Related

Creating data.table from a list of unequal vector lengths

I am looking to create a data.table from a list of unequal vectors, but instead of repeating the values for the "shorter" vector, I want it to be filled with NAs. I have one possible solution, but it repeats values and does not retain the NA as needed.
Example:
library(data.table)
my_list <- list(A = 1:4, B = letters[1:5])
as.data.table(do.call(cbind, my_list))
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 1 e
But I want it to look like:
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
Thank you!
We need to make the lengths same by appending NA at the end of the list elements having lesser length than the max length
mx <- max(lengths(my_list))
as.data.table(do.call(cbind, lapply(my_list, `length<-`, mx)))
-output
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: <NA> e
Instead of cbind/as.data.table, setDT is more compact
setDT(lapply(my_list, `length<-`, mx))[]
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
You may use stringi::stri_list2matrix to make all the list length equal.
my_list |>
stringi::stri_list2matrix() |>
data.table::as.data.table() |>
type.convert(as.is = TRUE) |>
setNames(names(my_list))
# A B
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: NA e

Get first and last value from groups using rle

I want to get first and last value for groups using grouping similar to what rle() function does.
For example I have this data frame:
> df
df time
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 1 H
9 1 I
10 1 J
11 3 K
12 3 L
13 3 M
14 2 N
15 2 O
16 2 P
I want to get something like this:
> want
df first last
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
How you can see, I want to group my values in a way rle() function does. I want to group elements only when this same value is next to each other. group_by groups elements in the different way.
> rle(df$df)
Run Length Encoding
lengths: int [1:5] 4 3 3 3 3
values : num [1:5] 1 2 1 3 2
Is there a solution for my problem? Any advice will be appreciated.
There is a function rleid from data.table that does that job, i.e.
library(data.table)
setDT(dt)[, .(df = head(df, 1),
first = head(time, 1),
last = tail(time, 1)),
by = (grp = rleid(df))][, grp := NULL][]
Which gives,
df first last
1: 1 A D
2: 2 E G
3: 1 H J
4: 3 K M
5: 2 N P
Adding a dplyr approach, as #RonakShah mentions
library(dplyr)
df %>%
group_by(grp = cumsum(c(0, diff(df)) != 0)) %>%
summarise(df = first(df),
first = first(time),
last = last(time)) %>%
select(-grp)
Giving,
# A tibble: 5 x 3
df first last
<int> <chr> <chr>
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
Here is an option using base R with rle. Once we do the rle on the first column, replicate the sequence of values with lengths, use that to create logical index with duplicated, then subset the values of the original dataset based on the index
rl <- rle(df[,1])
i1 <- rep(seq_along(rl$values), rl$lengths)
i2 <- !duplicated(i1)
i3 <- !duplicated(i1, fromLast = TRUE)
wanted <- data.frame(df = df[i2,1], first = df[i2,2], last = df[i3,2])
wanted
# df first last
#1 1 A D
#2 2 E G
#3 1 H J
#4 3 K M
#5 2 N P

How can I choose the first 3 rows of a data table in data.table by group?

I currently have a dataset like:
ID RESULTS
1 M
1 A
1 M
1 C
1 B
2 Q
2 E
2 S
2 G
2 Z
......
From this, I would like to keep the first 3 rows, by group. Meaning, I'd like:
ID RESULTS
1 M
1 A
1 M
2 Q
2 E
2 S
I dug around in data.table, the closest I found was using something like mult or .I. Does anyone have a simple workaround? Thanks!
I would suggest a more concise way. You can have more detail with ?data.table or with example(data.table)
DT = data.table(ID=rep(c(1,2),each=5),RESULTS=
c("M","A","M","C","B","Q","E","S","G","Z"))
> DT[,.SD[1:3],by=ID]
## ID RESULTS
## 1: 1 M
## 2: 1 A
## 3: 1 M
## 4: 2 Q
## 5: 2 E
## 6: 2 S

Formatting the output in R

I have a set of data which shows the visit ID and the subject name
visit<-c(1,2,3,1,2,1,1,2,3,1,2,3)
subject<-c("A","A","A","B","B","C","D","D","D","E","E","E")
data<-data.frame(visit=visit,subject=subject)
I attempted to work out the latest visit ID for each subject:
tapply(visit,subject,max)
And I get this output:
A B C D E
3 2 1 3 3
I am wondering if there is any way that I can change the output such that it becomes:
A 3
B 2
C 1
D 3
E 3
Thank you
You can try aggregate
aggregate(visit~subject, data, max)
# subject visit
#1 A 3
#2 B 2
#3 C 1
#4 D 3
#5 E 3
Or from tapply
res <- tapply(visit,subject,max)
data.frame(subject=names(res), visit=res)
Or data.table
library(data.table)
setDT(data)[, list(visit=max(visit)), by=subject]
And a dplyr solution would be:
library(dyplr)
data %>% group_by(subject) %>% summarize(max = max(visit))
## Source: local data frame [5 x 2]
## subject max
## 1 A 3
## 2 B 2
## 3 C 1
## 4 D 3
## 5 E 3
It may feel dirty, but using the base function as.matrix (or matrix for that matter) will give you what you need.
> as.matrix(tapply(visit,subject,max))
[,1]
A 3
B 2
C 1
D 3
E 3
You can easily do this in base R with stack:
stack(tapply(visit, subject, max))
# values ind
# 1 3 A
# 2 2 B
# 3 1 C
# 4 3 D
# 5 3 E
(Note: In this case, the values for "visit" and "subject" aren't actually coming from your data.frame. Just thought you should know!)
(Second note: You could also do data.frame(as.table(tapply(visit, subject, max))) but that is more deceptive than using stack so may lead to less readable code later on.)

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources