How to calculate the frequency of each value in a column corresponding to each value in another column in R? - r

I have a dataset as follows:
col1 col2
A 1
A 2
A 2
B 1
B 1
C 1
C 1
C 2
I want the output as:
col1 col2 Frequency
A 1 1
A 2 2
B 1 2
C 1 2
C 2 1
I tried using the aggregate function and also the table function but I am unable to get desired result.

You can add a dummy column or use the rownames to aggregate on:
aggregate(rownames(mydf) ~ ., mydf, length)
# col1 col2 rownames(mydf)
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 C 2 1
table also works fine but will report combinations that may not be in your data as "0":
data.frame(table(mydf))
# col1 col2 Freq
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 B 2 0
# 6 C 2 1
Another nice approach is to use "data.table":
library(data.table)
as.data.table(mydf)[, .N, by = names(mydf)]

if your data is
col1 <- c("A","A","A","B","B","C","C","C")
col2 <- c(1,2,2,1,1,1,1,2)
df <- data.frame(col1,col2)
you can use dplyr
1) group_by both both variables, since your output is supposed to include every combination of them
2) count the number of observations for each group using n()
library(dplyr)
df %>% group_by(col1,col2) %>% summarize(frequency=n())
# output
col1 col2 frequency
1 A 1 1
2 A 2 2
3 B 1 2
4 C 1 2
5 C 2 1

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

R: Collapse duplicated values in a column while keeping the order

I'm sure this is super simple but just can't find the answer. I have a data frame like so
Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A
And I'd like to group by Id and collapse the distinct event values while keeping the event order like so
Id event
1 1 A
2 1 B
3 1 A
4 2 C
5 2 A
Most of my searches end up with using the distinct() or unique() functions but that leads losing the A event in row 3 for Id 1.
Thanks in advance!
We can use lead to compare each row and filter those rows that are different than the previous ones. is.na(lead(Id)) is to also include the last rows.
library(dplyr)
dat2 <- dat %>%
filter(!(Id == lead(Id) & event == lead(event)) | is.na(lead(Id)))
dat2
# Id event
# 1 1 A
# 2 1 B
# 3 1 A
# 4 2 C
# 5 2 A
DATA
dat <- read.table(text = " Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header = TRUE, stringsAsFactors = FALSE)
You can just compare every row with the one after it.
df = read.table(text=" Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header=TRUE)
df[rowSums(df[-1,] == head(df, -1)) !=2, ]
Id event
1 1 A
2 1 B
4 1 A
6 2 C
7 2 A
Here is a solution with data.table:
library("data.table")
dt <- fread(
" Id event
1 A
1 B
1 A
1 A
2 C
2 C
2 A")
unique(dt[, r:=rleidv(event), Id])[, -3]
# Id event
# 1: 1 A
# 2: 1 B
# 3: 1 A
# 4: 2 C
# 5: 2 A
or
dt[, .SD[unique(rleidv(event))], by = Id]
(thx to #mt1022 for the comment)
A base R solution using tapply and rle:
x <- tapply(dat$event,dat$Id,function(x) rle(x)$values)
do.call(rbind,Map(data.frame,Id=names(x),event=x))
# Id event
# 1.1 1 A
# 1.2 1 B
# 1.3 1 A
# 2.1 2 C
# 2.2 2 A
I think the distinct function will be able to solve the problem.
dat %>%
distinct(Id, event)

Count based on multiple grouping variables r

I'm just trying to get a count of occurrences of 'stop' in variable (A) for each of 3 grouping variables (B,C,D).
A B C D
start 1 1 1
start 1 1 1
start 2 1 2
start 2 1 2
stop 1 2 1
stop 1 2 1
stop 2 2 1
Any help would be appreciated - please ask for clarification
I would convert to a data.table:
DT <- as.data.table(DF)
DT[A == 'stop', lapply(.SD, sum), .SDcols=c('B', 'C', 'D')]
B C D
1: 4 6 3
If you were working in the hadleyverse, you could do this using reshape2 and dplyr. Firstly you would use reshape to melt the data so that each of B,C,D has its own row. Then you can group_by and tally as usual.
library(reshape2)
library(dplyr)
melt(df) %>%
filter(A == "stop") %>%
group_by(variable, value) %>%
tally()
# variable value n
# 1 B 1 2
# 2 B 2 1
# 3 C 2 3
# 4 D 1 3
Do you mean each combination of B, C and D? If so here is a base R solution:
df <- read.table(text = "A B C D
start 1 1 1
start 1 1 1
start 2 1 2
start 2 1 2
stop 1 2 1
stop 1 2 1
stop 2 2 1", header = TRUE)
num.stops <- aggregate((A == "stop") ~ B + C + D, df, FUN = sum)
# B C D (A == "stop")
# 1 1 1 1 0
# 2 1 2 1 2
# 3 2 2 1 1
# 4 2 1 2 0
library(dplyr)
df%>%filter(A=='stop')%>%summarise_each(funs(sum),-1)

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

Resources