cumulatively concatenate columns in data.table by group [duplicate] - r

This question already has answers here:
Cumulatively paste (concatenate) values grouped by another variable
(6 answers)
Closed 4 years ago.
I have a data.table like the following:
x <- data.table(group = c('A', 'A', 'A', 'B', 'B'),
row_id = c(1, 2, 3, 1, 2),
value = c('a', 'b', 'c', 'd', 'e'))
I want to add a new column that cumulatively concatenate column 'value' ordered by 'row_id', within each group indicated by 'group'. So the output would look like:
group row_id value
1: A 1 a
2: A 2 a_b
3: A 3 a_b_c
4: B 1 d
5: B 2 d_e
Thank you for your help!

One option would be to group by 'group', loop through the sequence of rows, get the sequence of it, use that as index to get the corresponding 'value' and paste together with delimiter _, assign (:=) it to update the 'value'
x[, value := sapply(seq_len(.N), function(i)
paste(value[seq(i)], collapse = "_")), by = group]
x
# group row_id value
#1: A 1 a
#2: A 2 a_b
#3: A 3 a_b_c
#4: B 1 d
#5: B 2 d_e

Related

Create id column of repeated rows [duplicate]

This question already has answers here:
Assign unique ID based on two columns [duplicate]
(2 answers)
Closed 3 years ago.
EDITED:
I have a very simple question. I have a dataframe (already given) with repeated rows. I want to identify each unique row and add a column with an ID number.
The original table has thousands of row, but I simplify it here. A toy df can be created in this way.
df <- data.frame(var1 = c('a', 'a', 'a', 'b', 'c', 'c', 'a'),
var2 = c('d', 'd', 'd', 'e', 'f', 'f', 'c'))
For each unique row, I want a numeric ID:
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
/EDITED
Here is a base R solution using cumsum + duplicated, i.e.,
df$ID <- cumsum(!duplicated(df))
such that
> df
var1 var2 ID
1 a d 1
2 a d 1
3 a d 1
4 b e 2
5 c f 3
6 c f 3
7 a c 4
EDIT
Well, the question was completely changed by OP. For the updated question we can do
df$ID <- match(paste0(df$var1, df$var2), unique(paste0(df$var1, df$var2)))
Original answer
One way would be to use uncount from tidyr
library(dplyr)
df %>% mutate(ID = row_number()) %>% tidyr::uncount(ID, .remove = FALSE)
# var1 var2 ID
#1 a d 1
#2 b e 2
#2.1 b e 2
#3 c f 3
#3.1 c f 3
#3.2 c f 3
In base R we can create a row number column in the dataframe and repeat rows based on that.
df$ID <- seq(nrow(df))
df[rep(df$ID, df$ID), ]
data
df <- structure(list(var1 = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), var2 = structure(1:3, .Label = c("d", "e",
"f"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")

Accessing column name within the SD construct

I have a data table in R that looks like this
DT = data.table(a = c(1,2,3,4,5), a_mean = c(1,1,2,2,2), b = c(6,7,8,9,10), b_mean = c(3,2,1,1,2))
I want to create two more columns a_final and b_final defined as a_final = (a - a_mean) and b_final = (b - b_mean). In my real life use case, there can be a large number of such column pairs and I want a scalable solution in the spirit of R's data tables.
I tried something along the lines of
DT[,paste0(c('a','b'),'_final') := lapply(.SD, function(x) ((x-get(paste0(colnames(.SD),'_mean'))))), .SDcols = c('a','b')]
but this doesn't quite work. Any idea of how I can access the column name of the column being processed within the lapply statement?
We can create a character vector with columns names, subset it from the original data.table, get their corresponding "mean" columns, subtract and add as new columns.
library(data.table)
cols <- unique(sub('_.*', '', names(DT))) #Thanks to #Sotos
#OR just
#cols <- c('a', 'b')
DT[,paste0(cols, '_final')] <- DT[,cols, with = FALSE] -
DT[,paste0(cols, "_mean"), with = FALSE]
DT
# a a_mean b b_mean a_final b_final
#1: 1 1 6 3 0 3
#2: 2 1 7 2 1 5
#3: 3 2 8 1 1 7
#4: 4 2 9 1 2 8
#5: 5 2 10 2 3 8
Another option is using mget with Map:
cols <- c('a', 'b')
DT[, paste0(cols,'_final') := Map(`-`, mget(cols), mget(paste0(cols,"_mean")))]
Relying on the .SD construct you could do something along the lines of:
cols <- c('a', 'b')
DT[, paste0(cols, "_final") :=
DT[, .SD, .SDcols = cols] -
DT[, .SD, .SDcols = paste0(cols, "_mean")]]

Assign category based on magnitude of value

I have a data.table like this
dt1=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8))
where each individual has 2 scores on the variable "score".
I would like to assign each individual to a category in the variable outcome based on their score.
For their lower score, they get an "A", for their higher, they get a "B". So the final table looks like this
dt2=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8),
category=c('A','B', 'A','B', 'A','B'))
Since the values in column "score" are random, the category should be assigned based on the magnitude of the numbers assigned to each person. Any help is much appreciated.
We can order the 'score' in i, grouped by 'id' and assign the 'category' as 'A', 'B'
library(data.table)
dt1[order(score), category := c('A', 'B') , by = id]
dt1
# id score category
#1: 001 4 A
#2: 001 6 B
#3: 002 3 A
#4: 002 7 B
#5: 003 2 A
#6: 003 8 B
Or another option is to convert a logical vector to a numeric index and replace the values based on that
dt1[, category := c('A', 'B')[(score != min(score)) + 1] ,by = id]
data
dt1 <- data.table(id=c('001','001','002','002','003','003'),
score=c(4,6,3,7,2,8))
We can use ifelse:
library(data.table)
dt1[, category := ifelse(score == min(score), 'A', 'B'), by = id]
Result:
id score category
1: 1 4 A
2: 1 6 B
3: 2 3 A
4: 2 7 B
5: 3 2 A
6: 3 8 B

Counting the result of a left join using dplyr

What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI].
.EACHI refers to each row of i=a.
j=.N uses a special variable for the number of rows.
There are already some good answers but since the question asks not to use packages here is one. We perform a left join on a and b and append a refs column which is TRUE if ref_id is not NA. Then use aggregate to sum over the refs column:
m <- transform(merge(a, b, all.x = TRUE), refs = !is.na(ref_id))
aggregate(refs ~ id, m, sum)
giving:
id refs
1 1 2
2 2 0
3 3 3
4 4 1
It does require another package, but i'd feel remiss for not mentioning tidylog which provides reports for a wide range of tidyverse verbs. In your case, it would produce a report like:
library(tidylog)
a <- data.frame(id = c(1, 2, 3, 4 ))
b <- data.frame(id = c(1, 1, 3, 3, 3, 4), ref_id = c('a', 'b', 'c', 'd', 'e', 'f'))
a %>% left_join(b, by='id')
left_join: added one column (ref_id)
> rows only in x 1
> rows only in y (0)
> matched rows 6 (includes duplicates)
> ===
> rows total 7
id ref_id
1 1 a
2 1 b
3 2 <NA>
4 3 c
5 3 d
6 3 e
7 4 f
See here and here for more examples/info
I'm having a hard time deciding if this is a hack or the proper way to count references, but this returns the expected result:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=sum( !is.na( ref_id ) ) )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 0
3 3 3
4 4 1

R: Merging factor levels and creating sum in merged columns [duplicate]

This question already has answers here:
Cleaning up factor levels (collapsing multiple levels/labels)
(10 answers)
Closed 5 years ago.
I am a beginner in R and this is driving me nuts.
I have a dataframe:
someData = data.frame(Term=c('a', 'b', 'c', 'd', 'a', 'a', 'c', 'c'), Freq=c(1:8), Category=c(1,2,1,2,1,2,1,2));
someData$Term = as.factor(someData$Term)
someData$Category = as.factor(someData$Category)
and would like to combine terms a and c (both factors) into x while summing their respective frequencies and while maintaining the categories so that I have the following resulting dataframe:
Term Freq Category
x 16 1
b 2 2
d 4 2
x 14 2
The following code only changes all names to x, but does not sum its values.
combine <- c("a", "c");
levels(somedata$Term)[levels(somedata$Term) %in% combine] <- paste("x");
This looks valid:
#levels(someData$Term) = list(b = "b", d = "d", x = c("a", "c")) #just another approach
aggregate(Freq ~ Term + Category, someData, sum)
# Term Category Freq
#1 x 1 16
#2 b 2 2
#3 d 2 4
#4 x 2 14

Resources