joining the first n factors (with different n) in R

joining the first n factors (with different n) in R - r

A data frame contains ID, group, n (numeric), and several factor variables
ID <- c(1,2,3,4,5,6,7,8,9,10)
group <- c("m", "m", "m", "f", "f", "m", "m", "f", "f", "m")
n <- c(1,2,6,3,6,8,4,1,4,2)
b1 <- c("a", "b", "", "a", "d", "d", "a", "c", "c", "b")
b2 <- c("a", "", "e", "a", "d", "d", "a", "c", "c", "b")
b3 <- c("a", "b", "", "a", "", "d", "a", "c", "c", "b")
b4 <- c("a", "b", "e", "a", "", "d", "a", "c", "c", "b")
b5 <- c("a", "b", "e", "a", "d", "", "", "", "c", "b")
b6 <- c("a", "", "", "", "d", "d", "", "c", "c", "b")
df <- data.frame(ID, group, n, b1, b2, b3, b4, b5, b6)
I need to create a new character column (call it y).
They way to compute y is by joining the first n variables (b1,b2,b3,b4,b5,b6) and use comma to seperate them.
Note, in case a column is a blank, then remove it from the join.
For example, for ID=1, y = "a"; for ID = 2, y = "b" (not "b, "); for ID = 3, y = "e,e,e", etc.
And, the faster the code, the better.

A possible sollution, the speed might still be an issue:
df$y <- sapply(seq_len(nrow(df)), function(i){
cvec <- head(unlist(df[i, 4:9]), df$n[i])
cvec <- cvec[!cvec == '']
paste(cvec, collapse = ',')
})
# ID group n b1 b2 b3 b4 b5 b6 y
# 1 1 m 1 a a a a a a a
# 2 2 m 2 b b b b b
# 3 3 m 6 e e e e,e,e
# 4 4 f 3 a a a a a a,a,a
# 5 5 f 6 d d d d d,d,d,d
# 6 6 m 8 d d d d d d,d,d,d,d
# 7 7 m 4 a a a a a,a,a,a
# 8 8 f 1 c c c c c c
# 9 9 f 4 c c c c c c c,c,c,c
# 10 10 m 2 b b b b b b b,b

Here is an option using gsub and paste. We paste the 'b' columns of 'df' (do.call(paste0, df[-(1:3)]), then use substring to keep only the characters that suggested by 'n' column, use gsub to create the , in between each character.
df$y <- gsub("(?<=\\S)(?=\\S)", ",",
substring(do.call(paste0, df[-(1:3)]), 1, df$n), perl = TRUE)
df
# ID group n b1 b2 b3 b4 b5 b6 y
#1 1 m 1 a a a a a a a
#2 2 m 2 b b b b b,b
#3 3 m 6 e e e e,e,e
#4 4 f 3 a a a a a a,a,a
#5 5 f 6 d d d d d,d,d,d
#6 6 m 8 d d d d d d,d,d,d,d
#7 7 m 4 a a a a a,a,a,a
#8 8 f 1 c c c c c c
#9 9 f 4 c c c c c c c,c,c,c
#10 10 m 2 b b b b b b b,b

df$y <- apply(df, 1, function(r) {
gsub("\\s+", "\\,", trimws(paste(head(r[4:9], r["n"]), sep= " ", collapse = " ")))})
df
# ID group n b1 b2 b3 b4 b5 b6 y
# 1 1 m 1 a a a a a a a
# 2 2 m 2 b b b b b
# 3 3 m 6 e e e e,e,e
# 4 4 f 3 a a a a a a,a,a
# 5 5 f 6 d d d d d,d,d,d
# 6 6 m 8 d d d d d d,d,d,d,d
# 7 7 m 4 a a a a a,a,a,a
# 8 8 f 1 c c c c c c
# 9 9 f 4 c c c c c c c,c,c,c
# 10 10 m 2 b b b b b b b,b

Related

Replace certain columns in a data frame with the columns of another data frame

I have two data frames with the same columns names and the same size. Each of them has 40 columns and 5000 rows. I would like to replace certain columns in a data frame with those from the other df arranged by their common ID. The column ID is identical for both dfs but not necessarily in the same order for each df.
Let me provide an example for clarity.
df1 <- data.frame( ID = c("ID1", "ID2","ID3", "ID4","ID5", "ID6","ID7", "ID8", "ID9"),
A = c(1,2,3,4,5,6,7,8,9),
B = c(11,21,31,41,51,61,71,81,91),
C = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
D = c("a1","b1","c1", "d1","e1", "f1", "g1", "h1", "i1")
)
df1
df2 <- data.frame( ID = c("ID2", "ID1","ID3", "ID4","ID5", "ID6","ID9", "ID8", "ID7"),
A = sample(x = 1:20, size = 9),
B = sample(x = 1:50, size = 9),
C = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
D = c("A1","B1","C1", "D1","E1", "F1", "G1", "H1", "I1")
)
df2
This should be the df2 after replacing its columns, A, B with those from df1 while keeping the rest of the columns (C, D) unchanged.
df2_out <- data.frame( ID = c("ID2", "ID1","ID3", "ID4","ID5", "ID6","ID9", "ID8", "ID7"),
A = c(2,1,3,4,5,6,9,8,7),
B = c(21,11,31,41,51,61,91,81,71),
C = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
D = c("A1","B1","C1", "D1","E1", "F1", "G1", "H1", "I1")
)
As mentioned the number of the columns to be changed is long (30) in my data set:
changed_columns <- c("A", "B", ....)
any help on how to make it ?
Thank you

Using the data.table package, you can solve your problem as follows:
library(data.table)
setDT(df2)[df1, c("A", "B") := .(i.A, i.B), on = "ID"]
# ID A B C D
# 1: ID2 2 21 A A1
# 2: ID1 1 11 B B1
# 3: ID3 3 31 C C1
# 4: ID4 4 41 D D1
# 5: ID5 5 51 E E1
# 6: ID6 6 61 F F1
# 7: ID9 9 91 G G1
# 8: ID8 8 81 H H1
# 9: ID7 7 71 I I1

Another base R option by using merge + subset
df2_out <- subset(merge(df1[c("ID","A","B")],df2,all = TRUE,by = "ID"),select = -cbind(A.y,B.y))
such that
> df2_out
ID A.x B.x C D
1 ID1 1 11 B B1
2 ID2 2 21 A A1
3 ID3 3 31 C C1
4 ID4 4 41 D D1
5 ID5 5 51 E E1
6 ID6 6 61 F F1
7 ID7 7 71 I I1
8 ID8 8 81 H H1
9 ID9 9 91 G G1

We can use match to get the order of ID and replace them with changed_columns in df1.
changed_columns <- c("A", "B")
df2[match(df1$ID, df2$ID), changed_columns] <- df1[changed_columns]
df2
# ID A B C D
#1 ID2 2 21 A A1
#2 ID1 1 11 B B1
#3 ID3 3 31 C C1
#4 ID4 4 41 D D1
#5 ID5 5 51 E E1
#6 ID6 6 61 F F1
#7 ID9 9 91 G G1
#8 ID8 8 81 H H1
#9 ID7 7 71 I I1

Combining multiple column/ stacking multiple columns

So I have 9 column
a b c d e f g h i
1 1 t p 1 h p 1 v g
2 2 e h 2 j m 2 c f
3 3 f g 3 k l 3 b d
and i want to know how can I make them like this
a b c
1 1 t p
2 2 e h
3 3 f g
4 1 h p
5 2 j m
6 3 k l
7 1 v g
8 2 c f
9 3 b d

We can use reshape from base R by specifying the columns to combine together in a list of vectors
out <- reshape(df1, direction = 'long',
varying = list(c('a', 'd', 'g'), c('b', 'e', 'h'),
c('c', 'f', 'i')))[c('a', 'b', 'c')]
row.names(out) <- NULL
out
# a b c
#1 1 t p
#2 2 e h
#3 3 f g
#4 1 h p
#5 2 j m
#6 3 k l
#7 1 v g
#8 2 c f
#9 3 b d
Or using melt from data.table
library(data.table)
melt(setDT(df1), measure = list(c('a', 'd', 'g'), c('b', 'e', 'h'),
c('c', 'f', 'i')), value.name = c('a', 'b', 'c'))[, variable := NULL][]
data
df1 <- structure(list(a = 1:3, b = c("t", "e", "f"), c = c("p", "h",
"g"), d = 1:3, e = c("h", "j", "k"), f = c("p", "m", "l"), g = 1:3,
h = c("v", "c", "b"), i = c("g", "f", "d")),
class = "data.frame", row.names = c("1",
"2", "3"))

One option involving purrr could be:
map_dfc(.x = split.default(df, rep(1:3, length.out = length(df))),
~ stack(.)[1]) %>%
setNames(c("a", "b", "c"))
a b c
1 1 t p
2 2 e h
3 3 f g
4 1 h p
5 2 j m
6 3 k l
7 1 v g
8 2 c f
9 3 b d

Transform df into edges df for collaboration network

I have this df, which contains information on collaboration of articles:
author author2 author3 author4
1 A D E F
2 B G
3 C H F
I need to create an edges dataframe, which contains the relationship between the authors, like this:
from to
1 A D
2 A E
3 A F
4 B G
5 C H
6 C F
7 D E
8 D F
9 E F
11 H F
any ideas how to do it?

We can gather each column against the remaining columns i.e. to the left of that column and then binds all.
library(tidyverse)
map_dfr(names(df)[-length(df)], ~select(df,.x:ncol(df)) %>% gather( k,to,-.x) %>%
arrange(!!ensym(.x)) %>% select(-k) %>% filter(to!='') %>%
rename(form=starts_with('author')))
form to
1 A D
2 A E
3 A F
4 B G
5 C H
6 C F
7 D E
8 D F
9 H F
10 E F
Data
df <- structure(list(author = c("A", "B", "C"), author2 = c("D", "G",
"H"), author3 = c("E", "", "F"), author4 = c("F","", "")), class = "data.frame", row.names = c("1",
"2", "3"))

You could apply combn row-wise inside a function, no need for packages.
edges <- setNames(as.data.frame(do.call(rbind, lapply(seq(nrow(d)), function(x)
matrix(unlist(t(combn(na.omit(unlist(d[x, ])), 2))), ncol=2)))), c("from", "to"))
edges
# from to
# 1 A D
# 2 A E
# 3 A F
# 4 D E
# 5 D F
# 6 E F
# 7 B G
# 8 C H
# 9 C F
# 10 H F
Or, using igraph package as #akrun suggested.
library(igraph)
edges <- do.call(rbind, apply(d, 1, function(x)
as_data_frame(graph_from_data_frame(t(combn(na.omit(x), 2))))))
edges
# from to
# 1 A D
# 2 A E
# 3 A F
# 4 D E
# 5 D F
# 6 E F
# 7 B G
# 8 C H
# 9 C F
# 10 H F
Data
d <- structure(list(author = c("A", "B", "C"), author2 = c("D", "G",
"H"), author3 = c("E", NA, "F"), author4 = c("F", NA, NA)), row.names = c(NA,
-3L), class = "data.frame")

Get a unique hash value out of a combination of columns

I have a data.table with 4+ columns. The first 3 are necessary to get the data about one unique individual.
c1 c2 c3 c4
a c e other_data
a c e other_data
a c f other_data
a c f other_data
a d f other_data
b d g other_data
# (c1 = "a" AND c2 = "c" AND c3 = "e") => one individual
# (c1 = "a" AND c2 = "c" AND c3 = "f") => another individual
I'd like to compute another column which would mark each individual :
c1 c2 c3 c4 unique_individual_id
a c e other_data 1
a c e other_data 1
a c f other_data 2
a c f other_data 2
a d f other_data 3
b d g other_data 4
I would like to get a unique hash out of the content of the 3 columns.
How would I do that in code ?

as.numeric(as.factor(with(df, paste(c1, c2, c3))))
#[1] 1 1 2 2 3 4

We can use interaction to create the unique index
df1$unique_individual_id <- as.integer(do.call(interaction, c(df1[-4], drop = TRUE)))
df1$unique_individual_id
#[1] 1 1 2 2 3 4

Alternatively, you can paste the values of interest (for each row, you paste together the values in columns 1, 2, and 3), convert to factor and then to integer (this will return an unique ID num for your combination.
df <- data.frame(c("a", "a", "b", "c", "c", "d", "d"),
c("a", "a", "b", "c", "d", "e", "e"),
c("c", "c", "d", "d", "e", "e", "e"))
df$ID <- as.numeric(as.factor(sapply(1:nrow(df), (function(i) {paste(df[i, 1:3], collapse = "")}))))

Remove/collapse consecutive duplicate values in sequence

I have the following dataframe:
a a a b c c d e a a b b b e e d d
The required result should be
a b c d e a b e d
It means no two consecutive rows should have same value. How it can be done without using loop.
As my data set is quite huge, looping is taking lot of time to execute.
The dataframe structure is like the following
a 1
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10
Result:
a 1
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4
Its should delete the entire row.

One easy way is to use rle:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Update: For a data.frame
If you are working with a data.frame, try something like the following:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
Update 2
The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
EDIT: For data.frame
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10),
stringsAsFactors=FALSE)
dplyr solution is one liner:
mydf %>% filter(V1!= lag(V1, default="1"))
# V1 V2
#1 a 1
#2 b 2
#3 c 4
#4 d 3
#5 e 9
#6 a 4
#7 b 10
#8 e 2
#9 d 4
post scriptum
lead(x,1) suggested by #Carl Witthoft iterates in reverse order.
leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]
# V1 V2
#3 a 3
#4 b 2
#6 c 1
#7 d 3
#8 e 9
#10 a 8
#12 b 199
#14 e 5
#16 d 10

With base R, I like funny algorithmics:
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Much as I like,... errr, love rle , here's a shootoff:
EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.
xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]
microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
expr min lq median uq max neval
rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657 20
lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940 20
tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840 20

Tidyverse solution:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |>
mutate(id = consecutive_id(x)) |>
distinct(x, id)
In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |>
group_by(id = consecutive_id(x)) |>
slice_min(y)
We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.
This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

joining the first n factors (with different n) in R - r

Related

Replace certain columns in a data frame with the columns of another data frame

Combining multiple column/ stacking multiple columns

Transform df into edges df for collaboration network

Get a unique hash value out of a combination of columns

Remove/collapse consecutive duplicate values in sequence

Categories

Resources