I have two datasets. First one is smaller, but have more precise data.
I need to join them, but:
1. If I have some data in Data1 - I'm using only this data.
2. If I haven't got data in Data1, but they're in Data2 - I'm using only data from Data2.
Data1 <- data.frame(
X = c(1,4,7,10,13,16),
Y = c("a", "b", "c", "d", "e", "f")
)
Data2 <- data.frame(
X = c(1:10),
Y = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
)
So my data.frame should look like that:
DataJoin <- data.frame(
X = c(1,4,7,10,13,16,7,8,9,10),
Y = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
)
How can I do that?
I've tried somehow option merge form base package and data.table package, but I couldn't make it happend, as I like.
There's no join needed. You can reformulate the problem as "add the data found in Data2 and not found in Data1 to Data1". So simply do:
id <- Data2$Y %in% Data1$Y
DataJoin <- rbind(Data1,Data2[!id,])
Gives:
> DataJoin
X Y
1 1 a
2 4 b
3 7 c
4 10 d
5 13 e
6 16 f
7 7 g
8 8 h
9 9 i
10 10 j
Using data.table:
d1 <- data.table(Data1, key="Y")[, X := as.integer(X)]
d2 <- data.table(Data2, key="Y")
# copy d2 so that it doesn't get modified by reference
# i.X refers to the column X of DT in 'i' = d1's 'X'
ans <- copy(d2)[d1, X := i.X]
X Y
1: 1 a
2: 4 b
3: 7 c
4: 10 d
5: 13 e
6: 16 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
DataJoin <- merge(Data1, Data2, by="Y", all=TRUE)
DataJoin$X.x[is.na(DataJoin$X.x)] <- DataJoin$X.y[is.na(DataJoin$X.x)]
DataJoin[,1:2]
# Y X.x
# 1 a 1
# 2 b 4
# 3 c 7
# 4 d 10
# 5 e 13
# 6 f 16
# 7 g 7
# 8 h 8
# 9 i 9
# 10 j 10
Related
How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)
I have the following data:
data <- data.frame(name = c("A", "A", "A", "B", "B", "C", "D", "D", "D", "D", "E", "B", "C", "C"),
surname = c("aa", "bb", "cc", "dd", "hh", "ee", "ii", "aa", "qq", "ff", "gg", "ff", "gg", "cc"))
This data produces a connected graph:
plot(graph_from_data_frame(data, directed = F))
which obviously has 1 component.
I would like to count the number of components this data produces every time we add a row in the graph. For example, the initial graph will have 1 component, since the vertices A and aa in the first row of the data are connected. The next graph will have again 1 component, since we add the second row and because of the A value in the name column. When we include the fourth row (B, dd), the graph will have 2 components.
I use the following piece of code to get the number of components each time the data is updated:
for (i in 1:dim(data)[1]) {
data$number_of_components[i] <- components(graph_from_data_frame(data[1:i,], directed = F))$no}
Is there a smarter/more sophisticated way to get this? Thanks.
You can take a look at sapply().
dt$number_of_components <- sapply(seq_len(nrow(dt)), function(x) {
g <- graph_from_data_frame(dt[seq_len(x),], directed = FALSE)
components(g)$no
})
dt
# name surname number_of_components
# 1 A aa 1
# 2 A bb 1
# 3 A cc 1
# 4 B dd 2
# 5 B hh 2
# 6 C ee 3
# 7 D ii 4
# 8 D aa 3
# 9 D qq 3
# 10 D ff 3
# 11 E gg 4
# 12 B ff 3
# 13 C gg 2
# 14 C cc 1
You can try decompose like below
transform(
data,
num_components = sapply(
seq_along(name),
function(k) length(decompose(graph_from_data_frame(head(data, k), directed = FALSE)))
)
)
or
transform(
data,
num_components = lengths(
sapply(
seq_along(name),
function(k) decompose(graph_from_data_frame(head(data, k), directed = FALSE))
)
)
)
which gives
name surname num_components
1 A aa 1
2 A bb 1
3 A cc 1
4 B dd 2
5 B hh 2
6 C ee 3
7 D ii 4
8 D aa 3
9 D qq 3
10 D ff 3
11 E gg 4
12 B ff 3
13 C gg 2
14 C cc 1
I have a grouped data set that looks like this:
data = data.frame(group = c(1,1,1,1,2,2,2,2),
c1 = c("A", "E", "A", "J", "L", "M", "L", "J"),
c2 = c("B", "F", "F", "K", "B", "F", "T", "E"),
c3 = c("C", "G", "C", "L", "C", "X", "C", "V"),
c4 = c("D", "H", "I", "M", "D", "T", "I", "W"))
And I need to calculate the number of values in each row that are not duplicated within each group. For example, something that looks like this:
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
The count for row 1 would be 2, because B and D do not show up in any of the other rows within group 1.
I am familiar with using group_by and summarize but I am having trouble extending that to this particular situation, which requires that each value be checked across multiple columns and rows. For example, n_distinct on its own would not work because I'm looking for non-duplicated values, not unique values.
Ideally the solution would also ignore NAs and not count them as duplicated or non-duplicated values.
Here is an option with tidyverse. Reshape to 'long' format with pivot_longer, grouped by 'group', replace all the duplicate 'value' to NA, then grouped by row number, summarise to get the counts with n_distinct (number of distinct elements), and bind with the original data
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with('c')) %>%
group_by(group) %>%
mutate(value = replace(value, duplicated(value)|duplicated(value,
fromLast = TRUE), NA)) %>%
group_by(rn) %>%
summarise(uniq.vals = n_distinct(value, na.rm = TRUE), .groups = 'drop') %>%
select(uniq.vals) %>%
bind_cols(data, .)
-output
# group c1 c2 c3 c4 uniq.vals
#1 1 A B C D 2
#2 1 E F G H 3
#3 1 A F C I 1
#4 1 J K L M 4
#5 2 L B C D 2
#6 2 M F X T 3
#7 2 L T C I 1
#8 2 J E V W 4
In base R you would do:
a <- tapply(unlist(data[-1]), data$group[row(data[-1])],table)
data$uniq.vals <- c(by(data, seq(nrow(data)),
function(x)sum(a[[x[,1]]][unlist(x[-1])]<2)))
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
Note that in your case, row 3 should have 1 since only I is the unique value
Here's example data "ex" and "data".
I would like to make a new data set on the right side.
So, I want to match the same rowname between "ex" and "data",
moreoever, I want to match the same vector in "ex" and rowname of "data"
It's so complicated to explain.
Therefore, I attached the picture what I want.
This is my code as below. Unfortunately, I have a trouble making a new dataset.
What should I revise my code?
Thanks in advance.
ex <- data.frame(matrix(c(5, 12, 14, 20,
4, 19, 17, 9,
11, 15, 8, 10), ncol=4))
data <- data.frame(matrix(c("A","B","C","D","E","F","G",
"H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T",
"A","B","C","D","E","F","G",
"H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T"), ncol=2))
##something problem this code
for (i in (1:nrow(ex)))
{
if (row.names(data)[i]==row.names(ex)[i])
{
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,1])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,2])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,3])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,4])) {
data$group[i] <- i
}
}
Here's a tidyverse solution that will help you:
library(tidyverse)
# update ex dataset
ex_upd = ex %>%
rownames_to_column("Group") %>% # add row names as a column
gather(x, row_id, -Group) %>% # reshape dataset
select(-x) # remove column x
# update data and join ex_upd
data %>%
rownames_to_column("row_id") %>% # add row names as a column
mutate(row_id = as.numeric(row_id)) %>% # update to numeric variable
left_join(ex_upd, by="row_id") %>% # join ex updated dataset
column_to_rownames("row_id") # create row names from that column
# X1 X2 Group
# 1 A A <NA>
# 2 B B <NA>
# 3 C C <NA>
# 4 D D 2
# 5 E E 1
# 6 F F <NA>
# 7 G G <NA>
# 8 H H 2
# 9 I I 2
# 10 J J 3
# 11 K K 3
# 12 L L 2
# 13 M M <NA>
# 14 N N 3
# 15 O O 1
# 16 P P <NA>
# 17 Q Q 1
# 18 R R <NA>
# 19 S S 3
# 20 T T 1
Note: it's still not clear to me why in your expected output the first two rows have a Group value and the third one doesn't.
If you update your ex dataset like this:
ex_upd = ex %>%
rownames_to_column("Group") %>%
mutate(id = as.numeric(Group)) %>% # (new code added to previous one)
gather(x, row_id, -Group) %>%
select(-x)
You'll get a Group added to your first 3 rows.
A base R solution could be:
ex$X5 <- as.numeric(rownames(ex))
ex$Group <- ex$X5
data$Group <- numeric(nrow(data))
for(i in 1:nrow(ex)) {
select_rows <- unlist(ex[i, 1:5])
data$Group[select_rows] <- ex$Group[i]
}
data
# X1 X2 Group
# 1 A A 1
# 2 B B 2
# 3 C C 3
# 4 D D 2
# 5 E E 1
# 6 F F 0
# 7 G G 0
# 8 H H 2
# 9 I I 2
# 10 J J 3
# 11 K K 3
# 12 L L 2
# 13 M M 0
# 14 N N 3
# 15 O O 1
# 16 P P 0
# 17 Q Q 1
# 18 R R 0
# 19 S S 3
# 20 T T 1
library(data.table)
DT1 <- data.table(id = 1:6, junk = c("T", "U", "V", "X", "Y", "Z"),
type = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(id = 4:6, junk = c("X", "Y", "Z"),
type = c("B", "A", "C"))
That is,
> DT1
id junk type
1: 1 T A
2: 2 U B
3: 3 V B
4: 4 X B
5: 5 Y A
6: 6 Z C
> DT2
id junk type
1: 4 X B
2: 5 Y A
3: 6 Z C
I would like to add a column frequency to DT2 which gives the number of occurences of any given type in DT1. In other words, the result should look like this:
> DT2
id junk type frequency
1: 4 X B 3
2: 5 Y A 2
3: 6 Z C 1
(This seems somewhat related to Check frequency of data.table value in other data.table, but in that case, this could be accomplished by joining in the other direction. In this case, the resulting data table should be based on DT2.)
DT1[,frequency:=.N,by=type]
setkeyv(DT1, colnames(DT1)[-4])
DT1[DT2]
# id junk type frequency
#1: 4 X B 3
#2: 5 Y A 2
#3: 6 Z C 1
Suppose if your DT1 is
DT1 <- data.table(id = 1:5, junk = c("T", "U", "V", "X", "Y"),
type = c("A", "B", "B", "B", "A"))
Using the above code, gives
DT1[DT2]
# id junk type frequency
#1: 4 X B 3
#2: 5 Y A 2
#3: 6 Z C NA
Just try:
help<-DT1[,list(frequency=.N),by=type]
setkey(help, type)
setkey(DT2, type)
DT2[help]
# type id junk frequency
#1: A 5 Y 2
#2: B 4 X 3
#3: C 6 Z 1