Here's example data "ex" and "data".
I would like to make a new data set on the right side.
So, I want to match the same rowname between "ex" and "data",
moreoever, I want to match the same vector in "ex" and rowname of "data"
It's so complicated to explain.
Therefore, I attached the picture what I want.
This is my code as below. Unfortunately, I have a trouble making a new dataset.
What should I revise my code?
Thanks in advance.
ex <- data.frame(matrix(c(5, 12, 14, 20,
4, 19, 17, 9,
11, 15, 8, 10), ncol=4))
data <- data.frame(matrix(c("A","B","C","D","E","F","G",
"H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T",
"A","B","C","D","E","F","G",
"H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T"), ncol=2))
##something problem this code
for (i in (1:nrow(ex)))
{
if (row.names(data)[i]==row.names(ex)[i])
{
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,1])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,2])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,3])) {
data$group[i] <- i
}
else if (row.names(data)==as.character(ex[i,4])) {
data$group[i] <- i
}
}
Here's a tidyverse solution that will help you:
library(tidyverse)
# update ex dataset
ex_upd = ex %>%
rownames_to_column("Group") %>% # add row names as a column
gather(x, row_id, -Group) %>% # reshape dataset
select(-x) # remove column x
# update data and join ex_upd
data %>%
rownames_to_column("row_id") %>% # add row names as a column
mutate(row_id = as.numeric(row_id)) %>% # update to numeric variable
left_join(ex_upd, by="row_id") %>% # join ex updated dataset
column_to_rownames("row_id") # create row names from that column
# X1 X2 Group
# 1 A A <NA>
# 2 B B <NA>
# 3 C C <NA>
# 4 D D 2
# 5 E E 1
# 6 F F <NA>
# 7 G G <NA>
# 8 H H 2
# 9 I I 2
# 10 J J 3
# 11 K K 3
# 12 L L 2
# 13 M M <NA>
# 14 N N 3
# 15 O O 1
# 16 P P <NA>
# 17 Q Q 1
# 18 R R <NA>
# 19 S S 3
# 20 T T 1
Note: it's still not clear to me why in your expected output the first two rows have a Group value and the third one doesn't.
If you update your ex dataset like this:
ex_upd = ex %>%
rownames_to_column("Group") %>%
mutate(id = as.numeric(Group)) %>% # (new code added to previous one)
gather(x, row_id, -Group) %>%
select(-x)
You'll get a Group added to your first 3 rows.
A base R solution could be:
ex$X5 <- as.numeric(rownames(ex))
ex$Group <- ex$X5
data$Group <- numeric(nrow(data))
for(i in 1:nrow(ex)) {
select_rows <- unlist(ex[i, 1:5])
data$Group[select_rows] <- ex$Group[i]
}
data
# X1 X2 Group
# 1 A A 1
# 2 B B 2
# 3 C C 3
# 4 D D 2
# 5 E E 1
# 6 F F 0
# 7 G G 0
# 8 H H 2
# 9 I I 2
# 10 J J 3
# 11 K K 3
# 12 L L 2
# 13 M M 0
# 14 N N 3
# 15 O O 1
# 16 P P 0
# 17 Q Q 1
# 18 R R 0
# 19 S S 3
# 20 T T 1
Related
How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)
I have this dataframe
a <- c("a", "f", "n", "c", "d")
b <- c("L", "S", "N", "R", "S")
df <- data.frame(a,b)
a b
1 a L
2 f S
3 n N
4 c R
5 d S
Then I want the rows be ordered by column b, but first setting at the beginning the rows with "S" value and then alphabetically:
a b
2 f S
5 d S
1 a L
3 n N
4 c R
You can exchange the S to a space during order.
df[order(sub("S", " ", df$b)), ]
#df[order(chartr("S", " ", df$b)), ] #Alternative
# a b
#2 f S
#5 d S
#1 a L
#3 n N
#4 c R
Here is one option using factor.
df[order(factor(df$b, unique(c('S', sort(df$b))))), ]
# a b
#2 f S
#5 d S
#1 a L
#3 n N
#4 c R
Using dplyr
library(dplyr)
df %>%
arrange(b != 'S', b)
a b
1 f S
2 d S
3 a L
4 n N
5 c R
Or in base R
df[order(df$b != "S", df$b),]
a b
2 f S
5 d S
1 a L
3 n N
4 c R
I have a grouped data set that looks like this:
data = data.frame(group = c(1,1,1,1,2,2,2,2),
c1 = c("A", "E", "A", "J", "L", "M", "L", "J"),
c2 = c("B", "F", "F", "K", "B", "F", "T", "E"),
c3 = c("C", "G", "C", "L", "C", "X", "C", "V"),
c4 = c("D", "H", "I", "M", "D", "T", "I", "W"))
And I need to calculate the number of values in each row that are not duplicated within each group. For example, something that looks like this:
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
The count for row 1 would be 2, because B and D do not show up in any of the other rows within group 1.
I am familiar with using group_by and summarize but I am having trouble extending that to this particular situation, which requires that each value be checked across multiple columns and rows. For example, n_distinct on its own would not work because I'm looking for non-duplicated values, not unique values.
Ideally the solution would also ignore NAs and not count them as duplicated or non-duplicated values.
Here is an option with tidyverse. Reshape to 'long' format with pivot_longer, grouped by 'group', replace all the duplicate 'value' to NA, then grouped by row number, summarise to get the counts with n_distinct (number of distinct elements), and bind with the original data
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with('c')) %>%
group_by(group) %>%
mutate(value = replace(value, duplicated(value)|duplicated(value,
fromLast = TRUE), NA)) %>%
group_by(rn) %>%
summarise(uniq.vals = n_distinct(value, na.rm = TRUE), .groups = 'drop') %>%
select(uniq.vals) %>%
bind_cols(data, .)
-output
# group c1 c2 c3 c4 uniq.vals
#1 1 A B C D 2
#2 1 E F G H 3
#3 1 A F C I 1
#4 1 J K L M 4
#5 2 L B C D 2
#6 2 M F X T 3
#7 2 L T C I 1
#8 2 J E V W 4
In base R you would do:
a <- tapply(unlist(data[-1]), data$group[row(data[-1])],table)
data$uniq.vals <- c(by(data, seq(nrow(data)),
function(x)sum(a[[x[,1]]][unlist(x[-1])]<2)))
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
Note that in your case, row 3 should have 1 since only I is the unique value
This is a small example:
a <- c("a", "b", "f", "c", "e")
b <- c("a", "c", "e", "d", "b")
p <- matrix(1:25, nrow = 5, dimnames = list(a, b))
p <- as.data.frame(p)
#data.frame would be like that
a c e d b
a 1 6 11 16 21
b 2 7 12 17 22
f 3 8 13 18 23
c 4 9 14 19 24
e 5 10 15 20 25
The output what I want:
score
a 1
b 22
c 9
e 15
This is the code I wrote:
L <- rownames(p)
output <- NULL
t <- 1
for (i in L) {
tar_column <- p[i]
score <- tar_column[t, ]
tar_score <- matrix(score, nrow = 1, dimnames = list(i, "score"))
output <- rbind(output, tar_score)
t <- t+1
}
The output I got:
score
a 1
b 22
Error in `[.data.frame`(p, i) : undefined columns selected
The problem is that column name and rowname are not matched perfectly. I think that the if statement can help to skip the variable when it can't be matched to the column name. Could someone help me fix this problem?
Just loop through each column/rowname (using sapply) and use square bracket notation to subset p on both that row and column:
sapply(c('a','b','c','e'), function(x) p[x,x])
a b c e
1 22 9 15
If you don't want to specify the variable names beforehand, you can just use either colnames or rownames:
sapply(colnames(p), function(x) p[x,x])
a c e d b
1 9 15 NA 22
If there isn't a matching rowname, this will return NA for that value. If desired, you can drop the NA values by subsetting the result:
result <- sapply(colnames(p), function(x) p[x,x])
result[!is.na(result)]
a c e b
1 9 15 22
Here is another option:
library(tidyverse)
p %>%
rownames_to_column("row") %>%
gather(col, score, -row) %>%
filter(row == col) %>%
select(-row)
#> col score
#> 1 a 1
#> 2 c 9
#> 3 e 15
#> 4 b 22
First we make the row name into a variable, then we gather from wide to long format, lastly we filter only matching pairs of row and col.
I have two datasets. First one is smaller, but have more precise data.
I need to join them, but:
1. If I have some data in Data1 - I'm using only this data.
2. If I haven't got data in Data1, but they're in Data2 - I'm using only data from Data2.
Data1 <- data.frame(
X = c(1,4,7,10,13,16),
Y = c("a", "b", "c", "d", "e", "f")
)
Data2 <- data.frame(
X = c(1:10),
Y = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
)
So my data.frame should look like that:
DataJoin <- data.frame(
X = c(1,4,7,10,13,16,7,8,9,10),
Y = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
)
How can I do that?
I've tried somehow option merge form base package and data.table package, but I couldn't make it happend, as I like.
There's no join needed. You can reformulate the problem as "add the data found in Data2 and not found in Data1 to Data1". So simply do:
id <- Data2$Y %in% Data1$Y
DataJoin <- rbind(Data1,Data2[!id,])
Gives:
> DataJoin
X Y
1 1 a
2 4 b
3 7 c
4 10 d
5 13 e
6 16 f
7 7 g
8 8 h
9 9 i
10 10 j
Using data.table:
d1 <- data.table(Data1, key="Y")[, X := as.integer(X)]
d2 <- data.table(Data2, key="Y")
# copy d2 so that it doesn't get modified by reference
# i.X refers to the column X of DT in 'i' = d1's 'X'
ans <- copy(d2)[d1, X := i.X]
X Y
1: 1 a
2: 4 b
3: 7 c
4: 10 d
5: 13 e
6: 16 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
DataJoin <- merge(Data1, Data2, by="Y", all=TRUE)
DataJoin$X.x[is.na(DataJoin$X.x)] <- DataJoin$X.y[is.na(DataJoin$X.x)]
DataJoin[,1:2]
# Y X.x
# 1 a 1
# 2 b 4
# 3 c 7
# 4 d 10
# 5 e 13
# 6 f 16
# 7 g 7
# 8 h 8
# 9 i 9
# 10 j 10