Combining values from two different data frames - r

I have two data frames like this:
df.1 <- data.frame(
var.1 = sample(1:10),
code = sample(c("A", "B", "C"), 10, replace = TRUE))
df.2 <- data.frame(
var.2 = sample(1:3),
row.names=c("A","B","C"))
What I need to do is to add a third column df.1$var.2 which, for each value in df.1$code take the value from df.2$var.2 accordingly to their row name.
I got to this point but with no success.. Suggestions?
for (i in 1:length(df.1$code)){
if(df.1$code[i] == rownames(df.2))
df.1$var.2[i] <- df.2$var.2
}

You mean like this:
df.2$code <- rownames(df.2)
> merge(df.1,df.2,by = "code")
code var.1 var.2
1 A 5 1
2 B 3 2
3 B 2 2
4 B 7 2
5 B 10 2
6 C 8 3
7 C 4 3
8 C 1 3
9 C 9 3
10 C 6 3

Or, join() from the plyr package to preserve the order of df.1
df.2$code <- rownames(df.2)
library(plyr)
join(df.1, df.2, by = "code")
var.1 code var.2
1 7 B 2
2 2 A 1
3 3 C 3
4 6 B 2
5 10 C 3
6 4 C 3
7 1 C 3
8 8 B 2
9 9 A 1
10 5 C 3

Related

How can I stack my dataset so each observation relates to all other observations but itself?

I would like to stack my dataset so all observations relate to all other observations but itself.
Suppose I have the following dataset:
df <- data.frame(id = c("a", "b", "c", "d" ),
x1 = c(1,2,3,4))
df
id x1
1 a 1
2 b 2
3 c 3
4 d 4
I would like observation a to be related to b, c, and d. And the same for every other observation. The result should look like something like this:
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
So observation a is related to b,c,d. Observation b is related to a, c,d. And so on. Any ideas?
Another option:
library(dplyr)
left_join(df, df, by = character()) %>%
filter(id.x != id.y)
Or
output <- merge(df, df, by = NULL)
output = output[output$id.x != output$id.y,]
Thanks #ritchie-sacramento, I didn't know the by = NULL option for merge before, and thanks #zephryl for the by = character() option for dplyr joins.
tidyr::expand_grid() accepts data frames, which can then be filtered to remove rows that share the id:
library(tidyr)
library(dplyr)
expand_grid(df, df, .name_repair = make.unique) %>%
filter(id != id.1)
# A tibble: 12 × 4
id x1 id.1 x1.1
<chr> <dbl> <chr> <dbl>
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
You can use combn() to get all combinations of row indices, then assemble your dataframe from those:
rws <- cbind(combn(nrow(df), 2), combn(nrow(df), 2, rev))
df2 <- cbind(df[rws[1, ], ], df[rws[2, ], ])
# clean up row and column names
rownames(df2) <- 1:nrow(df2)
colnames(df2) <- c("id", "x1", "id2", "x2")
df2
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 c 3
5 b 2 d 4
6 c 3 d 4
7 b 2 a 1
8 c 3 a 1
9 d 4 a 1
10 c 3 b 2
11 d 4 b 2
12 d 4 c 3

Transpose and Merge columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
Quite new to R and I have a dataset in this format:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
But I want it in this format:
A 1
A 2
A 3
A 4
A 5
B 1
B 2
B 3
...etc.
Seems like such a simple issue but I need HELP! Thanks
df <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
stack(df)
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 1 C
12 2 C
13 3 C
14 4 C
15 5 C
Examples using dplyr's gather function:
library(tidyverse)
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5)
C <- c(1,2,3,4,5)
df <- data.frame(A,B,C)
df %>% gather(key = "key", value = "value")
key value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
You can use the package tidyr. This let's you choose, which columns you want to gather in the column "variable".
# if not installed yet
install.packages("tidyr")
library(tidyr)
data <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
data %>% pivot_longer(c(A, B, C), names_to = "variable", values_to = "value")
# Result
variable value
<chr> <int>
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
13 A 5
14 B 5
15 C 5

r - remove first row of condition per subject in dataframe

I have a long format dataframe with multiple subjects and multiple conditions for each subject.
I want to remove the first row of each condition (except the first one) for all subjects.
My dataframe looks like this:
> df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)), cond = (rep(c("A", "A", "B", "B"),times=3)), value = round(runif(12, min = 0, max = 10)))
> df
subj cond value
1 A 1
1 A 5
1 B 3
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
I have found the duplicated() function but it only removes the first row of each condition for the first subject:
df <- df[duplicated(df$cond),]
subj cond value
1 A 5
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
Is there a way to "reset" the finding of a duplicate whenever a new subject begins?
And how can I stop it from excluding the first row of the first condition?
Thank you all so much!
You could subset with the duplicated interaction of the two variables:
> df
subj cond value
1 1 A 5
2 1 A 7
3 1 B 4
4 1 B 8
5 2 A 5
6 2 A 2
7 2 B 8
8 2 B 5
9 3 A 8
10 3 A 1
11 3 B 1
12 3 B 5
df1 <- df[!duplicated(interaction(df$subj, df$cond)),]
> df1
subj cond value
1 1 A 5
3 1 B 4
5 2 A 5
7 2 B 8
9 3 A 8
11 3 B 1
Edit:
I've read your question again and it seems you want to remove the first row, not the last. In this case, use
df1 <- df[!duplicated(interaction(df$subj, df$cond), fromLast = TRUE),]
> df1
subj cond value
2 1 A 4
4 1 B 9
6 2 A 9
8 2 B 7
10 3 A 1
12 3 B 2
Alternative (but does depend on actual df):
df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)),
cond = (rep(c("A", "A", "B", "B"),times=3)),
value = round(runif(12, min = 0, max = 10)))
df
dummy <- as.character(df$cond) # factor to character
mask <- c(FALSE, dummy[-1] == dummy[-length(dummy)])
df[mask,]

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Select rows in a dataframe in r based on values in one row

I have a toy data-frame.
a = rep(1:5, each=3)
b = rep(c("a","b","c"), each = 5)
df = data.frame(a,b)
a b
1 1 a
2 1 a
3 1 a
4 2 a
5 2 a
6 2 b
7 3 b
8 3 b
9 3 b
10 4 b
11 4 c
12 4 c
13 5 c
14 5 c
15 5 c
I also have an index.
idx = c(2,3,5)
I want to select all the rows where the a is either 2, 3, or 5 as specified by the idx.
I've tried the following; but none of them works.
df[df$a==idx, ]
subset(df, df$a==idx)
This shouldn't be too hard.
Use the %in% argument
df[df$a %in% idx,]

Resources