Merging two data frames fails to populate columns when combined - r

I am new using R. I have two data frames (as below) and I would like to add the information from df2 in df1. The only column in common between both of data frames is "Sample".
So I tried to use this column to merge both data frames.
df1
structure(list(Segment = c(3L, 3L, 3L, 4L, 5L, 6L, 6L, 6L, 7L,
7L), Position = c(838L, 891L, 1204L, 732L, 1550L, 688L, 1167L,
1446L, 950L, 981L), `AA-REF` = structure(c(2L, 5L, 7L, 6L, 1L,
8L, 8L, 1L, 3L, 4L), .Label = c("", "D", "E", "H", "K", "L",
"Q", "T"), class = "factor"), `AA-ALT` = structure(c(4L, 2L,
2L, 3L, NA, 5L, 3L, NA, 1L, 4L), .Label = c("E", "K", "M", "N",
"T"), class = "factor"), SYN = structure(c(2L, 3L, 2L, 2L, 1L,
3L, 2L, 1L, 3L, 2L), .Label = c(" ", "N ", "Y "), class = "factor"),
Sample = c("AO103", "AO103", "AO103", "AO103", "AO103", "AO103",
"AO103", "AO103", "AO103", "AO103")), row.names = c(NA, 10L
), class = "data.frame")
Segment Position AA-REF AA-ALT SYN Sample
1 3 838 D N N AO103
2 3 891 K K Y AO103
3 3 1204 Q K N AO103
4 4 732 L M N AO103
5 5 1550 <NA> AO103
6 6 688 T T Y AO103
7 6 1167 T M N AO103
8 6 1446 <NA> AO103
9 7 950 E E Y AO103
10 7 981 H N N AO103
11 8 199 T N N AO103
12 1 341 T K N AO104
13 1 934 T A N AO104
14 1 1327 L F N AO104
15 1 1349 D G N AO104
df2
structure(list(Sample = c("AO208 ", "AO209 ", "AO210 ", "AO211 ",
"AO212 ", "AO213 ", "AO100 ", "AO101 ", "AO102 ", "AO103 "),
Quail = c(7, 8, 9, 10, 11, 12, 7, 8, 9, 10), day = c(3, 3,
3, 3, 3, 3, 5, 5, 5, 5), Expo = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = " DC ", class = "factor"),
Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = " var", class = "factor")), row.names = c(NA,
10L), class = "data.frame")
Sample Quail day Expo Group
1 AO208 7 3 DC var
2 AO209 8 3 DC var
3 AO210 9 3 DC var
4 AO211 10 3 DC var
5 AO212 11 3 DC var
6 AO213 12 3 DC var
7 AO100 7 5 DC var
8 AO101 8 5 DC var
9 AO102 9 5 DC var
10 AO103 10 5 DC var
11 AO104 11 5 DC var
NOTE: Not all entries in df2$Sample are present in df1$Sample
I would like to get something like the following:
Segment Position AA-REF AA-ALT SYN Sample Quail day Expo Group
1 3 838 D N N AO103 10 5 DC var
2 3 891 K K Y AO103 10 5 DC var
3 3 1204 Q K N AO103 10 5 DC var
4 4 732 L M N AO103 10 5 DC var
5 5 1550 <NA> AO103 10 5 DC var
6 6 688 T T Y AO103 10 5 DC var
7 6 1167 T M N AO103 10 5 DC var
8 6 1446 <NA> AO103 10 5 DC var
9 7 950 E E Y AO103 10 5 DC var
10 7 981 H N N AO103 10 5 DC var
11 8 199 T N N AO103 10 5 DC var
12 1 341 T K N AO104 11 5 DC var
13 1 934 T A N AO104 11 5 DC var
14 1 1327 L F N AO104 11 5 DC var
15 1 1349 D G N AO104 11 5 DC var
I tried:
x <- merge(df1, df2, by = "Sample", all = TRUE)
Even though this is adding the columns, everything from df2 is placed at the end of the df1.
I also tried using dplyr's left_join (among others) as:
x <- df1 %>%
left_join(df2, by = "Sample")
This adds empty columns from df2 and no information at all.
I have been looking at many merging posts but none of those seem to address my problem.
I also tried match without success.

x <- merge(x=df1, y=df2, by = "Sample", all.x = TRUE)
You only want all of the columns from df1, so you only need all.x.
Shout out to Tanner33 if you want to use dplyr or tidyverse packages.

Related

New data frame, if specific value(s) is contained AND other values aren't included in a range of columns in r

So, I have a large data frame with monthly observations of n individuals.
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
A 33 6 1 2 1 5
B 36 5 0 2 1 5
C 22 4 1 NA 1 5
D 2 2 0 2 1 5
E 5 2 1 2 1 6
F 7 1 0 2 1 5
G 8 6 1 2 1 5
H 2 8 0 2 2 5
I 1 3 1 2 1 5
J 3 2 0 2 1 5
I want to create a new data frame, in which include the individuals who meet some specific conditions.
E.g. if, for individual i, the range of column y_0101:y_0312 does NOT include values of 3 & 6 & NA, AND include values of 2 | 1 THEN for individual i should be included in new data frame. Which produce the following data frame:
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
B 36 5 0 2 1 5
D 2 2 0 2 1 5
F 7 1 0 2 1 5
H 2 8 0 2 2 5
I tried different ways, but I can't figure out how to get multiple conditions included.
df <- df %>% filter(vars(starts_with("y_"))!=3 | !=6 | != NA)
or
df <- df %>% filter_at(vars(starts_with("y_")), all_vars(!=3 | !=6 | != NA)
I've tried some other things as well, like !%in%, but that doesn't seem to work. Any ideas?
I think you're almost there, but might need a slight shift in the logic:
df <- data.frame(A1 = 1:10,
A2 = 10:1,
A3 = 1:10,
B1 = 1:10)
df %>%
filter_at(vars(starts_with("A")), ~!(.x %in% c(3, 6, NA))) %>%
filter(if_any(starts_with("A"), ~ .x %in% c(1, 2)))
In the first step, I filter out all rows where any of the columns are 3, 6, or NA. In the second row, I filter down to only rows where at least one of the columns is 1 or 2. Does this help with your case?
Here is a base R option using rowSums :
cols <- grep('y_', names(df))
include <- c(1, 2)
not_include <- c(3, 6, NA)
result <- subset(df, rowSums(sapply(df[cols], `%in%`, include)) > 0 &
rowSums(sapply(df[cols], `%in%`, not_include)) == 0)
result
# ind y_0101 y_0102 y_0103 y_0104 y_0311 y_0312
#2 B 36 5 0 2 1 5
#4 D 2 2 0 2 1 5
#6 F 7 1 0 2 1 5
#8 H 2 8 0 2 2 5
data
df <- structure(list(ind = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), y_0101 = c(33L, 36L, 22L, 2L, 5L, 7L, 8L, 2L, 1L,
3L), y_0102 = c(6L, 5L, 4L, 2L, 2L, 1L, 6L, 8L, 3L, 2L), y_0103 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), y_0104 = c(2L, 2L, NA, 2L,
2L, 2L, 2L, 2L, 2L, 2L), y_0311 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L), y_0312 = c(5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L
)), class = "data.frame", row.names = c(NA, -10L))

Two different id values for the same individuals in different datasets

I have two vectors of id values associated with two different datasets. The two vectors correspond to the same individuals, but the id vectors are unrelated (and there are multiple observations for each individual in each dataset). My goal is to merge them by id, but because the ids are different and they are different lengths there is no way to do that without matching on id. There's obviously a lot more data than what I included in the example.
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
So 4033 = 1; 4833 = 4...etc.
dummy dataset1:
id day y
1 1 10
1 2 4
1 3 2
4 1 9
4 2 10
4 3 6
dummy dataset2:
id day y1
4033 1 100
4033 1 120
4033 2 150
4033 3 200
4833 1 120
4833 2 100
4833 2 50
4833 3 100
4833 3 200
What I would like is an easy way to get:
dummy dataset1 output:
id day y id.2
1 1 10 4033
1 2 4 4033
1 3 2 4033
4 1 9 4833
4 2 10 4833
4 3 6 4833
I'm trying a solution in a forloop like:
for (i in length(dataset)) {
dataset$id[dataset[[1]] %in% int] <- int1
}
But that's not working correctly (probably for an obvious reason I'm missing).
As we have two vectors, we can easily create a match with a named vector in base R
df1$id.2 <- setNames(a, b)[as.character(df1$id)]
df1
# id day y id.2
#1 1 1 10 4033
#2 1 2 4 4033
#3 1 3 2 4033
#4 4 1 9 4833
#5 4 2 10 4833
#6 4 3 6 4833
Or another base R option is match
df1$id.2 <- a[match(df1$id, b)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)),
class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(id = c(4033L, 4033L, 4033L, 4033L, 4833L, 4833L,
4833L, 4833L, 4833L), day = c(1L, 1L, 2L, 3L, 1L, 2L, 2L, 3L,
3L), y1 = c(100L, 120L, 150L, 200L, 120L, 100L, 50L, 100L, 200L
)), class = "data.frame", row.names = c(NA, -9L))
Another approach is to make a data.frame of the IDs and use merge.
datasetID <- data.frame(id = b, id.2 = a)
merge(dataset1,datasetID)
id day y a
1 1 1 10 4033
2 1 2 4 4033
3 1 3 2 4033
4 4 1 9 4833
5 4 2 10 4833
6 4 3 6 4833
Data
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
dataset1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)), class = "data.frame", row.names = c(NA,
-6L))

Match and replace in R

I would like to match row names from table 1 with column names from table 2 and then replace them with corresponding names from column n in table 1.
table1
x y n
CAAGCCAAGCTAGATA 5 6 um
AATCCCAAGTGACACC 4 1 cs
AATCTCAAGTCACACC 4 1 cs
table2
CAAGCCAAGCTAGATA AATCCCAAGTGACACC AATCTCAAGTCACACC
a 1 3 5
b 2 3 4
c 6 3 6
d 8 3 5
result
um cs cs
a 1 3 5
b 2 3 4
c 6 3 6
d 8 3 5
One option is also to pass a named vector to do the matching
names(df2) <- setNames(df1$n, row.names(df1))[colnames(df2)]
df2
# um cs cs
#a 1 3 5
#b 2 3 4
#c 6 3 6
#d 8 3 5
data
df1 <- structure(list(x = c(5L, 4L, 4L), y = c(6L, 1L, 1L), n = c("um",
"cs", "cs")), class = "data.frame", row.names = c("CAAGCCAAGCTAGATA",
"AATCCCAAGTGACACC", "AATCTCAAGTCACACC"))
df2 <- structure(list(CAAGCCAAGCTAGATA = c(1L, 2L, 6L, 8L), AATCCCAAGTGACACC = c(3L,
3L, 3L, 3L), AATCTCAAGTCACACC = c(5L, 4L, 6L, 5L)),
class = "data.frame", row.names = c("a",
"b", "c", "d"))

Tidy data.frame with repeated column names

I have a program that gives me data in this format
toy
file_path Condition Trial.Num A B C ID A B C ID A B C ID
1 root/some.extension Baseline 1 2 3 5 car 2 1 7 bike 4 9 0 plane
2 root/thing.extension Baseline 2 3 6 45 car 5 4 4 bike 9 5 4 plane
3 root/else.extension Baseline 3 4 4 6 car 7 5 4 bike 68 7 56 plane
4 root/uniquely.extension Treatment 1 5 3 7 car 1 7 37 bike 9 8 7 plane
5 root/defined.extension Treatment 2 6 7 3 car 4 6 8 bike 9 0 8 plane
My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names
tidy_toy
file_path Condition Trial.Num A B C ID
1 root/some.extension Baseline 1 2 3 5 car
2 root/thing.extension Baseline 2 3 6 45 car
3 root/else.extension Baseline 3 4 4 6 car
4 root/uniquely.extension Treatment 1 5 3 7 car
5 root/defined.extension Treatment 2 6 7 3 car
6 root/some.extension Baseline 1 2 1 7 bike
7 root/thing.extension Baseline 2 5 4 4 bike
8 root/else.extension Baseline 3 7 5 4 bike
9 root/uniquely.extension Treatment 1 1 7 37 bike
10 root/defined.extension Treatment 2 4 6 8 bike
11 root/some.extension Baseline 1 4 9 0 plane
12 root/thing.extension Baseline 2 9 5 4 plane
13 root/else.extension Baseline 3 68 7 56 plane
14 root/uniquely.extension Treatment 1 9 8 7 plane
15 root/defined.extension Treatment 2 9 0 8 plane
If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.
Here's the dput of both tables
structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension",
"root/else.extension", "root/some.extension", "root/thing.extension",
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L,
1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"),
Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L,
4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L,
5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L,
4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"),
A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L,
4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path",
"Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C",
"ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA,
-5L))
structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L,
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension",
"root/else.extension", "root/some.extension", "root/thing.extension",
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline",
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L,
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L,
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L,
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L,
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike",
"car", "plane"), class = "factor")), .Names = c("file_path",
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA,
-15L))
You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:
# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])
# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A','^B','^C','^ID'),
value.name = c('A','B','C','ID'))
which gives:
> tidy_toy
file_path condition Trial.Num variable A B C ID
1: root/some.extension Baseline 1 1 2 3 5 car
2: root/thing.extension Baseline 2 1 3 6 45 car
3: root/else.extension Baseline 3 1 4 4 6 car
4: root/uniquely.extension Treatment 1 1 5 3 7 car
5: root/defined.extension Treatment 2 1 6 7 3 car
6: root/some.extension Baseline 1 2 2 1 7 bike
7: root/thing.extension Baseline 2 2 5 4 4 bike
8: root/else.extension Baseline 3 2 7 5 4 bike
9: root/uniquely.extension Treatment 1 2 1 7 37 bike
10: root/defined.extension Treatment 2 2 4 6 8 bike
11: root/some.extension Baseline 1 3 4 9 0 plane
12: root/thing.extension Baseline 2 3 9 5 4 plane
13: root/else.extension Baseline 3 3 68 7 56 plane
14: root/uniquely.extension Treatment 1 3 9 8 7 plane
15: root/defined.extension Treatment 2 3 9 0 8 plane
Another option is to use a list of column-indexes for measure.vars:
tidy_toy <- melt(setDT(toy),
measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)),
value.name = c('A','B','C','ID'))
Making the column-names unique isn't necessary then.
A more complicated method that creates names that are better distinguishable by the patterns argument:
# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]
# make them unique
names(toy)[idx] <- paste(nms,
rep(seq(length(nms) / length(names(tt)[tt > 1])),
each = length(names(tt)[tt > 1])),
sep = '.')
# your columnnames are now unique:
> names(toy)
[1] "file_path" "Condition" "Trial.Num" "A.1" "B.1" "C.1" "ID.1" "A.2"
[9] "B.2" "C.2" "ID.2" "A.3" "B.3" "C.3" "ID.3"
# tidy the data into long format
tidy_toy <- melt(setDT(toy),
measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'),
value.name = c('A','B','C','ID'))
which will give the same end-result.
As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.
with tidyverse we can do :
library(tidyverse)
toy %>%
repair_names(sep="_") %>%
pivot_longer(-(1:3),names_to = c(".value","id"), names_sep="_") %>%
select(-id)
#> # A tibble: 15 x 7
#> file_path Condition Trial.Num A B C ID
#> <fct> <fct> <int> <int> <int> <int> <fct>
#> 1 root/some.extension Baseline 1 2 3 5 car
#> 2 root/some.extension Baseline 1 2 1 7 bike
#> 3 root/some.extension Baseline 1 4 9 0 plane
#> 4 root/thing.extension Baseline 2 3 6 45 car
#> 5 root/thing.extension Baseline 2 5 4 4 bike
#> 6 root/thing.extension Baseline 2 9 5 4 plane
#> 7 root/else.extension Baseline 3 4 4 6 car
#> 8 root/else.extension Baseline 3 7 5 4 bike
#> 9 root/else.extension Baseline 3 68 7 56 plane
#> 10 root/uniquely.extension Treatment 1 5 3 7 car
#> 11 root/uniquely.extension Treatment 1 1 7 37 bike
#> 12 root/uniquely.extension Treatment 1 9 8 7 plane
#> 13 root/defined.extension Treatment 2 6 7 3 car
#> 14 root/defined.extension Treatment 2 4 6 8 bike
#> 15 root/defined.extension Treatment 2 9 0 8 plane
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 4].

change data frame in R

i have a data frame generated inside a for loop and have this structure
V1 V2 V3
1 a a 1
2 a b 3
3 a c 2
4 a d 1
5 a e 3
6 b a 3
7 b b 1
8 b c 8
9 b d 1
10 b e 1
11 c a 2
12 c b 8
the data is longer than this , but that's the idea that i want
(transform it to a wide table [V1 by V2])
V3 is a value based on (V1, V2)
i want to rearrange data to be like this (with first col is the unique of V1 and first row is the unique of V2 and data between them are from V3 )
a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 2 8 2
d 1 1 5 7 2
e 3 5 9 5 3
thnx in advance.
Reproducible example of yours:
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), V3 = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 8L, 1L, 1L, 2L, 8L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
And compute a basic crosstable based on your variables:
> xtabs(V3~V1+V2, df)
V2
V1 a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 0 0 0
I hope you meant this :)
If df is your data-frame, assuming a unique V3 is mapped to each V1,V2 combination, you can do it with
with(df, tapply(V3, list(V1,V2), identity))
Another method, perhaps slightly more baroque, for widening a dataframe from a third column on the basis of the first two... with Chase that the OP has not given an unambiguous problem description:
df2 <- expand.grid(A=LETTERS[1:5], B=LETTERS[1:5])
df2$N <- 1:25
mtx <- outer(X=LETTERS[1:5],Y=LETTERS[1:5], FUN=function(x,y){
df2[intersect(which(df2$A==x), which(df2$B==y)), "N"] })
colnames(mtx)<-LETTERS[1:5]; rownames(mtx)<-LETTERS[1:5]
mtx
A B C D E
A 1 6 11 16 21
B 2 7 12 17 22
C 3 8 13 18 23
D 4 9 14 19 24
E 5 10 15 20 25
I'm sure there are many other strategies using reshape in base or dcast in reshape2.

Resources