Mutate dataframes by matching ids in r [duplicate] - r

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 7 years ago.
I have three data frames:
df1:
id score1
1 50
2 23
3 40
4 68
5 82
6 38
df2:
id score2
1 33
2 23
4 64
5 12
6 32
df3:
id score3
1 50
2 23
3 40
4 68
5 82
I want to mutate the three scores to a dataframe like this, using NA to denote the missing value
id score1 score2 score3
1 50 33 50
2 23 23 23
3 40 NA 40
4 68 64 68
5 82 12 82
6 38 32 NA
Or like this, deleting the NA values:
id score1 score2 score3
1 50 33 50
2 23 23 23
4 68 64 68
5 82 12 82
However, mutate (in dplyer) does not take different length. So I can not mutate. How can I do that?

You can try
Reduce(function(...) merge(..., by='id'), list(df1, df2, df3))
# id score1 score2 score3
#1 1 50 33 50
#2 2 23 23 23
#3 4 68 64 68
#4 5 82 12 82
If you have many dataset object names with pattern 'df' followed by number
Reduce(function(...) merge(..., by='id'), mget(paste0('df',1:3)))
Or instead of paste0('df', 1:3), you can use ls(pattern='df\\d+') as commented by #DavidArenburg

Related

R: How to merge a new data frame to several other data frames in a list

I have several seperate data frames that I would like to keep separated because merging them together would create a very large element.
However, there are variables from another data frame that I would like to merge with all of them now.
Here is an example of what I would like to do:
df1 <- data.frame(ID1 = c(1:10), Var1 = rep(c(1,0),5))
df2 <- data.frame(ID1 = c(1:10), Var2 = c(21:30))
dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
mergewith <- data.frame(ID1 = c(1:10), ID2 = c(41:50))
My goal is that df1 and df2 will look like this:
df1
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
df2
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
What I have tried so far is:
dat = lapply(dfs,function(x){
merge(names(x), mergewith, by = "ID1");x})
list2env(dat,.GlobalEnv)
However, then I get the following message:
"'by' must specify a uniquely valid column"
Is it possible to do this without using a loop?
You can try Map
> Map(function(x, y) merge(x, y, by = "ID1"), dfs, list(mergewith))
[[1]]
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
[[2]]
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
You can use lapply to merge all the dataframes in dfs with mergewith. Use list2env to get the changed dataframes in the global environment.
list2env(lapply(dfs, function(x) merge(x, mergewith, by = 'ID1')), .GlobalEnv)

Label columns with a ascending number [duplicate]

This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))

Replace column values based on column in another dataframe

I would like to replace some column values in a df based on column in another data frame
This is the head of the first df:
df1
A tibble: 253 x 2
id sum_correct
<int> <dbl>
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 16
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement
df 2
A tibble: 14 x 2
id sum_correct
<int> <dbl>
1 866103 61
2 866124 79
3 866152 85
4 867101 24
5 867140 76
6 867146 51
7 867152 56
8 867200 50
9 867209 97
10 879657 56
11 879680 61
12 879683 58
13 879693 77
14 881451 57
how I can achieve this in R studio? thanks for the help in advance.
You can make an update join using match to find where id matches and remove non matches (NA) with which:
idx <- match(df1$id, df2$id)
idxn <- which(!is.na(idx))
df1$sum_correct[idxn] <- df2$sum_correct[idx[idxn]]
df1
id sum_correct
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 61
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
you can do a left_join and then use coalesce:
library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>%
mutate(sum_correct_final = coalesce(sum_correct_2, sum_correct_1))
The new column sum_correct_final contains the value from df2 if it exists and from df1 if a corresponding entry from df2 does not exist.

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

merging time series and cross-section time series dataframes

I am having an issue merging 2 dataframes in R - one is a time series cross-section (i.e. a panel) and the other is a simple time series. Suppose I have two dataframes, df1 and df2, that I would like to merge. The panel dataframe df1 is given by
id year var1
1 80 3
1 81 5
1 82 7
1 83 9
2 80 5
2 81 5
2 82 7
2 83 5
3 80 9
3 81 9
3 82 7
3 83 3
while the time series dataframe df2 is given by
year var2
80 10
81 15
82 17
83 19
I would like to merge df1 and df2 into a third dataframe df, while preserving the time series cross-section row ordering of df1. However, when I use the command
df <- merge(df1, df2, by="year")
the new dataframe clusters the observations by year.
year id var1 var2
80 1 3 10
80 2 5 10
80 3 9 10
81 1 5 15
81 2 5 15
81 3 9 15
82 1 7 17
82 2 7 17
82 3 7 17
83 1 9 19
83 2 5 19
83 3 3 19
Does anyone know how I can make the row ordering in df the same as in df1? Thanks in advance!

Resources