Merging two dataframes by keeping certain column values in r - r

I have two dataframes I need to merge with. The second one has certain columns missing and it also has some more ids. Here is how the sample datasets look like.
df1 <- data.frame(id = c(1,2,3,4,5,6),
item = c(11,22,33,44,55,66),
score = c(1,0,1,1,1,0),
cat.a = c("A","B","C","D","E","F"),
cat.b = c("a","a","b","b","c","f"))
> df1
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
df2 <- data.frame(id = c(1,2,3,4,5,6,7,8),
item = c(11,22,33,44,55,66,77,88),
score = c(1,0,1,1,1,0,1,1),
cat.a = c(NA,NA,NA,NA,NA,NA,NA,NA),
cat.b = c(NA,NA,NA,NA,NA,NA,NA,NA))
> df2
id item score cat.a cat.b
1 1 11 1 NA NA
2 2 22 0 NA NA
3 3 33 1 NA NA
4 4 44 1 NA NA
5 5 55 1 NA NA
6 6 66 0 NA NA
7 7 77 1 NA NA
8 8 88 1 NA NA
The two datasets share first 6 rows and dataset 2 has two more rows. When I merge I need to keep cat.a and cat.b information from the first dataframe. Then I also want to keep id=7 and id=8 with cat.a and cat.b columns missing.
Here is my desired output.
> df3
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>
Any ideas?
Thanks!

We may use rows_update
library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))
-output
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>

Related

How to use join to combined two data frame by two variables and keep different rows with second variable

I want to combine two data frames by both ID and date variables, and want to keep all IDs from two data, and dates from two data.
examples:
data A:
ID date V1
1 1 a
1 4 b
2 9 d
3 10 e
data B:
ID date X
1 1 24
1 2 30
1 4 15
2 2 40
2 5 10
2 7 12
results:
ID date X V1
1 1 24 a
1 2 30 NA
1 4 15 b
2 2 40 NA
2 5 10 NA
2 7 12 NA
2 9 NA d
3 10 NA e
You could use the following solution:
library(dplyr)
df1 %>%
full_join(df2, by = c("ID", "date")) %>%
arrange(ID, date)
ID date V1 X
1 1 1 a 24
2 1 2 <NA> 30
3 1 4 b 15
4 2 2 <NA> 40
5 2 5 <NA> 10
6 2 7 <NA> 12
7 2 9 d NA
8 3 10 e NA

count row number first and then insert new row by condition [duplicate]

This question already has answers here:
How to create missing value for repeated measurement data?
(2 answers)
Closed 4 years ago.
I need to count the number of rows first after a group_by function and add up new row(s) to 6 row if the row number < 6.
My df has three variables (v1,v2,v3): v1 = group name, v2 = row number (i.e., 1,2,3,4,5,6). In the new row(s), I want to repeat the v1 value, v2 continue the couting of row number, v3 = NA
sample df
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
expected output
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
2 5 NA #insert
2 6 NA #insert
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
3 6 NA #insert
I tried to count the row number first by dplyr, but I don't know if I can or how can I add this if else condition by using the pip. Or is there other easier function?
My code
df %>%
group_by(v1) %>%
dplyr::summarise(N=n()) %>%
if (N < 6) {
# sth like that?
}
Thanks!
We can use complete
library(tidyverse)
complete(df1, v1, v2)
# A tibble: 18 x 3
# v1 v2 v3
# <int> <int> <int>
# 1 1 1 79
# 2 1 2 32
# 3 1 3 53
# 4 1 4 33
# 5 1 5 76
# 6 1 6 11
# 7 2 1 32
# 8 2 2 42
# 9 2 3 44
#10 2 4 12
#11 2 5 NA
#12 2 6 NA
#13 3 1 22
#14 3 2 12
#15 3 3 12
#16 3 4 67
#17 3 5 32
#18 3 6 NA
Here is a way to do it using merge.
df <- read.table(text =
"v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32", header = T)
toMerge <- data.frame(v1 = rep(1:3, each = 6), v2 = rep(1:6, times = 3))
m <- merge(toMerge, df, by = c("v1", "v2"), all.x = T)
m
v1 v2 v3
1 1 1 79
2 1 2 32
3 1 3 53
4 1 4 33
5 1 5 76
6 1 6 11
7 2 1 32
8 2 2 42
9 2 3 44
10 2 4 12
11 2 5 NA
12 2 6 NA
13 3 1 22
14 3 2 12
15 3 3 12
16 3 4 67
17 3 5 32
18 3 6 NA

Unnest (seperate) multiple column values into new rows using Sparklyr

I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr.
id <- c(1,1,1,1,1,2,2,2,3,3,3)
name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E")
value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3")
dt <- data.frame(id,name,value)
R solution:
separate_rows(dt, name, sep=",") %>%
separate_rows(value, sep=",")
Desired Output from sparkframe(sparklyr package)-
> final_result
id name value
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 1
5 1 B 2
6 1 B 3
7 1 C 1
8 1 C 2
9 1 C 3
10 1 B 2
11 1 B 4
12 1 B 43
13 1 B 2
14 1 F 2
15 1 F 4
16 1 F 43
17 1 F 2
18 1 C 3
19 1 C 1
20 1 C 2
21 1 C 3
22 1 D 1
23 1 R 1
24 1 P 1
25 1 E 1
26 1 E 2
27 2 A 26
28 2 A 6
29 2 A 7
30 2 Q 26
31 2 Q 6
32 2 Q 7
33 2 W 26
34 2 W 6
35 2 W 7
36 2 B 3
37 2 B 3
38 2 B 4
39 2 J 3
40 2 J 3
41 2 J 4
42 2 C 1
43 3 D 1
44 3 D 12
45 3 M 1
46 3 M 12
47 3 E 2
48 3 E 3
49 3 E 3
50 3 X 2
51 3 X 3
52 3 X 3
53 3 F 3
54 3 E 3
Note-
I have approx 1000 columns with nested values. so, I need a function which can loop in for each column.
I know we have sdf_unnest() function from package sparklyr.nested. But, I am not sure how to split strings of multiple columns and apply this function. I am quite new in sparklyr.
Any help would be much appreciated.
You have to combine explode and split
sdt %>%
mutate(name = explode(split(name, ","))) %>%
mutate(value = explode(split(value, ",")))
# Source: lazy query [?? x 3]
# Database: spark_connection
id name value
<dbl> <chr> <chr>
1 1.00 A 1
2 1.00 A 2
3 1.00 A 3
4 1.00 B 1
5 1.00 B 2
6 1.00 B 3
7 1.00 C 1
8 1.00 C 2
9 1.00 C 3
10 1.00 B 2
# ... with more rows
Please note that lateral views have be to expressed as separate subqueries, so this:
sdt %>%
mutate(
name = explode(split(name, ",")),
value = explode(split(value, ",")))
won't work

Distinct in r within groups of data

How do I transform a dataframe (on the left) to dataframe (on the right)?
I am trying to do this via dplyr, by grouping into name and distinct, but it gives only 3 rows
df %>%
group_by(name) %>%
distinct(.,keep.all = T) %>%
View()
There is a simple way to access all the cells you want to change:
data <- data.frame(name = c(rep("A", 5), rep("B", 5), rep("C", 5)), subject = c(rep(1:5, 3)), marks = sample(1:100, 15))
> data
name subject marks
1 A 1 31
2 A 2 12
3 A 3 29
4 A 4 67
5 A 5 99
6 B 1 77
7 B 2 3
8 B 3 92
9 B 4 69
10 B 5 42
11 C 1 52
12 C 2 66
13 C 3 98
14 C 4 23
15 C 5 72
duplicated(data$name) accesses the relevant cells. But R has no way to leave a cell "blank", so to speak.
You can either set them NA, or fill it with an empty character:
data$name[duplicated(data$name)] <- NA
> data
name subject marks
1 A 1 31
2 <NA> 2 12
3 <NA> 3 29
4 <NA> 4 67
5 <NA> 5 99
6 B 1 77
7 <NA> 2 3
8 <NA> 3 92
9 <NA> 4 69
10 <NA> 5 42
11 C 1 52
12 <NA> 2 66
13 <NA> 3 98
14 <NA> 4 23
15 <NA> 5 72
data$name <- as.character(data$name)
data$name[duplicated(data$name)] <- ""
> data
name subject marks
1 A 1 30
2 2 52
3 3 5
4 4 48
5 5 99
6 B 1 14
7 2 20
8 3 34
9 4 55
10 5 53
11 C 1 38
12 2 27
13 3 67
14 4 12
15 5 77
To use the latter solution with a factor variable, you need to add "" as a factor label:
data$name <- factor(as.numeric(data$name), 1:4, c(levels(data$name), ""))
data$name[duplicated(data$name)] <- ""

R: Set a value for certain data meeting two conditions

I don't know how to set this the easiest way. I have a dataframe called Test, with a column containing some NA values. Now I want to set a value of 1 to all fields meeting the following conditions:
row number > 60
if there is an NA in the specific field
So far I have:
Test$MyColumn[is.na(Test$MyColumn)] <- 1
This works, but I don't know how to set the second condition :-/
If both conditions must apply before you change an element to 1 in bb here is an alternative:
aa <- 1:10
bb <- c(1,NA,6,4,NA,9,1,NA,2,5)
cc <- c(100,102,104,NA,78,54,99,NA,22,0)
dd <- data.frame(aa,bb,cc)
dd
dd$bb[4:nrow(dd)][is.na(dd$bb[4:nrow(dd)])] <- 1
dd
Here is the original data set:
aa bb cc
1 1 1 100
2 2 NA 102
3 3 6 104
4 4 4 NA
5 5 NA 78
6 6 9 54
7 7 1 99
8 8 NA NA
9 9 2 22
10 10 5 0
Here is the modified data set:
aa bb cc
1 1 1 100
2 2 NA 102
3 3 6 104
4 4 4 NA
5 5 1 78
6 6 9 54
7 7 1 99
8 8 1 NA
9 9 2 22
10 10 5 0
This changes NA in rows 4-10 of all columns if there is an NA in rows 4-10 of bb:
aa <- 1:10
bb <- c(1,NA,6,4,NA,9,1,NA,2,5)
cc <- c(100,102,104,NA,78,54,99,NA,22,0)
dd <- data.frame(aa,bb,cc)
dd
dd[4:nrow(dd),1:3][is.na(dd$bb[4:nrow(dd)]),] <- 1
dd
aa bb cc
1 1 1 100
2 2 NA 102
3 3 6 104
4 4 4 NA
5 1 1 1
6 6 9 54
7 7 1 99
8 1 1 1
9 9 2 22
10 10 5 0
This changes NA in rows 4-10 of all columns if there is an NA in rows 4-10 of bb then it changes all remaining NA in bb:
aa <- 1:10
bb <- c(1,NA,6,4,NA,9,1,NA,2,5)
cc <- c(100,102,104,NA,78,54,99,NA,22,0)
dd <- data.frame(aa,bb,cc)
dd
dd[4:nrow(dd),1:3][is.na(dd$bb[4:nrow(dd)]),] <- 1
dd$bb[is.na(dd$bb)] <- 1
dd
aa bb cc
1 1 1 100
2 2 1 102
3 3 6 104
4 4 4 NA
5 1 1 1
6 6 9 54
7 7 1 99
8 1 1 1
9 9 2 22
10 10 5 0
You can set rownumber like this:
Test$RowNumber <- 1:nrow(Test)
And then the condition would be:
Test$MyColumn[is.na(Test$MyColumn) & Test$RowNumber>60] <- 1
You can get the desired result as
Test[60:nrow(Test),][is.na(Test[60:nrow(Test),])]<-1

Resources