Maintaining order in subseting dataframe in R - r

I have a dataframe df :
V1 V2 V3
1 227 Day1
2 288 Day2
3 243 Day3
4 258 Day4
5 274 Day5
6 245 Day6
7 254 Day7
8 249 Day8
9 230 Day9
10 244 Day10
I want to subset df where V1 contains 5,1,7,3 in order. I used subset(df,V1 %in% c(5,1,7,3)) but what I am getting is:
V1 V2 V3
1 227 Day1
3 243 Day3
5 274 Day5
7 254 Day7
I want to maintain the order of rows in V1 to be 5,1,7,3 and not 1,3,5,7. How can I have the output to be:
V1 V2 V3
5 274 Day5
1 227 Day1
7 254 Day7
3 243 Day3

This is perhaps overcomplicated, but I would approach this task as follows:
An exemplary data frame:
(data2 <- data.frame(V1=1:10, V2=letters[1:10]))
## V1 V2
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
Let's find in which row we have a match (NA - no match, i - i-th value was matched):
(m <- match(data2$V1, c(5,1,7,3)))
## [1] 2 NA 4 NA 1 NA 3 NA NA NA
Now we select matching rows matches and permute them accordingly:
data2[!is.na(m),][order(na.omit(m)),]
## V1 V2
## 5 5 e
## 1 1 a
## 7 7 g
## 3 3 c
On the other hand, if you know that V1 consists of consecutive natural numbers (starting from 1), the solution is easy as 1-2-3:
data2[c(5,1,7,3),]
## V1 V2
## 5 5 e
## 1 1 a
## 7 7 g
## 3 3 c

You can use merge with sort=FALSE and it seems to return in the order that it matches. It will also work if you have repeated values in V1. E.g.:
dat <- data.frame(V1=1:10, V2=letters[1:10])
dat$V1[8] <- 5
dat
# V1 V2
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#8 5 h
#9 9 i
#10 10 j
merge(data.frame(V1=c(5,1,7,3)),dat,by="V1",sort=FALSE)
# V1 V2
#1 5 e
#2 5 h
#3 1 a
#4 7 g
#5 3 c

Related

Merging two dataframes by keeping certain column values in r

I have two dataframes I need to merge with. The second one has certain columns missing and it also has some more ids. Here is how the sample datasets look like.
df1 <- data.frame(id = c(1,2,3,4,5,6),
item = c(11,22,33,44,55,66),
score = c(1,0,1,1,1,0),
cat.a = c("A","B","C","D","E","F"),
cat.b = c("a","a","b","b","c","f"))
> df1
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
df2 <- data.frame(id = c(1,2,3,4,5,6,7,8),
item = c(11,22,33,44,55,66,77,88),
score = c(1,0,1,1,1,0,1,1),
cat.a = c(NA,NA,NA,NA,NA,NA,NA,NA),
cat.b = c(NA,NA,NA,NA,NA,NA,NA,NA))
> df2
id item score cat.a cat.b
1 1 11 1 NA NA
2 2 22 0 NA NA
3 3 33 1 NA NA
4 4 44 1 NA NA
5 5 55 1 NA NA
6 6 66 0 NA NA
7 7 77 1 NA NA
8 8 88 1 NA NA
The two datasets share first 6 rows and dataset 2 has two more rows. When I merge I need to keep cat.a and cat.b information from the first dataframe. Then I also want to keep id=7 and id=8 with cat.a and cat.b columns missing.
Here is my desired output.
> df3
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>
Any ideas?
Thanks!
We may use rows_update
library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))
-output
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>

How to fill NA rows by conditions from columns in R

Here is an example:
df<-data.frame(v1=rep(1:2, 4),
v2=rep(c("a", "b"), each=4),
v3=paste0(rep(1:2, each=4), rep(c("m", "n", "o", "p"), each=2)),
v4=c(1,2, NA, NA, 3,4, NA,NA),
v5=c(5,6, NA, NA, 7,8, NA,NA),
v6=c(9,10, NA, NA, 11,12, NA,NA))
df
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n NA NA NA
4 2 a 1n NA NA NA
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p NA NA NA
8 2 b 2p NA NA NA
What I wanted is, if column v1+v2+v3 are same by ignore the last letter of v3, fill the NAs from the rows that are not NA . In this case, row3's NA should be filled by row1 due to same 1a1 by ignoring m. So a desired output would be:
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n 1 5 9
4 2 a 1n 2 6 10
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p 3 7 11
8 2 b 2p 4 8 12
I don't know but I think this is a simpler way of producing your results
library(tidyverse)
df %>%
group_by(v1,v2) %>%
fill(v4:v6)
Adding the v3 logic
df %>%
mutate(v7 = v3 %>% as.character() %>% parse_number()) %>%
group_by(v1,v2,v7) %>%
fill(v4:v6) %>%
select(-v7)
Here is a solution that recodes v3 into a variable that only takes into account the numeric part.
library(dplyr)
library(stringr)
#Extract numeric part of the string in v3
df$v7<-str_extract(df$v3,"[[:digit:]]+")
df %>%
group_by(v1,v2,v7) %>%
fill(v4:v6)
Here's a solution using data.table and zoo which ignores v3 column's last letter:
library(data.table)
setDT(df)[, match_cols := paste0(v1, v2, substr(v3, 1, nchar(as.character(v3)) - 1))][, id := .GRP, by = match_cols][, v4 := zoo::na.locf(v4, na.rm = F), by = id][, v5 := zoo::na.locf(v5, na.rm = F), by = id][, v6 := zoo::na.locf(v6, na.rm = F), by = id][ , c("match_cols", "id") := NULL]
df
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12
Using na.locf from zoo
library(zoo)
library(data.table)
setDT(df)[, na.locf(.SD),.(v1, v2)]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 1 a 1n 1 5 9
#3: 2 a 1m 2 6 10
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 1 b 2p 3 7 11
#7: 2 b 2o 4 8 12
#8: 2 b 2p 4 8 12
If we want to add the condition in 'v3'
setDT(df)[, names(df)[4:6] := na.locf(.SD),.(v1, v2, sub("\\D+", "", v3))][]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12

sorting a list of data frame on a condition

I have a list of data frames containing different number of columns.
Say Y is a list of 3 data frames containing 4,10 and 5 columns respectively
I want to sort these data frames in a list based on a condition that which column will be sorted first and so on. for that i have another list:
i1 = list(c(0),c(4,5,2,3),c(3))
i2 = c(0,4,1)
in first data frame i don't want to sort anything and for second and third data frame i want to follow the order given in i1 and i2
i have tried writing this function which works for 1 data frame but not working for a list
for (i in 1:length(i1){
if (i2[i] < 1) {
sorted[[i]]=y[[i]]
} else {
for(j in i1[[i]]){
sorted[[i]] <- y[[i]][order(y[[i]][j],]
}}}
We can do this with Map
Map(function(x,y, z) if(z < 1) x else x[do.call(order, x[y]),], Y, i1, i2)
#[[1]]
# V1 V2 V3 V4
#1 3 10 7 10
#2 3 3 4 2
#3 8 8 7 1
#4 6 9 7 6
#5 7 3 4 2
#[[2]]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 1 7 4 3 5 1 5 10 5 4
#3 8 6 4 7 3 4 5 3 3 10
#4 2 7 2 7 3 3 8 2 2 8
#2 6 1 3 8 4 4 9 5 3 10
#5 3 1 10 10 1 4 6 2 8 5
#[[3]]
# V1 V2 V3 V4 V5
#2 3 6 2 3 8
#4 10 1 3 4 2
#5 7 8 4 9 5
#1 2 4 6 4 4
#3 1 9 6 9 10
data
set.seed(24)
Y <- list(as.data.frame(matrix(sample(1:10, 4*5, replace=TRUE), 5, 4)),
as.data.frame(matrix(sample(1:10, 10*5, replace=TRUE), 5, 10)),
as.data.frame(matrix(sample(1:10, 5*5, replace=TRUE), 5, 5)))

Rolling Join multiple columns independently to eliminate NAs

I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).

add an extra line to a table based on absence from other list

I was wondering if it is possible to add an extra line to a table based on the absence of a certain value from another table. This is how my situation looks like
Text file with 2 columns
V1 V2
1 100
1 101
1 102
1 103
2 230
2 231
2 232
... ...
Other text file with 5 columns
V1 V2 V3 V4 V5
1 100 a b c
1 101 a b c
1 103 a b c
2 231 a b c
2 232 a b c
When the combination of values from V1 & V2 of the first textfile is NOT present in the second text file (in the example 1 102 and 2 230 are not present), I want to add extra lines in the second file with the value of V1 and V2 from the first file & with V3,V4 and V5 equal to 0
So that the second file becomes like this:
V1 V2 V3 V4 V5
1 100 a b c
1 101 a b c
1 102 0 0 0
1 103 a b c
2 230 0 0 0
2 231 a b c
2 232 a b c
I cannot find the right command to do this in R. Could someone give me a hand?
Assuming the two objects are named "DF1" and "DF2", you can use merge as follows:
DFM <- merge(DF1, DF2, all = TRUE)
DFM
# V1 V2 V3 V4 V5
# 1 1 100 a b c
# 2 1 101 a b c
# 3 1 102 <NA> <NA> <NA>
# 4 1 103 a b c
# 5 2 230 <NA> <NA> <NA>
# 6 2 231 a b c
# 7 2 232 a b c
If you would really prefer 0 instead of NA, you can do the following:
# Convert the factors to characters
DFM[sapply(DFM, is.factor)] <- lapply(DFM[sapply(DFM, is.factor)], as.character)
# Identify the NA values and replace them with 0
DFM[is.na(DFM)] <- 0
DFM
# V1 V2 V3 V4 V5
# 1 1 100 a b c
# 2 1 101 a b c
# 3 1 102 0 0 0
# 4 1 103 a b c
# 5 2 230 0 0 0
# 6 2 231 a b c
# 7 2 232 a b c

Resources