I wrote a functioning for loop, but it's slow over thousands of rows and I'm looking for more efficient alternative. Thanks in advance!
The task:
If column a matches column b, column d becomes NA.
If column a does not match b, but b matches c, then column e becomes
NA.
The for loop:
for (i in 1:nrow(data)) {
if (data$a[i] == data$b[i]) {data$d[i] <- NA}
if (!(data$a[i] == data$b[i]) & data$b[i] == data$c[i])
{data$e[i] <- NA}
}
An example:
a b c d e
F G G 1 10
F G F 5 10
F F F 2 8
Would become:
a b c d e
F G G 1 NA
F G F 5 10
F F F NA 8
If you're concerned about speed and efficiency, I'd recommend data.table (though technically vectorizing a normal data.frame as recommended by #parfait would probably speed things up more than enough)
library(data.table)
DT <- fread("a b c d e
F G G 1 10
F G F 5 10
F F F 2 8")
print(DT)
# a b c d e
# 1: F G G 1 10
# 2: F G F 5 10
# 3: F F F 2 8
DT[a == b, d := NA]
DT[!a == b & b == c, e := NA]
print(DT)
# a b c d e
# 1: F G G 1 NA
# 2: F G F 5 10
# 3: F F F NA 8
Suppose df is your data then:
ab <- with(df, a==b)
bc <- with(df, b==c)
df$d[ab] <- NA
df$e[!ab & bc] <- NA
which would result in
# a b c d e
# 1 F G G 1 NA
# 2 F G F 5 10
# 3 F F F NA 8
We could create a list of quosure and evaluate it
library(tidyverse)
qs <- setNames(quos(d*NA^(a == b), e*NA^((!(a ==b) & (b == c)))), c("d", "e"))
df1 %>%
mutate(!!! qs)
# a b c d e
#1 F G G 1 NA
#2 F G F 5 10
#3 F F F NA 8
Related
I got these two data frames:
a <- c('A','B','C','D','E','F','G','H')
b <- c(1,2,1,3,1,3,1,6)
c <- c('K','K','H','H','K','K','H','H')
frame1 <- data.frame(a,b,c)
a <- c('A','A','B','B','C','C','D','D','E','E','F','F','G','H','H')
d <- c(5,5,6,3,1,9,1,0,2,3,6,5,5,5,4)
e <- c('W','W','D','D','D','D','W','W','D','D','W','W','D','W','W')
frame2<- data.frame(a,d,e)
And now I want to include the column 'e' from 'frame2' into 'frame1' depending on the matching value in column 'a' of both data frames. Note: 'e' is the same for all rows with the same value in 'a'.
The result should look like this:
a b c e
1 A 1 K W
2 B 2 K D
3 C 1 H D
4 D 3 H W
5 E 1 K D
6 F 3 K W
7 G 1 H D
8 H 6 H W
Any sugestions?
You can use match to matching value in column 'a' of both data frames:
frame1$e <- frame2$e[match(frame1$a, frame2$a)]
frame1
# a b c e
#1 A 1 K W
#2 B 2 K D
#3 C 1 H D
#4 D 3 H W
#5 E 1 K D
#6 F 3 K W
#7 G 1 H D
#8 H 6 H W
or using merge:
merge(frame1, frame2[!duplicated(frame2$a), c("a", "e")], all.x=TRUE)
you can perform join operation on 'a' column of both dataframes and take those values only which are matched. you can do left join , and after that remove 'a' column from 2nd dataframe and also remove rest of the columns, which are'nt needed from 2nd dataframe.
Using dplyr :
library(dplyr)
frame2 %>%
distinct(a, e, .keep_all = TRUE) %>%
right_join(frame1, by = 'a') %>%
select(-d) %>%
arrange(a)
# a e b c
#1 A W 1 K
#2 B D 2 K
#3 C D 1 H
#4 D W 3 H
#5 E D 1 K
#6 F W 3 K
#7 G D 1 H
#8 H W 6 H
I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).
In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
> var
A B C D E
1 a c e 1.1601720731 -0.57092435
2 a c f -0.0120178626 1.05003748
3 a d e 0.5311032778 1.67867806
4 a d f -0.3399901000 0.01459940
5 a c e -0.2887561691 -0.03847519
6 a c f 0.0004299922 -0.36695879
7 a d e 0.8124655890 0.05444033
8 a d f -0.3777058654 1.34074427
9 b c e 0.7380720821 0.37708543
10 b c f -0.3163496271 0.10921373
11 b d e -0.5543252191 0.35020193
12 b d f -0.5753686426 0.54642790
13 b c e -1.9973216646 0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729
sapply(4:ncol(var), function(i){
sapply(1:length(levels(var$A)), function(j){
sapply(1:length(levels(var$B)), function(t){
sapply(1:length(levels(var$C)), function(z){
mean(var[var$A == levels(var$A)[j] &
var$B == levels(var$B)[t] &
var$C == levels(var$C)[z],i])
})
})
})
})
[,1] [,2]
[1,] 0.435707952 -0.3046998
[2,] -0.005793935 0.3415393
[3,] 0.671784433 0.8665592
[4,] -0.358847983 0.6776718
[5,] -0.629624791 0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947
Is there a way to do this without this many sapply? maybe with mapply or outer
Maybe just,
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
library(dplyr)
var %>%
group_by(A,B,C) %>%
summarise_if(is.numeric,mean)
(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)
For inline aggregation (keeping same number of rows of data frame), consider ave:
var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))
For full aggregation (collapsed to factor groups), consider aggregate:
aggregate(. ~ A + B + C, var, mean)
I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.
library(data.table)
setDT(var)
var[, lapply(.SD, mean), by = .(A, B, C)]
# A B C D E
# 1: a c e 0.07465822 0.032976115
# 2: a c f 0.40789460 -0.944631574
# 3: a d e 0.72054938 0.039781185
# 4: a d f -0.12463910 0.003363382
# 5: b c e -1.64343115 0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e 0.03937829 0.048136471
# 8: b d f -0.43447899 0.028266455
I have 3 columns
2 are the same length
1 is of a lesser length
here are the columns:
column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
I want all 3 columns to be joined together but have the missing 5 values in column1 to be NA?
What can i do to achieve this? a tibble?
You can change length of a vector
column1 <- letters[1:10]
column2 <- letters[1:15]
length(column1) <- length(column2)
Now
> column1
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA NA NA NA NA
We can wrap it in function
cbind_dif <- function(x = list()){
# Find max length
max_length <- max(unlist(lapply(x, length)))
# Set length of each vector as
res <- lapply(x, function(x){
length(x) <- max_length
return(x)
})
return(as.data.frame(res))
}
# Example usage:
> cbind_dif(list(column1 = column1, column2 = column2))
column1 column2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
8 h h
9 i i
10 j j
11 <NA> k
12 <NA> l
13 <NA> m
14 <NA> n
15 <NA> o
n <- max(length(column1), length(column2), length(column3))
data.frame(column1[1:n],column2[1:n],column3[1:n])
column1.1.n. column2.1.n. column3.1.n.
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 <NA> k k
12 <NA> l l
13 <NA> m m
14 <NA> n n
15 <NA> o o
Using cbind.fill from rowr package you can do it easily.
library(rowr)
new<- cbind.fill(column1,column2,column3)
I hope this helps
column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
tibble(a = c(column1, rep(NA, length(column2) - length(column1))), b = column2, c = column3)
# A tibble: 15 × 3
a b c
<chr> <chr> <chr>
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 NA k k
12 NA l l
13 NA m m
14 NA n n
15 NA o o
I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>
I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.