find unique a row of a data frame R - r

I have many columns (120) in a dataframe.
I would like to create a new column in this dataframe, with each row containing the unique values (ignoring NAs) from the values across each row in the 120 columns. For example:
V1 V2 V3 V4 V5
a a NA c d
c d e f e
x x x NA NA
a V6 column should be added
V6
a c d
c d e f
x
Any suggestion is more than welcome!
Wannes

By using apply and toString
db$New=apply(db,1,function(x) toString(sort(unique(x[!is.na(x)]))))
db
V1 V2 V3 V4 V5 New
1 a a <NA> c d a, c, d
2 c d e f e c, d, e, f
3 x x x <NA> <NA> x
Or using paste
db$New=apply(db,1,function(x) paste(sort(unique(x[!is.na(x)]),collapse = ' ')))
db
V1 V2 V3 V4 V5 New
1 a a <NA> c d a c d
2 c d e f e c d e f
3 x x x <NA> <NA> x
The added sort ensures that the same set of unique values always appears in the same order.

Related

Referencing a the index object's name in a loop in R

How can R report the actual name i, when using it to name columns and lists in a for loop.
For example, using the following data:
z <- data.frame(x= c(1,2,3,4,5), y = c("a", "b", "v", "d", "e"))
When I reference i from the loop when creating the columns it names it i as the column names.
a_final <- NULL
for(i in z$x){
print(data.frame(i = z$y))
}
Instead, I'd like the columns to be named by the value of each i in the loop, instead.
I'd like the results to look something like:
1 2 3 4 5 6
a a a a a a
b b b b b b
c c c c c c
d d d d d d
e e e e e e
You could create a matrix with data from z$y and dimensions same as nrow(z) and convert it into dataframe.
as.data.frame(matrix(z$y, ncol = nrow(z), nrow = nrow(z)))
# V1 V2 V3 V4 V5
#1 a a a a a
#2 b b b b b
#3 c c c c c
#4 d d d d d
#5 e e e e e
We can also use replicate
as.data.frame(replicate(nrow(z), z$y))

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Get counts row wise and add result as new column

I have this R data frame:
v1 <- LETTERS[1:10]
v2 <- LETTERS[1:4]
v3 <- LETTERS[4:5]
dat <- data.frame(cbind(v1,v2,v3))
v1 v2 v3
A A D
B B E
C C D
D D E
E A D
F B E
G C D
H D E
I A D
J B E
I would like to get a count of the number of occurrences of a given value (e.g. "A") for each column,
and save that as a new column in my data frame.
I my dataframe I want to calculate the occurrences af "A" in column v1 thru v3, and make a new column (CountA) with the count of A's.
My desired output would be:
v1 v2 v3 CountA
A A D 2
B B E 0
C C D 0
D D E 0
E A D 1
F B E 0
G C D 0
H D E 0
I A D 1
J B E 0
Try this:
dat$CountA <- rowSums(dat=="A")

Count unique elements in data frame row and return one with maximum occurrence [duplicate]

This question already has answers here:
How to find mode across variables/vectors within a data row in R
(3 answers)
Closed 9 years ago.
Is it possible to count unique elements in data frame row and return one with maximum occurrence and as result form the vector.
example:
a a a b b b b -> b
c v f w w r t -> w
s s d f b b b -> b
You can use apply to use table function on every row of dataframe.
df <- read.table(textConnection("a a a b b b b\nc v f w w r t\ns s d f b b b"), header = F)
df$result <- apply(df, 1, function(x) names(table(x))[which.max(table(x))])
df
## V1 V2 V3 V4 V5 V6 V7 result
## 1 a a a b b b b b
## 2 c v f w w r t w
## 3 s s d f b b b b
Yes with table
x=c("a", "a", "a", "b" ,"b" ,"b" ,"b")
table(x)
x
a b
3 4
EDIT with data.table
DT = data.table(x=sample(letters[1:5],10,T),y=sample(letters[1:5],10,T))
#DT
# x y
# 1: d a
# 2: c d
# 3: d c
# 4: c a
# 5: a e
# 6: d c
# 7: c b
# 8: a b
# 9: b c
#10: c d
f = function(x) names(table(x))[which.max(table(x))]
DT[,lapply(.SD,f)]
# x y
#1: c c
Note that if you want to keep ALL max's, you need to ask for them explicitly.
You can save them as a list inside the data.frame. If there is only one per row, then the list will be simplified to a common vector
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
With Ties for max:
df2 <- df[, 1:6]
df2$result <- apply(df2, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df2
V1 V2 V3 V4 V5 V6 result
1 a a a b b b 3, 3
2 c v f w w r 2
3 s s d f b b 2, 2
With No Ties for max:
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df
V1 V2 V3 V4 V5 V6 V7 result
1 a a a b b b b 4
2 c v f w w r t 2
3 s s d f b b b 3

merge two dataframe based on matching two exchangable columns in each dataframe

I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.

Resources