R delete non max values in redundant rows - r

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.

You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))

Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1

You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Related

Create a new column from different columns of one data frame conditioned on another column from another data frame

Suppose I have two data frame
df1 <- data.frame(A = 1:6, B = 7:12, C = rep(1:2, 3))
df2 <- data.frame(C = 1:2, D = c("A", "B"))
I want to create a new column E in df1 whose value is based on the values of Column C, which can then be connected to Column D in df2. For example, the C value in the first row of df1 is "1". And value 1 of column C in df2 corresponds to "A" of Column D, so the value E created in df2 should from column "A", i.e., 1.
As suggested by Select values from different columns based on a variable containing column names, I can achieve this by two steps:
setDT(df1)
setDT(df2)
df3 <- df1[df2, on = "C"] # step 1 combines the two data.tables
df3[, E := .SD[[.BY[[1]]]], by = D] # step 2
My question is: Could we do this in one step? Furthermore, as my data is relatively large, the first step in this original solution takes a lot time. Could we do this in a faster way?
Any suggestions?
Here's how I would do it:
df1[df2, on=.(C), D := i.D][, E := .SD[[.BY$D]], by=D]
A B C D E
1: 1 7 1 A 1
2: 2 8 2 B 8
3: 3 9 1 A 3
4: 4 10 2 B 10
5: 5 11 1 A 5
6: 6 12 2 B 12
This adds the columns to df1 by reference instead of making a new table and so I guess is more efficient than building df3. Also, since they're added to df1, the rows retain their original ordering.
you can try this, the C column can indicates column value from df1
setDT(df1)
df1[, e := eval(parse(text = names(df1)[C])), by = 1:nrow(df1)]
df1
A B C e
1: 1 7 1 1
2: 2 8 2 8
3: 3 9 1 3
4: 4 10 2 10
5: 5 11 1 5
6: 6 12 2 12

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

replace values in a column based on another column but following the numeric index from the first replacement

I have a data.frame that looks like the one above. I need to replace the values in the first columns based on the values on second column but the replacement need to continue the numeric value of column 1, and only replacing the values in column 1 when !ValB==A
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
1 B
1 B
1 B
2 B
2 B
3 B
4 B
4 B
1 C
1 C
2 C
2 C
3 C
3 C
4 C
1 C
What I want is replace the values in column1 but using ValB==B as the index for replacing the values in ValA. The replacement has to continue the values in ValA, i.e, when there is a 1 and the ValB==B the ValA has to be 5, the 2 has to be 6 and so on. Please here is the desired output, what will make easier to understand what I am doing. I could do a for loop with if and elseif statement but I am sure that there is a cleaner way,
Desired output
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
5 B
5 B
5 B
6 B
6 B
6 B
7 B
7 B
8 C
8 C
9 C
9 C
10 C
10 C
11 C
12 C
You could do something like this. It basically runs a cumulative sum over a boolean vector which tells you whether ValA and ValB of one row are equal to the one of the previous row -
# do a running sum of the values
df$c = cumsum(
c(
# first value of the result is the same value as the first value of A
df$ValA[1],
# go through the second to the last value of the vector and compared it to the first to the n - 1th values
sapply(
2:nrow(df),
function(index) {
# look for change in value of A and B both
# if changed then return 1, else return 0
!(
df$ValA[index] == df$ValA[index - 1] &
df$ValB[index] == df$ValB[index - 1]
)
}
)
))

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Resources