Matching and merging headers in R - r

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?

You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).

You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Related

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

R delete non max values in redundant rows

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Resources