How to skip not completly empty rows in r - r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.

There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Related

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Return first occurring 2nd largest value in data frame rows using colnames & apply

Consider i have a df
> editor
A B C D E F G H I J
User1 1 0 5 6 5 6 5 6 2 6
User2 0 5 4 6 4 5 5 1 7 5
I want to store the column name of the first occuring 2nd largest value in above rows.
Expected results
> editor
A B C D E F G H I J 2nd_highest
User1 1 0 5 6 5 6 5 6 2 6 C
User2 0 5 4 6 4 5 5 1 7 5 D
i tried edited$2nd_highest <- colnames(edited)[apply(edited, 1, which.max)+1] but did'nt worked well .
Any ideas ?
Here's an attempt to achieve this using algebra in order to keep it vectorized and avoid by row operations (though it still does a matrix conversion similar to apply). The idea here is to find the maximum- then reduce it from the data set, then convert to log (after multiplying by -1) which will result in the largest value becoming -Inf (meaning the smallest value) and then do 1/result in order to find the largest value out of the values left.
indx <- max.col(1/log((editor - editor[cbind(1:nrow(editor),
max.col(editor))]) * -1), ties.method = "first")
names(editor)[indx]
# [1] "C" "D"
Here is an idea. We first sort the unique values of each row and extract the second value. Since we specify decreasing = TRUE, then the second value will be the second highest. We then use the first value of each element of the new list as the index for the column names
ind_lst <- apply(df, 1, function(i) which(i == sort(unique(i), decreasing = TRUE)[2]))
df$highest.two <- names(df)[unlist(lapply(ind_lst, '[', 1))]
df
# A B C D E F G H I J highest.two
#User1 1 0 5 6 5 6 5 6 2 6 C
#User2 0 5 4 6 4 5 5 1 7 5 D
This can help you:
mat <- matrix(sample(1:8, 24, replace=TRUE), ncol=6)
mat
sec_highest <- apply(mat, 1, function(x) which(x == max(x[which(x != max(x))])))
LETTERS[sec_highest] # letters display
Note that if you have two second highests with same scores, only one will be displayed.

Counting number of unique rows that have repeated records in one column

This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,3,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a repeated value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 3
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 repeated values in vector e. I would want to get a count on those - so the combination of 4344 has 2 repeated values in vector e.
The expected output would me how many times a certain combination such as 4344 had repeated values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
Both R and SQL work, whatever does the job.
Again, see my comments above, but I believe the following gives you a start on your first question. First, create a "key" variable (in this case named key_abcd which uses tidyr::unite to unite columns a, b, c, and d). Then, count up e by this key_abcd variable. The group_by is implicit.
library(tidyr)
library(dplyr)
df <- data.frame(a,b,c,d,e,f,g)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
# key_abcd e n
# (chr) (dbl) (int)
# 1 1_1_1_2 1 1
# 2 1_2_4_2 5 1
# 3 4_3_4_4 3 2
# 4 5_5_5_5 5 1
It appears from how you've worded the question, you are only interested in "more than one" combinations, therefore, you could add %>% filter(n > 1) to the above code.

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

R delete non max values in redundant rows

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Resources