Return first occurring 2nd largest value in data frame rows using colnames & apply - r

Consider i have a df
> editor
A B C D E F G H I J
User1 1 0 5 6 5 6 5 6 2 6
User2 0 5 4 6 4 5 5 1 7 5
I want to store the column name of the first occuring 2nd largest value in above rows.
Expected results
> editor
A B C D E F G H I J 2nd_highest
User1 1 0 5 6 5 6 5 6 2 6 C
User2 0 5 4 6 4 5 5 1 7 5 D
i tried edited$2nd_highest <- colnames(edited)[apply(edited, 1, which.max)+1] but did'nt worked well .
Any ideas ?

Here's an attempt to achieve this using algebra in order to keep it vectorized and avoid by row operations (though it still does a matrix conversion similar to apply). The idea here is to find the maximum- then reduce it from the data set, then convert to log (after multiplying by -1) which will result in the largest value becoming -Inf (meaning the smallest value) and then do 1/result in order to find the largest value out of the values left.
indx <- max.col(1/log((editor - editor[cbind(1:nrow(editor),
max.col(editor))]) * -1), ties.method = "first")
names(editor)[indx]
# [1] "C" "D"

Here is an idea. We first sort the unique values of each row and extract the second value. Since we specify decreasing = TRUE, then the second value will be the second highest. We then use the first value of each element of the new list as the index for the column names
ind_lst <- apply(df, 1, function(i) which(i == sort(unique(i), decreasing = TRUE)[2]))
df$highest.two <- names(df)[unlist(lapply(ind_lst, '[', 1))]
df
# A B C D E F G H I J highest.two
#User1 1 0 5 6 5 6 5 6 2 6 C
#User2 0 5 4 6 4 5 5 1 7 5 D

This can help you:
mat <- matrix(sample(1:8, 24, replace=TRUE), ncol=6)
mat
sec_highest <- apply(mat, 1, function(x) which(x == max(x[which(x != max(x))])))
LETTERS[sec_highest] # letters display
Note that if you have two second highests with same scores, only one will be displayed.

Related

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

replace values in a column based on another column but following the numeric index from the first replacement

I have a data.frame that looks like the one above. I need to replace the values in the first columns based on the values on second column but the replacement need to continue the numeric value of column 1, and only replacing the values in column 1 when !ValB==A
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
1 B
1 B
1 B
2 B
2 B
3 B
4 B
4 B
1 C
1 C
2 C
2 C
3 C
3 C
4 C
1 C
What I want is replace the values in column1 but using ValB==B as the index for replacing the values in ValA. The replacement has to continue the values in ValA, i.e, when there is a 1 and the ValB==B the ValA has to be 5, the 2 has to be 6 and so on. Please here is the desired output, what will make easier to understand what I am doing. I could do a for loop with if and elseif statement but I am sure that there is a cleaner way,
Desired output
>df1
ValA ValB
1 A
1 A
2 A
2 A
3 A
3 A
4 A
4 A
5 B
5 B
5 B
6 B
6 B
6 B
7 B
7 B
8 C
8 C
9 C
9 C
10 C
10 C
11 C
12 C
You could do something like this. It basically runs a cumulative sum over a boolean vector which tells you whether ValA and ValB of one row are equal to the one of the previous row -
# do a running sum of the values
df$c = cumsum(
c(
# first value of the result is the same value as the first value of A
df$ValA[1],
# go through the second to the last value of the vector and compared it to the first to the n - 1th values
sapply(
2:nrow(df),
function(index) {
# look for change in value of A and B both
# if changed then return 1, else return 0
!(
df$ValA[index] == df$ValA[index - 1] &
df$ValB[index] == df$ValB[index - 1]
)
}
)
))

Matching and merging headers in R

In R, I want to match and merge two matrices.
For example,
> A
ID a b c d e f g
1 ex 3 8 7 6 9 8 4
2 am 7 5 3 0 1 8 3
3 ple 8 5 7 9 2 3 1
> B
col1
1 a
2 c
3 e
4 f
Then, I want to match header of matrix A and 1st column of matrix B.
The final result should be a matrix like below.
> C
ID a c e f
1 ex 3 7 9 8
2 am 7 3 1 8
3 ple 8 7 2 3
*(My original data has more than 500 columns and more than 20,000 rows.)
Are there any tips for that? Would really appreciate your help.
*In advance, if the matrix B is like below,
> B
col1 col2 col3 col4
1 a c e f
How to make the matrix C in this case?
You want:
A[, c('ID', B[, 1])]
For the second case, you want to use row number 1 of the second matrix, instead of its first column.
A[, c('ID', B[1, ])]
If B is a data.frame instead of a matrix, the syntax changes somewhat — you can use B$col1 instead of B[, 1], and to select by row, you need to transform the result to a vector, because the result of selecting a row in a data.frame is again a data.frame, i.e. you need to do unlist(B[1, ]).
You can use a subset:
cbind(A$ID, A[names(A) %in% B$col1])

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources