Subset a dataset to leave the largest 2 values - r

I have a data set:
col1 col2
A 3
A 3
B 2
C 1
B 2
A 3
D 5
B 2
D 5
B 2
F 0
F 0
A 3
C 1
C 1
How can I subset it so as to "leave" the top 2 col1 values. So my output is this:
col1 col2
A 3
A 3
A 3
D 5
A 3
I have viewed this question, but it didn't answer my question.

Try this, but not sure why you only have one D:
#Code
newdf <- df[df$col2 %in% sort(unique(df$col2),decreasing = T)[1:2],]

I assume that your data is in a data.frame.
First of all, you need to get the top 2 values of col2. Therefore you can take the unique values of it, sort them in decreasing order, and take the first two elements:
col2Values <- unique(df$col2)
top2Elements <- sort(col2Values,decreasing = TRUE)[c(1,2)]
Now you know the top2 values, so you just need to check where these values appear in col2. This can be done via:
df[df$col2 %in% top2Elements,]
Update: Now it should work, I had some typos in there.

Related

I want to eliminate duplicates in a variable but only within a certain group of values in R

Not an extremely proficient programmer here so bear with me.
I want to eliminate duplicities in variable 'B' but only within the same values of variable 'A'. That is so that I get only one 'a' value for the group of 1's and I don't eliminate it for the group of 2's.
A <- c(1,1,1,2,2,2)
B <- c('a','b','a','c','a','d')
ab <- cbind(A,B)
AB <- as.data.frame(ab)
Thank you beforehand! Hope it was clear enough.
You may also want to take a look at the duplicated() function. Your example
a <- c(1,1,1,2,2,2)
b <- c('a','b','a','c','a','d')
ab <- cbind(a,b)
ab_df <- as.data.frame(ab)
gives you the following data frame:
> ab_df
a b
1 1 a
2 1 b
3 1 a
4 2 c
5 2 a
6 2 d
Obviously row 3 duplicates row 1. duplicated(ab_df) returns a logical vector indicating duplicated rows:
> duplicated(ab_df)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
This in turn could be used to eliminate the duplicated rows from your original data frame:
> d <- duplicated(ab_df)
> ab_df[!d, ]
a b
1 1 a
2 1 b
4 2 c
5 2 a
6 2 d
You may use unique which removes the duplicated rows of your data frame.
ab <- unique(ab)
ab
# A B
# 1 1 a
# 2 1 b
# 4 2 c
# 5 2 a
# 6 2 d

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

removing gene duplicates from heatmap in r

I have drawn heat maps from microarray expression data set and in the heatmaps I see duplicates and triplicates for many of the genes I am interested
I am very new to R and is there a way to remove these duplicates or triplicates of genes
For example I see name of one gene say (BMP1) 2 or 3 times in the heatmap
Kindly suggest me with some solutions
Regards
Ram
I try to guess your answer, but it will be better if you give an example of your problem:
> tmp <- data.frame("numbers" = 1:3, "letters" = letters[1:3])
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
> tmp <- rbind(tmp,tmp)
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
4 1 a
5 2 b
6 3 c
> unique(tmp)
numbers letters
1 1 a
2 2 b
3 3 c
From the base help
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

Resources