Finding the top values in data frame using r - r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)

You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

Related

data frame column names no longer unique when subsetting

I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.
Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E

populate Data Frame based on lookup data frame in R

How does one go about switching a data frame based on column names between to tables with a lookup table in between.
Orig
A B C
1 2 3
2 2 2
4 5 6
Ret
D E
7 8
8 9
2 4
lookup <- data.frame(Orig=c('A','B','C'),Ret=c('D','D','E'))
Orig Ret
1 A D
2 B D
3 C E
So that the final data frame would be
A B C
7 7 8
8 8 9
2 2 4
We can match the 'Orig' column in 'lookup' with the column names of 'Orig' to find the numeric index (although, it is in the same order, it could be different in other cases), get the corresponding 'Ret' elements based on that. We use that to subset the 'Ret' dataset and assign the output back to the original dataset. Here I made a copy of "Orig".
OrigN <- Orig
OrigN[] <- Ret[as.character(lookup$Ret[match(as.character(lookup$Orig),
colnames(Orig))])]
OrigN
# A B C
#1 7 7 8
#2 8 8 9
#3 2 2 4
NOTE: as.character was used as the columns in 'lookup' were factor class.
I believe that the following will work as well.
OrigN <- Orig
OrigN[, as.character(lookup$Orig)] <- Ret[, as.character(lookup$Ret)]
This method applies a column shuffle to Orig (actually a copy OrigN following #Akrun) and then fills these columns with the appropriately ordered columns of Ret using the lookup.

R: How can I remove rows from all the data frames in this list?

Say I have some data created like this
n <- 3
K <- 4
dat <- expand.grid(var1=1:n, var2=1:K)
dat looks like this:
var1 var2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
I want to remove some rows from both data frames in the list at the same time. Let's say I want to remove the 11th row, and I want the 'gap' to be filled in, so that now the 12th row will become the 11th row.
I understand this is a list of two data frames. Thus the advice here does not apply, since dat[[11]]<-NULL would do nothing, while dat[[2]]<-NULL would remove the second data frame from the list
lapply(dat,"[",11) lets me access the relevant elements, but I don't know how to remove them.
Assuming that we want to remove rows from a list of data.frames, we loop the list elements using lapply and remove the rows using numeric index.
lapply(lst, function(x) x[-11,])
Or without the anonymous function
lapply(lst, `[`, -11,)
The 'dat' is a data.frame.
is.data.frame(dat)
#[1] TRUE
If we want to remove rows from 'dat',
dat[-11,]
If the row.names also needs to be changed
`row.names<-`(dat[-11,], NULL)
data
lst <- list(dat, dat)

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources