Rank each row in a data frame in descending order - r

I want to apply rank() to each row in a data frame by apply(data.frame,1,rank). However, rank is by default ascending. So when I apply rank() to my first row with the values (2,1,3,5), I get
[1] 2 1 3 4
However, I want
[1] 3 4 2 1
How can I do this using apply(data.frame,1,rank)?

Try
apply(-data, 1, rank, ties.method='first')
and compare with
apply(data, 1, rank, ties.method='first')
For your specific example
v1 <- c(2,1,3,5)
rank(v1)
#[1] 2 1 3 4
rank(-v1)
#[1] 3 4 2 1
data
set.seed(24)
data <- as.data.frame(matrix(sample(1:20, 4*20, replace=TRUE), ncol=4))

Related

Count unique instances in rows between two columns given by index

Hi I have an example data frame as follows. What I would like to do is count the number of instances of a unique value (example 1) that occur between the columns given by the indices ind1 and ind2. Output would be a vector with a number for each row that is the number of instances for that row.
COL1 <- c(1,1,1,NA,1,1)
COL2 <- c(1,NA,NA,1,1,1)
COL3 <- c(1,1,1,1,1,1)
ind1 <- c(1,2,1,2,1,2)
ind2 <- c(3,3,2,3,3,3)
Data <- data.frame (COL1, COL2, COL3, ind1, ind2)
Data
COL1 COL2 COL3 ind1 ind2
1 1 1 1 3
1 NA 1 2 3
1 NA 1 1 2
NA 1 1 2 3
1 1 1 1 3
1 1 1 2 3
so example output should look like
3, 1, 1, 2, 3, 2
My actual data set has many rows so I want to avoid loops as much as possible to save time. I was thinking an apply function with a sum(which(x==1)) may work I'm just not sure how to get the column values from the given indices.
An option would be to loop over the rows, extract the values based on the sequence index from 'ind1' to 'ind2' and get the count with table
apply(Data, 1, function(x) table(x[x['ind1']:x['ind2']]))
#[1] 3 1 1 2 3 2
Or using sum
apply(Data, 1, function(x) sum(x[x['ind1']:x['ind2']] == 1, na.rm = TRUE))
Or create a logical matrix and then use rowSums
rowSums(Data[1:3] * NA^!((col(Data[1:3]) >= Data$ind1) &
(col(Data[1:3]) <= Data$ind2)), na.rm = TRUE)
#[1] 3 1 1 2 3 2

How to input nominal values into one column based on values in another column through If_else statements

I am trying to input nominal variables based on a column dedicated to age. Basically, if someone is between the ages of 1 to 5, indicated in the age column, then I want the age group column to have the value of 1, since they are in age group 1. I'm trying to do this in multiple columns since ages increase by one each year. I've tried doing this through a for loop that uses an if else function, but it does not work.
`my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
my_data_1<-cbind("AgeGroup2000"=NA, "AgeGroup2001"=NA, "AgeGroup2002"=NA, my_data_1)
my_data_1
#I'm basically trying to make the below code into a for loop
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 1:5]<-1
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 6:10]<-2
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 11:15]<-3
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 1:5]<-1
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 6:10]<-2
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 11:15]<-3
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 1:5]<-1
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 6:10]<-2
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 11:15]<-3`
Maybe it is better to use findInterval or cut here. We can use lapply to apply it for multiple columns
my_data_1[paste0("AgeGroup_", 2000:2002)] <- lapply(my_data_1, findInterval, c(1, 6, 11))
# Age2000 Age2001 Age2002 AgeGroup_2000 AgeGroup_2001 AgeGroup_2002
#Participant1 1 2 3 1 1 1
#Participant2 3 4 5 1 1 1
#Participant3 5 6 7 1 2 2
#Participant4 7 8 9 2 2 2
#Participant5 9 10 11 2 3 3
#Participant6 11 12 13 3 3 3
Or mutate_all from dplyr
library(dplyr)
my_data_1 %>% mutate_all(list(Group = ~findInterval(., c(1, 6, 11))))
data
my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)

Create then populate colmuns in a dataframe

Hello I'm trying to find a way to create new columns in a dataframe the populate them.
For example:
id = c(2, 3, 5)
v1 = c(2, 1, 7)
v2 = c(1, 9, 5)
duration=c(v1+v2)
df = data.frame(id,v1,v2,duration,stringsAsFactors=FALSE)
id v1 v2 duration
1 2 2 1 3
2 3 1 9 10
3 5 7 5 12
Now I want to create new columns by dividing each value of a row by the 'duration' of said row, I know how do it manually but it is prone to errors and not really elegant...
df$I_v1=v1/duration
df$I_v2=v2/duration
Or is df <- df %>% mutate(I_v1 = v1/duration) quicker/better?
id v1 v2 duration I_v1 I_v2
1 2 2 1 3 0.6666667 0.3333333
2 3 1 9 10 0.1000000 0.9000000
It works but I would like to know if it's possible to create -and name- the row and populate them automatically.
Say that you have a cols vector containing the names of the columns you want to manipulate. In your example:
cols<-c("v1","v2")
Then you can try:
df[paste0("I_",cols)]<-df[cols]/df$duration
# id v1 v2 duration I_v1 I_v2
#1 2 2 1 3 0.6666667 0.3333333
#2 3 1 9 10 0.1000000 0.9000000
#3 5 7 5 12 0.5833333 0.4166667
You can use transform():
df <- data.frame(id=c(2, 3, 5), v1=c(2, 1, 7), v2=c(1, 9, 5))
df$duration <- df$v1 + df$v2) # or ... <- with(df, v1 + v2)
df_new <- transform(df, I_v1=v1/duration, I_v2=v2/duration )
... or (if you have many columns v1, v2, ...):
as.matrix(df[, 2:3])/df$duration # or with cbind():
cbind(df, as.matrix(df[, 2:3])/df$duration)
(similar as in the answer from nicola)
All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values. You can name the rows as:
row.names(x) <- value
Arguments:
x
object of class "data.frame", or any other class for which a method has been defined.
value
an object to be coerced to character unless an integer vector.e here

In R, how to sum certain rows of a data frame with certain logic?

Hi experienced R users,
It's kind of a simple thing.
I want to sum x by Group.1 depending on one controllable variable.
I'd like to sum x by grouping the first two rows when I say something like: number <- 2
If I say 3, it should sum x of the first three rows by Group.1
Any idea how I might tackle this problem? Should I write a function?
Thank y'all in advance.
Group.1 Group.2 x
1 1 Eggs 230299
2 2 Eggs 263066
3 3 Eggs 266504
4 4 Eggs 177196
If the sums you want are always cumulative, there's a function for that, cumsum. It works like this.
> cumsum(c(1,2,3))
[1] 1 3 6
In this case you might want something like
> mysum <- cumsum(yourdata$x)
> mysum[2] # the sum of the first two rows
> mysum[3] # the sum of the first three rows
> mysum[number] # the sum of the first "number" rows
Assuming your data is in mydata:
with(mydata, sum(x[Group.1 <= 2])
You could use the by function.
For instance, given the following data.frame:
d <- data.frame(Group.1=c(1,1,2,1,3,3,1,3),Group.2=c('Eggs'),x=1:8)
> d
Group.1 Group.2 x
1 1 Eggs 1
2 1 Eggs 2
3 2 Eggs 3
4 1 Eggs 4
5 3 Eggs 5
6 3 Eggs 6
7 1 Eggs 7
8 3 Eggs 8
You can do this:
num <- 3 # sum only the first 3 rows
# The aggregation function:
# it is called for each group receiving the
# data.frame subset as input and returns the aggregated row
innerFunc <- function(subDf){
# we create the aggregated row by taking the first row of the subset
row <- head(subDf,1)
# we set the x column in the result row to the sum of the first "num"
# elements of the subset
row$x <- sum(head(subDf$x,num))
return(row)
}
# Here we call the "by" function:
# it returns an object of class "by" that is a list of the resulting
# aggregated rows; we want to convert it to a data.frame, so we call
# rbind repeatedly by using "do.call(rbind, ... )"
d2 <- do.call(rbind,by(data=d,INDICES=d$Group.1,FUN=innerFunc))
> d2
Group.1 Group.2 x
1 1 Eggs 7
2 2 Eggs 3
3 3 Eggs 19
If you want to sum only a subset of your data:
my_data <- data.frame(c("TRUE","FALSE","FALSE","FALSE","TRUE"), c(1,2,3,4,5))
names(my_data)[1] <- "DESCRIPTION" #Change Column Name
names(my_data)[2] <- "NUMBER" #Change Column Name
sum(subset(my_data, my_data$DESCRIPTION=="TRUE")$NUMBER)
You should get 6.
Not sure why Eggs are important here ;)
df1 <- data.frame(Gr=seq(4),
x=c(230299, 263066, 266504, 177196)
)
now with n=2 i.e. first two rows:
n <- 2
sum(df1[, "x"][df1[, "Gr"]<=n])
The expression [df1[, "Gr"]<=n] creates a logical vector to subset the elements in df1[, "x"] before summing them.
Also, it appears your Group.1 is the same as the row no. If so this may be simpler:
sum(df1[, "x"][1:n])
or to get all at once
cumsum(df1[, "x"])

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Resources