Ranking based on two variables - r

I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!

You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]

Related

How to get the NA while we exclude them for analysis

I have a large column with NAs, I want to rank the time as shown below. I want to keep NAs while I exclude them from the analysis,
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=True)
I want to get the following table:
time Rank
40 3
30 4
50 2
NA NA
60 1
NA NA
20 5
I have used the following code:
df$Rank<--df$time,ties.method="mim")
#fixed data
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=TRUE)
You can do something like
nonNaIndices <- !is.na(df$time)
df$Rank <- NA
df$Rank[nonNaIndices] <- rank(df$time[nonNaIndices],ties.method="min")
resulting in
> df
time Rank
1 40 3
2 30 2
3 50 4
4 NA NA
5 60 5
6 NA NA
7 20 1
Note: Please make sure to check your question for missing function calls before submitting it. In your case it could be guessed from the context.
You can use dense_rank from dplyr -
library(dplyr)
df %>% mutate(Rank = dense_rank(-time))
# time Rank
#1 40 3
#2 30 4
#3 50 2
#4 NA NA
#5 60 1
#6 NA NA
#7 20 5

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Merge two data frames with different dimensions with partial overwrite in R

I've spent the better part of a day on this but I keep getting stuck. This wouldn't take me very long using index-match-match in Excel, but I'm newer to R and merging data doesn't seem very straight-forward. I've searched the site and found similar problems but no solutions specific to this type of issue.
I have two data frames. They have different lengths in both dimensions. a is 4x4 and b is 3x3. They partially overlap:
a <- data.frame("ID" = c(1:4), "A" = c(21:24), "B" = c(31:34), "C" = c(41:44))
a
ID A B C
1 1 21 31 41
2 2 22 32 42
3 3 23 33 43
4 4 24 34 44
and
b <- data.frame("ID" = c(4:6), "C" = c(22:24), "D" = c(32:34))
b
ID C D
1 4 22 32
2 5 23 33
3 6 24 34
I'm merging on "ID" number. My goal is to get them to look like
c <- data.frame("ID" = c(1:6), "A" = c(21:24, NA, NA), "B" = c(31:34, NA, NA), "C" = c(41:43,22:24), "D" = c(NA, NA, NA, 32:34))
c
ID A B C D
1 21 31 41 NA
2 22 32 42 NA
3 23 33 43 NA
4 24 34 22 32
5 NA NA 23 33
6 NA NA 24 34
As you can see, the final data frame combines the two and assigns NA to the missing information. In column "C", I would like b to overwrite a where it has numerical values. In this example, the value in c[4,3] should change from 44 to 22.
Most of this is simple enough. But getting column "C" correct has been a nightmare. I did the simple thing first:
merge(a, b, by = "ID", all = T)
It almost does the trick but ends up with duplicate row "C"s:
ID A B C.x C.y D
1 1 21 31 41 NA NA
2 2 22 32 42 NA NA
3 3 23 33 43 NA NA
4 4 24 34 44 22 32
5 5 NA NA NA 23 33
6 6 NA NA NA 24 34
This wouldn't be so bad if I could find out how to merge the duplicate rows correctly because then I could just run
merge(a[-4], b[-2], by = "ID", all = T)
ID A B D
1 1 21 31 NA
2 2 22 32 NA
3 3 23 33 NA
4 4 24 34 32
5 5 NA NA 33
6 6 NA NA 34
to merge everything else, then bring in the merged "C" after the fact.
But I can't figure it out how to deal with this part of it:
merge(a[c(1,4)], b[c(1,2)], by = "ID", all = T)
ID C.x C.y ID C
1 1 41 NA 1 1 41
2 2 42 NA 2 2 42
3 3 43 NA -> 3 3 43
4 4 44 22 4 4 22
5 5 NA 23 5 5 23
6 6 NA 24 6 6 24
There's gotta be way.
Thanks for your help!
For anyone else looking at this in the future, I realized this could also be solved using the following in base rather than dplyr:
df <- merge(a, b, by = "ID", all = T)
df[,"C"] <- ifelse(is.na(df[,"C.y"]), df[,"C.x"], df[,"C.y"])
df <- df[,-c(match("C.x", names(df)),match("C.y", names(df)))]
This ended up being the method I used because down the road I came to needing to perform some steps that were very difficult with dplyr for a novice (using variables inside mutate() and select()) and much more straightforward in base using the above syntax.
Thanks again to CPak, without whom I could not have figured this out.
Try this
library(dplyr)
starthere <- merge(a, b, by = "ID", all = T)
starthere %>%
mutate(C = ifelse(is.na(C.y), C.x, C.y)) %>%
select(-C.x, -C.y)
# ID A B D C
# 1 1 21 31 NA 41
# 2 2 22 32 NA 42
# 3 3 23 33 NA 43
# 4 4 24 34 32 22
# 5 5 NA NA 33 23
# 6 6 NA NA 34 24

Conditional filtering of data.frame with preceeding and tailing NA observations

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3

Forcing unique values before casting (pivoting) in R

I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.

Resources