I have a large column with NAs, I want to rank the time as shown below. I want to keep NAs while I exclude them from the analysis,
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=True)
I want to get the following table:
time Rank
40 3
30 4
50 2
NA NA
60 1
NA NA
20 5
I have used the following code:
df$Rank<--df$time,ties.method="mim")
#fixed data
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=TRUE)
You can do something like
nonNaIndices <- !is.na(df$time)
df$Rank <- NA
df$Rank[nonNaIndices] <- rank(df$time[nonNaIndices],ties.method="min")
resulting in
> df
time Rank
1 40 3
2 30 2
3 50 4
4 NA NA
5 60 5
6 NA NA
7 20 1
Note: Please make sure to check your question for missing function calls before submitting it. In your case it could be guessed from the context.
You can use dense_rank from dplyr -
library(dplyr)
df %>% mutate(Rank = dense_rank(-time))
# time Rank
#1 40 3
#2 30 4
#3 50 2
#4 NA NA
#5 60 1
#6 NA NA
#7 20 5
Related
This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!
Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}
In the example below, the event start is defined as when the prior value of "values" is 90 or more and the current value is below 90. The event end is when the current value is below 90 and the next value is 90 or above.
sequential_index <- seq(1,10)
values <- c(91,90,89,89,90,90,89,88,90,91)
df <- data.frame(sequential_index, values)
Looking at df in the example above, the first event occurs for observations 3-4 and the second event occurs for observations 7-8. I am trying, to no avail, to add an "events" column to the above data frame that looks something like this:
sequential_index values events
1 1 91 NA
2 2 90 NA
3 3 89 1
4 4 89 1
5 5 90 NA
6 6 90 NA
7 7 89 2
8 8 88 2
9 9 90 NA
10 10 91 NA
My dataset is rather large and I'm trying to avoid for loops.
Thanks in advance,
-jt
I have this solution using dplyr.
library(dplyr)
df %>%
# Define the start of events (putting 1 at the start of events)
mutate(events = case_when(lag(values)>=90 & values<90 ~ 1, TRUE ~ 0)) %>%
# Extend the events using cumsum()
mutate(events = case_when(values<90 ~ cumsum(events)))
Output :
sequential_index values events
1 1 91 NA
2 2 90 NA
3 3 89 1
4 4 89 1
5 5 90 NA
6 6 90 NA
7 7 89 2
8 8 88 2
9 9 90 NA
10 10 91 NA
One option with base R would be rle
df$events <- inverse.rle(within.list(rle(df$values < 90),
values[values] <- seq_along(values[values])
))
df$events[df$events == 0] <- NA
df$events
#[1] NA NA 1 1 NA NA 2 2 NA NA
Or in a compact way with data.table
library(data.table)
setDT(df)[, events := as.integer(factor(rleid(events < 90)[events < 90]))]
I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!
You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]
I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3
I always use "with" instead of "within" within the context of my research, but I originally thought they were the same. Just now I mistype "with" for "within" and the results returned are quite different. I am wondering why?
I am using the baseball data in the plyr package, so I first load the library by
require(plyr)
Then, I want to select all rows with an id "ansonca01". At first, as I said, I used "within", and run the function as follows:
within(baseball, baseball[id=="ansonca01", ])
I got very strange results which basically includes everything:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
113 yorkto01 1871 1 TRO 29 145 36 37 5 7 2 23 2 2 9 1 NA NA NA NA NA
.........
Then I use "with" instead of "within",
with(baseball, baseball[id=="ansonca01",])
and got the results that I expected
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
121 ansonca01 1872 1 PH1 46 217 60 90 10 7 0 50 6 6 16 3 NA NA NA NA NA
276 ansonca01 1873 1 PH1 52 254 53 101 9 2 0 36 0 2 5 1 NA NA NA NA NA
398 ansonca01 1874 1 PH1 55 259 51 87 8 3 0 37 6 0 4 1 NA NA NA NA NA
525 ansonca01 1875 1 PH1 69 326 84 106 15 3 0 58 11 6 4 2 NA NA NA NA NA
I checked the documentation of with and within by typing help(with) in R environment, and got the following:
with is a generic function that evaluates expr in a local environment constructed from data. The environment has the caller's environment as its parent. This is useful for simplifying calls to modeling functions. (Note: if data is already an environment then this is used with its existing parent.)
Note that assignments within expr take place in the constructed environment and not in the user's workspace.
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to data (this may fail in the data frame case if objects are created which cannot be stored in a data frame), and returns it. within can be used as an alternative to transform.
From this explanation of the differences, I don't get why I obtained different results with such a simple operation. Anyone has ideas?
I find simple examples often work to highlight the difference. Something like:
df <- data.frame(a=1:5,b=2:6)
df
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
with(df, {c <- a + b; df;} )
a b
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
within(df, {c <- a + b; df;} )
# equivalent to: within(df, c <- a + b)
# i've just made the return of df explicit
# for comparison's sake
a b c
1 1 2 3
2 2 3 5
3 3 4 7
4 4 5 9
5 5 6 11
The documentation is quite clear about the semantics and return values (and nicely matches the everyday meanings of the words “with” and “within”):
Value:
For ‘with’, the value of the evaluated ‘expr’. For ‘within’, the
modified object.
Since your code doesn’t modify anything inside baseball, the unmodified baseball is returned. with on the other hand doesn’t return the object, it returns expr.
Here’s an example where the expression modifies the object:
> head(within(cars, speed[dist < 20] <- 1))
speed dist
1 1 2
2 1 10
3 1 4
4 7 22
5 1 16
6 1 10
As above, with returns the value of the last evaluated expression. It is handy for one-liners such as:
with(cars, summary(lm (speed ~ dist)))
but is not suitable for sending multiple expressions.
I often find within useful for manipulating a data.frame or list (or data.table) as I find the syntax easy to read.
I feel that the documentation could be improved by adding examples of use in this regard, e.g.:
df1 <- data.frame(a=1:3,
b=4:6,
c=letters[1:3])
## library("data.table")
## df1 <- as.data.table(df1)
df1 <- within(df1, {
a <- 10:12
b[1:2] <- letters[25:26]
c <- a
})
df1
giving
a b c
1: 10 y 10
2: 11 z 11
3: 12 6 12
and
df1 <- as.list(df1)
df1 <- within(df1, {
a <- 20:23
b[1:2] <- letters[25:26]
c <- paste0(a, b)
})
df1
giving
$a
[1] 20 21 22 23
$b
[1] "y" "z" "6"
$c
[1] "20y" "21z" "226" "23y"
Note also that methods("within") gives only these object types, being:
within.data.frame
within.list
(and within.data.table if the package is loaded).
Other packages may define additional methods.
Perhaps unexpectedly for some, with and within are generally not appropriate choices when manipulating variables within defined environments...
To address the comment - there is no within.environment method. Using with requires you to have the function you're calling within the environment, which somewhat defeats the purpose for me e.g.
df1 <- as.environment(df1)
## with(df1, ls()) ## Error
assign("ls", ls, envir=df1)
with(df1, ls())