Find Nth largest Across Columns (NOT in a vector) - r

Consider the following example:
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
In R, functions already exist to find the element-wise max and min, so I could easily create a new variable that is equal to the largest (or smallest) value across the columns of interest:
> DF$Max <- pmax(Var_A,Var_B,Var_C,Var_D)
> DF
Var_A Var_B Var_C Var_D Max
1 44 33 6 72 72
2 29 66 51 12 66
3 35 29 47 79 79
4 39 79 47 65 79
5 97 60 36 81 97
But what if I need to create a variable that captures, say, the second largest value in each row (i.e., across the columns)?
In the real data set that I'm working with, I have 600+ columns and about 28 million records. I need to create variables that will identify and store the largest, second largest, third largest, etc. values found when looking across the variables (columns) for each record, much like pmax would do, but for other ordinals.
The only way that I have been able to functionally make it work on a subset of the data is to do a loop, but that loop won't finish in my lifetime if I run it on the entire data set. I also considered using the apply function, but my understanding is that apply will convert the data set to a matrix first, which my data set won't take kindly to.
Any suggestions on a non-loop way to do this? And with this amount of data, the faster the better...

This may be a solution...
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
result <-sapply(1:nrow(DF), function(x) {
df <- as.data.frame(DF[x,])
ord <- df[order(-DF[x,])]
})
result <- t(result)
output <- cbind(DF,result)
for (i in (ncol(DF)+1):ncol(output) ) {
colnames(output)[i]<-paste0("Max",i-ncol(DF))
}
output
Var_A Var_B Var_C Var_D Max1 Max2 Max3 Max4
1 42 12 64 9 64 42 12 9
2 67 22 47 4 67 47 22 4
3 80 56 82 94 94 82 80 56
4 31 62 88 73 88 73 62 31
5 91 67 15 41 91 67 41 15

Related

How to use characters in variables summing in R?

I have some dataframe. Here is a small expample:
a <- rnorm(100, 5, 2)
b <- rnorm(100, 10, 3)
c <- rnorm(100, 15, 4)
df <- data.frame(a, b, c)
And I have a character variable vect <- "c('a','b')"
When I try to calculate sum of vars using command
df$d <- df[vect]
which must be an equivalent of
df$d <- df[c('a','b')]
But, as a reslut I have got an error
[.data.frame(df, vect) :undefined columns selected
You're assumption that
vect <- "c('a','b')"
df$d <- df[vect]
is equivalent to
df$d <- df[c('a','b')]
is incorrect.
As #Karthik points out, you should remove the quotation marks in the assignment to vect
However, from your question it sounds like you want to then sum the elements specified in vect and then assign to d. To do this you need to slightly change your code
vect <- c('a','b')
df$d <- apply(X = df[vect], MARGIN = 1, FUN = sum)
This does elementwise sum on the columns in df specified by vect. The MARGIN = 1 specifies that we want to apply the sum rowise rather than columnwise.
EDIT:
As #ThomasIsCoding points out below, if for some reason vect has to be a string, you can parse a string to an R expression using str2lang
vect <- "c('a','b')"
parsed_vect <- eval(str2lang(vect))
df$d <- apply(X = df[parsed_vect], MARGIN = 1, FUN = sum)
Perhaps you can try
> df[eval(str2lang(vect))]
a b
1 8.1588519 9.0617818
2 3.9361214 13.2752377
3 5.5370983 8.8739725
4 8.4542050 8.5704234
5 3.9044461 13.2642793
6 5.6679639 12.9529061
7 4.0183808 6.4746806
8 3.6415608 11.0308990
9 4.5237453 7.3255129
10 6.9379168 9.4594150
11 5.1557935 11.6776181
12 2.3829337 3.5170335
13 4.3556430 7.9706624
14 7.3274615 8.1852829
15 -0.5650641 2.8109197
16 7.1742283 6.8161200
17 3.3412044 11.6298940
18 2.5388981 10.1289533
19 3.8845686 14.1517643
20 2.4431608 6.8374837
21 4.8731053 12.7258259
22 6.9534912 6.5069513
23 4.4394807 14.5320225
24 2.0427553 12.1786148
25 7.1563978 11.9671603
26 2.4231207 6.1801862
27 6.5830372 0.9814878
28 2.5443326 9.8774632
29 1.1260322 9.4804636
30 4.0078436 12.9909014
31 9.3599808 12.2178596
32 3.5362245 8.6758910
33 4.6462337 8.6647953
34 2.0698037 7.2750532
35 7.0727970 8.9386798
36 4.8465248 8.0565347
37 5.6084462 7.5676308
38 6.7617479 9.5357666
39 5.2138482 13.6822924
40 3.6259103 13.8659939
41 5.8586547 6.5087016
42 4.3490281 9.5367522
43 7.5130701 8.1699117
44 3.7933813 9.3241308
45 4.9466813 9.4432584
46 -0.3730035 6.4695187
47 2.0646458 10.6511916
48 4.6027309 4.9207746
49 5.9919348 7.1946723
50 6.0148330 13.4702419
51 5.5354452 9.0193366
52 5.2621651 12.8856488
53 6.8580210 6.3526151
54 8.0812166 14.4659778
55 3.6039030 5.9857886
56 9.8548553 15.9081336
57 3.3675037 14.7207681
58 3.9935336 14.3186175
59 3.4308085 10.6024579
60 3.9609624 6.6595521
61 4.2358603 10.6600581
62 5.1791856 9.3241118
63 4.6976289 13.2833055
64 5.1868906 7.1323826
65 3.1810915 12.8402472
66 6.0258287 9.3805249
67 5.3768112 6.3805096
68 5.7072092 7.1130150
69 6.5789349 8.0092541
70 5.3175820 17.3377234
71 9.7706112 10.8648956
72 5.2332127 12.3418373
73 4.7626124 13.8816910
74 3.9395911 6.5270785
75 6.4394724 10.6344965
76 2.6803695 10.4501753
77 3.5577834 8.2323369
78 5.8431140 7.7932460
79 2.8596818 8.9581837
80 2.7365174 10.2902512
81 4.7560973 6.4555758
82 4.6519084 8.9786777
83 4.9467471 11.2818536
84 5.6167284 5.2641380
85 9.4700525 2.9904731
86 4.7392906 11.3572521
87 3.1221908 6.3881556
88 5.6949432 7.4518023
89 5.1435241 10.8912283
90 2.1628966 10.5080671
91 3.6380837 15.0594135
92 5.3434709 7.4034042
93 -0.1298439 0.4832707
94 7.8759390 2.7411723
95 2.0898649 9.7687250
96 4.2131549 9.3175228
97 5.0648105 11.3943350
98 7.7225193 11.4180456
99 3.1018895 12.8890257
100 4.4166832 10.4901303

Filter all columns in timeseries to keep only top 1/3

I have a timeseries with about 100 dates, 50 entities per date (so 5,000 rows) and 50 columns (all are different variables). How can I filter each column in the data frame, per unique date, to keep the top 1/3 of values for each column on each date. Then get the average Return for that group for that date. Thank you.
My data is organized as follows but the numbers in each column are random and vary like they do in column "a" (this is a sample, the real data has many more columns and many more rows):
Date Identity Return a b c d e f... ...z
2/1/19 X 5 75 43 67 85 72 56 92
2/1/19 Y 4 27 43 67 85 72 56 92
2/1/19 Z 7 88 43 67 85 72 56 92
2/1/19 W 2 55 43 67 85 72 56 92
2/2/19 X 7 69 43 67 85 72 56 92
2/2/19 Y 8 23 43 67 85 72 56 92
2/3/19 X 2 34 43 67 85 72 56 92
2/3/19 Y 3 56 43 67 85 72 56 92
2/3/19 Z 4 62 43 67 85 72 56 92
2/3/19 W 4 43 43 67 85 72 56 92
2/3/19 U 4 26 43 67 85 72 56 92
2/4/19 X 6 67 43 67 85 72 56 92
2/4/19 Y 1 78 43 67 85 72 56 92
2/5/19 X 4 75 43 67 85 72 56 92
2/7/19 X 5 99 43 67 85 72 56 92
2/7/19 Y 4 72 43 67 85 72 56 92
2/7/19 Z 4 45 43 67 85 72 56 92
I am trying to filter data into quantiles. I have a code that works for filtering into quantiles for one measure. However I want filtered results for many measures individually (i.e. I want a “high” group for a ton of columns).
The code that I have that works for one measure is as follows.
Columns are date, identity, and a a is the indicator I want to sort on
High = df[!is.na(df$a),] %>%
group_by(df.date) %>%
filter(a > quantile(a, .666)) %>%
summarise(high_return = sum(df.return) / length(df.identity)
Now I want to loop this for when I have many indicators to sort on individually (I.e. I do not want to sort within one another, I want each sorted separately and the results to be broken out by indicator)
I want the output of the loop to be a new data frame with the following format (where a_Return is the average return of the top 1/3 of the original a's on a given date):
Date a_Return b_Return c_Return
2/1/19 6. 7 3
2/3/19 4. 2 5
2/4/19 2. 4 6
I have tried the code below without it working:
Indicators <- c(“a”, “b”, “c”)
for(i in 1:length(Indicators)){
High = df %>%
group_by(df.date) %>%
filter(High[[I]] > quantile(High[[i]], .666)) %>%
summarise(g = sum(df.return) / length(df.identity)}
With this attempt I get the error: "Error in filter_impl(.data, quo) : Result must have length 20, not 4719.
I also tried:
High %>%
group_by(date) %>%
filter_at(vars(Indicators[i]), any_vars(. > quantile (., .666)))%>%
summarise(!!Indicators[I] := sum(Return) / n())
but with that code I get the error "Strings must match column names. Unknown Columns: NA"
I want High to turn up with a date column and then a column for each a, b, and c.
If you combine the filtering and calculations into a single function, then you can put that into summarize_at to apply it easily to each column. Since you're example data isn't fully reproducible, I'll use the iris dataset. In your case, you'd replace Species with Date, and Petal.Width with Return:
library(dplyr)
top_iris <- iris %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
top_iris
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.1 2.22 2.09
The problem with using filter is that each function in the pipe runs in order, so any criteria you give to filter_* will have to be applied to the whole data.frame before the result is piped into summarize_at. Instead, we just use a single summarize_at statement, and filter each column as the summarization function is applied to it.
To explain this in more detail, summarize_at takes 2 arguments:
The first argument is one or more of the variable selector functions described in ?select_helpers, enclosed in the vars function. Here we use one_of which just takes a vector of column names, but we could also use matches to select using a regular expession, or starts_with to choose based on a prefix, for example.
The second argument is a list of one or more function calls to be run on each selected column, enclosed in the funs function. Here we have 1 function call, to which we've given the name return.
Like with any tidyverse function, this is evaluated in a local environment constructed from the data piped in. So bare variable names like Petal.Width function as data$Petal.Width. In *_at functions, the . represents the variable passed in, so when the Sepal.Length column is being summarized:
Petal.Width[. > quantile(., .666)]
means:
data$Petal.Width[data$Sepal.Length > quantile(data$Sepal.Length, .666)]
Finally, since the function in funs is named (that's the return =), then the resulting summary columns have the function's name (return) appended to the original column names.
If you want to remove missing data before running these calculations, you can use na.omit to strip out NA values.
To remove all rows containing NA, just pipe your data through na.omit before grouping:
iris2 <- iris
iris2[c(143:149), c(1:2)] <- NA
iris2 %>%
na.omit() %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.09 2.19 2.07
To strip NA values from each column as it's being summarized, you need to move na.omit inside the summarize function:
iris2 %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = {
var <- na.omit(.)
length(Petal.Width[var > quantile(var, .666)])
}))
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.11 2.2 2.09
Here we use curly braces to extend the function we run in summarize_at to multiple expressions. First, we strip out NA values, then we calculate the return values. Since this function is in summarize_at it gets applied to each variable based on the grouping established by group_by.

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Applying function to multiple rows using values from multiple rows

I have created the following simple function in R:
fun <- function(a,b,c,d,e){b+(c-a)*((e-b)/(d-a))}
That I want to apply this function to a data.frame that looks something like:
> data.frame("x1"=seq(55,75,5),"x2"=round(rnorm(5,50,10),0),"x3"=seq(30,10,-5))
x1 x2 x3
1 55 51 30
2 60 45 25
3 65 43 20
4 70 57 15
5 75 58 10
I want to apply fun to each separate row to create a new variable x4, but now comes the difficult part (to me at least..): for the arguments d and e I want to use the values x2 and x3 from the next row. So for the first row of the example that would mean: fun(a=55,b=51,c=30,d=45,e=25). I know that I can use mapply() to apply a function to each row, but I have no clue on how to tell mapply that it should use some values from the next row, or whether I should be looking for a different approach than mapply()?
Many thanks in advance!
Use mapply, but shift the fourth and fifth columns by one row. You can do it manually, or use taRifx::shift.
> dat
x1 x2 x3
1 55 25 30
2 60 58 25
3 65 59 20
4 70 68 15
5 75 43 10
library(taRifx)
> shift(dat$x2)
[1] 58 59 68 43 25
> mapply( dat$x1, dat$x2, dat$x3, shift(dat$x2), shift(dat$x3) , FUN=fun )
[1] 25.00000 -1272.00000 719.00000 -50.14815 26.10000
If you want the last row to be NA rather than wrapping, use wrap=FALSE,pad=TRUE:
> shift(dat$x2,wrap=FALSE,pad=TRUE)
[1] 58 59 68 43 NA

Resources