Filter all columns in timeseries to keep only top 1/3 - r

I have a timeseries with about 100 dates, 50 entities per date (so 5,000 rows) and 50 columns (all are different variables). How can I filter each column in the data frame, per unique date, to keep the top 1/3 of values for each column on each date. Then get the average Return for that group for that date. Thank you.
My data is organized as follows but the numbers in each column are random and vary like they do in column "a" (this is a sample, the real data has many more columns and many more rows):
Date Identity Return a b c d e f... ...z
2/1/19 X 5 75 43 67 85 72 56 92
2/1/19 Y 4 27 43 67 85 72 56 92
2/1/19 Z 7 88 43 67 85 72 56 92
2/1/19 W 2 55 43 67 85 72 56 92
2/2/19 X 7 69 43 67 85 72 56 92
2/2/19 Y 8 23 43 67 85 72 56 92
2/3/19 X 2 34 43 67 85 72 56 92
2/3/19 Y 3 56 43 67 85 72 56 92
2/3/19 Z 4 62 43 67 85 72 56 92
2/3/19 W 4 43 43 67 85 72 56 92
2/3/19 U 4 26 43 67 85 72 56 92
2/4/19 X 6 67 43 67 85 72 56 92
2/4/19 Y 1 78 43 67 85 72 56 92
2/5/19 X 4 75 43 67 85 72 56 92
2/7/19 X 5 99 43 67 85 72 56 92
2/7/19 Y 4 72 43 67 85 72 56 92
2/7/19 Z 4 45 43 67 85 72 56 92
I am trying to filter data into quantiles. I have a code that works for filtering into quantiles for one measure. However I want filtered results for many measures individually (i.e. I want a “high” group for a ton of columns).
The code that I have that works for one measure is as follows.
Columns are date, identity, and a a is the indicator I want to sort on
High = df[!is.na(df$a),] %>%
group_by(df.date) %>%
filter(a > quantile(a, .666)) %>%
summarise(high_return = sum(df.return) / length(df.identity)
Now I want to loop this for when I have many indicators to sort on individually (I.e. I do not want to sort within one another, I want each sorted separately and the results to be broken out by indicator)
I want the output of the loop to be a new data frame with the following format (where a_Return is the average return of the top 1/3 of the original a's on a given date):
Date a_Return b_Return c_Return
2/1/19 6. 7 3
2/3/19 4. 2 5
2/4/19 2. 4 6
I have tried the code below without it working:
Indicators <- c(“a”, “b”, “c”)
for(i in 1:length(Indicators)){
High = df %>%
group_by(df.date) %>%
filter(High[[I]] > quantile(High[[i]], .666)) %>%
summarise(g = sum(df.return) / length(df.identity)}
With this attempt I get the error: "Error in filter_impl(.data, quo) : Result must have length 20, not 4719.
I also tried:
High %>%
group_by(date) %>%
filter_at(vars(Indicators[i]), any_vars(. > quantile (., .666)))%>%
summarise(!!Indicators[I] := sum(Return) / n())
but with that code I get the error "Strings must match column names. Unknown Columns: NA"
I want High to turn up with a date column and then a column for each a, b, and c.

If you combine the filtering and calculations into a single function, then you can put that into summarize_at to apply it easily to each column. Since you're example data isn't fully reproducible, I'll use the iris dataset. In your case, you'd replace Species with Date, and Petal.Width with Return:
library(dplyr)
top_iris <- iris %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
top_iris
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.1 2.22 2.09
The problem with using filter is that each function in the pipe runs in order, so any criteria you give to filter_* will have to be applied to the whole data.frame before the result is piped into summarize_at. Instead, we just use a single summarize_at statement, and filter each column as the summarization function is applied to it.
To explain this in more detail, summarize_at takes 2 arguments:
The first argument is one or more of the variable selector functions described in ?select_helpers, enclosed in the vars function. Here we use one_of which just takes a vector of column names, but we could also use matches to select using a regular expession, or starts_with to choose based on a prefix, for example.
The second argument is a list of one or more function calls to be run on each selected column, enclosed in the funs function. Here we have 1 function call, to which we've given the name return.
Like with any tidyverse function, this is evaluated in a local environment constructed from the data piped in. So bare variable names like Petal.Width function as data$Petal.Width. In *_at functions, the . represents the variable passed in, so when the Sepal.Length column is being summarized:
Petal.Width[. > quantile(., .666)]
means:
data$Petal.Width[data$Sepal.Length > quantile(data$Sepal.Length, .666)]
Finally, since the function in funs is named (that's the return =), then the resulting summary columns have the function's name (return) appended to the original column names.
If you want to remove missing data before running these calculations, you can use na.omit to strip out NA values.
To remove all rows containing NA, just pipe your data through na.omit before grouping:
iris2 <- iris
iris2[c(143:149), c(1:2)] <- NA
iris2 %>%
na.omit() %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = sum(Petal.Width[. > quantile(., .666)]) / length(Petal.Width[. > quantile(., .666)])))
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.09 2.19 2.07
To strip NA values from each column as it's being summarized, you need to move na.omit inside the summarize function:
iris2 %>%
group_by(Species) %>%
summarize_at(vars(one_of('Sepal.Length', 'Sepal.Width', 'Petal.Length')),
funs(return = {
var <- na.omit(.)
length(Petal.Width[var > quantile(var, .666)])
}))
# A tibble: 3 x 4
Species Sepal.Length_return Sepal.Width_return Petal.Length_return
<fct> <dbl> <dbl> <dbl>
1 setosa 0.257 0.262 0.308
2 versicolor 1.44 1.49 1.49
3 virginica 2.11 2.2 2.09
Here we use curly braces to extend the function we run in summarize_at to multiple expressions. First, we strip out NA values, then we calculate the return values. Since this function is in summarize_at it gets applied to each variable based on the grouping established by group_by.

Related

How do you select multiple values for grep across multiple columns in R?

this is my first question, sorry if I do this wrong, and sorry for it being so long...
I have a table of genomes from an entire genus that I would like to compare at a smaller level, such as within one or more species. My table is contains 3 columns: p1, p2, and percent identity. My rows are each comparisons between species.
p1 contains a list of genomes as does p2. Whatever number starts with the lowest digit is placed in p1 and the number with the higher digit goes in p2. The genome names are in the format 1_1_1, so p1 may be 1_1_1 and p2 may be 2_1_1200, but in the next row p1 could be 2_1_1200 if p2 is 3_1_23. The third column is the percent identity between them, but should not be relevant I don't think.
Multiple genomes belong to the same species, but they are not in any kind of order. For example, 42, 54, 210, and 694 are the same species. I would like to find only the rows where both p1 and p2 contain these numbers, so 42 to 54, 54 to 210, etc, but not 1 to 42. This species only has 4 genomes, but some have as many as 582 to compare.
So far:
They are bacterial genomes, so the genes are not in the same order, and the third digit corresponds to the gene position, so I've been using "^42" to call 42_1_622, for example. I don't want 642_1, so I anchored the 42 to the beginning. All middle digits are 1.
subset_species_1 <- rbind(x[grep("^42_", x$p1), ],
x[grep("^42_", x$p2), ],
x[grep("^54_", x$p1), ],
x[grep("^54_", x$p2), ],
x[grep("^210_", x$p1), ],
x[grep("^210_", x$p2), ],
x[grep("^694_", x$p1), ],
x[grep("^694_", x$p2), ])
This is obviously tedious, and it gives me all of the rows with any of these genomes in either column, not only rows with these genomes in both columns.
In addition, each table only represents one gene, and ideally I'd like to use the same subsets for every table, of which there are thousands.
Thank you in advance, I need all the help I can get!
Edited to add: I'm doing this in R/Rstudio
How about something like this. Rather than using regex to find the beginnings, why not just split the digits before the first underscore off from the rest and see whether those are in some pre-defined vector of values. That's what I've done below with find_vals being the values I'm looking for.
library(glue)
library(dplyr)
library(stringr)
set.seed(402943)
dat <- tibble(
p1 = glue("{sample(1:250, 250, replace=TRUE)}_1_{sample(1:250, 250, replace=TRUE)}"),
p2 = glue("{sample(1:250, 250, replace=TRUE)}_1_{sample(1:250, 250, replace=TRUE)}"),
p = runif(250, 0,1)
)
find_vals <- as.character(42:100)
dat %>% mutate(p11 = str_split(p1, "_", simplify=TRUE)[,1],
p21 = str_split(p2, "_", simplify=TRUE)[,1]) %>%
filter(p11 %in% find_vals & p21 %in% find_vals)
# A tibble: 16 x 5
# p1 p2 p p11 p21
# <glue> <glue> <dbl> <chr> <chr>
# 1 54_1_222 93_1_180 0.626 54 93
# 2 61_1_47 48_1_47 0.639 61 48
# 3 74_1_89 99_1_42 0.556 74 99
# 4 54_1_71 87_1_144 0.287 54 87
# 5 54_1_10 71_1_140 0.216 54 71
# 6 57_1_242 79_1_107 0.238 57 79
# 7 70_1_185 71_1_55 0.538 70 71
# 8 48_1_140 80_1_139 0.0752 48 80
# 9 72_1_105 62_1_56 0.213 72 62
# 10 70_1_241 64_1_220 0.857 70 64
# 11 57_1_213 97_1_47 0.432 57 97
# 12 55_1_56 45_1_249 0.907 55 45
# 13 55_1_9 44_1_156 0.633 55 44
# 14 59_1_153 96_1_228 0.154 59 96
# 15 61_1_97 99_1_189 0.556 61 99
# 16 83_1_56 86_1_85 0.787 83 86
#

using map function to create a dataframe from google trends data

relatively new to r, I have a list of words I want to run through the gtrendsr function to look at the google search hits, and then create a tibble with dates as index and relevant hits for each word as columns, I'm struggling to do this using the map functions in purr,
I started off trying to use a for loop but I've been told to try and use map in the tidyverse package instead, this is what I had so far:
library(gtrendsr)
words = c('cruise', 'plane', 'car')
for (i in words) {
rel_word_data = gtrends(i,geo= '', time = 'today 12-m')
iot <- data.frame()
iot[i] <- rel_word_data$interest_over_time$hits
}
I need to have the gtrends function take one word at a time, otherwise it will give a value for hits which is a adjusted for the popularity of the other words. so basically, I need the gtrends function to run the first word in the list, obtain the hits column in the interest_over_time section and add it to a final dataframe that contains a column for each word and the date as index.
I'm a bit lost in how to do this without a for loop
Assuming the gtrends output is the same length for every keyword, you can do the following:
# Load packages
library(purrr)
library(gtrendsR)
# Generate a vector of keywords
words <- c('cruise', 'plane', 'car')
# Download data by iterating gtrends over the vector of keywords
# Extract the hits data and make it into a dataframe for each keyword
trends <- map(.x = words,
~ as.data.frame(gtrends(keyword = .x, time = 'now 1-H')$interest_over_time$hits)) %>%
# Add the keywords as column names to the three dataframes
map2(.x = .,
.y = words,
~ set_names(.x, nm = .y)) %>%
# Convert the list of three dataframes to a single dataframe
map_dfc(~ data.frame(.x))
# Check data
head(trends)
#> cruise plane car
#> 1 50 75 84
#> 2 51 74 83
#> 3 100 67 81
#> 4 46 76 83
#> 5 48 77 84
#> 6 43 75 82
str(trends)
#> 'data.frame': 59 obs. of 3 variables:
#> $ cruise: int 50 51 100 46 48 43 48 53 43 50 ...
#> $ plane : int 75 74 67 76 77 75 73 80 70 79 ...
#> $ car : int 84 83 81 83 84 82 84 87 85 85 ...
Created on 2020-06-27 by the reprex package (v0.3.0)
You can use map to get all the data as a list and use reduce to combine the data.
library(purrr)
library(gtrendsr)
library(dplyr)
map(words, ~gtrends(.x,geo= '', time = 'today 12-m')$interest_over_time %>%
dplyr::select(date, !!.x := hits)) %>%
reduce(full_join, by = 'date')
# date cruise plane car
#1 2019-06-30 64 53 96
#2 2019-07-07 75 48 97
#3 2019-07-14 73 48 100
#4 2019-07-21 74 48 100
#5 2019-07-28 71 47 100
#6 2019-08-04 67 47 97
#7 2019-08-11 68 56 98
#.....

Find Nth largest Across Columns (NOT in a vector)

Consider the following example:
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
In R, functions already exist to find the element-wise max and min, so I could easily create a new variable that is equal to the largest (or smallest) value across the columns of interest:
> DF$Max <- pmax(Var_A,Var_B,Var_C,Var_D)
> DF
Var_A Var_B Var_C Var_D Max
1 44 33 6 72 72
2 29 66 51 12 66
3 35 29 47 79 79
4 39 79 47 65 79
5 97 60 36 81 97
But what if I need to create a variable that captures, say, the second largest value in each row (i.e., across the columns)?
In the real data set that I'm working with, I have 600+ columns and about 28 million records. I need to create variables that will identify and store the largest, second largest, third largest, etc. values found when looking across the variables (columns) for each record, much like pmax would do, but for other ordinals.
The only way that I have been able to functionally make it work on a subset of the data is to do a loop, but that loop won't finish in my lifetime if I run it on the entire data set. I also considered using the apply function, but my understanding is that apply will convert the data set to a matrix first, which my data set won't take kindly to.
Any suggestions on a non-loop way to do this? And with this amount of data, the faster the better...
This may be a solution...
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
result <-sapply(1:nrow(DF), function(x) {
df <- as.data.frame(DF[x,])
ord <- df[order(-DF[x,])]
})
result <- t(result)
output <- cbind(DF,result)
for (i in (ncol(DF)+1):ncol(output) ) {
colnames(output)[i]<-paste0("Max",i-ncol(DF))
}
output
Var_A Var_B Var_C Var_D Max1 Max2 Max3 Max4
1 42 12 64 9 64 42 12 9
2 67 22 47 4 67 47 22 4
3 80 56 82 94 94 82 80 56
4 31 62 88 73 88 73 62 31
5 91 67 15 41 91 67 41 15

Several Grubbs tests simultaneously in R

I'm new using R, I'm just starting with the outliers package. Probably this is very easy, but could anybody tell me how to run several Grubbs tests at the same time? I have 20 columns and I want to test all of them simultaneously.
Thanks in advance
Edit: Sorry for not explaining well. I'll try. I started using R today and I learned how to make Grubbs test using grubbs.test(data$S1, type=10 or 11 or 20) and it goes well. But I have a table with 20 columns, and I want to run Grubbs test for each of them simultaneously. I can do it one by one, but I think there must be a way to do it faster.
I ran the code at How to repeat the Grubbs test and flag the outliers as well, and works perfectly, but again, I would like to do it with my 20 samples.
As an example of my data:
S1 S2 S3 S4 S5 S6 S7
96 40 99 45 12 16 48
52 49 11 49 59 77 64
18 43 11 67 6 97 91
79 19 39 28 45 44 99
9 78 88 6 25 43 78
60 12 29 32 2 68 25
18 61 60 30 26 51 70
96 98 55 74 83 17 69
19 0 17 24 0 75 45
42 70 71 7 61 82 100
39 80 71 58 6 100 94
100 5 41 18 33 98 97
Hope this helps.
You can use lapply:
library(outliers)
df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test)
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)
Results:
> tests
$a
Grubbs test for one outlier
data: X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier
$b
Grubbs test for one outlier
data: X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier
$c
Grubbs test for one outlier
data: X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier
You can access the results as follows:
> tests$a$statistic
G U
1.8067906 0.8191417
Hope this helps.
A #Florian answer can be updated a bit. For example fancy and easy-reading result can be achieved with purrr package and tidyverse. It can be useful if you are comparing loads of groups:
Load necessary packages:
library(dplyr)
library(purrr)
library(tidyr)
library(outliers)
Create some data - we're going to use the same from Florian's answer, but transformed to a modern tibble and long format:
df <- tibble(a = runif(20),
b = runif(20),
c = runif(20)) %>%
# transform to along format
tidyr::gather(letter, value)
Then instead of apply functions we can use map and map_dbl from purrr:
df %>%
group_by(letter) %>%
nest() %>%
mutate(n = map_dbl(data, ~ nrow(.x)), # number of entries
G = map(data, ~ grubbs.test(.x$value)$statistic[[1]]), # G statistic
U = map(data, ~ grubbs.test(.x$value)$statistic[[2]]), # U statistic
grubbs = map(data, ~ grubbs.test(.x$value)$alternative), # Alternative hypotesis
p_grubbs = map_dbl(data, ~ grubbs.test(.x$value)$p.value)) %>% # p-value
# Let's make the output more fancy
mutate(G = signif(unlist(G), 3),
U = signif(unlist(U), 3),
grubbs = unlist(grubbs),
p_grubbs = signif(p_grubbs, 3)) %>%
select(-data) %>% # remove temporary column
arrange(p_grubbs)
And the desired output would be this:
# A tibble: 3 x 6
letter n G U grubbs p_grubbs
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 c 20 1.68 0.843 lowest value 0.0489965472370386 is an outlier 0.84
2 a 20 1.58 0.862 lowest value 0.0174888013862073 is an outlier 1
3 b 20 1.57 0.863 lowest value 0.0656482006888837 is an outlier 1

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Resources