I have a data frame of values and want to count the number of values in each column that are greater than or equal to a series of thresholds. The ultimate goal is to produce a graph that looks like panels B and D in this figure
I have figured out a way to do this with a for loop but would like to avoid a loop if possible:
data <- as.data.frame(attitude)
max <- 100
counts <- data.frame(t(c(score = 1, colSums(data >= 1))))
for (x in 2:max) {
counts <- rbind(counts, c(score = x, colSums(data >= x)))
}
I tried adapting the code from this question (Count the number of values per column above a range of thresholds in R), but it gives an error and does not produce the expected result:
as.data.frame() %>%
lapply(
function(z) table(findInterval(z, 0:max, rightmost.closed = TRUE))
) %>%
do.call(cbind, .) %>%
as.data.frame()
Is there a way to do this without a loop? Thanks.
You can do this with sapply/lapply :
data <- as.data.frame(attitude)
num <- seq_len(100)
result <- data.frame(score = num,
do.call(rbind, lapply(num, function(x) colSums(data >= x))))
Related
I want to code very simple program: It should go through all columns in data frame and if there is at least one observation bigger than 1000 then program should divide this variable by 1000 and add "in (000)" to the variable name.
My solution
set.seed(42)
df <- data.frame("Norm" = rnorm(100, 1000, 0.1), rexp(100))
for (var in seq_len(ncol(df))) {
if (max(df[, var], na.rm = T) > 1000) {
df[, var] <- df[, var] / 1000
colnames(df)[var] <- print(paste(colnames(df[var]), "(in 000')"))
}
}
However I find it little inconvenient. I think that loop is not necessary here. I tried to do it with apply, but I'm not sure why I obtain maximum of column divided by 1000 instead of data frame in which each value is divided by 1000:
apply(df, 2, function(x) ifelse(max(x) > 1000, x/1000, x))
Norm rexp.100.
0.9999925 0.4473922
Do you know how it can be done without loops ?
apply is for matrices, don't use it on data frames. And ifelse is for vector tests - the output is the same shape as the input. Your input to ifelse() is max(x) > 1000 which has length 1, so the result will be length 1. You could use lapply instead of for and if(){}else{} instead of ifelse():
df[] <- lapply(df, function(x) if(max(x, na.rm = TRUE) > 1000) {x / 1000}else{x})
But with *apply family functions you have to go back and change the name in another step---I often prefer for in a case like that.
But I would probably do it this way without looping:
cols_over_1000 = sapply(df, max, na.rm = TRUE) > 1000
df[cols_over_1000] = df[cols_over_1000] / 1000
names(df)[cols_over_1000] = paste(names(df)[cols_over_1000], "(in '000)")
Or in dplyr:
library(dplyr)
df %>%
mutate(across(
where(~ any(. > 1000)),
~ . / 1000,
.names = "{.col} (in '000)"
))
ab1k <- sapply(df, function(x) any(x > 1000))
df[ab1k] <- df[ab1k]/1000
names(df)[ab1k] <- paste(names(df)[ab1k], "(in 000')")
You could try with purrr package like this:
library(dplyr)
library(purrr)
my_fun <- function(x,y){
if(max(x, na.rm = T)>1000){
return(rename_with(tibble(x/1000),~paste0(y,"(in '000)")))
}else{return(rename_with(tibble(x),~y))}
}
map2_dfc(df,names(df),my_fun)
Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)
Below is my data
set.seed(100)
toydata <- data.frame(A = sample(1:50,50,replace = T),
B = sample(1:50,50,replace = T),
C = sample(1:50,50,replace = T)
)
Below is my swapping function
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
In the first case,I get the new data toy by swapping the entire dataframe. The code is below:
toydata<-as.matrix(toydata)
toy<-swapFun(toydata)
toy<-as.data.frame(toy)
In the second case, I get the new data toy by swapping each column respectively. Below is the code:
toydata<-as.data.frame(toydata)
toy2 <- toydata # Work with a copy
toy2[] <- lapply(toydata, swapFun)
toy<-toy2
Below is the function that can output the difference of contigency table after swapping.
# the function to compare contingency tables
f = function(x,y){
table1<-table(toydata[,x],toydata[,y])
table2<-table(toy[,x],toy[,y])
sum(abs(table1-table2))
}
# vectorise your function
f = Vectorize(f)
combn(x=names(toydata),
y=names(toydata), 2) %>%# create all combinations of your column names
t() %>% # transpose
data.frame(., stringsAsFactors = F) %>% # save as dataframe
filter(X1 != X2) %>% # exclude pairs of same
# column
mutate(SumAbs = f(X1,X2)) # apply function
In the second case, this mutate function works.
But in the first case, this mutatefunction does not work. It says:
+ filter(X1 != X2) %>% # exclude pairs of same column
+ mutate(SumAbs = f(X1,X2)) # apply function
Error in combn(x = names(toydata), y = names(toydata), 2) : n < m
However in the two cases, the toy data are all dataframes with the same dimension, the same row names and the same column names. I feel confused.
How can I fix it? Thanks.
Neither this post nor this post apply to my case.
Assume:
set.seed(42)
x<-rep(c("A","B","C"), c(3,4,1))
y<-rep(c("V","W"),c(5,3))
z<-rnorm(8,-2,1)
df<-data.frame(x,y,z)
boxplot(z~x+y,df)
I want my plot to include groups with more than, say, one element. This means that I want my plot show only A.V, B.V and B.W.
Furthermore, since my graph has about 70 groups, I don't want to do it by writing a list by hand.
Thanks
You can create a new column ('xy') using paste, create a logical index using ave for 'xy' groups having more than one elements, and then do the boxplot.
df1$xy <- factor(paste(df1$x, df1$y, sep='.'))
index <- with(df1, ave(1:nrow(df1), xy, FUN=length))>1
boxplot(z~xy, droplevels(df1[index,]))
Or using ggplot
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
group_by(x,y) %>%
filter(n()>1) %>%
unite(xy, x,y) %>%
ggplot(., aes(xy, z))+
geom_boxplot()
You can see if any bp$n are 0 and subset by that
set.seed(42)
df <- data.frame(x = rep(c("A","B","C"), c(3,4,1)),
y = rep(c("V","W"),c(5,3)),
z = rnorm(8,-2,1))
bp <- boxplot(z ~ x + y, df, plot = FALSE)
wh <- which(bp$n == 0)
bp[] <- lapply(bp, function(x) if (length(x)) {
## `bp` contains a list of boxplot statistics, some vectors and
## some matrices which need to be indexed accordingly
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
## some of `bp` will not be present depending on how you called
## `boxplot`, so if that is the case, you need to leave them alone
## to keep the original structure so `bxp` can deal with it
} else x)
## call `bxp` on the subset of `bp`
bxp(bp)
Or you can use any value you like:
wh <- which(bp$n <= 1)
bp[] <- lapply(bp, function(x) if (length(x)) {
if (!is.null(nrow(x))) x[, -wh] else x[-wh]
} else x)
bxp(bp)
I have daily data (4011 days) together with indicator (1-weekdays, 2-weekend). I want to find weekly maxima with the corresponding indicator. For example (let say) the data is:
mydat <- matrix(c(0.027,0.034,0.019,0.021,0.026,0.024,0.058,0.026,0.064,
0.066,0.026,0.101,0.069,0.054,rep(2,2),rep(1,5),rep(2,2),rep(1,5)), ncol=2)
I have try with the following code. I manage to get the maximum sequences (in this case, weekly maxima) but I dont want maximum sequences in indicator. Here is the code
week.max <- function(vec){
if(length(vec[is.na(vec)]) == 7){
return(NA)
}
else{
return(max(vec, na.rm = T))
}
}
max.week.dat <- apply(mydat, 2, function(x) tapply(x, rep(1:(length(x)/7),
each=7, len=length(x)), week.max))
and the result
matrix(c(0.058,0.101,2,2),ncol=2)
I want the output like this:
matrix(c(0.058,0.101,1,1),ncol=2)
Many thanks in advance.
Here is the data (with an extra day in the third week)
mydat <- data.frame(value = c(0.027,0.034,0.019,0.021,0.026,0.024,0.058,0.026,0.064,
0.066,0.026,0.101,0.069,0.054,0.95), ind = c(rep(2,2),rep(1,5),rep(2,2),rep(1,5),2))
Your function
week.max <- function(vec){
if(length(vec[is.na(vec)]) == 7){
return(NA)
}
else{
return(max(vec, na.rm = T))
}
}
Add the week information
mydat$week <- c(rep(1:2,each=7),3)
Use the same solution as for here
library(plyr)
ddply(mydat, .(week), subset, subset = value==week.max(value), select = -week)