How to randomly select row from a dataframe for which the row skewness is larger that a given value in R - r

I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.

you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)

Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}

Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))

Related

Randomly Select 10 percent of data from the whole data set in R

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)
Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)
In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

Apply loop for rollapply windows

I currently have a dataset with 50,000+ rows of data for which I need to find rolling sums. I have completed this using rollaply which has worked perfectly. I need to apply these rolling sums across a range of widths (600, 1200, 1800...6000) which I have done by cut and pasting each line of script and changing the width. While it works, I'd like to tidy my script but applying a loop, or similar, if possible so that once the rollapply function has completed it's first 'pass' at 600 width, it then completes the same with 1200 and so on. Example:
Var1 Var2 Var3
1 11 19
43 12 1
4 13 47
21 14 29
41 15 42
16 16 5
17 17 16
10 18 15
20 19 41
44 20 27
width_2 <- rollapply(x$Var1, FUN = sum, width = 2)
width_3 <- rollapply(x$Var1, FUN = sum, width = 3)
width_4 <- rollapply(x$Var1, FUN = sum, width = 4)
Is there a way to run widths 2, 3, then 4 in a simpler way rather than cut and paste, particularly when I have up to 10 widths, and then need to run this across other cols. Any help would be appreciated.
We can use lapply in base R
lst1 <- lapply(2:4, function(i) rollapply(x$Var1, FUN = sum, width = i))
names(lst1) <- paste0('width_', 2:4)
list2env(lst1, .GlobalEnv)
NOTE: It is not recommended to create multiple objects in the global environment. Instead, the list would be better
Or with a for loop
for(v in 2:4) {
assign(paste0('width_', v), rollapply(x$Var1, FUN = sum, width = v))
}
Create a function to do this for multiple dataset
f1 <- function(col1, i) {
rollapply(col1, FUN = sum, width = i)
}
lapply(x[c('Var1', 'Var2')], function(x) lapply(2:4, function(i)
f1(x, i)))
Instead of creating separate vectors in global environment probably you can add these as new columns in the already existing dataframe.
Note that rollaplly(..., FUN = sum) is same as rollsum.
library(dplyr)
library(zoo)
bind_cols(x, purrr::map_dfc(2:4,
~x %>% transmute(!!paste0('Var1_roll_', .x) := rollsumr(Var1, .x, fill = NA))))
# Var1 Var2 Var3 Var1_roll_2 Var1_roll_3 Var1_roll_4
#1 1 11 19 NA NA NA
#2 43 12 1 44 NA NA
#3 4 13 47 47 48 NA
#4 21 14 29 25 68 69
#5 41 15 42 62 66 109
#6 16 16 5 57 78 82
#7 17 17 16 33 74 95
#8 10 18 15 27 43 84
#9 20 19 41 30 47 63
#10 44 20 27 64 74 91
You can use seq to generate the variable window size.
seq(600, 6000, 600)
#[1] 600 1200 1800 2400 3000 3600 4200 4800 5400 6000

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

Several Grubbs tests simultaneously in R

I'm new using R, I'm just starting with the outliers package. Probably this is very easy, but could anybody tell me how to run several Grubbs tests at the same time? I have 20 columns and I want to test all of them simultaneously.
Thanks in advance
Edit: Sorry for not explaining well. I'll try. I started using R today and I learned how to make Grubbs test using grubbs.test(data$S1, type=10 or 11 or 20) and it goes well. But I have a table with 20 columns, and I want to run Grubbs test for each of them simultaneously. I can do it one by one, but I think there must be a way to do it faster.
I ran the code at How to repeat the Grubbs test and flag the outliers as well, and works perfectly, but again, I would like to do it with my 20 samples.
As an example of my data:
S1 S2 S3 S4 S5 S6 S7
96 40 99 45 12 16 48
52 49 11 49 59 77 64
18 43 11 67 6 97 91
79 19 39 28 45 44 99
9 78 88 6 25 43 78
60 12 29 32 2 68 25
18 61 60 30 26 51 70
96 98 55 74 83 17 69
19 0 17 24 0 75 45
42 70 71 7 61 82 100
39 80 71 58 6 100 94
100 5 41 18 33 98 97
Hope this helps.
You can use lapply:
library(outliers)
df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test)
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)
Results:
> tests
$a
Grubbs test for one outlier
data: X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier
$b
Grubbs test for one outlier
data: X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier
$c
Grubbs test for one outlier
data: X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier
You can access the results as follows:
> tests$a$statistic
G U
1.8067906 0.8191417
Hope this helps.
A #Florian answer can be updated a bit. For example fancy and easy-reading result can be achieved with purrr package and tidyverse. It can be useful if you are comparing loads of groups:
Load necessary packages:
library(dplyr)
library(purrr)
library(tidyr)
library(outliers)
Create some data - we're going to use the same from Florian's answer, but transformed to a modern tibble and long format:
df <- tibble(a = runif(20),
b = runif(20),
c = runif(20)) %>%
# transform to along format
tidyr::gather(letter, value)
Then instead of apply functions we can use map and map_dbl from purrr:
df %>%
group_by(letter) %>%
nest() %>%
mutate(n = map_dbl(data, ~ nrow(.x)), # number of entries
G = map(data, ~ grubbs.test(.x$value)$statistic[[1]]), # G statistic
U = map(data, ~ grubbs.test(.x$value)$statistic[[2]]), # U statistic
grubbs = map(data, ~ grubbs.test(.x$value)$alternative), # Alternative hypotesis
p_grubbs = map_dbl(data, ~ grubbs.test(.x$value)$p.value)) %>% # p-value
# Let's make the output more fancy
mutate(G = signif(unlist(G), 3),
U = signif(unlist(U), 3),
grubbs = unlist(grubbs),
p_grubbs = signif(p_grubbs, 3)) %>%
select(-data) %>% # remove temporary column
arrange(p_grubbs)
And the desired output would be this:
# A tibble: 3 x 6
letter n G U grubbs p_grubbs
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 c 20 1.68 0.843 lowest value 0.0489965472370386 is an outlier 0.84
2 a 20 1.58 0.862 lowest value 0.0174888013862073 is an outlier 1
3 b 20 1.57 0.863 lowest value 0.0656482006888837 is an outlier 1

outcome variable as argument in regression function

I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.
Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.
preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
testData <- ddply(testData, .(firm), transform,
ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
testData$dy1<-(testData$y1-testData$ly1)
}
where the "l" and "d" in front of y1 stand for lagged and differenced.
Depending How can I include the outcome variable?
Thanks
T
Here's a solution using data tables:
# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))
preparedata<-function(testData,ID1,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
return(DT)
}
result <- preparedata(df,1,outcomevar="y1")
head(result)
# firm date y1 y2 y3 ly1 dy1
# 1: A 2014-01-02 27 48 66 NA NA
# 2: A 2014-01-03 37 86 35 27 10
# 3: A 2014-01-04 57 43 27 37 20
# 4: A 2014-01-05 89 24 97 57 32
# 5: A 2014-01-06 20 7 61 89 -69
# 6: A 2014-01-07 86 10 21 20 66
This assumes you pass the name of the column containing the "outcomevar", not the column itself.
You should read the documentation on data tables (?data.table), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)), and adds two new columns by reference: lag and diff. .SD is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...". You can specify which columns to include using .SDcols=.... The diff(...) function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag and diff to, e.g. ly1 and dy1.
Here is an outline of a function that relies more heavily on your example:
preparedata<-function(testData,outcomevar){
require(plyr)
testData <- testData[order(testData$firm,testData$date),]
testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
testData <- ddply(testData, .(firm), transform,
lvar = c( NA, tmp.var[-length(tmp.var)]))
testData$tmp.var <- NULL
testData <- within(testData, assign(paste("d", outcomevar, sep=""),
testData[,outcomevar]-testData$lvar))
colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
return(testData)
}
Using the df defined in jihoward's answer, we get
> head(preparedata(df,"y1"))
firm date y1 y2 y3 lvar dy1
1 A 2014-01-02 27 48 66 NA NA
2 A 2014-01-03 37 86 35 27 10
3 A 2014-01-04 57 43 27 37 20
4 A 2014-01-05 89 24 97 57 32
5 A 2014-01-06 20 7 61 89 -69
6 A 2014-01-07 86 10 21 20 66
This function returns a dataframe where ly1 is the lagged variable, and dy1 is the differenced variable that was specified with the second argument outcomevar. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1, but "y1" when you call the function.
You could process all outcome variables simultaneously by first gathering them into a key-value column pair:
set.seed(1)
df <- data.frame(
firm = rep(LETTERS[1:5], each = 10),
date = as.Date("2014-01-01") + 1:10,
y1 = sample(100, 50),
y2 = sample(100, 50),
y3 = sample(100, 50)
)
library(dplyr)
library(tidyr)
df %>%
gather(key, value, y1:y3) %>%
group_by(firm, key) %>%
mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#>
#> firm date key value lag diff
#> 1 A 2014-01-02 y1 27 NA NA
#> 2 A 2014-01-03 y1 37 27 -10
#> 3 A 2014-01-04 y1 57 37 -20
#> 4 A 2014-01-05 y1 89 57 -32
#> 5 A 2014-01-06 y1 20 89 69
#> 6 A 2014-01-07 y1 86 20 -66
#> 7 A 2014-01-08 y1 97 86 -11
#> 8 A 2014-01-09 y1 62 97 35
#> 9 A 2014-01-10 y1 58 62 4
#> 10 A 2014-01-11 y1 6 58 52
#> .. ... ... ... ... ... ...

Resources