Subset columns in R with specific values - r

Given a large data frame e.g. df, with 500 columns and 100 rows, how do I just subset columns exceeding a specific threshold e.g. 1 ?
Thanks

Assuming you mean subset columns that all of their values are greater than 1 you could do something like this (essentially you can also use the following according to any condition you might have. Just change the if condition):
Example Data
a <- data.frame(matrix(runif(90),ncol=3))
> a
X1 X2 X3
1 0.33341130 0.09307143 0.51932506
2 0.78014395 0.30378432 0.67309736
3 0.19967771 0.30829771 0.60144888
4 0.77736355 0.42504910 0.23880491
5 0.60631868 0.55198423 0.29565519
6 0.24246456 0.57945721 0.17882712
7 0.10499677 0.48768998 0.54931955
8 0.92288335 0.29290491 0.72885160
9 0.85246128 0.87564673 0.60069170
10 0.39931205 0.29895856 0.83249469
11 0.33674259 0.85618041 0.62940935
12 0.27816980 0.51508938 0.76079354
13 0.19121182 0.27586235 0.21273823
14 0.66337625 0.18631150 0.67762964
15 0.00923405 0.84753915 0.08386400
16 0.33209371 0.54919903 0.49128825
17 0.97685675 0.25564765 0.56439142
18 0.26710042 0.75852884 0.88706946
19 0.32422355 0.58971620 0.84070049
20 0.73000898 0.09068726 0.92541277
21 0.80547283 0.93723241 0.31050230
22 0.28897215 0.80679092 0.06080124
23 0.32190269 0.12254342 0.42506740
24 0.52569405 0.68506407 0.68302356
25 0.31098388 0.66225007 0.08565480
26 0.67546897 0.08123716 0.58419470
27 0.29501987 0.17836528 0.79322116
28 0.20736102 0.81145297 0.44078101
29 0.75165829 0.51865202 0.36653840
30 0.63375066 0.03804626 0.69949846
Solution
Just a single lapply is enough. I use 0.05 as threshold here because it is easier to demonstrate how to use it according to my random data set. Change that to whatever you want in your dataset.
b <- do.call(cbind, (lapply(a, function(x) if(all(x>0.05)) return(x) )))
Output
> b
X3
[1,] 0.51932506
[2,] 0.67309736
[3,] 0.60144888
[4,] 0.23880491
[5,] 0.29565519
[6,] 0.17882712
[7,] 0.54931955
[8,] 0.72885160
[9,] 0.60069170
[10,] 0.83249469
[11,] 0.62940935
[12,] 0.76079354
[13,] 0.21273823
[14,] 0.67762964
[15,] 0.08386400
[16,] 0.49128825
[17,] 0.56439142
[18,] 0.88706946
[19,] 0.84070049
[20,] 0.92541277
[21,] 0.31050230
[22,] 0.06080124
[23,] 0.42506740
[24,] 0.68302356
[25,] 0.08565480
[26,] 0.58419470
[27,] 0.79322116
[28,] 0.44078101
[29,] 0.36653840
[30,] 0.69949846
Only column 3 confirmed the condition on this occasion so it was returned.

Or another option (based on #LyzandeR's data). Get the colSums of the logical condition (a <= 0.05) and negate it (!). If there are 0 values for a particular column, this step converts to TRUE. This can be used for selecting the columns.
a[,!colSums(a<=0.05), drop=FALSE]

Related

R: How do I copy a value from a table and paste it into all rows within a specified row number range?

The current issue I am facing is that I have generated a list of row numbers based on the location of specific values within a table. This is as follows:
row_values <- which(table[,1] == "place of work :")
This leaves me with a list of all the row numbers in which "place of work" appears:
[1] 1 406 811 1216 1621 2026 2431 2836 3241 3646 4051 4456 4861 5266 5671 6076 6481 6886 7291 7696 8101 8506 8911 9316 9721 10126
[27] 10531 10936 11341 11746 12151 12556
This is 32 entries in total. I then also have another table which contains a number of codes (32 in total):
Area Code (Origin)
[1,] "E41000293"
[2,] "E41000294"
[3,] "E41000295"
[4,] "E41000296"
[5,] "E41000297"
[6,] "E41000298"
[7,] "E41000299"
[8,] "E41000300"
[9,] "E41000301"
[10,] "E41000302"
[11,] "E41000303"
[12,] "E41000304"
[13,] "E41000305"
[14,] "E41000306"
[15,] "E41000307"
[16,] "E41000308"
[17,] "E41000309"
[18,] "E41000310"
[19,] "E41000311"
[20,] "E41000312"
[21,] "E41000313"
[22,] "E41000314"
[23,] "E41000315"
[24,] "E41000316"
[25,] "E41000317"
[26,] "E41000318"
[27,] "E41000319"
[28,] "E41000320"
[29,] "E41000321"
[30,] "E41000322"
[31,] "E41000323"
[32,] "E41000324"
I need to find a way to paste those "Area Code (Origin)" values into the original table in every row between what the row_values as shown. These would sit in a blank column I have called "Area Codes (Destinations)". I.e, E41000293 should sit in the Area Code Destinations Column for every row from 1 to 405, E41000294 should sit from 406 to 810 ect.
Any help with this would be greatly appreciated as I am at a loss for how best to approach this.
Thanks,
Joe
You can use cumsum in a clever way to achieve this:
library(dplyr)
table %>%
group_by(grp = cumsum(text == "place of work :")) %>% #change "text" to your column name
mutate(Area_Code = Area[grp,]) %>% #change Area to your table name
ungroup() %>%
select(-grp)
All that's left to do is change the table names and column names to fit your problem.
This should do it :
for (i in 1:length(row_values)) {
if (i == length(row_values)-1) {
table[c(row_values[i]:(row_values[i+1])),'Area Codes (Destinations)'] = Areacode[i]
} else if (i <length(row_values)-1)) {
table[c(row_values[i]:(row_values[i+1]-1)),'Area Codes (Destinations)'] = Areacode[i]
}else{}
}

How do I automatically populate a matrix with intervals given the size and number of the intervals?

bucket_size <- 30
bucket_amount <- 24
matrix(???, bucket_amount, 2)
I'm trying to populate a (bucket_amount x 2) matrix using the interval size given by bucket_size. Here is what it would look like with the current given values of bucket_size and bucket_amount.
[1 30]
[31 60]
[61 90]
[91 120]
.
.
.
[691 720]
I can obviously hard code this specific example out, but I'm wondering how I can do this for different values of bucket_size and bucket_amount and have the matrix populate automatically.
We can seq specifying the from, by as 'bucket_size' and length.out as 'bucket_amount' to create a sequence of values ('v1'). Append 1 at the beginning while adding 1 to the 'v1' without last element and cbind these two vectors to create a matrix
v1 <- seq(bucket_size, length.out = bucket_amount , by = bucket_size)
v2 <- c(1, v1[-length(v1)] + 1)
m1 <- cbind(v2, v1)
-outupt
> head(m1)
v2 v1
[1,] 1 30
[2,] 31 60
[3,] 61 90
[4,] 91 120
[5,] 121 150
[6,] 151 180
> tail(m1)
v2 v1
[19,] 541 570
[20,] 571 600
[21,] 601 630
[22,] 631 660
[23,] 661 690
[24,] 691 720

Using different row values to create a function

I'm using the sp500 dataset to produce some visualizations and i'm trying to figure out how to code the formula for the daily returns.
With 't' being the day in question, I've figured out the formula but I can't conceptualize how to get it into R code. The formula is
(ClosingPrice(T)-ClosingPrice(T-1))/ClosingPrice(T-1)
I'm not sure how to reference the column that precedes the one in question (ClosingPrice[-1]?) but I need to make it into a function which I'll use to iterate through the rows, the value of which will go into the new column sp500$DailyReturns
I think this may be more an R programming problem than a statistics problem, but
sometimes the line between the two is fuzzy, so here goes.
If I understand the problem correctly, I believe the following R code will
give you a clue how to handle this.
My vector cp has 20 fake closing prices (with a bit of an upward trend), dif.cp has differences (with a NA for the first element), and dr is the daily return. Notice that output of diff is
a vector one element shorter than its argument. [Use the same set.seed
statement for exactly the same fake data; omit or set a different seed for
your own fresh example.]
set.seed(1883)
t = 1:20; cp = round(rnorm(20, 100, 5)) + 2*t
dif.cp = c(NA, diff(cp))
dr = c(NA, diff(cp))/cp
cbind(cp,dif.cp, dr) # binds together column vectors to make matrix
cp dif.cp dr
[1,] 104 NA NA
[2,] 106 2 0.018867925
[3,] 106 0 0.000000000
[4,] 112 6 0.053571429
[5,] 119 7 0.058823529
[6,] 107 -12 -0.112149533
[7,] 108 1 0.009259259
[8,] 112 4 0.035714286
[9,] 116 4 0.034482759
[10,] 127 11 0.086614173
[11,] 115 -12 -0.104347826
[12,] 127 12 0.094488189
[13,] 118 -9 -0.076271186
[14,] 126 8 0.063492063
[15,] 128 2 0.015625000
[16,] 131 3 0.022900763
[17,] 131 0 0.000000000
[18,] 131 0 0.000000000
[19,] 141 10 0.070921986
[20,] 139 -2 -0.014388489
You may have to make slight adjustments to fit your needs exactly. And maybe someone else has a stunning more clever way to do this.

Plotting in RGB system

I have the following matrix:
[1,] 0.41037159 0.035512698 0.29994815
[2,] 0.78614949 0.011315428 0.62326616
[3,] 0.42033801 0.061607952 0.25401746
[4,] 0.09617148 0.018400841 0.03194410
[5,] 0.20674738 0.006731245 0.04494770
[6,] 0.04557131 0.004572941 0.21202555
[7,] 0.34248003 0.049949400 0.15443408
[8,] 0.02531455 0.000000000 0.42509625
[9,] 0.90997863 0.997772243 0.22139140
[10,] 0.76310619 0.509855546 0.03353221
[11,] 0.00000000 0.012677219 0.46590562
[12,] 0.25175140 0.053030978 0.20539943
[13,] 0.45103356 0.066157072 0.25589777
[14,] 0.05925331 0.019370010 0.00000000
[15,] 0.47797323 0.028505669 0.50553749
[16,] 0.19155010 0.104653515 0.11193315
[17,] 0.51185644 0.135238576 0.10319339
[18,] 0.39407653 0.052845711 0.91779848
[19,] 0.13960324 0.004667373 0.06151135
[20,] 0.41404594 0.183680484 0.01052881
[21,] 0.16835070 0.045960588 0.99267994
[22,] 0.48752986 0.069917560 0.36119324
[23,] 0.37388790 0.030336825 0.21154492
[24,] 0.24967125 0.002199422 0.19477217
These values are coming from the values of the first three pca axes.
I also have the x, y coordinates of the 24 values:
x y
ABO 6.722778 46.27972
ANG -2.889466 56.64358
AUB 2.848056 44.68500
BPN -2.980000 48.07000
BRU 8.658332 47.02055
CHA 4.275278 46.43444
GAS 1.638333 43.15556
GNS -2.533333 49.45000
HFD -2.708111 52.06120
HOL 10.133333 54.33333
JER -2.116667 49.20000
LMS 1.332222 45.77083
MAN -0.702778 47.82861
MAR -1.083611 46.42222
MON 5.875556 47.19194
NOR -0.325278 49.20444
NRC 11.056110 60.79917
OUL -6.016670 33.41667
PMT 7.666667 45.06667
PRP -4.096389 47.99667
RMG 15.661940 38.11139
SAL 2.495000 45.13889
TAR 5.953056 45.53611
VOS 7.385556 48.00917
I want to plot the values of the pca using their coordinates into an RGB system.
I tried :
rgb(pca_matrix)
It indeed gives me the values of RGB system but I can't figure it out correctly how I should plot them using these coordinates.
Assuming that you need to plot points in different colors, first, save your rgb() output as object.
coll<-rgb(pca_matrix)
Then use, for example, function plot() to plot points and set colors according to saved object coll. Your second data frame was named as df in example.
plot(df$x,df$y,col=coll,pch=16,cex=2)

Cumulative sums, moving averages, and SQL "group by" equivalents in R

What's the most efficient way to create a moving average or rolling sum in R? How do you do the rolling function along with a "group by"?
While zoo is great, sometimes there are simpler ways. If you data behaves nicely, and is evenly spaced, the embed() function effectively lets you create multiple lagged version of a time series. If you look inside the VARS package for vector auto-regression, you will see that the package author chooses this route.
For example, to calculate the 3 period rolling average of x, where x = (1 -> 20)^2:
> x <- (1:20)^2
> embed (x, 3)
[,1] [,2] [,3]
[1,] 9 4 1
[2,] 16 9 4
[3,] 25 16 9
[4,] 36 25 16
[5,] 49 36 25
[6,] 64 49 36
[7,] 81 64 49
[8,] 100 81 64
[9,] 121 100 81
[10,] 144 121 100
[11,] 169 144 121
[12,] 196 169 144
[13,] 225 196 169
[14,] 256 225 196
[15,] 289 256 225
[16,] 324 289 256
[17,] 361 324 289
[18,] 400 361 324
> apply (embed (x, 3), 1, mean)
[1] 4.666667 9.666667 16.666667 25.666667 36.666667 49.666667
[7] 64.666667 81.666667 100.666667 121.666667 144.666667 169.666667
[13] 196.666667 225.666667 256.666667 289.666667 324.666667 361.666667
I scratched up a good answer from Achim Zeileis over on the r list. Here's what he said:
library(zoo)
## create data
x <- rnorm(365)
## transform to regular zoo series with "Date" index
x <- zooreg(x, start = as.Date("2004-01-01")) plot(x)
## add rolling/running/moving average with window size 7
lines(rollmean(x, 7), col = 2, lwd = 2)
## if you don't want the rolling mean but rather a weekly ## time series of means you can do
nextfri <- function(x) 7 * ceiling(as.numeric(x - 1)/7) + as.Date(1) xw <- aggregate(x, nextfri, mean)
## nextfri is a function which computes for a certain "Date" ## the next friday. xw is then the weekly series.
lines(xw, col = 4)
Achim went on to say:
Note, that the difference between is
rolling mean and the aggregated series
is due to different alignments. This
can be changed by changing the 'align'
argument in rollmean() or the
nextfri() function in the aggregate
call.
All this came from Achim, not from me:
http://tolstoy.newcastle.edu.au/R/help/05/06/6785.html

Resources