Let's say I want to select the 1st, 3rd, and 12th element from a data frame or a matrix:
m = matrix(1:12, 3, 4)
m[c(1,3,12)] # as expected: selects the 1st, 3rd, and 12th element
However this does not seems to work for data frames:
df = data.frame(m)
df[c(1,3,12)] # doesn't select the elements
What I'm using is:
as.vector(df)[c(1,3,12)] # works as expected
Is there a simpler way to achieve the same result?
EDIT:
as.vector(df)[c(1,3,12)] # does not work
As Richard Scriven pointed out:
unlist(df, use.names=FALSE)[c(1, 3, 12)] # do work
But I'm still looking for a shorter notation (if possible).
df<-data.frame(11:13,14:16,17:19,20:22)
df
# X11.13 X14.16 X17.19 X20.22
# 1 11 14 17 20
# 2 12 15 18 21
# 3 13 16 19 22
c(df[df==11], df[df==13],df[df==22])
# [1] 11 13 22
Related
I have a R dataFrame df with the following:
time value reference
45 10 11
22 12 10
13 15 5
I would like to replace the last column of the dataFrame to obtain:
time value space
45 10 11
22 12 10
13 15 5
I tried this:
colnames(length(colnames(df)))<-"space"
but it does not work. How can I do it?
You can use names() instead:
names(df)[length(names(df))]<-"space"
Marginally less typing:
names(df)[ncol(df)] <- "space"
The following should do what you need:
colnames(df)[ncol(df)] <- "space"
Suppose we have a data frame of two columns
X Y
10 14
12 16
14 17
15 19
21 19
The first element of Y that is 14, the nearest value (or same) to it is 14 (which is 3rd element of X). Similarly, next element of Y is closest to 15 that is 4th element of X
So, the output I would like should be
3
4
4
5
5
As my data is large, Can you give me some advice on the systemic/proper code for doing it?
You can try this piece of code:
apply(abs(outer(d$X,d$Y,FUN = '-')),2,which.min)
# [1] 3 4 4 5 5
Here, abs(outer(d$X,d$Y,FUN = '-')) returns a matrix of unsigned differences between d$X and d$Y, and apply(...,2,which.min) will return position of the minimum by row.
I have a vector of: 0,24,12,12,12,96,12,12,12,12,12,12.
I want to repeat only a part of it from 96 to the last element (12). The first part (0, 24, 12, 12, 12) I want to keep constant.
Could you please help ?
The answer depends on whether number 96 is always located at the 6th position inside your vector. If so, please refer to the first comment underneath your question. If the position is variable, however, you could implement a simple query that identifies the position of 96 inside your vector, and then repeat the part of the vector starting from there as often as you wish (2 times in the below-mentioned code).
x <- c(0,24,12,12,12,96,12,12,12,12,12,12)
# Identify index of 96
id <- which(x == 96)
# Repeat part of vector starting from `id` 2 times
c(x[1:(id-1)], rep(x[id:length(x)], 2))
# # Which results in
# [1] 0 24 12 12 12 96 12 12 12 12 12 12 96 12 12 12 12 12 12
I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).
So As you can see I have a price and Day columns below
Price Day
2 1
5 2
8 3
11 4
14 5
17 6
20 7
23 8
26 9
29 10
32 11
35 12
38 13
41 14
44 15
47 16
50 17
53 18
56 19
59 20
I then want the output below
Difference Day
12 5
15 10
15 15
15 20
So now I have the difference in prices every 5 days...it just basically subtracts the 5th day with the first day.....and then the 10th day with the 5th day etc....
I already made a code that will seperate my data into 5 day intervals...but I want the code that will let me minus the 5th with the 1st day....the 10th day with the 5th day...etc
So the code should look something like this
difference<-tapply(Price[,1],Day, ____________)
So basically Price[,1] will be my Price data.....while "Day" is the variable that I created that will let me seperate my Day data into 5 day intervals.....I'm thinking that in the blank section I could put in the function or another variable that will let me subtract the 5th day with the 1st day prices and then the 10th day and 5th day prices...etc.....you dont have to help me to seperate my Days into intervals...just how to do "difference" section....thanks guys
Here's one option, assuming your data.frame is called "SODF":
within(SODF[c(1, seq(5, nrow(SODF), 5)), ], {
Price <- diff(c(0, Price))
})[-1, ]
# Price Day
# 5 12 5
# 10 15 10
# 15 15 15
# 20 15 20
The first step is basic subsetting. According to your description and expected answer, you want the first row, and then every fifth row starting from row 5:
> SODF[c(1, seq(5, nrow(SODF), 5)), ]
Price Day
1 2 1
5 14 5
10 29 10
15 44 15
20 59 20
From there, you can use diff on the "Price" column, but since diff will result in a vector that is one in length shorter than your input, you need to "pad" the input vector, which I did with diff(c(0, Price)).
# Correct values, but the number of rows needs to be 5
> diff(SODF[c(1, seq(5, nrow(SODF), 5)), "Price"])
[1] 12 15 15 15
Then, the [-1, ] at the end just deletes the extraneous row.
Update
In the comments below, #geektrader points out in the comments (thanks!), an alternative to using:
SODF[c(1, seq(5, nrow(SODF), 5)), ]
as your input data.frame, you may consider using the following instead:
rbind(SODF[1,], SODF[$Day %% 5 == 0,] )
The difference in the two approaches is that the first approach simply subsets by row number, while the second approach subsets according to the value in the "Day" column, extracting rows where "Day" is a multiple of 5. This second approach might be useful, for instance, when there are missing rows in the dataset.
Ananda's is a nice approach (always forget about within myself). Here's another approach:
dat2 <- dat[seq(0, nrow(dat), by=5), ]
data.frame(Difference=diff(c(dat[1,1], dat2[, 1])), Day=dat2[, 2])
Here a solution if you have a matrix as input.
The subsequent function, given a matrix m, a column col_id and a numeric interval interv, subtracts every interv rows the current value in the col_id column of the m matrix with the previous value (5 rows before, same column, obiviously).
The results are stored in a new column called diff and appended to the end of the m matrix.
In short, the approach is very similar to that used by #Ananda Mahto.
So, this is the function:
subtract_column <- function(m, col_id, interv) {
select <- c(1, seq(interv, nrow(m), interv))
cbind(m[select[-1], ], diff = diff(m[select, col_id]))
}
Example:
# this emulates your data as a matrix
price_vect <- c(2,5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50,53,56,59)
day_vect <- 1:20
matr <- do.call(cbind, list(price = price_vect, day = day_vect))
# and this calls the function above and does the job:
# subtracts every 5 rows the current and the previous (5 rows back) value in the column `price` of matrix `matr`
subtract_column(matr, 'price', 5)
Output:
price day diff
[1,] 14 5 12
[2,] 29 10 15
[3,] 44 15 15
[4,] 59 20 15