Related
I'm new to R and have been struggling with the following for a while now so I was hoping someone would be able to help me out.
The sample data represents stock price returns (each row is a monthly period). The real data set is much bigger and is structured like the input below:
Input:
stock1 <- c(0.01, -0.02, 0.01, 0.05, 0.04, -0.02)
stock2 <- c(0, 0, 0.02, 0.04, -0.03, 0.02)
stock3 <- c(0, 0, 0.02, 0, -0.01, 0.03)
stock4 <- c(0, -0.02, 0.01, 0, 0, -0.02)
df <- cbind(stock1,stock2,stock3,stock4)
stock1 stock2 stock3 stock4
[1,] 0.01 0.00 0.00 0.00
[2,] -0.02 0.00 0.00 -0.02
[3,] 0.01 0.02 0.02 0.01
[4,] 0.05 0.04 0.00 0.00
[5,] 0.04 -0.03 -0.01 0.00
[6,] -0.02 0.02 0.03 -0.02
Any zeroes that precedes a non-zero for a given stock represents missing data as opposed to a return of zero for the period. I would like to set these values as NA so the output I would like to achieve is the following:
Desired Output:
stock1 <- c(0.01, -0.02, 0.01, 0.05, 0.04, -0.02)
stock2 <- c(NA, NA, 0.02, 0.04, -0.03, 0.02)
stock3 <- c(NA, NA, 0.02, 0, -0.01, 0.03)
stock4 <- c(NA, -0.02, 0.01, 0, 0, -0.02)
df <- cbind(stock1,stock2,stock3,stock4)
stock1 stock2 stock3 stock4
[1,] 0.01 NA NA NA
[2,] -0.02 NA NA -0.02
[3,] 0.01 0.02 0.02 0.01
[4,] 0.05 0.04 0.00 0.00
[5,] 0.04 -0.03 -0.01 0.00
[6,] -0.02 0.02 0.03 -0.02
I've tried a few things but they only seem to work for a single vector as opposed to a data set with multiple columns. I've tried using lapply to get around this but haven't had any luck so far. The closest I've gotten is shown below.
My single vector solution:
stock1[1:min(which(stock1!=0))-1 <- NA
My multiple vector solution which does not work:
lapply(df,function(x) x[1:min(which(x!=0))-1 <- NA]
Would greatly appreciate any guidance! Thanks!
There are three issues. First, writing:
df <- cbind(stock1,stock2,stock3,stock4)
doesn't create a data frame. It creates a matrix. This is an issue when you try to use lapply, which will operate over the columns of a data frame but over the elements of a matrix. Instead, you should write:
df <- data.frame(stock1,stock2,stock3,stock4)
Second, the function you're using in lapply needs to return the modified vector. Otherwise, the return value will be something unexpected (in this case, the assignment will return a single NA, and the lapply will return a data frame of one row of NAs instead of the data frame you want).
Third, you need to take care with 1:n when n can be zero (i.e., when the first stock quote is non-zero) because 1:0 gives the sequence c(1,0) instead of an empty sequence. (This is arguably one of R's stupidest features.)
Therefore, the following will give you what you want:
stock1 <- c(0.01, -0.02, 0.01, 0.05, 0.04, -0.02)
stock2 <- c(0, 0, 0.02, 0.04, -0.03, 0.02)
stock3 <- c(0, 0, 0.02, 0, -0.01, 0.03)
stock4 <- c(0, -0.02, 0.01, 0, 0, -0.02)
df <- data.frame(stock1,stock2,stock3,stock4)
as.data.frame(lapply(df, function(x) {
n <- min(which(x != 0)) - 1
if (n > 0)
x[1:n] <- NA
x
}))
The output is as expected:
stock1 stock2 stock3 stock4
1 0.01 NA NA NA
2 -0.02 NA NA -0.02
3 0.01 0.02 0.02 0.01
4 0.05 0.04 0.00 0.00
5 0.04 -0.03 -0.01 0.00
6 -0.02 0.02 0.03 -0.02
Update: As #Daniel_Fischer notes, there's a clever trick to avoid the 1:0 problem. You can instead write:
as.data.frame(lapply(df, function(x) {
n <- min(which(x != 0)) - 1
x[0:n] <- NA # use 0:n instead of 1:n
x
}))
This takes advantage of the fact that R ignores zeros in this type of indexing operation, so:
x[0:0] <- NA # same as x[0] <- NA and does nothing
x[0:1] <- NA # same as x[1] <- NA
x[0:2] <- NA # same as x[1:2] <- NA, etc.
This might be not the most elegant way, but I think it works
changeValues <- function(x){
place <- min(which(diff(c(0,cumsum(x==0)))==0))-1;
x[0:place] <- NA
x
}
apply(df,2,changeValues)
EDIT: Some brief explanation to the function: First I create a vector that increases at each position where is a zero in your column, then I check at which position this vector does not increase (=that means, there are not two zeros next to each other) and then I still take the minimum of that and make sure that these are only leading zeros (so that not values from within the matrix are changed)
stock1 <- c(0.01, -0.02, 0.01, 0.05, 0.04, -0.02)
stock2 <- c(0, 0, 0.02, 0.04, -0.03, 0.02)
stock3 <- c(0, 0, 0.02, 0, -0.01, 0.03)
stock4 <- c(0, -0.02, 0.01, 0, 0, -0.02)
df <- data.frame(stock1,stock2,stock3,stock4) #the following function only works if df is actually a data.frame
df[] <- lapply(df, function(x) {ifelse(cumsum(x) == 0 & x == 0, NA, x)})
df
stock1 stock2 stock3 stock4
1 0.01 NA NA NA
2 -0.02 NA NA -0.02
3 0.01 0.02 0.02 0.01
4 0.05 0.04 0.00 0.00
5 0.04 -0.03 -0.01 0.00
6 -0.02 0.02 0.03 -0.02
Some explanation: first check for each cell whether the cumulative colSum ánd the current cell are equal to 0. If so, return NA, else the original value. The brackets behind df make sure the lapply function returns a dataframe again that is assigned to df.
Also, if you don't really need df to be a dataframe, this works as well:
df <- cbind(stock1,stock2,stock3,stock4)
apply(df, 2, function(x) {ifelse(cumsum(x) == 0 & x == 0, NA, x)})
I have a large set of data that I want to reorder in groups of twelve using the sample() function in R to generate randomised data sets with which I can carry out a permutation test. However, this data has NA characters where data could not be collected and I would like them to stay in their respective original positions when the data is shuffled.
Currently, NAs are shuffled randomly with all other values. For example, where example.data is a made-up example set of 12 values:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42)
sample(example.data, replace = F, prob = NULL)
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 NA 0.12 0.42 1.00 NA
Whereas a suitable reordering would be:
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 0.12 NA NA 0.42 1.00
Is there a simple way to do this?
Thank you for your help!
This has been solved, but I have an extending question
Extending from this, if I have a set of data with a length of 24 how would I go about re-ordering the first and second set of 12 values individually?
For example, a vector extending from the first example:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98, NA, 0.05)
Where example.data[1:12] and example.data[13:24] are shuffled separately within their own respective groups.
The code I am trying to work this solution into is as follows:
shuffle.data = function(input.data,nr,ns){
simdata <- input.data
for(i in 1:nr){
start.row <- (ns*(i-1))+1
end.row <- start.row + actual.length[i] - 1
newdata = sample(input.data[start.row:end.row], size=actual.length[i], replace=F)
simdata[start.row:end.row] <- newdata
}
return(simdata)}
Where input.data is the raw input data (example.data); nr is the number of groups (2), ns is the size of each sample (12); and actual.length is the length of each group exluding NAs stored in a vector (actual.length <- c(9, 8) in the example above).
Thank you again for your help!
Is this what you are looking for ?
example.data[!is.na(example.data)] <- sample(example.data[!is.na(example.data)], replace = F, prob = NULL)
We can try with non-NA elements by creating an index
i1 <- which(!is.na(example.data))
example.data[i1] <- example.data[sample(i1)]
example.data
#[1] 0.25 0.64 NA 0.83 0.12 1.00 0.42 0.47 NA NA 0.33 0.90
I need to do a diagonal multiplication for below table.
It's a 7*7 matrix:
Step 1: need a diagonal multiplcation for 7*7 matrix,
Step 2: then ignore the first column and select the next 7 columns and 7 rows and do diagonal multiplication.
Step 3: ignore the 1st & 2nd column and select the next 7 columns and 7 rows and do diagonal multiplication.
Step 4: similar to step 3 and increment the column ignore 1,2,3 .... and so on and so far ....
Note: the diagonal will be going in upward direct from right side Bottom to the left upper side.
Data:
28/02/2013 31/03/2013 30/04/2013 31/05/2013 30/06/2013 31/07/2013 31/08/2013 30/09/2013 31/10/2013 30/11/2013 31/12/2013 31/01/2014 28/02/2014
0.04 0.03 0.03 0.04 0.04 0.07 0.86 0.28 0.05 0.05 0.05 0.04 0.04
0.44 0.44 0.42 0.43 0.40 0.32 0.64 0.02 0.33 0.36 0.30 0.27 0.37
0.57 0.57 0.52 0.59 0.62 0.51 0.79 0.23 0.64 0.66 0.50 0.55 0.60
0.61 0.58 0.60 0.63 0.65 0.59 0.81 0.83 1.00 0.63 0.57 0.63 0.74
0.70 0.65 0.66 0.71 0.73 0.66 0.86 0.90 0.55 0.76 0.65 0.66 0.74
0.76 0.76 0.79 0.74 0.83 0.83 0.86 1.00 0.61 0.83 0.38 0.74 0.75
0.80 0.84 0.89 0.84 0.82 0.83 0.98 0.84 0.44 0.93 0.88 0.78 0.78
Considering each column as A, B, C, D, E, F, G, H, I, J, K and so on ... there will be many columns, but the number of rows will be only 7.
Calculation of the 7*7 daigonal matrix will be as follows.
A is result for -> STEP 1, B -> STEP 2 AND C -> STEP 3 ... and so on.
A B C
G8*F7*E6*D5*C4*B3*A2 = 0.00 H8*G7*F6*E5*D4*C3*B2 = 0.02 I8*H7*G6*F5*E4*D3*C2 = 0.00
G8*F7*E6*D5*C4*B3 = 0.08 H8*G7*F6*E5*D4*C3 = 0.08 I8*H7*G6*F5*E4*D3 = 0.06
G8*F7*E6*D5*C4 = 0.19 H8*G7*F6*E5*D4 = 0.18 I8*H7*G6*F5*E4 = 0.14
G8*F7*E6*D5 = 0.37 H8*G7*F6*E5 = 0.31 I8*H7*G6*F5 = 0.22
G8*F8*E6 = 0.59 H8*G7*F6 = 0.47 I8*H7*G6 = 0.38
G8*F8 = 0.81 H8*G7 = 0.72 I8*H7 = 0.44
G8 = 0.98 H8 = 0.84 I8 = 0.44
So result should be printed as.
A B C
0 0.02 0.00
0.08 0.08 0.06
0.19 0.18 0.14
0.37 0.31 0.22
0.59 0.47 0.38
0.81 0.72 0.44
0.98 0.84 0.44
Similary there will result for D, E, F, and so on.
Please help, Thanks in Advance.
sapply(lapply(7:NCOL(df), function(i)
df[, (i-6):i]), function(a)
round(x = rev(cumprod(rev(diag(as.matrix(a))))), digits = 2))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#[2,] 0.09 0.08 0.06 0.08 0.08 0.03 0.00
#[3,] 0.19 0.18 0.14 0.21 0.26 0.05 0.15
#[4,] 0.37 0.31 0.22 0.41 0.33 0.23 0.24
#[5,] 0.59 0.48 0.38 0.51 0.40 0.23 0.38
#[6,] 0.81 0.72 0.44 0.57 0.73 0.30 0.58
#[7,] 0.98 0.84 0.44 0.93 0.88 0.78 0.78
Let me know if the output is correct
DATA
df = structure(list(A = c(0.04, 0.44, 0.57, 0.61, 0.7, 0.76, 0.8),
B = c(0.03, 0.44, 0.57, 0.58, 0.65, 0.76, 0.84), C = c(0.03,
0.42, 0.52, 0.6, 0.66, 0.79, 0.89), D = c(0.04, 0.43, 0.59,
0.63, 0.71, 0.74, 0.84), E = c(0.04, 0.4, 0.62, 0.65, 0.73,
0.83, 0.82), F = c(0.07, 0.32, 0.51, 0.59, 0.66, 0.83, 0.83
), G = c(0.86, 0.64, 0.79, 0.81, 0.86, 0.86, 0.98), H = c(0.28,
0.02, 0.23, 0.83, 0.9, 1, 0.84), I = c(0.05, 0.33, 0.64,
1, 0.55, 0.61, 0.44), J = c(0.05, 0.36, 0.66, 0.63, 0.76,
0.83, 0.93), K = c(0.05, 0.3, 0.5, 0.57, 0.65, 0.38, 0.88
), L = c(0.04, 0.27, 0.55, 0.63, 0.66, 0.74, 0.78), M = c(0.04,
0.37, 0.6, 0.74, 0.74, 0.75, 0.78)), .Names = c("A", "B",
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"), class = "data.frame", row.names = c(NA,
-7L))
I think a for loop is a good bet here - inspired from this
n <- nrow(df)
b <- ncol(df) - n + 1
out <- matrix(0, n, b)
ro <- 1:n
for(i in 1:b){
co <- i:(n + i - 1)
out[ro, i] <- rev(cumprod(rev(df[cbind(ro, co)])))
}
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.003423605 0.002303868 0.001785601 0.003374663 0.00337162 0.00232112
# [2,] 0.085590113 0.076795599 0.059520050 0.084366587 0.08429050 0.03315886
# [3,] 0.194522983 0.182846664 0.138418720 0.210916467 0.26340780 0.05181072
# [4,] 0.374082660 0.309909600 0.223256000 0.413561700 0.33342760 0.22526400
# [5,] 0.593782000 0.476784000 0.378400000 0.510570000 0.40172000 0.22526400
# [6,] 0.813400000 0.722400000 0.440000000 0.567300000 0.73040000 0.29640000
# [7,] 0.980000000 0.840000000 0.440000000 0.930000000 0.88000000 0.78000000
Wrap the answer in round to alter how it is printed.
Another way , also using indexing...
ro <- nrow(df)
co <- ncol(df)
b <- co - ro + 1
id <- pmin(ro, b)
ccols <- mapply(seq, 1:b, id:co)
rrows <- rep(1:ro, b)
mat <- matrix(rev(df[cbind(rrows, c(ccols))]), nr=ro)
matrix(rev(matrixStats::colCumprods(mat)), nr=ro)
A quick benchmark on larger data seems to show that method two is considerably faster, however, if you convert the dataframe to a matrix then the for loop has similar speed
I have 2 data frames A and B of dimensions 2 x 5 like this:
A = data.frame(GeneA1=-0.02:1.89, GeneB2=0.25:1.99, GeneB3=0.17:1.87, GeneB4=0.3:1.63, GeneC2=0.29:1.97, row.names=c("sample 1", "sample 2"))
B = data.frame(GeneA1=0.52:-0.04, GeneB1=1.1:0.08, GeneB3=0.72:0.03, GeneB5=0.78:0.06, GeneC2=0.78:0.25, row.names=c("sample 1", "sample 2"))
For both A & B, the rows are samples and the columns are gene type
I want to try and merge A & B using rbind, adding NAs where the gene types don't match up. I've heard there's a way to do this, using the setdiff argument but I don't know how?
Use merge
> AB <- merge(A, B, all=TRUE)
> AB[,order(names(AB))] # to get the result ordered by colnames
Gene A1 Gene B1 Gene B2 Gene B3 Gene B4 Gene B5 Gene C2
1 -0.04 0.08 NA 0.03 NA 0.06 0.25
2 -0.02 NA 0.25 0.17 0.30 NA 0.29
3 0.52 1.10 NA 0.72 NA 0.78 0.78
4 1.89 NA 1.99 1.87 1.63 NA 1.97
Where A and B are as follows:
A <- matrix(c(-0.02, 0.25, 0.17, 0.3, 0.29,
1.89, 1.99, 1.87, 1.63, 1.97),
nrow=2, byrow=TRUE,
dimnames=list(NULL, c("Gene A1", "Gene B2",
"Gene B3",
"Gene B4", "Gene C2")))
B <- matrix(c(0.52, 1.1, 0.72, 0.78, 0.78,
-0.04, 0.08, 0.03, 0.06,0.25),
nrow=2, byrow=TRUE,
dimnames=list(NULL, c("Gene A1", "Gene B1",
"Gene B3",
"Gene B5", "Gene C2")))
You can use the function merge:
A=data.frame(A1=c(-0.02,1.89),B2=c(0.25,1.99),B3=c(0.17,1.87),B4=c(0.3,1.63),C2=c(0.29,1.97))
B=data.frame(A1=c(0.52,-0.04),B1=c(1.1,0.08),B3=c(0.72,0.03),B5=c(0.78,0.06),C2=c(0.78,0.25))
C<-merge(A, B, all=T)
View(C)
Try this:
# dummy data
A <- read.table(text="
Gene A1, Gene B2, Gene B3, Gene B4, Gene C2
0.52, 0.25, 0.17, 0.3, 0.29
1.89, 1.99, 1.87, 1.63, 1.97",
sep=",", header=TRUE)
B <- read.table(text="
Gene A1, Gene B1, Gene B3, Gene B5, Gene C2
0.52, 1.1, 0.72, 0.78, 0.78
-0.04, 0.08, 0.03, 0.06,0.25",
sep=",", header=TRUE)
#transpose and merge
tAB <- merge(t(A),t(B),by="row.names",all=TRUE)
#keep gene names
col <- tAB[,1]
#exclude rownames, transpose
output <- t(tAB[,-1])
#update colnames
colnames(output) <- col
#output
# Gene.A1 Gene.B1 Gene.B2 Gene.B3 Gene.B4 Gene.B5 Gene.C2
#V1.x -0.02 NA 0.25 0.17 0.30 NA 0.29
#V2.x 1.89 NA 1.99 1.87 1.63 NA 1.97
#V1.y 0.52 1.10 NA 0.72 NA 0.78 0.78
#V2.y -0.04 0.08 NA 0.03 NA 0.06 0.25
So my test data looks like this:
structure(list(day = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L
), Left = c(0.25, 0.33, 0, 0, 0.25, 0.33, 0.5, 0.33, 0.5, 0),
Left1 = c(NA, NA, 0, 0.5, 0.25, 0.33, 0.1, 0.33, 0.5, 0),
Middle = c(0, 0, 0.3, 0, 0.25, 0, 0.3, 0.33, 0, 0), Right = c(0.25,
0.33, 0.3, 0.5, 0.25, 0.33, 0.1, 0, 0, 0.25), Right1 = c(0.5,
0.33, 0.3, 0, 0, 0, 0, 0, 0, 0.75), Side = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("L", "R"), class = "factor")), .Names = c("day",
"Left", "Left1", "Middle", "Right", "Right1", "Side"), class = "data.frame", row.names = c(NA,
-10L))
or this:
day Left Left1 Middle Right Right1 Side
1 0.25 NA 0.00 0.25 0.50 R
1 0.33 NA 0.00 0.33 0.33 R
2 0.00 0.00 0.30 0.30 0.30 R
2 0.00 0.50 0.00 0.50 0.00 R
2 0.25 0.25 0.25 0.25 0.00 L
3 0.33 0.33 0.00 0.33 0.00 L
I would like to write a loop to find the standard error and average value for each day on the chosen side..
Ok.. So far I have this code:
td<-read.csv('test data.csv')
IDs<-unique(td$day)
se<-function(x) sqrt(var(x)/length(x))
for (i in 1:length (IDs)) {
day.i<-which(td$day==IDs[i])
td.i<-td[day.i,]
if(td$Side=='L'){
side<-cbind(td.i$Left + td.i$Left1)
}else{
side<-cbind(td.i$Right + td.i$Right1)
}
mean(side)
se(side)
print(mean)
print(se)
}
But I am getting error messages like this
Error: unexpected '}' in "}"
Obviously, I am also not getting the print out of means for each day.. Does anyone know why?
also working on things here: http://www.talkstats.com/showthread.php/27187-Writing-a-mean-loop..-(literally)
Convert your data into a list and work with that instead:
First, split up your data into a list according to Side, subsetting the relevant columns along the way.
td = split(td, td$Side)
NAMES = names(td)
td = lapply(1:length(td),
function(x) td[[x]][c(1, grep(NAMES[x],
names(td[[x]])))])
names(td) = NAMES
td
# $L
# day Left Left1
# 5 2 0.25 0.25
# 6 3 0.33 0.33
# 7 3 0.50 0.10
# 8 4 0.33 0.33
# 9 4 0.50 0.50
#
# $R
# day Right Right1
# 1 1 0.25 0.50
# 2 1 0.33 0.33
# 3 2 0.30 0.30
# 4 2 0.50 0.00
# 10 4 0.25 0.75
Then, use lapply and aggregate to apply whatever functions you want to your data.
lapply(1:length(td),
function(x) aggregate(list(td[[x]][-1]),
list(day = td[[x]]$day), mean))
# [[1]]
# day Left Left1
# 1 2 0.250 0.250
# 2 3 0.415 0.215
# 3 4 0.415 0.415
#
# [[2]]
# day Right Right1
# 1 1 0.29 0.415
# 2 2 0.40 0.150
# 3 4 0.25 0.750
Still not entirely sure if I understand (that is if you want mean and SE for both Left and Left 1 or some sort of combination like sum). This is how I interpreted your question:
FUN <- function(dat, side = "L") {
DF <- split(dat, dat$Side)[[side]]
ind <- if(side=="L") 2:3 else 5:6
stderr <- function(x) sqrt(var(x)/length(x))
meanNse <- function(x) c(mean=mean(x), se=stderr(x))
OUT <- aggregate(DF[, ind], list(DF[, 1]), meanNse)
names(OUT)[1] <- "day"
return(OUT)
}
#test it
FUN(td)
FUN(td, "R")
Which yields:
> FUN(td)
day Left.mean Left.se Left1.mean Left1.se
1 2 0.250 NA 0.250 NA
2 3 0.415 0.085 0.215 0.115
3 4 0.415 0.085 0.415 0.085
> FUN(td, "R")
day Right.mean Right.se Right1.mean Right1.se
1 1 0.29 0.04 0.415 0.085
2 2 0.40 0.10 0.150 0.150
3 4 0.25 NA 0.750 NA