Looping Through R data to replace all values - r

Basic level R programmer trying to re-calibrate data using a weighted effect and some other value. In particular I want to 1) if the weighted effect is negative take the row value of X and subtract the person's value or 2) if the weighted effect is positive take the person's value and subtract X.
Mock data:
p1 <- c(0.4,0.7,0.3,0.2)
p2 <- c(0.8,0.4,0.5,0.1)
p3 <- c(0.6,0.5,0.4,0.3)
wef <- c(1.5,-1.2,1.8,-1.3)
x <- c(0.5,0.4,0.6,0.2)
print(df)
p1 p2 p3 wef x
1 0.4 0.8 0.6 1.5 0.5
2 0.7 0.4 0.5 -1.2 0.4
3 0.3 0.5 0.4 1.8 0.6
4 0.2 0.1 0.3 -1.3 0.2
I attempted this (which did nothing and likely would be inefficient with for loops):
for(row in 1:nrow(df)) {
for(col in 1:ncol(df)) {
ifelse(weightef[row] < 0, df[row,col]==(df$x[row]-df[row,col]),
df[row,col]==df[row,col]-df$x[row])
}
}
my desired output in case the above was to hard to follow is this
person1 person2 person3 weightef x
1 -0.1 0.3 0.1 1.5 0.5
2 -0.3 0.0 -0.1 -1.2 0.4
3 -0.3 -0.1 -0.2 1.8 0.6
4 0.0 0.1 -0.1 -1.3 0.2

You can using apply and ifelse function in R. This is just one line function, and you are not required to understand grep. The second line of code just put everything into data frame.
result <- apply(df[, 1:3], 2, FUN = function(y) with(df, ifelse(wef < 0, x - y, y - x)))
df <- as.data.frame(cbind(result, wef, x))
p1 p2 p3 wef x
1 -0.1 0.3 0.1 1.5 0.5
2 -0.3 0.0 -0.1 -1.2 0.4
3 -0.3 -0.1 -0.2 1.8 0.6
4 0.0 0.1 -0.1 -1.3 0.2

We can do this without a loop in R
nm1 <- grep("^p\\d+", names(df), value = TRUE)
i1 <- df$wef > 0
df[i1, nm1] <- df[i1, nm1] - df$x[i1]
df[!i1, nm1] <- df$x - df[!i1, nm1]
data
df <- data.frame(p1, p2, p3, wef, x)

If you want to use for loops you can do it in this way
#Create dataframe
df = data.frame(p1, p2, p3, wef, x)
#looping lenght of vector wef
for (w in 1:length(df$wef))
{
#Checking positive or negative weight
if (df$wef[w] >= 0)
{
#subtracting
df$p1[w] = df$p1[w] - df$x[w]
df$p2[w] = df$p2[w] - df$x[w]
df$p3[w] = df$p3[w] - df$x[w]
}
else
{
#subtracting
df$p1[w] = df$x[w] - df$p1[w]
df$p2[w] = df$x[w] - df$p2[w]
df$p3[w] = df$x[w] - df$p3[w]
}
}
#print result
print(df)
p1 p2 p3 wef x
1 -0.1 0.3 0.1 1.5 0.5
2 -0.3 0.0 -0.1 -1.2 0.4
3 -0.3 -0.1 -0.2 1.8 0.6
4 0.0 0.1 -0.1 -1.3 0.2

Related

Transform NA values based on first registration and nearest values

I already made a similar question but now I want just to restrict the new values of NA.
I have some data like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A NA 0.1 0.2 NA 0.3 0.2
B 0.1 NA NA 0.3 0.2 0.1
C NA NA NA NA 0.3 NA
D 0.1 0.2 0.3 NA 0.1 NA
E NA NA 0.1 0.2 0.1 0.3
I would like to change the NA values of my data based on the first date a value is registered. So for example for A, the first registration is Date 2. Then I want that before that registration the values of NA in A are 0, and after the first registration the values of NA become the mean of the nearest values (mean of date 3 and 5).
In case the last value is an NA, transform it into the last registered value (as in C and D). In the case of E all NA values will become 0.
Get something like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A 0 0.1 0.2 0.25 0.3 0.2
B 0.1 0.2 0.2 0.3 0.2 0.1
C 0 0 0 0 0.3 0.3
D 0.1 0.2 0.3 0.2 0.1 0.1
E 0 0 0.1 0.2 0.1 0.3
Can you help me? I'm not sure how to do it in R.
Here is a way using na.approx from the zoo package and apply with MARGIN = 1 (so this is probably not very efficient but get's the job done).
library(zoo)
df1 <- as.data.frame(t(apply(dat, 1, na.approx, method = "constant", f = .5, na.rm = FALSE)))
This results in
df1
# V1 V2 V3 V4 V5
#A NA 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C NA NA NA NA 0.3
#E NA NA 0.1 0.20 0.1
Replace NAs and rename columns.
df1[is.na(df1)] <- 0
names(df1) <- names(dat)
df1
# Date_1 Date_2 Date_3 Date_4 Date_5
#A 0.0 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C 0.0 0.0 0.0 0.00 0.3
#E 0.0 0.0 0.1 0.20 0.1
explanation
Given a vector
x <- c(0.1, NA, NA, 0.3, 0.2)
na.approx(x)
returns x with linear interpolated values
#[1] 0.1000000 0.1666667 0.2333333 0.3000000 0.2000000
But OP asked for constant values so we need the argument method = "constant" from the approx function.
na.approx(x, method = "constant")
# [1] 0.1 0.1 0.1 0.3 0.2
But this is still not what OP asked for because it carries the last observation forward while you want the mean for the closest non-NA values. Therefore we need the argument f (also from approx)
na.approx(x, method = "constant", f = .5)
# [1] 0.1 0.2 0.2 0.3 0.2 # looks good
From ?approx
f : for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.
Lastly, if we don't want to replace the NAs at the beginning and end of each row we need na.rm = FALSE.
From ?na.approx
na.rm : logical. If the result of the (spline) interpolation still results in NAs, should these be removed?
data
dat <- structure(list(Date_1 = c(NA, 0.1, NA, NA), Date_2 = c(0.1, NA,
NA, NA), Date_3 = c(0.2, NA, NA, 0.1), Date_4 = c(NA, 0.3, NA,
0.2), Date_5 = c(0.3, 0.2, 0.3, 0.1)), .Names = c("Date_1", "Date_2",
"Date_3", "Date_4", "Date_5"), class = "data.frame", row.names = c("A",
"B", "C", "E"))
EDIT
If there are NAs in the last column we can replace these with the last non-NAs before we apply na.approx as shown above.
dat$Date_6[is.na(dat$Date_6)] <- dat[cbind(1:nrow(dat),
max.col(!is.na(dat), ties.method = "last"))][is.na(dat$Date_6)]
This is another possible answer, using na.locf from the zoo package.
Edit: apply is actually not required; This solution fills in the last observed value if this value is missing.
# create the dataframe
Date1 <- c(NA,.1,NA,NA)
Date2 <- c(.1, NA,NA,NA)
Date3 <- c(.2,NA,NA,.1)
Date4 <- c(NA,.3,NA,.2)
Date5 <- c(.3,.2,.3,.1)
Date6 <- c(.1,NA,NA,NA)
df <- as.data.frame(cbind(Date1,Date2,Date3,Date4,Date5,Date6))
rownames(df) <- c('A','B','C','D')
> df
Date1 Date2 Date3 Date4 Date5 Date6
A NA 0.1 0.2 NA 0.3 0.1
B 0.1 NA NA 0.3 0.2 NA
C NA NA NA NA 0.3 NA
D NA NA 0.1 0.2 0.1 NA
# Load library
library(zoo)
df2 <- t(na.locf(t(df),na.rm = F)) # fill last observation carried forward
df3 <- t(na.locf(t(df),na.rm = F, fromLast = T)) # last obs carried backward
df4 <- (df2 + df3)/2 # mean of both dataframes
df4 <- t(na.locf(t(df4),na.rm = F)) # fill last observation carried forward
df4[is.na(df4)] <- 0 # NA values are 0
Date1 Date2 Date3 Date4 Date5 Date6
A 0.0 0.1 0.2 0.25 0.3 0.1
B 0.1 0.2 0.2 0.30 0.2 0.2
C 0.0 0.0 0.0 0.00 0.3 0.3
D 0.0 0.0 0.1 0.20 0.1 0.1
Here's another option with base R + rollmean from zoo (clearly easy to rewrite in base R for this case with window size k = 2).
t(apply(df, 1, function(x) {
means <- c(0, rollmean(na.omit(x), 2), tail(na.omit(x), 1))
replace(x, is.na(x), means[1 + cumsum(!is.na(x))[is.na(x)]])
}))
# Date1 Date2 Date3 Date4 Date5 Date6
# A 0.0 0.1 0.2 0.25 0.3 0.2
# B 0.1 0.2 0.2 0.30 0.2 0.1
# C 0.0 0.0 0.0 0.00 0.3 0.3
# D 0.1 0.2 0.3 0.20 0.1 0.1
# E 0.0 0.0 0.1 0.20 0.1 0.3
Explanation. Suppose that x is the first row of df:
# Date1 Date2 Date3 Date4 Date5 Date6
# A NA 0.1 0.2 NA 0.3 0.2
Then
means
# [1] 0.00 0.15 0.25 0.25 0.20
is a vector of 0, rolling means of two the following non-NA elements, and the last non-NA element. Then all we need to do is to replace those elements of x that are is.na(x). We will replace them by the elements of means at indices 1 + cumsum(!is.na(x))[is.na(x)]. That's the trickier part. Here
cumsum(!is.na(x))
# [1] 0 1 2 2 3 4
Meaning that the first element of x has seen 0 non-NA elements, while, say, the last one has seen 4 non-NA elements so far. Then
cumsum(!is.na(x))[is.na(x)]
# [1] 0 2
is about those NA elements in x that we want to replace. Notice that then
1 + cumsum(!is.na(x))[is.na(x)]
# [1] 1 3
corresponds to the elements of means that we want to use for replacement.
I am finding the function below too complicated but it works, so here it goes.
fun <- function(x){
if(anyNA(x)){
inx <- which(!is.na(x))
if(inx[1] > 1) x[seq_len(inx[1] - 1)] <- 0
prev <- inx[1]
for(i in inx[-1]){
if(i - prev > 1){
m <- mean(c(x[i], x[prev]))
while(prev < i){
x[prev] <- m
prev <- prev + 1
}
}
prev <- i
}
}
x
}
res <- t(apply(df1, 1, fun))
res <- as.data.frame(res)
res
# Date.1 Date.2 Date.3 Date.4 Date.5
#A 0.0 0.1 0.25 0.25 0.3
#B 0.2 0.2 0.20 0.30 0.2
#C 0.0 0.0 0.00 0.00 0.3
#E 0.0 0.0 0.10 0.20 0.1
Data.
df1 <- read.table(text = "
Date.1 Date.2 Date.3 Date.4 Date.5
A NA 0.1 0.2 NA 0.3
B 0.1 NA NA 0.3 0.2
C NA NA NA NA 0.3
E NA NA 0.1 0.2 0.1
", header = TRUE)

Calculations as per the formula

I need to do some calculation as per the below formula:
B1 = A1 + (1-A1) * B1
example:
B1 = 0.2 + (1 - 0.2) * 0.4
= 0.52
C1 = 0.4 + (1 - 0.4) * 0.8
= 0.904
D1 = 0.8 + (1 - 0.8) * 0.5
= 0.952
Same logic applied for other rows and other columns, there are total 11.
dataframe:
df
A B C D
0.2 0.4 0.8 0.5
0.4 0.5 0.6 0.2
0.8 0.1 0.5 0.4
0.3 0.4 0.1 0.8
Expected output:
A B C D
0.2 0.52 0.904 0.952
0.4 0.7 0.88 0.904
0.8 0.82 0.91 0.946
0.3 0.58 0.622 0.9244
I tried it for 1 with the below code:
Df <- df[-ncol(df)] + ( 1 – df[-ncol(df)]) * df[-1]
I was able to get the column B as per the output, but not working for rest of the column.
Please help, thanks. BM.
You can do this recursively as follows:
do.call(cbind, Reduce(f = function(A1, B1) A1+(1-A1)*B1,
x = df,
accumulate = TRUE))
Explanation:
Since df is a data.frame which is a list of vectors, Reduce will take each vector and apply your function. Then do.call(cbind,...) combine the results into a data.frame.

Normalize blocks/sub-matrices within a matrix

I want to normalize (i.e., 0-1) blocks/sub-matrices within a square matrix based on row/col names. It is important that the normalized matrix correspond to the original matrix. The below code extracts the blocks, e.g. all col/row names == "A" and normalizes it by its max value. How do I put that matrix of normalized blocks back together so it corresponds to the original matrix, such that each single value of the normalized blocks are in the same place as in the original matrix. I.e. you cannot put the blocks together and then e.g. sort the normalized matrix by the original's matrix row/col names.
#dummy code
mat <- matrix(round(runif(90, 0, 50),),9,9)
rownames(mat) <- rep(LETTERS[1:3],3)
colnames(mat) <- rep(LETTERS[1:3],3)
mat.n <- matrix(0,nrow(mat),ncol(mat), dimnames = list(rownames(mat),colnames(mat)))
for(i in 1:length(LETTERS[1:3])){
? <- mat[rownames(mat)==LETTERS[1:3][i],colnames(mat)==LETTERS[1:3][i]] / max(mat[rownames(mat)==LETTERS[1:3][i],colnames(mat)==LETTERS[1:3][i]])
#For example,
mat.n[rownames(mat)==LETTERS[1:3][i],colnames(mat)==LETTERS[1:3][i]] <- # doesn't work
}
UPDATE
Using ave() as #G. Grothendieck suggested works for the blocks, but I'm not sure how it's normalizing beyond that.
mat.n <- mat / ave(mat, rownames(mat)[row(mat)], colnames(mat)[col(mat)], FUN = max)
Within block the normalization works, e.g.
mat[rownames(mat)=="A",colnames(mat)=="A"]
A A A
A 13 18 15
A 38 33 41
A 12 18 47
mat.n[rownames(mat.n)=="A",colnames(mat.n)=="A"]
A A A
A 0.2765957 0.3829787 0.3191489
A 0.8085106 0.7021277 0.8723404
A 0.2553191 0.3829787 1.0000000
But beyond that, it looks weird.
> round(mat.n,1)
A B C A B C A B C
A 0.3 0.2 0.1 0.4 0.2 1.0 0.3 0.9 1.0
B 0.9 0.8 0.9 0.4 0.5 0.4 0.4 0.9 0.0
C 0.0 0.4 0.4 0.0 0.8 0.5 0.4 0.9 0.0
A 0.8 0.9 0.5 0.7 0.9 0.6 0.9 0.4 0.4
B 0.1 0.8 0.7 1.0 0.3 0.5 0.1 1.0 0.8
C 0.4 0.0 0.2 0.2 0.2 0.6 1.0 0.4 1.0
A 0.3 0.4 0.3 0.4 0.6 0.8 1.0 1.0 0.3
B 0.6 0.2 0.5 0.9 0.3 0.2 0.9 0.3 1.0
C 0.5 0.9 0.7 1.0 0.4 0.5 1.0 1.0 0.9
In this case, I would expect 3 1s across the whole matrix- 1 for each block. But there're 10 1s, e.g. mat.n[3,2], mat.n[1,9]. I'm not sure how this function normalized between blocks.
UPDATE 2
#Original matrix.
#Suggested solution produces `NaN`
mat <- as.matrix(read.csv(text=",1.21,1.1,2.2,1.1,1.1,1.21,2.2,2.2,1.21,1.22,1.22,1.1,1.1,2.2,2.1,2.2,2.1,2.2,2.2,2.2,1.21,2.1,2.1,1.21,1.21,1.21,1.21,1.21,2.2,1.21,2.2,1.1,1.22,1.22,1.22,1.22,1.21,1.22,2.1,2.1,2.1,1.22
1.21,0,0,0,0,0,0,0,0,292,13,0,0,0,0,0,0,0,0,0,0,22,0,0,94,19,79,0,9,0,126,0,0,0,0,0,0,0,0,0,0,0,0
1.1,0,0,0,155,166,0,0,0,0,0,0,4,76,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.1,0,201,0,0,79,0,0,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.1,0,33,0,91,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.21,8,0,0,0,0,0,0,0,404,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,37,26,18,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,162,79,1,0,0,0,0,0,0,0,0,10,0,27,0,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,33,17,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0
1.21,207,0,0,0,0,1644,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,16,17,402,0,0,0,606,0,0,0,0,0,0,0,0,0,0,0,0
1.22,13,0,0,0,0,0,0,0,0,0,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,26,0,0,15,0,0,0,0,0
1.22,0,0,0,0,0,0,0,0,0,71,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,374,6,121,6,21,0,0,0,0
1.1,0,0,0,44,0,0,0,0,0,0,0,0,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,0,0,0,0,0,0,0,0,0,0
1.1,0,0,0,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,0,0,0,0,353,116,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,29,0,5,0
2.2,0,0,0,0,0,0,0,37,0,0,0,0,0,4,0,0,0,36,46,62,0,0,0,0,0,0,0,0,0,0,73,0,0,0,0,0,0,1,0,0,0,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,61,0,0,0,0,0,0,0,38,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2.2,17,0,23,0,0,0,444,65,0,0,0,0,0,0,0,78,0,0,42,30,15,0,0,0,0,0,0,0,4,0,18,0,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,75,8,0,0,0,0,0,0,0,87,0,74,0,85,0,0,0,0,0,0,0,0,1,0,19,0,25,0,0,0,0,0,0,0,0,0
2.2,0,0,13,0,0,0,12,20,0,0,0,0,0,0,0,118,0,29,92,0,25,0,0,0,0,0,0,0,0,0,16,0,48,0,0,0,0,0,0,0,0,0
1.21,14,0,1,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,14,0,0,0,0,0,0,0,0,3,0,20,0,0,0,0,0,0,0,0,0,0,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,204,0,0,0,0,0,0,0,133,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,44,0,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,67,0,0,0,0,0,0,143,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,12,15,0
1.21,79,0,0,0,0,0,0,0,34,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,38,26,6,9,0,112,0,0,0,0,0,0,0,0,0,0,0,0
1.21,11,0,0,0,0,17,0,0,49,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,0,0,0,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.21,40,0,0,0,0,0,0,0,122,0,0,0,0,0,0,0,0,0,0,0,3,0,0,24,11,0,887,20,0,389,0,0,0,0,0,0,0,0,0,0,0,0
1.21,14,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,50,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.21,34,0,0,0,0,26,0,0,56,0,0,0,0,0,0,0,0,0,0,0,0,0,0,54,9,297,13,0,0,16,0,0,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,39,0,0,0,0,0,0,0,0,25,0,17,12,20,25,0,0,0,0,0,0,0,0,0,393,0,7,0,0,0,0,0,0,0,0,0
1.21,177,0,0,0,0,8,0,0,775,0,0,0,0,0,0,0,0,0,0,0,0,0,0,113,0,227,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2.2,0,0,0,0,0,0,21,17,0,0,0,0,0,0,0,0,0,42,30,16,0,0,0,0,0,0,0,0,165,0,0,0,0,0,0,0,0,0,0,0,0,0
1.1,0,6,0,28,0,0,0,0,0,0,0,9,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.22,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,4,37,0,0,0,0,0,0,0,0,3,0,0,0,0,14,7,0,0,18,0,0,0,0
1.22,0,0,0,0,0,0,0,0,0,44,785,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,21,0,44,177,13,24,0,0,0,0
1.22,0,0,0,0,0,0,30,0,0,182,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,12,0,1231,135,17,0,0,0,0
1.22,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,73,1308,0,669,16,0,0,0,8
1.21,0,0,0,0,0,0,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,33,197,626,0,44,0,0,0,0
1.22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,37,12,80,0,0,0,0,16
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,54,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,75,0,0,0
2.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,58,0,1,0,0,0,0,28,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,61,2,0,0
1.22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,9,0,0,0,0"))
ids <- read.csv(text=",x
1,1.21
2,1.1
3,2.2
4,1.1
5,1.1
6,1.21
7,2.2
8,2.2
9,1.21
10,1.22
11,1.22
12,1.1
13,1.1
14,2.2
15,2.1
16,2.2
17,2.1
18,2.2
19,2.2
20,2.2
21,1.21
22,2.1
23,2.1
24,1.21
25,1.21
26,1.21
27,1.21
28,1.21
29,2.2
30,1.21
31,2.2
32,1.1
33,1.22
34,1.22
35,1.22
36,1.22
37,1.21
38,1.22
39,2.1
40,2.1
41,2.1
42,1.22")
mat <- mat[,-1]
rownames(mat) <- ids$x
colnames(mat) <- ids$x
ans <- mat / ave(mat, rownames(mat)[row(mat)], colnames(mat)[col(mat)], FUN = max)
Any help is much appreciated, thanks.
Use ave to get the maxima:
mat / ave(mat, rownames(mat)[row(mat)], colnames(mat)[col(mat)], FUN = max)
For example, there are 9 ones, as expected, and there is one 1 in each block also as expected. (There could be more than 9 if the matrix happened to have multiple maxima in one or more blocks but there shoud not be less than 9.)
set.seed(123)
mat <- matrix(round(runif(90, 0, 50),),9,9)
rownames(mat) <- rep(LETTERS[1:3],3)
colnames(mat) <- rep(LETTERS[1:3],3)
ans <- mat / ave(mat, rownames(mat)[row(mat)], colnames(mat)[col(mat)], FUN = max)
sum(ans == 1)
## [1] 9
# there are no duplicates (i.e. a block showing up more than once) hence
# there is exactly one 1 in each block
w <- which(ans == 1, arr = TRUE)
anyDuplicated(cbind(rownames(mat)[w[, 1]], colnames(mat)[w[, 2]]))
## [1] 0
ADDED
If some blocks are entirely zero (which is the case in UPDATE 2) then you will get NaNs for those blocks. If you want 0s instead for the all-zero blocks try this:
xmax <- function(x) if (all(x == 0)) 0 else x/max(x)
ave(mat, rownames(mat)[row(mat)], colnames(mat)[col(mat)], FUN = xmax)

How to plot the min and max of each row in y-axis?

I'm new and learning R and got a problem plotting the min and max value of a matrix.
The matrix is something like this:
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3
Now I would like to plot the first column(X) as x-axis, and pick out the min and max values of each row (e.g. X=1, Ymin=0.2, Ymax=0.6 in the first row), and plot them as the y-axis.
Could someone help me to figure it out?
Here is one possibility, considering you want a scatterplot.
#reading your data
table = read.table(header=TRUE, text="
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3", sep= " ")
#using a for loop to filter only data to be used in the plot (X, Min_Y, Max_Y)
df = data.frame(X=NA,min_Y=NA,max_Y=NA)
for (i in c(1:length(df))) {
X = table[i,1] #X values from table
min_Y = c(min(table[i,c(2:6)])) #minimum values inside table columns 2 to 6
max_Y = c(max(table[i,c(2:6)])) #maximum values inside table columns 2 to 6
df = rbind(df,c(X,min_Y,max_Y)) #new df with X, Min_Y, Max_Y
}
df = df[-1,]
df #df results
X min_Y max_Y
2 1 0.2 0.6
3 2 0.1 0.7
4 3 0.3 0.6
#produce scatterplot with R package ggplot2
library(ggplot2)
ggplot(df) +
geom_point(aes(x=X,y=min_Y),colour="red") +
geom_point(aes(x=X,y=max_Y),colour="blue") +
ylab("Y") +
theme_bw()
A solution with rbind and 2 apply functions (for min and max) (surely not the best tho) :
mat <- as.matrix(read.table(header = T, text = "X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3"))
mat2 <- t(rbind(X = mat[ ,1], Ymin = apply(mat[ ,-1], 1, min), Ymax = apply(mat[ ,-1], 1, max)))
matplot(mat2[ ,1], mat2[ ,-1], pch = 20, cex = 1.5)
For example using pmin and pmax:
mn = Reduce(pmin,as.list(dat[,-1]))
mx = Reduce(pmax,as.list(dat[,-1]))
library(lattice)
xyplot(mn+mx~x,data.frame(x= dat[,1],mn=mn,mx=mx),
type='l',auto.key=T,
ylab=list(label="max and min"))
Where dat is :
dat <-
read.table(text='
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3',header=TRUE)
So here is (another...) way to get the column-wise min and max (using m as your matrix).
z <- t(apply(m,1,
function(x)return(c(x[1],min=min(x[2:length(x)]),max=max(x[2:length(x)])))))
z <- data.frame(z)
z
# X min max
# [1,] 1 0.2 0.6
# [2,] 2 0.1 0.7
# [3,] 3 0.3 0.6
From here, plotting is straightforward.
plot(z$X, z$max, ylim=c(min(z$min),max(z$max)),col="blue")
points(z$X, z$min, col="red")

Use index of a list of data.frames to apply a function in certain elements of a data frame

I have a data.frame that looks like this:
>df
A B C P1 P2 P3 P4 P5 P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 1 0.2 0.1 0.4 0.2 0.1 0.2 0.2
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
4 d 2 0.1 0.1 0.7 0.5 0.1 0.7 0.1
5 e 2 0.5 0.7 0.5 0.1 0.7 0.1 0.5
6 f 2 0.7 0.5 0.5 0.7 0.1 0.7 0.1
7 g 3 0.1 0.1 0.1 0.2 0.2 0.2 0.5
8 h 3 0.2 0.2 0.1 0.5 0.2 0.2 0.5
9 i 3 0.5 0.1 0.2 0.1 0.1 0.5 0.2
And a list of data.frames similar to this one:
list.1 <- list(data.frame(AA=c("a","b","c","d")),
data.frame(BB=c("e","f")),
data.frame(CC=c("a","b","i")),
data.frame(DD=c("d","e","f","g")))
Besides, I have this function:
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=T)
p.val <- 1-pchisq(Xsq, df = 2*length(p))
return(p.val)
}
I would like to select in df those values of df$A that correspond to each data.frame in the list and compute Fisher.test for P1...P6. The way I was doing it is merging df with list.1 and then apply Fisher.method to each data.frame in the list:
func <- function(x,y){merge(x,y, by.x=names(x)[1], by.y=names(y)[1])}
ll <- lapply(list.1, func, df)
ll.fis <- lapply(ll, FUN=function(i){apply(i[,4:9],2,Fisher.test)})
This works but my real data is huge, so I think that a different approach could use the index of elements of list.1[1] to calculate Fisher.test in df storing the result, then use the index of list.1[2] and calculate Fisher.test and so on. In this way, the merging would be avoided because all the calculations are made over df, also, the RAM resources would be also minimised with this approach. However, I have no clue how to achieve this. Perhaps a for loop?
Thanks
Leveraging data.table here is helpful since you can easily subset your data using .( ) syntax and extremely fast, especially with large data compared to working with, say subset
library(data.table)
# convert to data.table, setting the key to the column `A`
DT <- data.table(df, key="A")
p.col.names <- paste0("P", 1:6)
results <- lapply(list.1, function(ll)
DT[.(ll)][, lapply(.SD, Fisher.test), .SDcols=p.col.names] )
results
side note
You might want to fix the names of list.1 so that the results form lapply are properly named
# fix the names, helpful for the lapply
names(list.1) <- lapply(list.1, names)
results:
$AA
P1 P2 P3 P4 P5 P6
1: 0.04770305 0.1624142 0.2899578 0.029753 0.1070376 0.17549
$BB
P1 P2 P3 P4 P5 P6
1: 0.7174377 0.5965736 0.2561482 0.2561482 0.2561482 0.1997866
$CC
P1 P2 P3 P4 P5 P6
1: 0.0317663 0.139877 0.139877 0.05305057 0.1620897 0.2189595
$DD
P1 P2 P3 P4 P5 P6
1: 0.184746 0.4246214 0.2704228 0.1070376 0.3215871 0.1519672

Resources