Retrieve index of newly added row - for loop in R - r

I am trying to retrieve the index of a newly-added row, added via a for loop.
Starting from the beginning, I have a list of matrices of p-values, each with a variable number of rows and columns. This is because not all groups have an adequate number of treated individuals to run t-tests. The following is what prints to the console when I access this sample list:
$Group1
Normal Treatment 1 Treatment 2
Treatment 1 1 NA NA
Treatment 2 1 1 NA
Treatment 3 1 1 1
$Group2
Normal Treatment 2
Treatment 2 1 NA
Treatment 4 1 1
I would like every group to have the same number of rows and columns, in the correct order, with the missing values just filled in with NAs. This is a sample of what I would like:
$Group1
Normal Treatment 1 Treatment 2 Treatment 3
Treatment 1 1 NA NA NA
Treatment 2 1 1 NA NA
Treatment 3 1 1 1 NA
Treatment 4 NA NA NA NA
$Group2
Normal Treatment 1 Treatment 2 Treatment 3
Treatment 1 NA NA NA NA
Treatment 2 1 NA NA NA
Treatment 3 NA NA NA NA
Treatment 4 1 1 NA NA
Here is the code I have so far:
fix.results.row <- function(x, factors) {
results.matrix <- x
num <- 1
for (i in factors){
if (!i %in% rownames(results.matrix)) {
results.matrix <- rbind(results.matrix, NA)
rownames(results.matrix)[num] <- i
}
num <- num + 1
}
rownames(results.matrix) <- results.matrix[rownames(factors),,drop=FALSE]
return(results.matrix)
}
In the function above, x would be my list of matrices, and factors would be a list of all the factors in the order I want them. I have a similar function for adding columns.
My problem, as I see it, is in Group 2. If it sees that I'm missing Treatment 1, it will replace the rowname Treatment 2 with the rowname Treatment 1, so the data for Treatment 2 is now mislabeled Treatment 1. Then it reorders the variables the way I want them, but the data are already mislabeled!
If I could access the index of the newly-added row, which changes from group to group, then I could just change that specific row name. Any suggestions? Please let me know if there's any more information I need to provide. I tried to cover everything but I'm not sure if there's anything else you all need.

This isn't very elegant, but it might work better than using two functions to fill in the rows and columns separately.
Here, x is a list of all your matrices; factor is an optional list of desired row and column names
fix_rc <- function(x, factors) {
f <- function(x) factor(ul <- unique(unlist(x)), levels = sort(ul))
if (missing(factors))
factors <- list(f(sapply(x, rownames)),
f(sapply(x, colnames)))
template <- matrix(NA, length(factors[[1]]), length(factors[[2]]),
dimnames = factors)
lapply(x, function(xx) {
## original
# xx <- rbind(xx, template[, colnames(xx)])
# xx <- cbind(xx, template[rownames(xx), ])
# xx[rownames(template), colnames(template)]
## better http://stackoverflow.com/questions/31050787/r-how-to-match-join-2-matrices-of-different-dimensions-nrow-ncol/31051218#31051218
xx <- as.data.frame.table(xx)
template[as.matrix(xx[, 1:2])] <- xx$Freq
template
})
}
Here is the data I am using
l <- list(Group1 = matrix(c(1,1,1,NA,1,1,NA,NA,1), 3, 3,
dimnames = list(paste('Treatment', 1:3),
c('Normal', paste('Treatment', 1:2)))),
Group2 = matrix(c(1,1,NA,1), 2, 2,
dimnames = list(paste('Treatment', c(2,4)),
c('Normal','Treatment 2'))))
# $Group1
# Normal Treatment 1 Treatment 2
# Treatment 1 1 NA NA
# Treatment 2 1 1 NA
# Treatment 3 1 1 1
#
# $Group2
# Normal Treatment 2
# Treatment 2 1 NA
# Treatment 4 1 1
And you can use it like this. Note that when you don't supply factors, the function will get all the row and column names from your list of matrices
fix_rc(l)
# $Group1
# Normal Treatment 1 Treatment 2
# Treatment 1 1 NA NA
# Treatment 2 1 1 NA
# Treatment 3 1 1 1
# Treatment 4 NA NA NA
#
# $Group2
# Normal Treatment 1 Treatment 2
# Treatment 1 NA NA NA
# Treatment 2 1 NA NA
# Treatment 3 NA NA NA
# Treatment 4 1 NA 1
I'm not sure where treatment 3 in the columns in your desired output came from, but you can get that here if you want like so
fix_rc(l, factors = list(paste('Treatment', 1:6),
c('Normal', paste('Treatment', 1:3))))
# $Group1
# Normal Treatment 1 Treatment 2 Treatment 3
# Treatment 1 1 NA NA NA
# Treatment 2 1 1 NA NA
# Treatment 3 1 1 1 NA
# Treatment 4 NA NA NA NA
# Treatment 5 NA NA NA NA
# Treatment 6 NA NA NA NA
#
# $Group2
# Normal Treatment 1 Treatment 2 Treatment 3
# Treatment 1 NA NA NA NA
# Treatment 2 1 NA NA NA
# Treatment 3 NA NA NA NA
# Treatment 4 1 NA 1 NA
# Treatment 5 NA NA NA NA
# Treatment 6 NA NA NA NA

Not a complete solution, but if you used data frames: wouldn't it be easier to get there?
df1 <- data.frame(normal=c(1,1,1)
, treatment1=c(NA, 1,1)
, treatment2=c(NA,NA,1)
, row.names=c("Treatment1", "Treatment2", "Treatment3")
)
df2 <- data.frame(normal=c(1,1)
, treatment2=c(NA,1)
, row.names=c("Treatment2", "Treatment4")
)
df1$names <- rownames(df1)
df2$names <- rownames(df2)
df3 <- merge(df1,df2, by="names", all=TRUE)
df3
names normal.x treatment1 treatment2.x normal.y treatment2.y
1 Treatment1 1 NA NA NA NA
2 Treatment2 1 1 NA 1 NA
3 Treatment3 1 1 1 NA NA
4 Treatment4 NA NA NA 1 1
Now all you have to do is combine columns based on their names

Related

adding two variables which has NA present

lets say data is 'ab':
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
ab <-c(a,b)
I would like to have new variable which is sum of the two but keeping NA's as follows:
desired output:
ab$c <-(6,2,7,NA,5,6)
so addition of number + NA should equal number
I tried following but does not work as desired:
ab$c <- a+b
gives me : 6 NA 7 NA NA NA
Also don't know how to include "na.rm=TRUE", something I was trying.
I would also like to create third variable as categorical based on cutoff <=4 then event 1, otherwise 0:
desired output:
ab$d <-(1,1,1,NA,0,0)
I tried:
ab$d =ifelse(ab$a<=4|ab$b<=4,1,0)
print(ab$d)
gives me logical(0)
Thanks!
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
dfd <- data.frame(a,b)
dfd$c <- rowSums(dfd, na.rm = TRUE)
dfd$c <- ifelse(is.na(dfd$a) & is.na(dfd$b), NA_integer_, dfd$c)
dfd$d <- ifelse(dfd$c >= 4, 1, 0)
dfd
a b c d
1 1 5 6 1
2 2 NA 2 0
3 3 4 7 1
4 NA NA NA NA
5 5 NA 5 1
6 NA 6 6 1

How do I calculate Euclidean distances across NA values in r

I have a date frame like this
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
df <- data.frame(individual,x,y,frame)
I have an ID column labeled 'individual', xy coordinates, and a frame number.
I need to calculate the euclidean distances for the x,y coordinates between rows but over the NA values.
So, in the example I gave - I would need to calculate the distances between rows 1 and 9, as well as 10 and 9. In the real data there would be substantially more rows of course.
Eventually what I need to do is interpolate the data, so that if the euclidean distance is <5, fill in the data rows that are missing with the ID of the individual. If the euclidean distance is >5, then ignore and interpolate nothing.
Here is the example result data frame that's needed:
individual <- c("1","1","1","1","1","1","1","1","1","1")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5)
frame <- rep(1:10)
dist_measure <- c(NA,NA,NA,NA,NA,NA,NA,NA,2,2.828427)
df <- data.frame(individual,x,y,frame,dist_measure)
Any advice on an approach to this problem is greatly appreciated. My first thought was to have a function that calculates Euclidean distance and put it in a for loop. But I'm a bit stuck on how to work this over the NA values. I thought somehow using the lag function in the tidyverse would help, but not sure again how to integrate that into the loop/function.
Thank you in advance.
This should work. I've added another individual into the hypothetical data to show how it works.
individual <- c("1",NA,NA,NA,NA,NA,NA,NA,"1","1",
"2",NA,NA,NA,NA,NA,NA,NA,"2","2")
x <- c(665,NA,NA,NA,NA,NA,NA,NA,663,665,
.665,NA,NA,NA,NA,NA,NA,NA,.663,.665)
y <- c(-474.5,NA,NA,NA,NA,NA,NA,NA,-474.5,-472.5,
-.4745,NA,NA,NA,NA,NA,NA,NA,-.4745,-.4725)
frame <- rep(1:10, 2)
df <- data.frame(individual,x,y,frame)
for(i in 1:2){
tmp <- df[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i))), ]
ends <- range(which(is.na(tmp$individual))) + c(-1,1)
if(nrow(tmp) > 1 & ends[1] > 0 & ends[2] <= nrow(tmp)){
d <- c(dist(tmp[ends, c("x", "y")]))
if(d < 5){
df$individual[min(which(df$individual == as.character(i))):
max(which(df$individual == as.character(i)))] <- tmp$individual[ends[1]]
}
}
}
df
# individual x y frame
# 1 1 665.000 -474.5000 1
# 2 1 NA NA 2
# 3 1 NA NA 3
# 4 1 NA NA 4
# 5 1 NA NA 5
# 6 1 NA NA 6
# 7 1 NA NA 7
# 8 1 NA NA 8
# 9 1 663.000 -474.5000 9
# 10 1 665.000 -472.5000 10
# 11 2 0.665 -0.4745 1
# 12 2 NA NA 2
# 13 2 NA NA 3
# 14 2 NA NA 4
# 15 2 NA NA 5
# 16 2 NA NA 6
# 17 2 NA NA 7
# 18 2 NA NA 8
# 19 2 0.663 -0.4745 9
# 20 2 0.665 -0.4725 10

How to find whether at least one column satisfies a certain condition, with NAs

I have a dataframe with multiple columns: I need to identify those rows in which there is at least one outlier among some of the columns, but I do not know how to deal with NAs.
An example of dataframe (different from mine):
# X atq ME.BE.crsp X2
# 1 10 0.5 4
# NA 2 1.3 5
# 3 NA 5 2
# NA NA NA NA
# 2 4 NA 3
I'm doing the following:
data = data %>%
mutate(outlier= as.numeric(atq > quantile(atq, 0.99,na.rm=T)|
atq < quantile(atq, 0.01,na.rm=T)|
ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)|
ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)
))
My expected result is (I'm making up the outliers, the point is about NAs):
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 0
# NA NA NA NA NA
# 2 4 NA 3 1
What I get instead is:
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 NA
# NA NA NA NA NA
# 2 4 NA 3 NA
So, it seems that as soon as the as.numeric finds an NA either in data$atq or in data$ME.BE.crsp, it just gives NA to data$outlier, while I would like it to consider the non NA value and assign 0 or 1 based on that one.
Any suggestions? Thanks!
If both'atq' and 'ME.BE.crsp' are NA and it should return NA, then use a condition with case_when
library(dplyr)
data %>%
mutate(outlier= case_when(is.na(atq) & is.na(ME.BE.crsp) ~
NA_real_,
TRUE ~ as.numeric((atq > quantile(atq, 0.99,na.rm=TRUE)) &
!is.na(atq)|
(atq < quantile(atq, 0.01,na.rm=T)) & !is.na(atq)|
(ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)) &
!is.na(ME.BE.crsp)|
(ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)) &
!is.na(ME.BE.crsp)
)))

Using multiple data frames to introduce new variables into each other R

I've got three data frames (Df1, Df2, Df3). These data frames have some variable in common, but they also each contain some unique variables. I'd like to make sure that all variables are represented in all data frames, eg material is present in Df2 but not Df1, so I'd like to create a variable named material in Df1 and set that variable to be NA. Thanks for any help.
Starting point (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))
Desired outcome (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(1,1,1),"size"=c(NA,NA,NA),"key"=c(NA,NA,NA))
Df2 <- data.frame("color"=c(1,1,1),"price"=c(NA,NA,NA),"material"=c(1,1,1),"buyer"=c(NA,NA,NA),"size"=c(1,1,1),"key"=c(NA,NA,NA))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(NA,NA,NA),"size"=c(NA,NA,NA),"key"=c(1,1,1))
My code so far: (I'm trying to compare the variable names in an individual data frame with the variable names in all three data frames, and use the ones not present in the individual data frame to generate the new variables set to NA. But I end up with: Error in VarDf1[, NewVariables] <- NA :incorrect number of subscripts on matrix). Don't know how to fix it.
dfs <- list(Df1,Df2,Df3)
numdfs <- length(dfs)
for (i in 1:numdfs)
{
VarDf1 <- as.vector(names(Df1))
VarDf2 <- as.vector(names(Df2))
VarDf3 <- as.vector(names(Df3))
VarAll <- c(VarDf1, VarDf2,VarDf3)
NewVariables <- as.vector(setdiff(VarAll, dfs[i]))
dfs[i][ , NewVariables] <- NA
}
rbind.fill from the plyr package does what you expect while also combining everything into a big data.frame:
plyr::rbind.fill(Df1,Df2,Df3)
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
4 1 NA NA 1 1 NA
5 1 NA NA 1 1 NA
6 1 NA NA 1 1 NA
7 1 1 NA NA NA 1
8 1 1 NA NA NA 1
9 1 1 NA NA NA 1
You can subset the data back out in to new data.frames.
This method is similar to rbind.fill, but it will let you separate it back into 3 data frames at the end.
We use tibble::lst rather than list so that the names of the list become 'Df1', 'Df2' and 'Df3'.
bind_rows does the same thing as rbind.fill however we can specify a .id column that links the row to its original data frame. Using this column, we can split this data frame into 3.
library('tidyverse')
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
split(.$df_id)
# $Df1
# df_id color price buyer material size key
# 1 Df1 1 1 1 NA NA NA
# 2 Df1 1 1 1 NA NA NA
# 3 Df1 1 1 1 NA NA NA
#
# $Df2
# df_id color price buyer material size key
# 4 Df2 1 NA NA 1 1 NA
# 5 Df2 1 NA NA 1 1 NA
# 6 Df2 1 NA NA 1 1 NA
#
# $Df3
# df_id color price buyer material size key
# 7 Df3 1 1 NA NA NA 1
# 8 Df3 1 1 NA NA NA 1
# 9 Df3 1 1 NA NA NA 1
The split can also be written like this if you prefer "tidy" functions.
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
group_by(df_id) %>%
nest %>%
deframe
We can create a function, add_cols, and apply this function to all data frames.
# Create a list to store all data frames
Df_list <- list(Df1, Df2, Df3)
# Get the unique name of all data frame
Cols <- unique(unlist(lapply(Df_list, colnames)))
# Create a function to add columns
add_cols <- function(df, cols){
new_col <- cols[!cols %in% colnames(df)]
df[, new_col] <- NA
return(df)
}
# Use lapply to apply the function
Df_list2 <- lapply(Df_list, add_cols, Cols)
# View the results
Df_list2
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
Here's an approach in base R
Get the column names in all data frames
cols = unique(unlist(lapply(list(Df1,Df2,Df3), FUN = colnames)))
add missing columns filled with NA
lapply(list(Df1,Df2,Df3), function(x){
for (i in cols[!cols %in% colnames(x)]){
x[[i]] = NA
}
return(x)
}
)
#output
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
data:
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Resources