Using multiple data frames to introduce new variables into each other R - r

I've got three data frames (Df1, Df2, Df3). These data frames have some variable in common, but they also each contain some unique variables. I'd like to make sure that all variables are represented in all data frames, eg material is present in Df2 but not Df1, so I'd like to create a variable named material in Df1 and set that variable to be NA. Thanks for any help.
Starting point (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))
Desired outcome (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(1,1,1),"size"=c(NA,NA,NA),"key"=c(NA,NA,NA))
Df2 <- data.frame("color"=c(1,1,1),"price"=c(NA,NA,NA),"material"=c(1,1,1),"buyer"=c(NA,NA,NA),"size"=c(1,1,1),"key"=c(NA,NA,NA))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(NA,NA,NA),"size"=c(NA,NA,NA),"key"=c(1,1,1))
My code so far: (I'm trying to compare the variable names in an individual data frame with the variable names in all three data frames, and use the ones not present in the individual data frame to generate the new variables set to NA. But I end up with: Error in VarDf1[, NewVariables] <- NA :incorrect number of subscripts on matrix). Don't know how to fix it.
dfs <- list(Df1,Df2,Df3)
numdfs <- length(dfs)
for (i in 1:numdfs)
{
VarDf1 <- as.vector(names(Df1))
VarDf2 <- as.vector(names(Df2))
VarDf3 <- as.vector(names(Df3))
VarAll <- c(VarDf1, VarDf2,VarDf3)
NewVariables <- as.vector(setdiff(VarAll, dfs[i]))
dfs[i][ , NewVariables] <- NA
}

rbind.fill from the plyr package does what you expect while also combining everything into a big data.frame:
plyr::rbind.fill(Df1,Df2,Df3)
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
4 1 NA NA 1 1 NA
5 1 NA NA 1 1 NA
6 1 NA NA 1 1 NA
7 1 1 NA NA NA 1
8 1 1 NA NA NA 1
9 1 1 NA NA NA 1
You can subset the data back out in to new data.frames.

This method is similar to rbind.fill, but it will let you separate it back into 3 data frames at the end.
We use tibble::lst rather than list so that the names of the list become 'Df1', 'Df2' and 'Df3'.
bind_rows does the same thing as rbind.fill however we can specify a .id column that links the row to its original data frame. Using this column, we can split this data frame into 3.
library('tidyverse')
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
split(.$df_id)
# $Df1
# df_id color price buyer material size key
# 1 Df1 1 1 1 NA NA NA
# 2 Df1 1 1 1 NA NA NA
# 3 Df1 1 1 1 NA NA NA
#
# $Df2
# df_id color price buyer material size key
# 4 Df2 1 NA NA 1 1 NA
# 5 Df2 1 NA NA 1 1 NA
# 6 Df2 1 NA NA 1 1 NA
#
# $Df3
# df_id color price buyer material size key
# 7 Df3 1 1 NA NA NA 1
# 8 Df3 1 1 NA NA NA 1
# 9 Df3 1 1 NA NA NA 1
The split can also be written like this if you prefer "tidy" functions.
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
group_by(df_id) %>%
nest %>%
deframe

We can create a function, add_cols, and apply this function to all data frames.
# Create a list to store all data frames
Df_list <- list(Df1, Df2, Df3)
# Get the unique name of all data frame
Cols <- unique(unlist(lapply(Df_list, colnames)))
# Create a function to add columns
add_cols <- function(df, cols){
new_col <- cols[!cols %in% colnames(df)]
df[, new_col] <- NA
return(df)
}
# Use lapply to apply the function
Df_list2 <- lapply(Df_list, add_cols, Cols)
# View the results
Df_list2
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA

Here's an approach in base R
Get the column names in all data frames
cols = unique(unlist(lapply(list(Df1,Df2,Df3), FUN = colnames)))
add missing columns filled with NA
lapply(list(Df1,Df2,Df3), function(x){
for (i in cols[!cols %in% colnames(x)]){
x[[i]] = NA
}
return(x)
}
)
#output
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
data:
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))

Related

Remove NA's by keeping all the populated cells in new columns using R

How can I drop all the elements with missing values but instead of deleting entire columns, create columns with just the populated cells? For example getting from this
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
(data1) in order to create a data-set containing only the populated cells, as this
AB BB
1 2
3 4
5 6
Below I have created a small working example to test a solution.
># Create example dataset (data1)
>data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,6,NA),nrow = 3, byrow = T))
>colnames(data1) <- c("A","B","C","D")
>print(data1)
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
> # Create new dataset?
Here is a potential solution using akrun's/Valentin's answer from this question.
Let's say the data is
data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,NA,NA),nrow = 3, byrow = T))
> data1
X1 X2 X3 X4
1 1 NA 2 NA
2 NA 3 NA 4
3 NA 5 NA NA
Then use
df1 <- t(sapply(apply(data1, 1, function(x) x[!is.na(x)]), "length<-", max(lengths(lapply(data1, function(x) x[!is.na(x)])))))
to arrive at
> df1
X1 X3
[1,] 1 2
[2,] 3 4
[3,] 5 NA

How to find whether at least one column satisfies a certain condition, with NAs

I have a dataframe with multiple columns: I need to identify those rows in which there is at least one outlier among some of the columns, but I do not know how to deal with NAs.
An example of dataframe (different from mine):
# X atq ME.BE.crsp X2
# 1 10 0.5 4
# NA 2 1.3 5
# 3 NA 5 2
# NA NA NA NA
# 2 4 NA 3
I'm doing the following:
data = data %>%
mutate(outlier= as.numeric(atq > quantile(atq, 0.99,na.rm=T)|
atq < quantile(atq, 0.01,na.rm=T)|
ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)|
ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)
))
My expected result is (I'm making up the outliers, the point is about NAs):
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 0
# NA NA NA NA NA
# 2 4 NA 3 1
What I get instead is:
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 NA
# NA NA NA NA NA
# 2 4 NA 3 NA
So, it seems that as soon as the as.numeric finds an NA either in data$atq or in data$ME.BE.crsp, it just gives NA to data$outlier, while I would like it to consider the non NA value and assign 0 or 1 based on that one.
Any suggestions? Thanks!
If both'atq' and 'ME.BE.crsp' are NA and it should return NA, then use a condition with case_when
library(dplyr)
data %>%
mutate(outlier= case_when(is.na(atq) & is.na(ME.BE.crsp) ~
NA_real_,
TRUE ~ as.numeric((atq > quantile(atq, 0.99,na.rm=TRUE)) &
!is.na(atq)|
(atq < quantile(atq, 0.01,na.rm=T)) & !is.na(atq)|
(ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)) &
!is.na(ME.BE.crsp)|
(ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)) &
!is.na(ME.BE.crsp)
)))

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Retrieve index of newly added row - for loop in R

I am trying to retrieve the index of a newly-added row, added via a for loop.
Starting from the beginning, I have a list of matrices of p-values, each with a variable number of rows and columns. This is because not all groups have an adequate number of treated individuals to run t-tests. The following is what prints to the console when I access this sample list:
$Group1
Normal Treatment 1 Treatment 2
Treatment 1 1 NA NA
Treatment 2 1 1 NA
Treatment 3 1 1 1
$Group2
Normal Treatment 2
Treatment 2 1 NA
Treatment 4 1 1
I would like every group to have the same number of rows and columns, in the correct order, with the missing values just filled in with NAs. This is a sample of what I would like:
$Group1
Normal Treatment 1 Treatment 2 Treatment 3
Treatment 1 1 NA NA NA
Treatment 2 1 1 NA NA
Treatment 3 1 1 1 NA
Treatment 4 NA NA NA NA
$Group2
Normal Treatment 1 Treatment 2 Treatment 3
Treatment 1 NA NA NA NA
Treatment 2 1 NA NA NA
Treatment 3 NA NA NA NA
Treatment 4 1 1 NA NA
Here is the code I have so far:
fix.results.row <- function(x, factors) {
results.matrix <- x
num <- 1
for (i in factors){
if (!i %in% rownames(results.matrix)) {
results.matrix <- rbind(results.matrix, NA)
rownames(results.matrix)[num] <- i
}
num <- num + 1
}
rownames(results.matrix) <- results.matrix[rownames(factors),,drop=FALSE]
return(results.matrix)
}
In the function above, x would be my list of matrices, and factors would be a list of all the factors in the order I want them. I have a similar function for adding columns.
My problem, as I see it, is in Group 2. If it sees that I'm missing Treatment 1, it will replace the rowname Treatment 2 with the rowname Treatment 1, so the data for Treatment 2 is now mislabeled Treatment 1. Then it reorders the variables the way I want them, but the data are already mislabeled!
If I could access the index of the newly-added row, which changes from group to group, then I could just change that specific row name. Any suggestions? Please let me know if there's any more information I need to provide. I tried to cover everything but I'm not sure if there's anything else you all need.
This isn't very elegant, but it might work better than using two functions to fill in the rows and columns separately.
Here, x is a list of all your matrices; factor is an optional list of desired row and column names
fix_rc <- function(x, factors) {
f <- function(x) factor(ul <- unique(unlist(x)), levels = sort(ul))
if (missing(factors))
factors <- list(f(sapply(x, rownames)),
f(sapply(x, colnames)))
template <- matrix(NA, length(factors[[1]]), length(factors[[2]]),
dimnames = factors)
lapply(x, function(xx) {
## original
# xx <- rbind(xx, template[, colnames(xx)])
# xx <- cbind(xx, template[rownames(xx), ])
# xx[rownames(template), colnames(template)]
## better http://stackoverflow.com/questions/31050787/r-how-to-match-join-2-matrices-of-different-dimensions-nrow-ncol/31051218#31051218
xx <- as.data.frame.table(xx)
template[as.matrix(xx[, 1:2])] <- xx$Freq
template
})
}
Here is the data I am using
l <- list(Group1 = matrix(c(1,1,1,NA,1,1,NA,NA,1), 3, 3,
dimnames = list(paste('Treatment', 1:3),
c('Normal', paste('Treatment', 1:2)))),
Group2 = matrix(c(1,1,NA,1), 2, 2,
dimnames = list(paste('Treatment', c(2,4)),
c('Normal','Treatment 2'))))
# $Group1
# Normal Treatment 1 Treatment 2
# Treatment 1 1 NA NA
# Treatment 2 1 1 NA
# Treatment 3 1 1 1
#
# $Group2
# Normal Treatment 2
# Treatment 2 1 NA
# Treatment 4 1 1
And you can use it like this. Note that when you don't supply factors, the function will get all the row and column names from your list of matrices
fix_rc(l)
# $Group1
# Normal Treatment 1 Treatment 2
# Treatment 1 1 NA NA
# Treatment 2 1 1 NA
# Treatment 3 1 1 1
# Treatment 4 NA NA NA
#
# $Group2
# Normal Treatment 1 Treatment 2
# Treatment 1 NA NA NA
# Treatment 2 1 NA NA
# Treatment 3 NA NA NA
# Treatment 4 1 NA 1
I'm not sure where treatment 3 in the columns in your desired output came from, but you can get that here if you want like so
fix_rc(l, factors = list(paste('Treatment', 1:6),
c('Normal', paste('Treatment', 1:3))))
# $Group1
# Normal Treatment 1 Treatment 2 Treatment 3
# Treatment 1 1 NA NA NA
# Treatment 2 1 1 NA NA
# Treatment 3 1 1 1 NA
# Treatment 4 NA NA NA NA
# Treatment 5 NA NA NA NA
# Treatment 6 NA NA NA NA
#
# $Group2
# Normal Treatment 1 Treatment 2 Treatment 3
# Treatment 1 NA NA NA NA
# Treatment 2 1 NA NA NA
# Treatment 3 NA NA NA NA
# Treatment 4 1 NA 1 NA
# Treatment 5 NA NA NA NA
# Treatment 6 NA NA NA NA
Not a complete solution, but if you used data frames: wouldn't it be easier to get there?
df1 <- data.frame(normal=c(1,1,1)
, treatment1=c(NA, 1,1)
, treatment2=c(NA,NA,1)
, row.names=c("Treatment1", "Treatment2", "Treatment3")
)
df2 <- data.frame(normal=c(1,1)
, treatment2=c(NA,1)
, row.names=c("Treatment2", "Treatment4")
)
df1$names <- rownames(df1)
df2$names <- rownames(df2)
df3 <- merge(df1,df2, by="names", all=TRUE)
df3
names normal.x treatment1 treatment2.x normal.y treatment2.y
1 Treatment1 1 NA NA NA NA
2 Treatment2 1 1 NA 1 NA
3 Treatment3 1 1 1 NA NA
4 Treatment4 NA NA NA 1 1
Now all you have to do is combine columns based on their names

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Resources