How to replace values in dataframe based on a second dataframe in R? - r

I have a dataframe df1 with multiple columns, each column representing a species name (sp1, sp2, sp3, ...).
df1
sp1 sp2 sp3 sp4
NA NA r1 r1
NA NA 1 3
NA 5 NA NA
m4 NA NA m2
I would like to replace each value in df1 with values based on a second dataframe, df2. Here, the values in df1 should match df2$scale_nr, and replaced by df2$percentage. Thus, the result should be so that I have my values in df1 based on $percentage in df2.
df2
scale_nr percentage
r1 1
p1 1
a1 1
m1 1
r2 2
p2 2
a2 2
m2 2
1 10
2 20
3 30
4 40
...
Then after replacement df1 should look like
df1
sp1 sp2 sp3 sp4
NA NA 1 1
NA NA 10 30
NA 50 NA NA
4 NA NA 2
I tried this:
df2$percentage[match(df1$sp1, df2$scale_nr)] # this one works for one column
which works for one column, I know I should be able to do this over all columns easily, but somehow I can't figure it out.
I know I could do it by 'hand', like
df[df == 'Old Value'] <- 'New value'
but this seems highly inefficient because I have 40 different values that need to be replaced.
Can someone please help me with a solution for this?

You can use lapply on the frame to iterate the same thing over multiple columns.
df1[] <- lapply(df1, function(z) df2$percentage[match(z, df2$scale_nr)])
df1
# sp1 sp2 sp3 sp4
# 1 NA NA 1 1
# 2 NA NA 10 30
# 3 NA NA NA NA
# 4 NA NA NA 2
The missing values are likely because of the truncated df2 in the sample data.
If you want the option to preserve the previous value if not found in df2, then you can modify that slightly:
df1[] <- lapply(df1, function(z) {
newval <- df2$percentage[match(z, df2$scale_nr)]
ifelse(is.na(newval), z, newval)
})
df1
# sp1 sp2 sp3 sp4
# 1 <NA> NA 1 1
# 2 <NA> NA 10 30
# 3 <NA> 5 <NA> <NA>
# 4 m4 NA <NA> 2
FYI, the reassignment into df1[] <- is important, in constrast with df1 <-. The difference is that lapply is going to return a list, so if you use df1 <-thendf1will no longer be adata.frame. Using df[] <-, you are telling it to replace the contents of the columns without changing the overall class of df1`.
If you need to do this on only a subset of columns, that's easy:
df1[1:3] <- lapply(df[1:3], ...)`

Related

Convert a single column into multiple columns based on delimiter in R

I have the following dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the : delimiter. so it should look like:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts column
You can use a combination of tidyr::seperate, tidyr::pivot_wider, and tidyr::pivot_longer. First you can still use strsplit to determine the number of columns to split Parts into not the number of unique values (How it works):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works
You do not need to know the number of unique values with this code -- the pivots take care of that. What you do need to know is how many new columns Parts will be split into with seperate. That's easy to do by counting the number of delimiters and adding one with str_count. This way you have the appropriate number of columns to seperate Parts into by your delimiter.
This is because pivot_longer will create a two column dataframe with repeated ID and a column with the delimited values of Parts -- an ID, Parts pairing. Then when you use pivot_wider the columns are automatically created for each unique value of Parts and the value is retained within the column. This function automatically fills with NA where an ID and Parts combination is not found.
Try running this pipe by pipe to better understand if need be.
Data
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)
Could the seperate function from tidyr be what you are looking for?
https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.

Using multiple data frames to introduce new variables into each other R

I've got three data frames (Df1, Df2, Df3). These data frames have some variable in common, but they also each contain some unique variables. I'd like to make sure that all variables are represented in all data frames, eg material is present in Df2 but not Df1, so I'd like to create a variable named material in Df1 and set that variable to be NA. Thanks for any help.
Starting point (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))
Desired outcome (dfs):
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(1,1,1),"size"=c(NA,NA,NA),"key"=c(NA,NA,NA))
Df2 <- data.frame("color"=c(1,1,1),"price"=c(NA,NA,NA),"material"=c(1,1,1),"buyer"=c(NA,NA,NA),"size"=c(1,1,1),"key"=c(NA,NA,NA))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"material"=c(NA,NA,NA),"buyer"=c(NA,NA,NA),"size"=c(NA,NA,NA),"key"=c(1,1,1))
My code so far: (I'm trying to compare the variable names in an individual data frame with the variable names in all three data frames, and use the ones not present in the individual data frame to generate the new variables set to NA. But I end up with: Error in VarDf1[, NewVariables] <- NA :incorrect number of subscripts on matrix). Don't know how to fix it.
dfs <- list(Df1,Df2,Df3)
numdfs <- length(dfs)
for (i in 1:numdfs)
{
VarDf1 <- as.vector(names(Df1))
VarDf2 <- as.vector(names(Df2))
VarDf3 <- as.vector(names(Df3))
VarAll <- c(VarDf1, VarDf2,VarDf3)
NewVariables <- as.vector(setdiff(VarAll, dfs[i]))
dfs[i][ , NewVariables] <- NA
}
rbind.fill from the plyr package does what you expect while also combining everything into a big data.frame:
plyr::rbind.fill(Df1,Df2,Df3)
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
4 1 NA NA 1 1 NA
5 1 NA NA 1 1 NA
6 1 NA NA 1 1 NA
7 1 1 NA NA NA 1
8 1 1 NA NA NA 1
9 1 1 NA NA NA 1
You can subset the data back out in to new data.frames.
This method is similar to rbind.fill, but it will let you separate it back into 3 data frames at the end.
We use tibble::lst rather than list so that the names of the list become 'Df1', 'Df2' and 'Df3'.
bind_rows does the same thing as rbind.fill however we can specify a .id column that links the row to its original data frame. Using this column, we can split this data frame into 3.
library('tidyverse')
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
split(.$df_id)
# $Df1
# df_id color price buyer material size key
# 1 Df1 1 1 1 NA NA NA
# 2 Df1 1 1 1 NA NA NA
# 3 Df1 1 1 1 NA NA NA
#
# $Df2
# df_id color price buyer material size key
# 4 Df2 1 NA NA 1 1 NA
# 5 Df2 1 NA NA 1 1 NA
# 6 Df2 1 NA NA 1 1 NA
#
# $Df3
# df_id color price buyer material size key
# 7 Df3 1 1 NA NA NA 1
# 8 Df3 1 1 NA NA NA 1
# 9 Df3 1 1 NA NA NA 1
The split can also be written like this if you prefer "tidy" functions.
lst(Df1, Df2, Df3) %>%
bind_rows(.id = 'df_id') %>%
group_by(df_id) %>%
nest %>%
deframe
We can create a function, add_cols, and apply this function to all data frames.
# Create a list to store all data frames
Df_list <- list(Df1, Df2, Df3)
# Get the unique name of all data frame
Cols <- unique(unlist(lapply(Df_list, colnames)))
# Create a function to add columns
add_cols <- function(df, cols){
new_col <- cols[!cols %in% colnames(df)]
df[, new_col] <- NA
return(df)
}
# Use lapply to apply the function
Df_list2 <- lapply(Df_list, add_cols, Cols)
# View the results
Df_list2
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
Here's an approach in base R
Get the column names in all data frames
cols = unique(unlist(lapply(list(Df1,Df2,Df3), FUN = colnames)))
add missing columns filled with NA
lapply(list(Df1,Df2,Df3), function(x){
for (i in cols[!cols %in% colnames(x)]){
x[[i]] = NA
}
return(x)
}
)
#output
[[1]]
color price buyer material size key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[2]]
color material size price buyer key
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
[[3]]
color price key buyer material size
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 NA NA NA
data:
Df1 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"buyer"=c(1,1,1))
Df2 <- data.frame("color"=c(1,1,1),"material"=c(1,1,1),"size"=c(1,1,1))
Df3 <- data.frame("color"=c(1,1,1),"price"=c(1,1,1),"key"=c(1,1,1))

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

How does one merge dataframes by row name without adding a "Row.names" column?

If I have two data frames, such as:
df1 = data.frame(x=1:3,y=1:3,row.names=c('r1','r2','r3'))
df2 = data.frame(z=5:7,row.names=c('r5','r6','r7'))
(
R> df1
x y
r1 1 1
r2 2 2
r3 3 3
R> df2
z
r5 5
r6 6
r7 7
), I'd like to merge them by row names, keeping everything (so an outer join, or all=T). This does it:
merged.df <- merge(df1,df2,all=T,by='row.names')
R> merged.df
Row.names x y z
1 r1 1 1 NA
2 r2 2 2 NA
3 r3 3 3 NA
4 r5 NA NA 5
5 r6 NA NA 6
6 r7 NA NA 7
but I want the input row names to be the row names in the output dataframe (merged.df).
I can do:
rownames(merged.df) <- merged.df[[1]]
merged.df <- merged.df[-1]
which works, but seems inelegant and hard to remember. Anyone know of a cleaner way?
Not sure if it's any easier to remember, but you can do it all in one step using transform.
transform(merge(df1,df2,by=0,all=TRUE), row.names=Row.names, Row.names=NULL)
# x y z
#r1 1 1 NA
#r2 2 2 NA
#r3 3 3 NA
#r5 NA NA 5
#r6 NA NA 6
#r7 NA NA 7
From the help of merge:
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
So it is clear that you can't avoid the Row.names column at least using merge. But maybe to remove this column you can subset by name and not by index. For example:
dd <- merge(df1,df2,by=0,all=TRUE) ## by=0 easier to write than row.names ,
## TRUE is cleaner than T
Then I use row.names to subset like this :
res <- subset(dd,select=-c(Row.names))
rownames(res) <- dd[,'Row.names']
x y z
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 NA NA 5
5 NA NA 6
6 NA NA 7

Creating categorical variables from mutually exclusive dummy variables

My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable.
In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here), so I don't think interaction does what I need to do.
For example, my data might look like this:
id conditionA conditionB conditionC conditionD
1 NA 1 NA NA
2 1 NA NA NA
3 NA NA 1 NA
4 NA NA NA 1
5 NA 2 NA NA
6 2 NA NA NA
7 NA NA 2 NA
8 NA NA NA 2
I'd like to now make categorical variables that combine ACROSS different types of conditions. For example, people who had values for condition A and B might be coded with one categorical variable, and people who had values for condition C and D.
id conditionA conditionB conditionC conditionD factor1 factor2
1 NA 1 NA NA 1 NA
2 1 NA NA NA 1 NA
3 NA NA 1 NA NA 1
4 NA NA NA 1 NA 1
5 NA 2 NA NA 2 NA
6 2 NA NA NA 2 NA
7 NA NA 2 NA NA 2
8 NA NA NA 2 NA 2
Right now, I'm doing this using ifelse() statements, which quite simply is a hot mess (and doesn't always work). Please help! There's probably some super-obvious "easier way."
EDIT:
The kinds of ifelse commands that I am using are as follows:
attach(df)
df$factor<-ifelse(conditionA==1 | conditionB==1, 1, NA)
df$factor<-ifelse(conditionA==2 | conditionB==2, 2, df$factor)
In reality, I'm combining across 6-8 columns each time, so a more elegant solution would help a lot.
Update (2019): Please use dplyr::coalesce(), it works pretty much the same.
My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:
#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)
df$factor1 <- with(df, coalesce.na(conditionA, conditionB))
(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)
Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:
df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0),
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')
I think this function gives you what you need (admittedly, this is a quick hack).
to_indicator <- function(x, grp)
{
apply(tbl, 1,
function (x)
{
idx <- which(!is.na(x))
nm <- names(idx)
if (nm %in% grp)
x[idx]
else
NA
})
}
And here is it's used with the example data you provide.
tbl <- read.table(header=TRUE, text="
conditionA conditionB conditionC conditionD
NA 1 NA NA
1 NA NA NA
NA NA 1 NA
NA NA NA 1
NA 2 NA NA
2 NA NA NA
NA NA 2 NA
NA NA NA 2")
tbl <- data.frame(tbl)
(tbl <- cbind(tbl,
factor1=to_indicator(tbl, c("conditionA", "conditionB")),
factor2=to_indicator(tbl, c("conditionC", "conditionD"))))
Well, I think you can do it simply with ifelse, something like :
factor1 <- ifelse(is.na(conditionA), conditionB, conditionA)
Another way could be :
factor1 <- conditionA
factor1[is.na(factor1)] <- conditionB
And a third solution, certainly more pratical if you have more than two columns conditions :
factor1 <- apply(df[,c("conditionA","conditionB")], 1, sum, na.rm=TRUE)

Resources