Recoding a large number of variables using another data frame in R - r

I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.
This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.
Starting point (dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Desired outcome (df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}

Here is on method in base R using match and Map. Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.
The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
this returns
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K

Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3

A solution from dplyr and tidyr. The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
DATA
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)

Related

Return the row indices of df1 when those row values occur in df2 in R

I'm coding in R. I have a big data frame (df1) and a little data frame (df2). df2 is a subset of df1, but in a random order. I need to know the row indices of df1 which occur in df2. All of the specific cell values have lots of duplicates. Tapirus terrestris shows up more than once, as does each ModType value. I tried experimenting with which() and grpl() but couldn't get my code to work.
df1 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Panthera onca', 'Leopardus tigrinus' , 'Leopardus tigrinus'),
ModType = c('ANN', 'GAM', 'GAM','RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1019_s3_sd','CHELSAbio1015_s4_sd','CHELSAbio1015_s4_sd'))
df2 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Leopardus tigrinus'),
ModType = c('ANN', 'RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1015_s4_sd'))
Should output an array: 1,4 because df1 rows 1 and 4 occur in df2.
You can create an index column in df1 and merge the datasets.
df1$index <- 1:nrow(df1)
df3 <- merge(df1, df2)
df3$index
#[1] 4 1
You can use match.
df1[match(df2$SpeciesName, df1$SpeciesName), ]
Another option is tidyverse
library(dplyr)
df1 %>%
mutate(index = row_number()) %>%
inner_join(df2)

Select specific columns, where the column names are in another df in r

I couldn't find a solution in stack, so here's my issue:
I have a df with 342 columns.
I want to make a new df with only specific columns
The list of columns to keep is in another df, listed in 3 columns titled X,Y,Z for 3 new dataframes
Here's my code right now:
# Read the data:
data <- data.table::fread("data_30_9.csv")
# Import variable names #
variable.names.full = openxlsx::read.xlsx("variables2.xlsx")
Y.variable.names = na.omit(variable.names.full[1])
X.variable.names = na.omit(variable.names.full[2])
Z.variable.names = na.omit(variable.names.full[3])
# Make new DF with only specific columns:
X.Data = data %>% select(as.character(X.variable.names)) # This works as X has only 1 variable
Y.Data = data %>% select(as.character(Y.variable.names)) # This give an error: Error:
# # Can't subset columns that don't exist.
Help?
the data is available here:
https://github.com/amirnakar/TammyA/blob/main/data_30_9.csv
https://github.com/amirnakar/TammyA/blob/main/Variables2.xlsx
The problem is that Y.variable.names is a data.frame which you cannot use to subset another data.frame.
You can check by typing class(Y.variable.names).
So the solution to your problem is subsetting Y.variable.names:
Y.Data = data %>% select(Y.variable.names[,1])
Use lapply on variable.names.full and select the columns from data.
list_data <- lapply(variable.names.full, function(x)
data[, na.omit(x), drop = FALSE])

Replace values in column with matching column in different DF

I have two data frames:
DF <- data.frame(A=letters[1:5],B=1:5)
DF_2 <- data.frame(match_col = c("a","a","c"))
Here we have to get only matching columns of DF_2$match_col
final_df <- data.frame(A=c("a","a","c","d","e"),B=1:5)
Your question here is not very clear. For youR DF_2, I am not sure if there is a column of B in it. I assume you forgot to include it, as I assume you need that column to perform matching.
Please see below:
DF <- data.frame(A=letters[1:5],B=1:5)
DF_2 <- data.frame(match_col = c("a","a","c"))
DF_2$B=c(1:3)
DF$A= as.character(DF$A)
DF_2$match_col= as.character(DF_2$match_col)
for(id in 1:nrow(DF_2)){
DF$A[DF$B %in% DF_2$B[id]] <- DF_2$match_col[id]
}
DF
Here my DF matches with your final_df, therefore I presume my assumption is right.

Retain previous data frame based on condition in R

So I'm trying to update or retain a dataframe df2 based on a certain condition of another data frame df1.
For Example, Assuming df1 get updated for every 30 seconds, so if the number of rows in df1 i.e nrow(df1)!= 0 then df2 <- df1 else if retain the previous values in df2.
NOTE: On the first iteration, df2 can be initialized to a NULL dataframe.
Following is my code
#Initializing df2 as empty dataframe
df2 <- data.frame(weight = integer(),stringAsFactors = FALSE)
#Condition to check if number of rows in df1 != 0
if(nrow(df1) != 0){
df2 <- df1
temp <- df1 #Another copy of df1
}
else{
df2 <- temp
}
Here I created an another data frame called temp to keep a copy of df1 so that it can be used when nrow(df1) == 0. I don't know if the usage of temp is correct or not.
This code will create an empty dataframe named df2. If nrow(df1)>0 then it will effectively assign the contents of df1 to df2. If nrow(df1)==0 then df2 remains empty.
df2 <- data.frame()
if(nrow(df1)>0) df2 <- df1
I have a hard time imagining why this is useful. If, perhaps, you intended to "grow" df2 by appending on whatever is in df1 - which might be more common - then do something like this:
df2 <- data.frame()
if(nrow(df1)>0) df2 <- rbind(df2, df1)

removing columns from a data frame which feature in a list, but don't feature in another list

Say, my variable are as follows.
df = read.csv('somedataset.csv') #contains 'col1','col2','col3','col4','col5' say
colsSomeRemoveSomeDontRemove = c('col1','col2','col3')
colsDontRemove = 'col2'
I would like to remove all those columns from df which feature in colsSomeRemoveSomeDontRemove, but are not part of colsDontRemove.
So basically, at the end my df should contain only columns 'col2','col4','col5'
How can I do that?
I have tried doing the following, but could not get it to work
df1 = cbind(df[,which(!(names(df) %in% colsSomeRemoveSomeDontRemove))],as.data.frame(df[,colsDontRemove]))
df[, !(colnames(df) %in% setdiff(colsSomeRemoveSomeDontRemove, colsDontRemove))]

Resources