My overall goal is to assign values to a new variable from one of several variables with specific string matches conditional on the value of another variable. More specifically:
I am trying to add many columns to a data frame where each of the given new columns (e.g. 'foo') takes on the value of one of two columns already in the data frame and whose names begin with the same string and end with one of two suffixes (e.g. 'foo.2009' and 'foo.2014') conditional on the value of another column (e.g. 'year'). The data frame also contains columns unrelated to this operation and these are identified by their lack of suffixes (e.g. 'other_example' do not end in '.2009' or '.2014') and I have created a vector of the names of the new columns. In the below example data, I want to assign values to foo from foo.2014 if year >=2014 and from foo.2009 if year < 2014.
# Original data frame
df <- data.frame( foo.2009 = seq(1,3),
foo.2014 = seq(5,7),
foo = NA,
bar = NA,
other_example = seq(20,22),
year = c(2014,2009,2014))
print(df)
# The vector of variable names ending in '.####`
names <- c("foo")
# Target data frame
df$foo <- c(5,2,7)
print(df)
In my real data, I have many variables (e.g. bar) similar to foo where I want bar == bar.2014 if year >= 2014 and bar == bar.2009 if year < 2014. I am therefore trying to develop a solution where I can loop through (or use vectorized operations on) a vector of variable names (e.g. names) for an arbitrarily large number of variables where I want to replace the values:
# The vector of variable names ending in `.####`
names <- c("foo","bar")
# Original data frame
df <- data.frame( foo.2009 = seq(1,3),
foo.2014 = seq(5,7),
bar.2009 = seq(8,10),
bar.2014 = rep(5,3),
foo = NA,
bar = NA,
other_example = seq(20,22),
year = c(2014,2009,2014))
df
# Target data frame
df$foo <- c(5,2,7)
df$bar <- c(5,9,5)
df
I am particularly having trouble with the need to evaluate multiple strings comprising variable names in a loop or using a vectorized approach. An attempt is below using dplyr::mutate() to add the variables then assign them values. Below is the same data as above but an example of what an additional variable to recode would look like.
library(dplyr)
for (i in names){
var09 <- paste0(i, ".2009")
var14 <- paste0(i, ".2014")
dplyr::mutate_(df,
i = ifelse(df$year < 2010,
paste0("df$",i, ".2009"),
paste0("df$",i, ".2014")))}
We can loop through the sequence in base R
nm1 <- c("foo\\.\\d+", "bar\\.\\d+")
nm2 <- c("foo", "bar")
for(j in seq_along(nm1)){
sub1 <- df[grep(nm1[j], names(df))]
df[[nm2[j]]] <- ifelse(df$year < 2010, sub1[[1]], sub1[[2]])
}
df
# foo.2009 foo.2014 bar.2009 bar.2014 foo bar other_example year
#1 1 5 8 5 5 5 20 2014
#2 2 6 9 5 2 9 21 2009
#3 3 7 10 5 7 5 22 2014
Related
In my data below, I want to replace any value in a column (excluding the first column) that occurs less than two times (ex. 'greek' in column L1, and 'german' in column L2) to "others".
I have tried the following, but don't get the desired output. Is there a short and efficient way to do this in R?
data <- data.frame(study=c('a','a','b','c','c','d'),
L1= c('arabic','turkish','greek','arabic','turkish','turkish'),
L2= c(rep('english',5),'german'))
# I tried the following without success:
dd[-1] <- lapply(names(dd)[-1], function(i) ifelse(table(dd[[i]]) < 2,"others",dd[[i]]))
forcats has specific function for this:
dd = data
dd[-1] = lapply(dd[-1], forcats::fct_lump_min, min = 2, other_level = "others")
dd
# study L1 L2
# 1 a arabic english
# 2 a turkish english
# 3 b others english
# 4 c arabic english
# 5 c turkish english
# 6 d turkish others
Your approach fails because ifelse() returns a vector the same length as the test, which in your case is the table, but the way you are using it you are assigning to the whole column so it needs to return something the same length as the whole column.
We can fix it like this:
dd[-1] <- lapply(names(dd)[-1], function(i) {
tt = table(dd[[i]])
drop = names(tt)[tt <= 2]
ifelse(dd[[i]] %in% drop, "others", dd[[i]])
})
Suppose I have two lists with the following embedded data frames:
# Data frames to intersect
US <- data.frame("Group" = c(1,2,3), "Age" = c(21,20,17), "Name" = c("John","Dora","Helen"))
CA <- data.frame("Group" = c(2,3,4), "Age" = c(21,20,19), "Name" = c("John","Dora","Dan"))
JP <- data.frame("Group" = c(4,5,6), "Age" = c(16,15,14), "Name" = c("Mac","Hector","Jack"))
# Lists to compare----
list1<-list(US,CA,JP)
names(list1)<-c("US","CA","JP")
# List 2 can serve as a "reference list," a duplicate of the first.
list2<-list(US,CA,JP)
names(list2)<-c("US","CA","JP")
I have a second list, that serves as a "reference list" to the first. It is copy and is only meant to be used as a reference in some operation, like a for loop. What I want to do is intersect the scalars / values from only the first column (e.g. Group), and store the intersected output in separate data frames or matrices. I do not want to intersect dataframe groups that have the same names(i.e. List 1 US groups should not be intersected with List 2 US groups).
Ideally, a final list of DFs would be created, containing all possible combinations of intersected DF, their names and the results for final output would be something to the effect of:
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
data frame with 0 columns and 0 rows
$CA_JP
Group
1 4
Would it be possible to create this as a for-loop?
Sure that looks doable with a nested for loop. There's no need to copy the initial list. The loop can iterate over the same list. I'd suggest using dplyr for it's handy filter and select functions
require(dplyr)
comb_list <- list()
for (i in 1:length(list1)) {
for (j in 1:length(list1)) {
# don't intersect country with itself
if (names(list1)[i] != names(list1)[j]) {
value <- filter(list1[[i]], Group %in% list1[[j]]$Group)
value <- select(value, Group)
name <- paste0(names(list1)[i], "_", names(list1[j]))
name_alt <- paste0(names(list1)[j], "_", names(list1[i]))
#don't store equivalent country intersections i.e. US_CA and CA_US
if (!name %in% names(comb_list) & !name_alt %in% names(comb_list)) {
comb_list[[name]] <- value
}
}
}
}
print(comb_list)
$US_CA
Group
1 2
2 3
$US_JP
[1] Group
<0 rows> (or 0-length row.names)
$CA_JP
Group
1 4
Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.
Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')
df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14
#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)
I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"
This question already has answers here:
Filling in a new column based on a condition in a data frame
(2 answers)
Closed 6 years ago.
I have data like the below:
# Create fake data frame
score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var_if_1 <- rep(c(0.1, 0.8), each = 3)
var_if_2 <- rep(c(0.9, 0.7), each = 3)
var_if_3 <- rep(c(0.6, 0.2), each = 3)
data.frame(score, id, var_if_1, var_if_2, var_if_3)
More specifically, each row is uniquely defined by two vectors in a data frame (e.g. score and id) and there are a multitude of additional columns that begin with a string (e.g. "var_if_") and end with a different number (e.g. 1,2,3). Furthermore, for a given value of score (i.e. for any row with a given score) the value of the additional variables does not vary.
I am trying to convert these data into a data frame like the below:
# Desired output data frame
score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var <- c(0.1, 0.9, 0.6, 0.8, 0.7, 0.2)
data.frame(score, id, var)
More specifically, the additional variables (var_if_#) are removed and aggregated into a single new variable (e.g. var) which takes on the value of one of the additional variable columns based on the value of score. For example, if score == 2, then var == var_if_2.
Constrains on the solution
Looking to use base R or dplyr().
Looking for a solution that generalizes to a large number of values of 'score' and corresponding columns for 'var_if_#' and rows of arbitrary ordering.
The below exemplifies the arbitrary row ordering.
score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var_if_1 <- rep(c(0.1, 0.8), each = 3)
var_if_2 <- rep(c(0.9, 0.7), each = 3)
var_if_3 <- rep(c(0.6, 0.2), each = 3)
foo <- data.frame(score, id, var_if_1, var_if_2, var_if_3)
foo[sample(1:nrow(foo)), ] # arbitrary row order
I am also aware that I could just use ifelse() but this becomes
tedious with many possible values of score (unless there is a looping
approach that can reduce the tedium).
Use matrix indexing, which avoids slow looping or apply logic:
cbind(dat[1:2], var=dat[3:5][cbind(seq_len(nrow(dat)), dat$score)])
# score id var
#1 1 2014 0.1
#2 2 2014 0.9
#3 3 2014 0.6
#4 1 2015 0.8
#5 2 2015 0.7
#6 3 2015 0.2
If you are specifically matching on name patterns like var_if_1 etc, then use match to get the columns to extract:
dat[cbind( seq_len(nrow(dat)), match(paste0("var_if_", dat$score), names(dat)))]
You can use the function apply, that will iterate over each row of your data frame. If the columns are in a specific order like in your example:
var <- apply(my_data_frame, 1, function(x) { x[x["score"] + 2] })
If you want to use the name of the column instead of their positions you could replace x["score"] + 2] by x[paste0("var_if_",x["score"])]