I have a series of dataframes that are noncumulative. During the cleaning process I want to loop through the dataframe and check if certain columns exists and if they aren't present create them. I can't for the life of me figure out a method to do this. I am not package shy and prefer them to base.
Any direction is much appreciated.
You can use this dummy data df and colToAdd columns to check if not exists to add
df <- data.frame(A = rnorm(5) , B = rnorm(5) , C = rnorm(5))
colToAdd <- c("B" , "D")
then apply the check if the column exists NULL produced else add your column e.g. rnorm(5)
add <- sapply(colToAdd , \(x) if(!(x %in% colnames(df))) rnorm(5))
data.frame(do.call(cbind , c(df , add)))
output
A B C D
1 1.5681665 -0.1767517 0.6658019 -0.8477818
2 -0.5814281 -1.0720196 0.5343765 -0.8259426
3 -0.5649507 -1.1552189 -0.8525945 1.0447395
4 1.2024881 -0.6584889 -0.1551638 0.5726059
5 0.7927576 0.5340098 -0.5139548 -0.7805733
Related
I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)
I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.
I am writing a function to automatically read in columns of several data frames. The name and the position of the column that I want to access differs sometimes
as a reproducable example:
a= c(1:3)
b= c(4:6)
A = data.frame(a,b)
colnames(A) = c("column1", "column2")
B = data.frame(b,a)
colnames(B) = c("col1", "col2")
Now I want to access the values a, something like:
`$`(A, "column1"|"col2")
Because logical operators don't work for characters, so I tried to go with
A[,which(colnames(A[,1]) == "column1" | names(A[,2]) =="col2")]
but it seems that the colnames()-function doesn't work that.
Does anyone have an idea how to approach this?
Try using grep. It works with both A and B.
A[, grep("column1|col2", names(A))]
#[1] 1 2 3
B[, grep("column1|col2", names(B))]
#[1] 1 2 3
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
As a continuation of this question I am now looking for a way to mark only non-unique variables from the y-data-frame when I merge.
The default for suffixes is to look for a vector that has the length of two.
Say I have this list,
a <- list(A = data.frame(id = c(01, 02), a=runif(2), b=runif(2)),
B = data.frame(id = c(01, 02), b=runif(2), c=runif(2)),
C = data.frame(id = c(01, 02), c=runif(2), d=runif(2)))
a
$A
id a b
1 1 0.6922513 0.9966336
2 2 0.9216164 0.8256744
$B
id b c
1 1 0.2242940 0.7058331
2 2 0.4474754 0.9228213
$C
id c d
1 1 0.969796 0.1761250
2 2 0.633697 0.6618188
then I make some customization where I merge some of the data frames together one by oen, here exemplified by taking out one data frame,
df <- a[[1]]
a <- a[setdiff(names(a), names(a[1]))]
then I merge the list in this way,
for(i in seq_along(a)) {
v <- a[[i]] # extract value
ns <- names(a)
n <- ns[[i]] # extract name
df <-merge(df, v, by.x="id", by.y="id", all.x=T,
suffixes=paste(".", n, sep = ""))
}
df
id a b.B bNA c.C cNA d
1 1 0.6922513 0.9966336 0.2242940 0.7058331 0.969796 0.1761250
2 2 0.9216164 0.8256744 0.4474754 0.9228213 0.633697 0.6618188
The issue is, as shown above, that R adds a mark to both non-unique variables, but as I only supplied one name n I get an NA on the 'other' variable.' In the example above I get an .B-suffix on the variable from the A-data-frame.
Is there a way I can either add the correct data frame name to both variables or (preferred) exclusively mark y's variables when merging?
This was a fun little puzzle. I cheated and "borrowed" heavily from Hadley's merge_recurse function in the reshape package:
merge_recurse1 <- function (dfs, ...)
{
n <- length(dfs)
if (!is.null(names(dfs))){
}
if (length(dfs) == 2) {
merge(dfs[[1]], dfs[[2]],all = TRUE,sort = FALSE,
suffixes = c('',names(dfs)[2]), ...)
}
else {
merge(Recall(dfs[-n],...), dfs[[n]],all = TRUE,sort = FALSE,
suffixes = c('',names(dfs)[n]),...)
}
}
> merge_recurse1(a,by = "id")
id a b bB c cC d
1 1 0.2536158 0.6083147 0.3060572 0.1428531 0.6403072 0.4621454
2 2 0.9839910 0.7256161 0.2203161 0.6653415 0.1496376 0.8767888
In addition to the suffix changes I made, I found I need to add a ... argument to Recall just to get merge_recurse to work the way I thought it should. Not sure if that's a bug or if I'm just misunderstanding the function.
Sorry... It took me a little while to understand your problem. But, you're... like... 99% there.
Change the argument:
suffixes = paste(".", n, sep = "")
to:
suffixes = c("", paste(".", n, sep = ""))
And you should be OK. By doing this, I got a df that looks like this:
> df
id a b b.B c c.C d
1 1 -0.6039805 0.08297807 0.06426459 2.787147 -0.9566280 -0.36054991
2 2 -0.1694382 -0.95296450 0.37144139 -1.346691 0.7072892 0.09239593
By the way, instead of all of this, did you try some of the other recommendations from earlier Stackoverflow posts? Somewhere I remember seeing something using Reduce which got me to this partial solution (with your original "a" data):
Reduce(function(x, y) merge(x, y, by="id", all=TRUE, suffixes=c("", "_2")),
a, accumulate=FALSE)
which gives you output like:
id a b b_2 c c_2 d
1 1 -0.6039805 0.08297807 0.06426459 2.787147 -0.9566280 -0.36054991
2 2 -0.1694382 -0.95296450 0.37144139 -1.346691 0.7072892 0.09239593
Are either of these more useful or closer to what you are looking for?