Using variable value as column name in data.frame or cbind - r

Is there a way in R to have a variable evaluated as a column name when creating a data frame (or in similar situations like using cbind)?
For example
a <- "mycol";
d <- data.frame(a=1:10)
this creates a data frame with one column named a rather than mycol.
This is less important than the case that would help me remove quite a few lines from my code:
a <- "mycol";
d <- cbind(some.dataframe, a=some.sequence)
My current code has the tortured:
names(d)[dim(d)[2]] <- a;
which is aesthetically barftastic.

> d <- setNames( data.frame(a=1:10), a)
> d
mycol
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10

Is structure(data.frame(1:10),names="mycol") aesthetically pleasing to you? :-)

just use colnames after creation.
eg
a <- "mycolA"
b<- "mycolB"
d <- data.frame(a=1:10, b=rnorm(1:10))
colnames(d)<-c(a,b)
d
mycolA mycolB
1 -1.5873866
2 -0.4195322
3 -0.9511075
4 0.2259858
5 -0.6619433
6 3.4669774
7 0.4087541
8 -0.3891437
9 -1.6163175
10 0.7642909

Simple solution:
df <- data.frame(1:5, letters[1:5])
logics <- c(T,T,F,F,T)
cities <- c("Warsaw","London","Paris","NY","Tokio")
m <- as.matrix(logics)
m2 <- as.matrix(cities)
name <- "MyCities"
colnames(m) <- deparse(substitute(logics))
colnames(m2) <- eval(name)
df<-cbind(df,m)
cbind(df,m2)
X1.5 letters.1.5. logics MyCities
1 a TRUE Warsaw
2 b TRUE London
3 c FALSE Paris
4 d FALSE NY
5 e TRUE Tokio

Related

Dynamic column rename based on a separate data frame in R

Generate df1 and df2 like this
pro <- c("Hide-Away", "Hide-Away")
sourceName <- c("New Rate2", "FST")
standardName <- c("New Rate", "SFT")
df1 <- data.frame(pro, sourceName, standardName, stringsAsFactors = F)
A <- 1; B <- 2; C <-3; D <- 4; G <- 5; H <- 6; E <-7; FST <-8; Z <-8
df2<- data.frame(A,B,C,D,G,H,E,FST)
colnames(df2)[1]<- "New Rate2"
Then run this code.
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of DF2 will be like
New Rate2 B C D G H E FST
1 2 3 4 5 6 7 8
The output of DF2 will be like
New Rate B C D G H E SFT
1 2 3 4 5 6 7 8
So clearly the code worked and swapped the names correctly. But now create df2 with the below code instead. And make sure to regenrate df1 to what it was before.
df2<- data.frame(FST,B,C,D,G,H,E,Z)
colnames(df2)[8]<- "New Rate2"
and then run
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of df2 will be
FST B C D G H E New Rate2
8 2 3 4 5 6 7 8
The output of df2 will be
New Rate B C D G H E SFT
8 2 3 4 5 6 7 8
So the order of the columns has not been preserved. I know this is because of the %in code but I am not sure of an easy fix to make the column swapping more dynamic.
I am not totally sure about the question, as it seems a little vague. I'll try my best though--the best way I know to dynamically set column names is setnames from the data.table package. So let's say that I have a set of source names and a set of standard names, and I want to swap the source for the standard (which I take to be the question).
Given the data above, I have a data.frame structured like so:
> df2
A B C D G H E FST
1 1 2 3 4 5 6 7 8
as well as two vectors, sourceName and standardName.
sourceName <- c("A", "FST")
standardName <- c("New A", "FST 2: Electric Boogaloo")
I want to dynamically swap sourceName for standardName, and I can do this with setnames like so:
df3 <- as.data.table(df2)
setnames(df3, sourceName, standardName)
> df3
New A B C D G H E FST 2: Electric Boogaloo
1: 1 2 3 4 5 6 7 8
Trying to follow your example, in your second pass I get an index value of 0,
> df2
New Rate B C D G H E SFT
1 8 2 3 4 5 6 7 8
> df1
sourceName standardName
1 New Rate2 New Rate
2 FST SFT
> index<-which(colnames(df2) %in% df1[,1])
> index
integer(0)
which would account for your expected ordering on assignment to column names.

Sum the values of a 2 dimensional table according to labels in R

Coming from Sum the values according to labels in R.
I've been notified that working with 2 dimensional tables is rather significantly different with 1 dimensional ones, like:
a a,b a,b,c c
d 5 2 1 2
d,e 2 1 1 1
And we want to achieve:
a b c
d 12 5 5
e 4 2 2
So how can this be achieved using R?
A little bit convoluted, but it should work :
m <- as.matrix(data.frame('a'=c(5,2),'a,b'=c(2,1),
'a,b,c'=c(1:1),'c'=c(2,1),
check.names = FALSE,row.names=c('d','d,e')))
colNamesSplits <- strsplit(colnames(m),',')
rowNamesSplits <- strsplit(rownames(m),',')
colNms <- unique(unlist(colNamesSplits))
rowNms <- unique(unlist(rowNamesSplits))
colIdxs <- unlist(sapply(1:length(colNamesSplits),
function(i) rep.int(i,length(colNamesSplits[[i]]))))
rowIdxs <- unlist(sapply(1:length(rowNamesSplits),
function(i) rep.int(i,length(rowNamesSplits[[i]]))))
colIdxsMapped <- unlist(sapply(colNamesSplits, function(n) match(n,colNms)))
rowIdxsMapped <- unlist(sapply(rowNamesSplits, function(n) match(n,rowNms)))
# let's create the fully expanded matrix
expanded <- as.matrix(m[rowIdxs,colIdxs])
rownames(expanded) <- rowNms[rowIdxsMapped]
colnames(expanded) <- colNms[colIdxsMapped]
# aggregate expanded by cols :
expanded <- do.call(cbind,lapply(split(1:ncol(expanded),colnames(expanded)),
function(ii) rowSums(expanded[,ii,drop=FALSE])))
# aggregate expanded by rows :
expanded <- do.call(rbind,lapply(split(1:nrow(expanded),rownames(expanded)),
function(ii) colSums(expanded[ii,,drop=FALSE])))
> expanded
a b c
d 12 5 5
e 4 2 2

R Dataframe comparison which, scaling bad

The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6

data.frame() : make object's string value the object name to use for columns [duplicate]

Is there a way in R to have a variable evaluated as a column name when creating a data frame (or in similar situations like using cbind)?
For example
a <- "mycol";
d <- data.frame(a=1:10)
this creates a data frame with one column named a rather than mycol.
This is less important than the case that would help me remove quite a few lines from my code:
a <- "mycol";
d <- cbind(some.dataframe, a=some.sequence)
My current code has the tortured:
names(d)[dim(d)[2]] <- a;
which is aesthetically barftastic.
> d <- setNames( data.frame(a=1:10), a)
> d
mycol
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Is structure(data.frame(1:10),names="mycol") aesthetically pleasing to you? :-)
just use colnames after creation.
eg
a <- "mycolA"
b<- "mycolB"
d <- data.frame(a=1:10, b=rnorm(1:10))
colnames(d)<-c(a,b)
d
mycolA mycolB
1 -1.5873866
2 -0.4195322
3 -0.9511075
4 0.2259858
5 -0.6619433
6 3.4669774
7 0.4087541
8 -0.3891437
9 -1.6163175
10 0.7642909
Simple solution:
df <- data.frame(1:5, letters[1:5])
logics <- c(T,T,F,F,T)
cities <- c("Warsaw","London","Paris","NY","Tokio")
m <- as.matrix(logics)
m2 <- as.matrix(cities)
name <- "MyCities"
colnames(m) <- deparse(substitute(logics))
colnames(m2) <- eval(name)
df<-cbind(df,m)
cbind(df,m2)
X1.5 letters.1.5. logics MyCities
1 a TRUE Warsaw
2 b TRUE London
3 c FALSE Paris
4 d FALSE NY
5 e TRUE Tokio

pair-wise duplicate removal from dataframe [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 5 years ago.
This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B
If I now remove duplicates, I get the following data frame:
df[duplicated(df),]
a b
3 A B
6 B A
8 C B
However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?
Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.
Thanks!
Extending Ari's answer, to specify columns to check if other columns are also there:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
a b c d
1 A A 10 B
2 A B 8 S
3 A B 7 J
4 B C 3 Q
5 B A 2 I
6 B A 6 U
7 C B 4 L
8 C B 5 V
cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
newdf[i, ] = sort(df[i,cols])
}
df[!duplicated(newdf),]
a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
One solution is to first sort each row of df:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.
The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.
Stealing from the other linked answer using data.table, you could try something like:
df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]
A comparison benchmark with a larger dataset (df2):
df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]
system.time(
df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user system elapsed
# 0.07 0.00 0.06
system.time({
for (i in 1:nrow(df2))
{
df2[i, ] = sort(df2[i, ])
}
df2[!duplicated(df2),]
}
)
# user system elapsed
# 42.07 0.02 42.09
Using apply will be a better option than loops.
newDf <- data.frame(t(apply(df,1,sort)))
All you need to do now is remove duplicates.
newDf <- newDf[!duplicated(newDf),]

Resources