Renaming duplicate strings in R - r

I have an R dataframe that has two columns of strings. In one of the columns (say, Column1) there are duplicate values. I need to relabel that column so that it would have the duplicated strings renamed with ordered suffixes, like in the Column1.new
Column1 Column2 Column1.new
1 A 1_1
1 B 1_2
2 C 2_1
2 D 2_2
3 E 3
4 F 4
Any ideas of how to do this would be appreciated.
Cheers,
Antti

Let's say your data (ordered by Column1) is within an object called tab. First create a run length object
c1.rle <- rle(tab$Column1)
c1.rle
##lengths: int [1:4] 2 2 1 1
##values : int [1:4] 1 2 3 4
That gives you values of Column1 and the according number of appearences of each element. Then use that information to create the new column with unique identifiers:
tab$Column1.new <- paste0(rep(c1.rle$values, times = c1.rle$lengths), "_",
unlist(lapply(c1.rle$lengths, seq_len)))
Not sure, if this is appropriate in your situation, but you could also just paste together Column1 and Column2, to create an unique identifier...

May be a little more of a workaround, but parts of this may be more useful and simpler for someone with not quite the same needs. make.names with the unique=T attribute adds a dot and numbers names that are repeated:
x <- make.names(tab$Column1,unique=T)
> print(x)
[1] "X1" "X1.1" "X2" "X2.1" "X3" "X4"
This might be enough for some folks. Here you can then grab the first entries of elements that are repeated, but not elements that are not repeated, then add a .0 to the end.
y <- rle(tab$Column1)
tmp <- !duplicated(tab$Column1) & (tab$Column1 %in% y$values[y$lengths>1])
x[tmp] <- str_replace(x[tmp],"$","\\.0")
> print(x)
[1] "X1.0" "X1.1" "X2.0" "X2.1" "X3" "X4"
Replace the dots and remove the X
x <- str_replace(x,"X","")
x <- str_replace(x,"\\.","_")
> print(x)
[1] "1_0" "1_1" "2_0" "2_1" "3" "4"
Might be good enough for you. But if you want the indexing to start at 1, grab the numbers, add one then put them back.
z <- str_match(x,"_([0-9]*)$")[,2]
z <- as.character(as.numeric(z)+1)
x <- str_replace(x,"_([0-9]*)$",paste0("_",z))
> print(x)
[1] "1_1" "1_2" "2_1" "2_2" "3" "4"
Like I said, more of a workaround here, but gives some options.

d <- read.table(text='Column1 Column2
1 A
1 B
2 C
2 D
3 E
4 F', header=TRUE)
transform(d,
Column1.new = ifelse(duplicated(Column1) | duplicated(Column1, fromLast=TRUE),
paste(Column1, ave(Column1, Column1, FUN=seq_along), sep='_'),
Column1))
# Column1 Column2 Column1.new
# 1 1 A 1_1
# 2 1 B 1_2
# 3 2 C 2_1
# 4 2 D 2_2
# 5 3 E 3
# 6 4 F 4

#Cão answer only with base R:
x=read.table(text="
Column1 Column2 #Column1.new
1 A #1_1
1 B #1_2
2 C #2_1
2 D #2_2
3 E #3
4 F #4", stringsAsFactors=F, header=T)
string<-x$Column1
mstring <- make.unique(as.character(string) )
mstring<-sub("(.*)(\\.)([0-9]+)","\\1_\\3",mstring)
y <- rle(string)
tmp <- !duplicated(string) & (string %in% y$values[y$lengths>1])
mstring[tmp]<-gsub("(.*)","\\1_0", mstring[tmp])
end <- sub(".*_([0-9]+)","\\1",grep("_([0-9]*)$",mstring,value=T) )
beg <- sub("(.*_)[0-9]+","\\1",grep("_([0-9]*)$",mstring,value=T) )
newend <- as.numeric(end)+1
mstring[grep("_([0-9]*)$",mstring)]<-paste0(beg,newend)
x$Column1New<-mstring
x

It's a very old post, and I am probably missing something obvious, but what is wrong with(?):
tab$Column1 <- make.unique(tab$Column1.sep="_")
Albeit I believe this requires character input.

Related

How to correct/standardize variable names if their format is not consistent

I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call

Combine a list of data frames into one preserving row names

I do know about the basics of combining a list of data frames into one as has been answered before. However, I am interested in smart ways to maintain row names. Suppose I have a list of data frames that are fairly equal and I keep them in a named list.
library(plyr)
library(dplyr)
library(data.table)
a = data.frame(x=1:3, row.names = letters[1:3])
b = data.frame(x=4:6, row.names = letters[4:6])
c = data.frame(x=7:9, row.names = letters[7:9])
l = list(A=a, B=b, C=c)
When I use do.call, the list names are combined with the row names:
> rownames(do.call("rbind", l))
[1] "A.a" "A.b" "A.c" "B.d" "B.e" "B.f" "C.g" "C.h" "C.i"
When I use any of rbind.fill, bind_rows or rbindlist the row names are replaced by a numeric range:
> rownames(rbind.fill(l))
> rownames(bind_rows(l))
> rownames(rbindlist(l))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
When I remove the names from the list, do.call produces the desired output:
> names(l) = NULL
> rownames(do.call("rbind", l))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
So is there a function that I'm missing that provides some finer control over the row names? I do need the names for a different context so removing them is sub-optimal.
To preserve rownames, you can simply do:
do.call(rbind, unname(l))
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#f 6
#g 7
#h 8
#i 9
Or as you underlined by setting the rownames of l to NULL , this can be also done by:
do.call(rbind, setNames(l, NULL))
We can use add_rownames from dplyr package before binding:
rbind_all(lapply(l, add_rownames))
# Source: local data frame [9 x 2]
#
# rowname x
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 5
# 6 f 6
# 7 g 7
# 8 h 8
# 9 i 9
Why not only using rbind:
rbind(l$A, l$B, l$C)
Here it is another solution that I have just found and it works well (and efficiently) when you have large list and therefore, big dataframes.
df <- data.table::rbindlist(l)
# add a column with the rownames
df[,Col := unlist(lapply(l, rownames))]
df <- df %>% dplyr::select(Col, everything())
> df
Col x
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
More info about rbindlist here.

Extract values from list of lists with R

I have list of lists similar to this sample:
z <- list(list(num1=list((list(tab1=list(list(a=1, b=2, c=5), list(a=3, b=4), list(d=4,e=7)))))),list(num2=list((list(tab2=list(list(a=1, b=2), list(a=3, b=4)))))))
I would like to extract the figures out of the last list of lists names:
Desired output list (since 1 list entries are shorter) or as dataframe with columns corresponding to main list:
[1] a b c a b d e
[2] a b a b
dataframe:
column1 column2
a a
b b
c a
a b
b ""
d ""
e ""
I have tried various combinations of sapply(z, "[[", c("a","b"...) but failed, since the sublist names varies.
EDIT: Sorry, I needed the actual values not the last node (letters)! Additionally, each numeric value has column name, not set in the example above; it is like this:
[[1]]$num1[[1]]$tab1[[1]]$a
Name
1
So the desired solution are values:
[1]
1 2 5 3 4 4 7
[2]
1 2 3 4
I would actually need the numeric values instead of the letters. If you could adjust your solution to this I would be grateful. Thanks.
Try
lapply(z, function(x) as.numeric(unlist(x)))
## [[1]]
## [1] 1 2 5 3 4 4 7
##
## [[2]]
## [1] 1 2 3 4
z1 <- lapply(z, function(x) names(unlist(x)))
z1 <- lapply(z1, function(x) gsub(".*\\.", "", x))
n <- max(sapply(z1, length))
z1 <- lapply(z1, `length<-`, value = n)
setNames(as.data.frame(z1), paste0("Column", seq_along(z1)))
# Column1 Column2
#1 a a
#2 b b
#3 c a
#4 a b
#5 b <NA>
#6 d <NA>
#7 e <NA>
A bit far-fetched and everything but elegant, here is a way to get what you want :
lista<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[3]))
names(lista)<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[1]))
uninames<-unique(names(lista))
res<-sapply(uninames,function(x,vec){vec[names(vec)==x]},lista)
> res
$num1
num1 num1 num1 num1 num1 num1 num1
"a" "b" "c" "a" "b" "d" "e"
$num2
num2 num2 num2 num2
"a" "b" "a" "b"
UPDATE
To get the numbers :
a<-unlist(z)
b<-names(unique(z))
res<-sapply(unique(b),function(name,vec,l_name){vec[l_name==name]},a,b)
>res
$num1
num1.tab1.a num1.tab1.b num1.tab1.c num1.tab1.a num1.tab1.b num1.tab1.d num1.tab1.e
1 2 5 3 4 4 7
$num2
num2.tab2.a num2.tab2.b num2.tab2.a num2.tab2.b
1 2 3 4

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

match values in dataframes with values in a column

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks
You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"
Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.
Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

Resources