match values in dataframes with values in a column

match values in dataframes with values in a column - r

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks

You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"

Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.

Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

Related

Extract certain part of name in string

Im trying to extract particular part of names in a column of DF
DF
a b
a.b.c_tot 1
b.c.d_tot 2
d.e.g_tot 3
I need to extract letter between . and _tot, so that
DF
a b c
a.b.c_tot 1 c
b.c.d_tot 2 d
d.e.g_tot 3 g
I suppose it could be done with sub as i have learnt today how to extract the letter before first ., but how to extract "middle" part of the name?
I was reading sub explanation and help but all my trials results in just copying full name of a to c.
Thank you for any tips.

We can call sub() to match the entire string, starting with (1) any number of any characters, then (2) a literal dot, then (3) use a capture group to capture the following character, then (4) a literal _tot. We can then use the \1 backreference atom (with the backslash properly backslash-escaped as per R's string encoding rules) to replace the entire string with the captured character.
DF$c <- sub('^.*\\.(.)_tot$','\\1',DF$a);
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
Yes, I see the problem; if DF$a were to contain values that do not match the expected pattern, the sub() call would pass them through to the new DF$c column. Here's a hacky solution using the Perl branch reset feature:
DF <- data.frame(a=c('a.b.c_tot','b.c.d_tot','d.e.g_tot','non-matching'),b=c(1L,2L,3L,4L),stringsAsFactors=F);
DF$c <- sub(perl=T,'(?|^.*\\.(.)_tot$|^.*$())','\\1',DF$a);
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
## 4 non-matching 4
Here's a better solution, involving storing the regex in a variable in advance, and using grepl() and replace() to replace non-matching values with NA prior to calling sub():
re <- '^.*\\.(.)_tot$';
DF$c <- sub(re,'\\1',replace(DF$a,!grepl(re,DF$a),NA));
DF;
## a b c
## 1 a.b.c_tot 1 c
## 2 b.c.d_tot 2 d
## 3 d.e.g_tot 3 g
## 4 non-matching 4 <NA>

Use regexpr and regmatches with a lookbehind and lookahead regex.
x <- c("a.b.c_tot", "b.c.d_tot", "d.e.g_tot")
regmatches(x, regexpr("(?<=\\.).(?=_tot)", x, perl = TRUE))
#[1] "c" "d" "g"

We can use str_extract
library(stringr)
DF$c <- str_extract(DF$a, "\\w(?=_tot)")
DF$c
#[1] "c" "d" "g"

Match one column of a data.frame with all the columns in another data.frame

I have two data.frames:
DF1
Col1 Col2 ...... ...... Col2000
A H
c d
d e
n b
e A
b n
H c
DF2
A
b
c
d
e
n
H
I need simply to match the only one column in DF2 with each column in DF1. I need to match them because I need to know exactly the ranking of the match. Anyway I tried to write a function but since I'm not an R expert something goes wrong in my code:
lapply(DF1, function(x) match(DF1[,i], DF2[,1]))

To get a correct result, you need a correct command :
lapply(DF1, function(x) match(x, DF2[,1]))
is doing what you're trying to do. Take :
DF1 <- data.frame(
Col1 = c('A','c','d','n','e','b','H'),
Col2 = c('H','d','e','b','A','n','c')
)
DF2 <- data.frame(c('A','b','c','d','e','n','H'))
Then:
> lapply(DF1, function(x) match(x, DF2[,1]))
$Col1
[1] 1 3 4 6 5 2 7
$Col2
[1] 7 4 5 2 1 6 3

Numbers as factor after read.delim

I have a data frame that looks like this:
A B C D
1 2 3 4
E F G H
5 6 7 8
I would like to subset only the numeric portion using the following code:
sub_num = DF[sapply(DF, is.numeric)]
The problem is that the numbers are factors after reading the data.frame using read.delim. If I set stringsAsFactors = FALSE the numbers are characters.
This may be a basic problem but I'm not able to solve it.

Try the following instead
sub_num <- DF[!is.na(as.numeric(sapply(DF, as.character)))[1:ncol(DF)], ]
# V1 V2 V3 V4
# 2 1 2 3 4
# 4 5 6 7 8
As for your sapply statement, sapply(DF, is.numeric), in order to work correctly, it would need as.character
sapply(DF, function(X) is.numeric(as.character(X)))
But that would not index your DF as you would expect

Renaming duplicate strings in R

I have an R dataframe that has two columns of strings. In one of the columns (say, Column1) there are duplicate values. I need to relabel that column so that it would have the duplicated strings renamed with ordered suffixes, like in the Column1.new
Column1 Column2 Column1.new
1 A 1_1
1 B 1_2
2 C 2_1
2 D 2_2
3 E 3
4 F 4
Any ideas of how to do this would be appreciated.
Cheers,
Antti

Let's say your data (ordered by Column1) is within an object called tab. First create a run length object
c1.rle <- rle(tab$Column1)
c1.rle
##lengths: int [1:4] 2 2 1 1
##values : int [1:4] 1 2 3 4
That gives you values of Column1 and the according number of appearences of each element. Then use that information to create the new column with unique identifiers:
tab$Column1.new <- paste0(rep(c1.rle$values, times = c1.rle$lengths), "_",
unlist(lapply(c1.rle$lengths, seq_len)))
Not sure, if this is appropriate in your situation, but you could also just paste together Column1 and Column2, to create an unique identifier...

May be a little more of a workaround, but parts of this may be more useful and simpler for someone with not quite the same needs. make.names with the unique=T attribute adds a dot and numbers names that are repeated:
x <- make.names(tab$Column1,unique=T)
> print(x)
[1] "X1" "X1.1" "X2" "X2.1" "X3" "X4"
This might be enough for some folks. Here you can then grab the first entries of elements that are repeated, but not elements that are not repeated, then add a .0 to the end.
y <- rle(tab$Column1)
tmp <- !duplicated(tab$Column1) & (tab$Column1 %in% y$values[y$lengths>1])
x[tmp] <- str_replace(x[tmp],"$","\\.0")
> print(x)
[1] "X1.0" "X1.1" "X2.0" "X2.1" "X3" "X4"
Replace the dots and remove the X
x <- str_replace(x,"X","")
x <- str_replace(x,"\\.","_")
> print(x)
[1] "1_0" "1_1" "2_0" "2_1" "3" "4"
Might be good enough for you. But if you want the indexing to start at 1, grab the numbers, add one then put them back.
z <- str_match(x,"_([0-9]*)$")[,2]
z <- as.character(as.numeric(z)+1)
x <- str_replace(x,"_([0-9]*)$",paste0("_",z))
> print(x)
[1] "1_1" "1_2" "2_1" "2_2" "3" "4"
Like I said, more of a workaround here, but gives some options.

d <- read.table(text='Column1 Column2
1 A
1 B
2 C
2 D
3 E
4 F', header=TRUE)
transform(d,
Column1.new = ifelse(duplicated(Column1) | duplicated(Column1, fromLast=TRUE),
paste(Column1, ave(Column1, Column1, FUN=seq_along), sep='_'),
Column1))
# Column1 Column2 Column1.new
# 1 1 A 1_1
# 2 1 B 1_2
# 3 2 C 2_1
# 4 2 D 2_2
# 5 3 E 3
# 6 4 F 4

#Cão answer only with base R:
x=read.table(text="
Column1 Column2 #Column1.new
1 A #1_1
1 B #1_2
2 C #2_1
2 D #2_2
3 E #3
4 F #4", stringsAsFactors=F, header=T)
string<-x$Column1
mstring <- make.unique(as.character(string) )
mstring<-sub("(.*)(\\.)([0-9]+)","\\1_\\3",mstring)
y <- rle(string)
tmp <- !duplicated(string) & (string %in% y$values[y$lengths>1])
mstring[tmp]<-gsub("(.*)","\\1_0", mstring[tmp])
end <- sub(".*_([0-9]+)","\\1",grep("_([0-9]*)$",mstring,value=T) )
beg <- sub("(.*_)[0-9]+","\\1",grep("_([0-9]*)$",mstring,value=T) )
newend <- as.numeric(end)+1
mstring[grep("_([0-9]*)$",mstring)]<-paste0(beg,newend)
x$Column1New<-mstring
x

It's a very old post, and I am probably missing something obvious, but what is wrong with(?):
tab$Column1 <- make.unique(tab$Column1.sep="_")
Albeit I believe this requires character input.

Convert the values in a column into row names in an existing data frame

I would like to convert the values in a column of an existing data frame into row names. Is is possible to do this without exporting the data frame and then reimporting it with a row.names = call?
For example I would like to convert:
> samp
names Var.1 Var.2 Var.3
1 A 1 5 0
2 B 2 4 1
3 C 3 3 2
4 D 4 2 3
5 E 5 1 4
Into:
> samp.with.rownames
Var.1 Var.2 Var.3
A 1 5 0
B 2 4 1
C 3 3 2
D 4 2 3
E 5 1 4

This should do:
samp2 <- samp[,-1]
rownames(samp2) <- samp[,1]
So in short, no there is no alternative to reassigning.
Edit: Correcting myself, one can also do it in place: assign rowname attributes, then remove column:
R> df<-data.frame(a=letters[1:10], b=1:10, c=LETTERS[1:10])
R> rownames(df) <- df[,1]
R> df[,1] <- NULL
R> df
b c
a 1 A
b 2 B
c 3 C
d 4 D
e 5 E
f 6 F
g 7 G
h 8 H
i 9 I
j 10 J
R>

As of 2016 you can also use the tidyverse.
library(tidyverse)
samp %>% remove_rownames %>% column_to_rownames(var="names")

in one line
> samp.with.rownames <- data.frame(samp[,-1], row.names=samp[,1])

It looks like the one-liner got even simpler along the line (currently using R 3.5.3):
# generate original data.frame
df <- data.frame(a = letters[1:10], b = 1:10, c = LETTERS[1:10])
# use first column for row names
df <- data.frame(df, row.names = 1)
The column used for row names is removed automatically.
With a one-row dataframe
Beware that if the dataframe has a single row, the behaviour might be confusing. As the documentation mentions:
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
This mean that, if you use the same command as above, it might look like it did nothing (when it actually named the first row "1", which won't look any different in the viewer).
In that case, you will have to stick to the more verbose:
df <- data.frame(a = "a", b = 1)
df <- data.frame(df, row.names = df[,1])
... but the column won't be removed. Also remember that, if you remove a column to end up with a single-column dataframe, R will simplify it to an atomic vector. In that case, you will want to use the extra drop argument:
df <- data.frame(df[,-1, drop = FALSE], row.names = df[,1])

You can execute this in 2 simple statements:
row.names(samp) <- samp$names
samp[1] <- NULL

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

match values in dataframes with values in a column - r

Related

Extract certain part of name in string

Match one column of a data.frame with all the columns in another data.frame

Numbers as factor after read.delim

Renaming duplicate strings in R

Convert the values in a column into row names in an existing data frame

Categories

Resources