R - check elements of vector against other vector - r

I would like to match the elements of a column in a dataframe against another dataframe.
Consider these dataframes:
A=data.frame(par=c('long A story','long C story', 'blabla D'),val=1:3)
B=data.frame(par=c('Z','D','A'),val=letters[1:3])
Each element of B column 'par' should be matched against A column par.
If there is a match, it should be labeled in A.
[This then gives a column of common values for merging A and B].
The desired result is therefore:
A=transform(A,label=c('A','NA','D'))
How can this be done?
Henk

Hi you can do something like this :
list <- lapply(1:length(B$par),function(x) grep(B$par[x],A$par))
list
[[1]]
integer(0)
[[2]]
[1] 3
[[3]]
[1] 1
label <- rep("NA",length(list))
B$par <-as.character(B$par)
label[unlist(list)] <- B$par[which(list != "integer(0)")]
label
[1] "A" "NA" "D"
A <- transform(A,label=label)
A
par val label
1 long A story 1 A
2 long C story 2 NA
3 blabla D 3 D
Hope this helps.

The approach I thought of:
M <- lapply(strsplit(as.character(A$par), " "), function(x) x[x %in% B$par])
M[sapply(M, function(x) {identical(x, character(0))})] <- NA
A$label <- unlist(M)
A
par val label
1 long A story 1 A
2 long C story 2 <NA>
3 blabla D 3 D
Microbenchmarked the answers here and here are the results:
Unit: microseconds
expr min lq median uq max
1 EDWARD() 1638.815 1678.934 1698.061 1726.983 4973.823
2 SONAL() 705.348 725.874 734.738 747.334 2085.721
3 TLM() 268.705 281.300 287.831 294.362 1465.744
4 TRINKER() 156.278 168.407 173.538 177.737 1331.391

To do what you're asking for, try
A=data.frame(par=c('long A story','long C story', 'blabla D'),val=1:3)
B=data.frame(par=c('Z','D','A'),val=letters[1:3])
A$label <- NA
for (x in B$par){
is.match <- lapply(A$par,function(y) grep(x, y))
A$label[which(is.match > 0)] <- x
}
(I assumed you meant a capital A in your example A=transform(a,label=c('A','NA','D')); in that case, these match exactly). EDIT: I see you made that edit. They do match then.
The above method will work only if there is exactly one B that fits every A (in other words, there can be multiple As to a B but not multiple Bs to an A). This is because of the structure you want in the output.

Without loops in a handy function:
findkey <- function(key,terms) {
result <- sapply(as.character(key),function(x) grepl(x,terms))
result <- apply(result,1,function(x) names(x)[x==TRUE])
result[(lapply(result,length)==0)] <- NA
return(unlist(result))
}
Apply to current example:
A$label <- findkey(B$par,A$par)
Result:
> A
par val label
1 long A story 1 A
2 long C story 2 <NA>
3 blabla D 3 D

Related

Replace each observation in a data.frame with n copies [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.

R: How to interpolate a string variable into a vector [duplicate]

I'm trying to set the default value for a function parameter to a named numeric. Is there a way to create one in a single statement? I checked ?numeric and ?vector but it doesn't seem so. Perhaps I can convert/coerce a matrix or data.frame and achieve the same result in one statement? To be clear, I'm trying to do the following in one shot:
test = c( 1 , 2 )
names( test ) = c( "A" , "B" )
The setNames() function is made for this purpose. As described in Advanced R and ?setNames:
test <- setNames(c(1, 2), c("A", "B"))
How about:
c(A = 1, B = 2)
A B
1 2
...as a side note, the structure function allows you to set ALL attributes, not just names:
structure(1:10, names=letters[1:10], foo="bar", class="myclass")
Which would produce
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
attr(,"foo")
[1] "bar"
attr(,"class")
[1] "myclass"
The convention for naming vector elements is the same as with lists:
newfunc <- function(A=1, B=2) { body} # the parameters are an 'alist' with two items
If instead you wanted this to be a parameter that was a named vector (the sort of function that would handle arguments supplied by apply):
newfunc <- function(params =c(A=1, B=2) ) { body} # a vector wtih two elements
If instead you wanted this to be a parameter that was a named list:
newfunc <- function(params =list(A=1, B=2) ) { body}
# a single parameter (with two elements in a list structure
magrittr offers a nice and clean solution.
result = c(1,2) %>% set_names(c("A", "B"))
print(result)
A B
1 2
You can also use it to transform data.frames into vectors.
df = data.frame(value=1:10, label=letters[1:10])
vec = extract2(df, 'value') %>% set_names(df$label)
vec
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
df
value label
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
To expand upon #joran's answer (I couldn't get this to format correctly as a comment): If the named vector is assigned to a variable, the values of A and B are accessed via subsetting using the [ function. Use the names to subset the vector the same way you might use the index number to subset:
my_vector = c(A = 1, B = 2)
my_vector["A"] # subset by name
# A
# 1
my_vector[1] # subset by index
# A
# 1

Split column into vectors by group R - independent of column order

Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)

Repeat rows of a data.frame [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.
Example for N = 2:
A B C
A B C 1 j i 100
1 j i 100 --> 2 j i 100
2 K P 101 3 K P 101
4 K P 101
So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...
My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:
A B C
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
A clean dplyr solution, taken from here
library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))
There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:
A B C ntimes
1 j i 100 2
2 K P 101 4
3 Z Z 102 1
Method:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))
Result:
A B C ntimes
1 Z Z 102 1
2 j i 100 2
3 j i 100 2
4 K P 101 4
5 K P 101 4
6 K P 101 4
7 K P 101 4
This is very similar to Josh O'Brien and Mark Miller's method:
df[rep(seq_len(nrow(df)), df$ntimes),]
However, that method appears quite a bit slower:
df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))
microbenchmark::microbenchmark(
df[rep(seq_len(nrow(df)), df$ntimes),],
as.data.frame(lapply(df, rep, df$ntimes)),
times = 10
)
Result:
Unit: microseconds
expr min lq mean median uq max neval
df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757 10
as.data.frame(lapply(df, rep, df$ntimes)) 625.552 654.638 676.4067 668.094 681.929 799.893 10
If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:
library(mefa)
rep(mtcars,10)
or simply
mefa:::rep.data.frame(mtcars)
Adding to what #dardisco mentioned about mefa::rep.data.frame(), it's very flexible.
You can either repeat each row N times:
rep(df, each=N)
or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)
rep(df, times=N)
Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.
For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:
> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f
The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:
library(plyr)
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:
rep.data.frame <- function(x, times) {
rnames <- attr(x, "row.names")
x <- lapply(x, rep.int, times = times)
class(x) <- "data.frame"
if (!is.numeric(rnames))
attr(x, "row.names") <- make.unique(rep.int(rnames, times))
else
attr(x, "row.names") <- .set_row_names(length(rnames) * times)
x
}
Compare solutions:
library(Lahman)
library(microbenchmark)
microbenchmark(
mefa:::rep.data.frame(Batting, 10),
rep.data.frame(Batting, 10),
Batting[rep.int(seq_len(nrow(Batting)), 10), ],
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749 278.1066 356.3210 10 a
#> rep.data.frame(Batting, 10) 79.70335 82.8165 134.0974 87.2587 191.1713 307.4567 10 a
#> Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927 10 b
try using for example
N=2
rep(1:4, each = N)
as an index
Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:
df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]
Although the other solutions may be shorter, this method may be more advantageous in certain situations.

Create a numeric vector with names in one statement?

I'm trying to set the default value for a function parameter to a named numeric. Is there a way to create one in a single statement? I checked ?numeric and ?vector but it doesn't seem so. Perhaps I can convert/coerce a matrix or data.frame and achieve the same result in one statement? To be clear, I'm trying to do the following in one shot:
test = c( 1 , 2 )
names( test ) = c( "A" , "B" )
The setNames() function is made for this purpose. As described in Advanced R and ?setNames:
test <- setNames(c(1, 2), c("A", "B"))
How about:
c(A = 1, B = 2)
A B
1 2
...as a side note, the structure function allows you to set ALL attributes, not just names:
structure(1:10, names=letters[1:10], foo="bar", class="myclass")
Which would produce
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
attr(,"foo")
[1] "bar"
attr(,"class")
[1] "myclass"
The convention for naming vector elements is the same as with lists:
newfunc <- function(A=1, B=2) { body} # the parameters are an 'alist' with two items
If instead you wanted this to be a parameter that was a named vector (the sort of function that would handle arguments supplied by apply):
newfunc <- function(params =c(A=1, B=2) ) { body} # a vector wtih two elements
If instead you wanted this to be a parameter that was a named list:
newfunc <- function(params =list(A=1, B=2) ) { body}
# a single parameter (with two elements in a list structure
magrittr offers a nice and clean solution.
result = c(1,2) %>% set_names(c("A", "B"))
print(result)
A B
1 2
You can also use it to transform data.frames into vectors.
df = data.frame(value=1:10, label=letters[1:10])
vec = extract2(df, 'value') %>% set_names(df$label)
vec
a b c d e f g h i j
1 2 3 4 5 6 7 8 9 10
df
value label
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
To expand upon #joran's answer (I couldn't get this to format correctly as a comment): If the named vector is assigned to a variable, the values of A and B are accessed via subsetting using the [ function. Use the names to subset the vector the same way you might use the index number to subset:
my_vector = c(A = 1, B = 2)
my_vector["A"] # subset by name
# A
# 1
my_vector[1] # subset by index
# A
# 1

Resources