assume that I have two lists of the same length.
l1 <- list(c("a", "b", "c"), "d")
l2 <- list(c("e", "f"), c("g", "h", "i"))
Each row/element of a list can be seen as a specific pair. So in this example the two vectors
c("a", "b", "c")
c("e", "f")
"belong together" and so do the two others.
I need to get all the possible combinations/permutations of those two vectors with the same index.
I know that I can use expand.grid(c("a", "b", "c"), c("e", "f")) for two vectors, but I'm struggling to do this over both lists iteratively. I tried to use mapply(), but couldn't come up with a solution.
The preferred output can be a dataframe or a list containing all possible row-wise combinations. It's not necessary to keep the information of the "source pair". I'm just interested in the combinations.
So, a possible output could look like this:
l1 l2
1 a e
2 b e
3 c e
4 a f
5 b f
6 c f
7 d g
8 d h
9 d i
You can use Map to loop over the list elements and then use rbind:
do.call(rbind, Map(expand.grid, l1, l2))
# Var1 Var2
#1 a e
#2 b e
#3 c e
#4 a f
#5 b f
#6 c f
#7 d g
#8 d h
#9 d i
Map is just mapply with different defaults.
Related
How can R report the actual name i, when using it to name columns and lists in a for loop.
For example, using the following data:
z <- data.frame(x= c(1,2,3,4,5), y = c("a", "b", "v", "d", "e"))
When I reference i from the loop when creating the columns it names it i as the column names.
a_final <- NULL
for(i in z$x){
print(data.frame(i = z$y))
}
Instead, I'd like the columns to be named by the value of each i in the loop, instead.
I'd like the results to look something like:
1 2 3 4 5 6
a a a a a a
b b b b b b
c c c c c c
d d d d d d
e e e e e e
You could create a matrix with data from z$y and dimensions same as nrow(z) and convert it into dataframe.
as.data.frame(matrix(z$y, ncol = nrow(z), nrow = nrow(z)))
# V1 V2 V3 V4 V5
#1 a a a a a
#2 b b b b b
#3 c c c c c
#4 d d d d d
#5 e e e e e
We can also use replicate
as.data.frame(replicate(nrow(z), z$y))
I have a large data.frame, example:
> m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
> colnames(m) <- c("A", "B", "C", "D", "E")
> rownames(m) <- c("a", "b", "c", "d", "e")
> m
A B C D E
a 3 3 5 7 5
b 6 2 3 5 4
c 2 5 6 8 9
d 5 4 3 2 2
e 3 3 6 5 2
I would like to remove all rows, where A and/or B columns have greater value than C D and E columns.
So in this case rows b, d, e should be removed and I should get this:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Can not remove them one by one because the data.frame has more than a million rows.
Thanks
Use subsetting, together with pmin() and pmax() to retain the values that you want. I'm not sure that I fully understand your criteria (you said "C D and E" but since you want to throw away row e, I think that you meant C, D or E ), but the following seems to do what you want:
> m[pmax(m[,"A"],m[,"B"])<=pmin(m[,"C"],m[,"D"],m[,"E"]),]
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
# creating the df
m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
colnames(m) <- c("A", "B", "C", "D", "E")
rownames(m) <- c("a", "b", "c", "d", "e")
# initialize as data frame.
m <- as.data.frame(m)
df_n <- m
for(i in 1:nrow(m)){
#print(i)
#print(paste(max(m[,1:2][i,]), max(m[,3:5][i,])))
if(max(m[,1:2][i,]) > (max(m[,3:4][i,])) || max(m[,1:2][i,]) > ((m[,5])[i])){
#df_n <- m[-i,]
df_n[i,] <- NA
}
}
#df_n
df_n <- df_n[complete.cases(df_n), ]
print(df_n)
Results
> print(df_n)
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Here's another solution with apply:
m[apply(m, 1, function(x) max(x[1], x[2]) < min(x[3], x[4], x[5])),]
Result:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
I think what you actually meant is to remove rows where max(A, B) > min(C, D, E), which translates to keep rows where all values of A and B are smaller than all values of C, D, and E.
I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a
I have a large data frame that I am working with from which I need to exclude rows that contain all but a small number of characters.
Currently I am using the following code to do so and it is working fine, but I can only seem to apply this to a single column at a time which is not only inefficient but time consuming on my part as I have many columns to work through.
df <- df[(df$column_name_01 %in% c("a", "b", "c", "d")),]
So far I have tried referring to multiple columns like so (as this approach works for single columns):
df <- df[(df[, 1:10] %in% c("a", "b", "c", "d")),]
But this is obviously not working as intended. Is there a concise way to exclude rows from a data frame that contain certain characters (or that do not match certain characters, either way)?
I think you want a regular old apply here:
df[apply(df[,1:10], 1, function(x) all(x %in% c("a", "b", "c", "d"))),]
Or for non-matching rows
df[apply(df[,1:10], 1, function(x) all(! x %in% c("a", "b", "c", "d"))),]
You can compute whether you want each row across the 10 columns, combining into a single vector with Reduce and "&":
df[Reduce("&", lapply(df[,1:10], function(x) x %in% c("a", "b", "c", "d"))),]
# NA NA NA NA NA NA NA NA NA NA
# 14 a c a d d c d c c c
# 25 b a a a b a c a a c
# 29 d d d a a a b c c a
# 31 c b b d b c a b b c
# 33 b a c b a d c a a c
# 36 c d c b d a c a a a
# 42 b b a a b c d b d d
# 45 c c b b d a b a d b
You could also do this by converting the data frame to a matrix and using rowSums to make sure all values in the row fall in the desired set:
df[rowSums(matrix(unlist(df[,1:10]) %in% c("a", "b", "c", "d"), nrow(df))) == 10,]
# NA NA NA NA NA NA NA NA NA NA
# 14 a c a d d c d c c c
# 25 b a a a b a c a a c
# 29 d d d a a a b c c a
# 31 c b b d b c a b b c
# 33 b a c b a d c a a c
# 36 c d c b d a c a a a
# 42 b b a a b c d b d d
# 45 c c b b d a b a d b
Both of these solutions should be faster than an apply-based solution for large matrices (I benchmark a 100k-row data frame here) because they operate on a small number of columns instead of a large number of rows, better taking advantage of vectorization:
josilber.lapply <- function(df) df[Reduce("&", lapply(df[,1:10], function(x) x %in% c("a", "b", "c", "d"))),]
josilber.rowSums <- function(df) df[rowSums(matrix(unlist(df[,1:10]) %in% c("a", "b", "c", "d"), nrow(df))) == 10,]
crimson.apply <- function(df) df[apply(df[,1:10], 1, function(x) all(x %in% c("a", "b", "c", "d"))),]
library(microbenchmark)
microbenchmark(josilber.lapply(big.df), josilber.rowSums(big.df), crimson.apply(big.df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber.lapply(big.df) 67.17092 71.0628 83.36787 74.74011 86.00722 231.6794 100
# josilber.rowSums(big.df) 98.75142 116.3975 136.28880 128.28851 149.55155 301.9346 100
# crimson.apply(big.df) 676.66290 725.6616 789.45954 762.74171 805.72437 2681.8203 100
Data:
set.seed(144)
df <- unname(do.call(data.frame, replicate(10, sample(letters[1:5], 50, replace=TRUE), simplify=FALSE)))
set.seed(144)
big.df <- unname(do.call(data.frame, replicate(10, sample(letters[1:5], 100000, replace=TRUE), simplify=FALSE)))
I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.