I have a large data frame that I am working with from which I need to exclude rows that contain all but a small number of characters.
Currently I am using the following code to do so and it is working fine, but I can only seem to apply this to a single column at a time which is not only inefficient but time consuming on my part as I have many columns to work through.
df <- df[(df$column_name_01 %in% c("a", "b", "c", "d")),]
So far I have tried referring to multiple columns like so (as this approach works for single columns):
df <- df[(df[, 1:10] %in% c("a", "b", "c", "d")),]
But this is obviously not working as intended. Is there a concise way to exclude rows from a data frame that contain certain characters (or that do not match certain characters, either way)?
I think you want a regular old apply here:
df[apply(df[,1:10], 1, function(x) all(x %in% c("a", "b", "c", "d"))),]
Or for non-matching rows
df[apply(df[,1:10], 1, function(x) all(! x %in% c("a", "b", "c", "d"))),]
You can compute whether you want each row across the 10 columns, combining into a single vector with Reduce and "&":
df[Reduce("&", lapply(df[,1:10], function(x) x %in% c("a", "b", "c", "d"))),]
# NA NA NA NA NA NA NA NA NA NA
# 14 a c a d d c d c c c
# 25 b a a a b a c a a c
# 29 d d d a a a b c c a
# 31 c b b d b c a b b c
# 33 b a c b a d c a a c
# 36 c d c b d a c a a a
# 42 b b a a b c d b d d
# 45 c c b b d a b a d b
You could also do this by converting the data frame to a matrix and using rowSums to make sure all values in the row fall in the desired set:
df[rowSums(matrix(unlist(df[,1:10]) %in% c("a", "b", "c", "d"), nrow(df))) == 10,]
# NA NA NA NA NA NA NA NA NA NA
# 14 a c a d d c d c c c
# 25 b a a a b a c a a c
# 29 d d d a a a b c c a
# 31 c b b d b c a b b c
# 33 b a c b a d c a a c
# 36 c d c b d a c a a a
# 42 b b a a b c d b d d
# 45 c c b b d a b a d b
Both of these solutions should be faster than an apply-based solution for large matrices (I benchmark a 100k-row data frame here) because they operate on a small number of columns instead of a large number of rows, better taking advantage of vectorization:
josilber.lapply <- function(df) df[Reduce("&", lapply(df[,1:10], function(x) x %in% c("a", "b", "c", "d"))),]
josilber.rowSums <- function(df) df[rowSums(matrix(unlist(df[,1:10]) %in% c("a", "b", "c", "d"), nrow(df))) == 10,]
crimson.apply <- function(df) df[apply(df[,1:10], 1, function(x) all(x %in% c("a", "b", "c", "d"))),]
library(microbenchmark)
microbenchmark(josilber.lapply(big.df), josilber.rowSums(big.df), crimson.apply(big.df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber.lapply(big.df) 67.17092 71.0628 83.36787 74.74011 86.00722 231.6794 100
# josilber.rowSums(big.df) 98.75142 116.3975 136.28880 128.28851 149.55155 301.9346 100
# crimson.apply(big.df) 676.66290 725.6616 789.45954 762.74171 805.72437 2681.8203 100
Data:
set.seed(144)
df <- unname(do.call(data.frame, replicate(10, sample(letters[1:5], 50, replace=TRUE), simplify=FALSE)))
set.seed(144)
big.df <- unname(do.call(data.frame, replicate(10, sample(letters[1:5], 100000, replace=TRUE), simplify=FALSE)))
Related
I am organizing a large dataset adapted to my research. Suppose that I have 9 observations (records) and 4 columns as follows:
z <- data.frame("fa" = c(1, NA, NA, 2, 1, 1, 2, 1, 1),
"fb" = c(2, 2, NA, 1, NA, NA, NA, 1, 2),
"initial_1" = c("A", "B", "B", "B", "A", "C", "D", "B", "A"),
"initial_2" = c("D", "C", "C", "A", "B", "A", "A", "D", "D"))
I want to create two new columns, fa_new and fb_new according to the values of the first two columns, fa and fb, which are linked to the reference columns, initial_1 and initial_2, such that fa == # is matching to intial_#.
For example, as can be seen above, the first record of the column fa is 1 which is linked to "A" of intial_1. Thus, the first record of the new column fa_new will be "A". Likewise, the first record of fb is 2 which is linked to "D" of intial_2; thus, the first record of fb_new will be "D".
Accordingly, my expectation is:
fa_new fb_new
1 A D
2 NA C
3 NA NA
4 A B
5 A NA
6 C NA
7 A NA
8 B B
9 A D
Is this possible using r?
You can use lapply to do this for multiple columns :
cols <- 1:2
init_cols <- paste0('initial_', cols)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z[new_cols] <- lapply(z[cols], function(x) z[init_cols][cbind(inds, x)])
z
# fa fb initial_1 initial_2 fa_new fb_new
#1 1 2 A D A D
#2 NA 2 B C <NA> C
#3 NA NA B C <NA> <NA>
#4 2 1 B A A B
#5 1 NA A B A <NA>
#6 1 NA C A C <NA>
#7 2 NA D A A <NA>
#8 1 1 B D B B
#9 1 2 A D A D
The logic here is we create a matrix with cbind which has row/column number. The row number is inds (1:nrow(z)) whereas column number comes from fa/fb columns which is used to subset z dataframe.
The actual dataframe is labelled dataset, the following answer should work on the real data.
cols <- 1:2
init_cols <- paste0('fuinitials_', 1:94)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z1 <- data.frame(z)
z1[cols][z1[cols] < 1] <- NA
z1[new_cols] <- lapply(z1[cols], function(x) z1[init_cols][cbind(inds, x)])
I have a large data.frame, example:
> m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
> colnames(m) <- c("A", "B", "C", "D", "E")
> rownames(m) <- c("a", "b", "c", "d", "e")
> m
A B C D E
a 3 3 5 7 5
b 6 2 3 5 4
c 2 5 6 8 9
d 5 4 3 2 2
e 3 3 6 5 2
I would like to remove all rows, where A and/or B columns have greater value than C D and E columns.
So in this case rows b, d, e should be removed and I should get this:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Can not remove them one by one because the data.frame has more than a million rows.
Thanks
Use subsetting, together with pmin() and pmax() to retain the values that you want. I'm not sure that I fully understand your criteria (you said "C D and E" but since you want to throw away row e, I think that you meant C, D or E ), but the following seems to do what you want:
> m[pmax(m[,"A"],m[,"B"])<=pmin(m[,"C"],m[,"D"],m[,"E"]),]
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
# creating the df
m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
colnames(m) <- c("A", "B", "C", "D", "E")
rownames(m) <- c("a", "b", "c", "d", "e")
# initialize as data frame.
m <- as.data.frame(m)
df_n <- m
for(i in 1:nrow(m)){
#print(i)
#print(paste(max(m[,1:2][i,]), max(m[,3:5][i,])))
if(max(m[,1:2][i,]) > (max(m[,3:4][i,])) || max(m[,1:2][i,]) > ((m[,5])[i])){
#df_n <- m[-i,]
df_n[i,] <- NA
}
}
#df_n
df_n <- df_n[complete.cases(df_n), ]
print(df_n)
Results
> print(df_n)
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Here's another solution with apply:
m[apply(m, 1, function(x) max(x[1], x[2]) < min(x[3], x[4], x[5])),]
Result:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
I think what you actually meant is to remove rows where max(A, B) > min(C, D, E), which translates to keep rows where all values of A and B are smaller than all values of C, D, and E.
I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a
I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.
I currently have a dataframe in which there are several rows I would like converted to "NA". When I first imported this dataframe from a .csv, I could use na.strings=c("A", "B", "C) and so on to remove the values I didn't want.
I want to do the same thing again, but this time using a dataframe already, not importing another .csv
To import the data, I used:
data<-read.csv("code.csv", header=T, strip.white=TRUE, stringsAsFactors=FALSE, na.strings=c("", "A", "B", "C"))
Now, with "data", I would like to subset it while removing even more specific values in the rows.. I tried someting like:
data2<-data.frame(data, na.strings=c("D", "E", "F"))
Of course this doesn't work because I think na.strings only works with the "read" package.. not other functions. Is there any equivalent to simply convert certain values into NA so I can na.omit(data2) fairly easily?
Thanks for your help.
Here's a way to replace values in multiple columns:
# an example data frame
dat <- data.frame(x = c("D", "E", "F", "G"),
y = c("A", "B", "C", "D"),
z = c("X", "Y", "Z", "A"))
# x y z
# 1 D A X
# 2 E B Y
# 3 F C Z
# 4 G D A
# values to replace
na.strings <- c("D", "E", "F")
# index matrix
idx <- Reduce("|", lapply(na.strings, "==", dat))
# replace values with NA
is.na(dat) <- idx
dat
# x y z
# 1 <NA> A X
# 2 <NA> B Y
# 3 <NA> C Z
# 4 G <NA> A
Just assign the NA values directly.
e.g.:
x <- data.frame(a=1:5, b=letters[1:5])
# > x
# a b
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# convert the 'b' and 'd' in columb b to NA
x$b[x$b %in% c('b', 'd')] <- NA
# > x
# a b
# 1 1 a
# 2 2 <NA>
# 3 3 c
# 4 4 <NA>
# 5 5 e
data[ data == "D" ] = NA
Note that if you were trying to replace NA with "D", the reverse (df[ df == NA ] = "D") will not work; you would need to use df[is.na(df)] <- "D"
Since we don't have your data I will use mtcars. Suppose we want to set values anywhere in mtcars that are equal to 4 or 19.2 to NA
ind <- which(mtcars == 4, arr.ind = TRUE)
mtcars[ind] <- NA
In your setting you would replace this number by "D" or "E"