I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)
Related
library(tidyverse)
separator <- function(x){
format(as.numeric(x), big.mark = ".", decimal.mark = ",")
}
x <- c(1000)
y <- c(1000)
z <- c(1000)
df <- tibble(x, y, z)
df[ , 2:ncol(df)] <- apply(df[ , 2:ncol(df)], 2, separator)
Error:
Error: Assigned data `apply(df[, 2:ncol(df)], 2, separator)` must be compatible with existing data.
x Existing data has 1 row.
x Assigned data has 2 rows.
i Row updates require a list value. Do you need `list()` or `as.list()`?
Run `rlang::last_error()` to see where the error occurred.
Why does this happen? Why does the new data suddenly have 2 rows? It should just keep 1 row and add thousands separator, starting from the second variable to end.
If it's a data.frame, R doesn't complain and just does the job but not with tibbles. I know there are a few differences in tibbles and data.frames, but I couldn't find a reason why this happens.
Edit (pointed out by Ric S): The code works with tibbles, that have more than one row, but not with "one-row-tibbles".
Use lapply:
df[ , 2:ncol(df)] <- lapply(df[ , 2:ncol(df)], separator)
If you had posted the full error/warning message, it would be clear:
# Error: Assigned data `apply(df[, 2:ncol(df)], 2, separator)` must be compatible with existing data.
# x Existing data has 1 row.
# x Assigned data has 2 rows.
# i Row updates require a list value. Do you need `list()` or `as.list()`?
# Run `rlang::last_error()` to see where the error occurred.
Then we can check if the assignment is compatible, i.e: list class:
is.list(apply(df[ , 2:ncol(df)], 2, separator))
# [1] FALSE
is.list(lapply(df[ , 2:ncol(df)], separator))
# [1] TRUE
is.list(df[ , 2:ncol(df)])
# [1] TRUE
If you wanted to just apply the separator function to y and z columns, you could do this:
df <- df %>%
mutate_at(vars(y, z), separator)
Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.
I made data frame called x:
a b
1 2
3 NA
3 32
21 7
12 8
When I run
y <- x["a">2,]
The object y returned is identical to x. If I run
y <- x["a" == 1,]
y is an empty frame.
I made sure that the names of the x data frame have no white spaces (I named them myself with names() ) and also that a and are numeric.
PS: If I try
y <- x["a">2]
y is also returned as identical to x.
You're making an error in referencing the column of your data.frame x.
"a">2 means character a bigger than two, not variable a of data.frame x. You need to add either x$a or x["a"] to reference your data.frame column.
try
y <- x[x$a >2 ,]
or
y <- x[x["a"] >2 ,]
or even more clear
ix <- x["a"] > 2
y <- x[ix,]
A simple alternative would be using data.table
library(data.table)
setDT(x)
y <- x[ a > 2, ]
y <- x[ a == 1, ]
Following Data Set:
df <- read.table(text = " a b c
X Y Z", header = T)
The command
df[, sapply (df[1, ], as.character) %in% c("Y", "X")]
returns
a b
1 X Y
but the command
df[, sapply (df[1, ], as.character) %in% c("Y")]
returns
[1] Y
Levels: Y
and not
b
1 Y
Any idea why? And how I can retrieve the correct column name
Likely a duplicate...but to provide an answer...
You've fallen into one of the major traps of the R Inferno; 8.1.44 where "By default dimensions of arrays are dropped when subscripting makes the dimension length 1. Subscripting with drop=FALSE overrides the default."
So, this will return what you expect:
df[, sapply (df[1, ], as.character) %in% c("Y"), drop = FALSE]
Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE