Very similar questions have been asked here, here, and here. However, they all seem to rely on knowing the column names of the data.
I am trying to get the column index of a data frame that matches a numeric vector. For example, if I have some data and a vector like so:
dat <- data.frame(
x = c(1,2,3,4,5),
y = c(10,9,8,7,6),
z = c(2,4,6,8,10)
)
testVec <- c(2,4,6,8,10)
I would just like to return the column index of dat that matches testVec . We can see that dat$z matches testVec... so in this situation I would just like to return 3.
Any suggestions as to how I could do this?
Here's a base R approach, which compares every column in dat with testVec to see if they are identical. Use which to output the column index if they're identical.
which(sapply(1:ncol(dat), function(x) identical(dat[,x], testVec)))
[1] 3
UPDATE
#nicola has provided a better syntax to my original code (you can see it in the comment under this answer):
which(sapply(dat, identical, y = testVec))
z
3
You perhaps can try this
> which(colSums(dat == testVec) == nrow(dat))
z
3
An option with select from dplyr
library(dplyr)
dat %>%
select(where(~ all(testVec == .x))) %>%
names %>%
match(names(dat))
[1] 3
Subtract the testVec.
which(colSums(dat - testVec) == 0)
# z
# 3
Without name:
unname(which(colSums(dat - testVec) == 0))
# [1] 3
Data:
dat <- structure(list(x = c(1, 2, 3, 4, 5), y = c(10, 9, 8, 7, 6), z = c(2,
4, 6, 8, 10)), class = "data.frame", row.names = c(NA, -5L))
testVec <- c(2, 4, 6, 8, 10)
Related
I have a data.table with two columns "From" and "To" as follows:
data.table(From = c(1,1,1,1,2,2,2,2,3,3,3,4,4,5),
To = c(3,4,5,6,3,4,5,6,4,5,6,5,6,6))
The data.table will always be sorted as shown in the example above, with "From" and "To" values increasing from smallest to largest.
I need to find a 'path' starting from the first 'From' (which will always be '1'), through to the last 'To' value, subject to always choosing the lowest 'To' value.
In the above example, I would have 1 --> 3, then 3 --> 4, then 4 --> 5, then finally 5 --> 6.
I then want to return in a vector 1, 3, 4, 5, and 6, representing the linked values.
The only way that I can think of doing it is using a while or for loop and looping through each group of 'From' values and iteratively choosing the smallest. That seems inefficient though and will probably be very slow on my actual data set which is over 100,000 rows long.
Are there any data.table-like solutions?
I also thought that maybe igraph would have a method for this, but I must admit that I currently have pretty much zero knowledge of this function.
Any help would be greatly appreciated.
Thanks,
Phil
EDIT:
Thanks for all the responses so far.
My example/explanation wasn't a great one sorry, as I didn't explain that the 'From' / 'To' pairs don't need to go all the way through to the end value of the 'To' column.
Using the example from the comments below:
dt <- data.table(From = c(1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 5),
To = c(3, 4, 5, 6, 3, 4, 5, 6, 5, 6, 6))
The output would simply be a vector of c(1, 3), as it will start at 1, choose the lowest value which is 3, and then because there are no 'From' values of '3', it wouldn't continue any further.
Another example:
dt <- data.table(From = c(1,1,1,2,2,3,3,4,4),
To = c(2,3,4,5,6,4,7,8,9))
The intended output here is a vector c(1,2,5); following the path 1 --> 2, then 2 --> 5, at which point it stops as there is no '5' value in the "From" column.
Hopefully, that makes sense, and apologies for the lack of clarity in the original question.
Thanks,
Phil
You can try the code below
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
ego(
order = sum((m <- membership(components(.))) == m[names(m) == "1"]),
nodes = "1",
mode = "out"
) %>%
pluck(1) %>%
names() %>%
as.numeric()
or simpler with subcomponent (as #clp did)
dt %>%
group_by(From) %>%
slice_min(To) %>%
graph_from_data_frame() %>%
subcomponent(v = "1", mode = "out") %>%
names() %>%
as.integer()
which gives
For the first new updated data
[1] 1 3
For the second updaed data
[1] 1 2 5
Assuming an ordered From and To list this may work.
It first groups by From, compresses by To, then excludes non-matching From-To values using shift.
If jumps are missing (e.g. To 3 but From 3 missing) it prints NULL
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
[1] 1 3 4 5 6
Using from Igraph and subcomponents().
After ThomasisCoding's comment, I realized that graph_from_data_frame creates a graph by name.
This is a waste of memory (and time) if the graph is large (1E6).
Note also that graph_from_edgelist(as.matrix(...)) is much faster.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_edgelist(as.matrix(dt2), directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
First attempt.
dt2 <- setNames(aggregate(dt$To, list(dt$From), "min"), c("From", "To") )
g <- graph_from_data_frame(dt2, directed=TRUE)
as.numeric(as_ids(subcomponent(g, 1, mode="out")))
I can't seem to get the other answers to work with certain tables. E.g.,
library(data.table)
library(igraph)
library(purrr)
dt <- data.table(
From = c(1, 1, 1, 1, 2, 2, 4, 5),
To = c(3, 4, 5, 6, 4, 6, 6, 6)
)
fPath1 <- function(dt) {
setorder(dt, From, To)[, wt := fifelse(rleid(To)==1,1,Inf), From] %>%
graph_from_data_frame() %>%
set_edge_attr(name = "weight", value = dt[, wt]) %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath2 <- function(dt) {
dt[, .SD[which.min(To)], From] %>%
graph_from_data_frame() %>%
shortest_paths(min(dt[, From]), max(dt[, To])) %>%
pluck(1) %>%
unlist(use.names = FALSE)
}
fPath3 <- function(dt) {
dt[, .(frst = first(To)), From][
, if(all((frst %in% From)[1:(.N - 1)])){
c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
}
fPath1(dt)
#> [1] 1 6
fPath2(dt)
#> Warning in shortest_paths(., min(dt[, From]), max(dt[, To])): At core/paths/
#> unweighted.c:368 : Couldn't reach some vertices.
#> integer(0)
fPath3(dt)
#> NULL
This igraph solution seems to work based on a little more extensive testing:
fPath4 <- function(dt) {
g <- graph_from_data_frame(dt)
E(g)$weight <- (dt$To - dt$From)^2
as.integer(V(g)[shortest_paths(g, V(g)[1], V(g)[name == dt$To[nrow(dt)]])$vpath[[1]]]$name)
}
fPath4(dt)
#> [1] 1 4 6
A sequential solution is feasable.
Copying one million dataframe lines took 8 seconds on my system.
n <- 1E6
df1 <- data.frame(from=sample(n), to=sample(n))
path <- c()
system.time(
for (i in seq(nrow(df1)) ){
path[length(path) + 1] <- df1[i, "to"] # avoid copying.
}
)
mean(path)
length(path)
Output.
[1] 500000.5
[1] 1000000
Updated after last edit of Phil.
The first step is to simplify the input (df).
## Select min(To) by From.
if (nrow(df) > 0) { df2 <- setNames(aggregate(df$To, list(df$From), "min"), c("From", "To") )
} else df2 <- df
Set path to first start node and
subsequently append end nodes.
## Let tt is maximal outgoing node upto now.
path <- df2[1,1]
tt <- df2[1,1]
for (i in seq_len(nrow(df2))){
if (df2[i, 1] < tt) next
else if (df2[i,1] == tt) { tt <- df2[i, 2]
path[length(path) + 1] <- df2[i, 2]
}
else break
}
head(path)
Output:
[1] 1 3 4 5 6 , df as in first example.
[1] 1 2 5 , df as in another example.
I have a dataset with these three columns and other additional columns
structure(list(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4,
1), to = c(8, 3, 8, 54, 3, 4, 1, 6, 7, 1, 4, 3, 8, 8), time = c(1521823032,
1521827196, 1521827196, 1522678358, 1522701516, 1522701993, 1522702123,
1522769399, 1522780956, 1522794468, 1522794468, 1522794468, 1522794468,
1522859524)), class = "data.frame", row.names = c(NA, -14L))
I need the code to take all indices less than a number (e.g. 5) and for each of them do the following: Subset the data set if the index is either in column "from" or in column "to" and calculate a function (e.g the difference between the min and max in time). As a result I expect a dataframe with the indexes and the results of the calculation.
This is what I have, but it does not work.
dur<-function(x)max(x)-min(x) #The function to calculate the difference. In other cases I need to use other functions of my own
filternumber <- function(number,x){ #A function to filter data x by the number in the two two columns
x <- x%>% subset(from == number | to == number)
return(x)
}
lista <- unique(c(data$from, data$to)) # Creates a list with all the indexes in the data. I do this to avoid having non-existing indexes
lista <-lista[lista <= 5] #Limit the list to 5. In my code this number would be an argument to a function
result<-lista%>%filteremployee(.,data) %>% select(time) %>% dur() #I use select because I have many other columns in the data
The result in this case should be a dataframe with 1036492 for 1, 967272 for 3 and 92475 for 4
I´ve also try putting filteremployee(.,data) %>% select(time) %>% dur() in side mutate but that does not work either
Perhaps you are looking for something like this:
library(purrr)
library(dplyr)
index <- c(1, 3, 4)
names(index) <- index
index %>%
map_dfr(~ df %>%
filter(from == .x | to == .x) %>%
summarize(result = dur(time)),
.id = "index")
This returns
index result
1 1 1036492
2 3 967272
3 4 92475
The function was created with ==, which is elementwise. Here, we may need to loop
library(dplyr)
library(purrr)
map_dbl(lista, ~ filternumber(.x, data) %>%
select(time) %>%
dur)
[1] 1036492 967272 92475 0
Hope you have a nice day.
Today I was trying two make from one big column two small ones in R. However, I haven't found a way how to make it.
I have something like this (however, it is way bigger)
name3 <- c(1, 2, 3, 4, 5, 6)
df1 <- data.frame(name3)
print(df1)
I want to do something like this. My intention is just take the total number of variables and divide it into two equal groups.
name <- c(1, 2, 3)
name1 <- c(4, 5, 6)
df <- data.frame(name, name1)
print (df)
Thanks in advance!
One way to do it, you can first write this as a matrix in which you specify the number of columns
than transform the matrix to dataframe
from a dataframe you can convert each column to a vector
This is how I did it
name3 <- c(1, 2, 3, 4, 5, 6)
df <- as.data.frame(matrix(name3, ncol = 2))
name1 <- df$V1
name2 <- df$V2
Trying to accomplish this as close to base r as possible, this would be my method if the order of the sub vector don't matter:
# needed for index function
library(zoo)
# simple function to calculate even / odd
is.even <- function(x) x %% 2 == 0
# define my vector of values
name3 <- c(1, 2, 3, 4, 5, 6)
# split vector by even or odd index.
split(name3,f= is.even(index(name3)) )
Result:
$`FALSE`
[1] 1 3 5
$`TRUE`
[1] 2 4 6
I have an R dataframe that contains 18 columns, I would like to write a function that compares column 1 to column 2, and if both columns contain the same value, a logical result of T or F is written to a new column (this part is not too hard for me), however I would like to repeat this process over for the next columns and write T/F to a new column.
values col 1 = values col 2, write T/F to new column, values col 3 = values col 4, write T/F to a new column (or write results to a new dataframe)
I have been trying to do this with the purrr package, and use the pmap/map function, but I know I am making a mistake and missing some important part.
This function should work if I understand your problem correctly.
df <-
data.frame(a = c(18, 6, 2 ,0),
b = c(0, 6, 2, 18),
c = c(1, 5, 6, 8),
d = c(3, 5, 9, 2))
compare_columns <-
function(x){
n_columns <- ncol(x)
odd_columns <- 2*1:(n_columns/2) - 1
even_columns <- 2*1:(n_columns/2)
comparisons_list <-
lapply(seq_len(n_columns/2),
function(y){
df[, odd_columns[y]] == df[, even_columns[y]]
})
comparisons_df <-
as.data.frame(comparisons_list,
col.names = paste0("column", odd_columns, "_column", even_columns))
return(cbind(x, comparisons_df))
}
compare_columns(df)
I have multiple data frames and I want to perform the same action in all data frames, such, for example, transform all them into data.tables (this is just an example, I want to apply other functions too).
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to create a list of how they should be called afterwards (list.dt) and (iii) to loop into those two lists:
list.df:
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
list.dt
list.dt<-vector('list',3)
for(j in 1:3){
name <- paste('dt',j,sep='')
list.dt[j] <- name
}
Loop (to make all data frames into data tables):
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(list.df[i]))
}
I am definitely doing something wrong as the result of this are three data tables with 1 variable, 1 observation (exactly the name list.df[i]).
I've tried to unlist the list.df thinking r would recognize that as an entire data frame and not only as a string:
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(unlist(list.df[i])))
}
But I get the error message:
Error in setDT(unlist(list.df[i])) :
Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'
Any suggestions?
You can just put all the data into one dataframe. Then, if you want to iterate through dataframes, use dplyr::do or, preferably, other dplyr functions
library(dplyr)
data =
list(df1 = df2, df2 = df2, df3 = df3) %>%
bind_rows(.id = "source") %>%
group_by(source)
Change your last snippet to this:
for(i in 1:3){
name <- list.dt[i]
assign(unlist(name), setDT(get(list.df[[i]])))
}
# Alternative to using lists
list.df <- paste0("df", 1:3)
# For loop that works with the length of the input 'list'/vector
# Creates the 'dt' objects on the fly
for(i in seq_along(list.df)){
assign(paste0("dt", i), setDT(get(list.df[i])))
}
Using data.table (which deserve far more advertising):
a) If you need all your data.frames converted to data.tables, then as was already suggested in the comments by #A5C1D2H2I1M1N2O1R2T1, iterate over your data.frames with setDT
library(data.table)
lapply(mget(paste0("df", 1:3)), setDT)
# or, if you wish to type them one by one:
lapply(list(df1, df2, df3), setDT)
class(df1) # check if coercion took place
# [1] "data.table" "data.frame"
b) If you need to bind your data.frames by rows, then use data.table::rbindlist
data <- rbindlist(mget(paste0("df", 1:3)), idcol = TRUE)
# or, if you wish to type them one by one:
data <- rbindlist(list(df1 = df1, df2 = df2, df3 = df3), idcol = TRUE)
Side note: If you like chaining/piping with the magrittr package (which you see almost always in combination with dplyr syntax), then it goes like:
library(data.table)
library(magrittr)
# for a)
mget(paste0("df", 1:3)) %>% lapply(setDT)
# for b)
data <- mget(paste0("df", 1:3)) %>% rbindlist(idcol = TRUE)