I would like to setdiff between consecutive groups without for looping, if possible with a datatable way or a function of apply family.
Dataframe df :
id group
1 L1 1
2 L2 1
3 L1 2
4 L3 2
5 L4 2
6 L3 3
7 L5 3
8 L6 3
9 L1 4
10 L4 4
11 L2 5
I want to know how much new ids there are between consecutive groups. So, for example, if we compare group 1 and 2, there are two new ids : L3 and L4 so it returns 2 (not with setdiff directly but with length()), if we compare group 2 and 3, L5 and L6 are the news ids so it returns 2 and so on.
Expected results :
new_id
2
2
2
1
Data :
structure(list(id = structure(c(1L, 2L, 1L, 3L, 4L, 3L, 5L, 6L,
1L, 4L, 2L), .Label = c("L1", "L2", "L3", "L4", "L5", "L6"), class = "factor"),
group = c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5)), class = "data.frame", row.names = c(NA,
-11L), .Names = c("id", "group"))
Here is an option with mapply:
lst <- with(df, split(id, group))
mapply(function(x, y) length(setdiff(y, x)), head(lst, -1), tail(lst, -1))
#1 2 3 4
#2 2 2 1
Here is a data.table way with merge. Suppose the original data.frame is named dt:
library(data.table)
setDT(dt)
dt2 <- copy(dt)[, group := group + 1]
merge(
dt, dt2, by = 'group', allow.cartesian = T
)[, .(n = length(setdiff(id.x, id.y))), by = group]
# group n
# 1: 2 2
# 2: 3 2
# 3: 4 2
# 4: 5 1
You could use Reduce to run a comparison function on pairwise elements in a list. For example
xx<-Reduce(function(a, b) {
x <- setdiff(b$id, a$id);
list(id=b$id, new=x, newcount=length(x))
}, split(df, df$group),
acc=TRUE)[-1]
Then you can get the counts of new elements out with
sapply(xx, '[[', "newcount")
and you can get the new values with
sapply(xx, '[[', "new")
L = split(d, d$group) #Split data ('d') by group and create a list
#use lapply to access 'id' for each sub group in the list and obtain setdiff
sapply(2:length(L), function(i)
setNames(length(setdiff(L[[i]][,1], L[[i-1]][,1])),
nm = paste(names(L)[i], names(L)[i-1], sep = "-")))
#2-1 3-2 4-3 5-4
# 2 2 2 1
Related
I'm looking to find the max values of different columns based on specified rows of each column.
My actual data frame is 50K columns and 1K+ rows so I can't use a loop without greatly increasing run time.
Data Frame:
row
V1
V2
V3
V4
1
5
2
4
5
2
3
5
1
6
3
7
3
2
6
4
2
5
3
10
5
6
9
1
2
beg_row <- c(2, 1, 2, 3)
end_row <- c(4, 3, 3, 5)
output:
c(7, 5, 2, 10)
You can try mapply (but I suspect that it won't speed up the runtime if you have massive columns)
> mapply(function(x, y, z) max(x[y:z]), df[-1], beg_row, end_row)
V1 V2 V3 V4
7 5 2 10
Data
df <- structure(list(row = 1:5, V1 = c(5L, 3L, 7L, 2L, 6L), V2 = c(
2L,
5L, 3L, 5L, 9L
), V3 = c(4L, 1L, 2L, 3L, 1L), V4 = c(
5L, 6L, 6L,
10L, 2L
)), class = "data.frame", row.names = c(NA, -5L))
beg_row <- c(2, 1, 2, 3)
end_row <- c(4, 3, 3, 5)
An option with dplyr
library(dplyr)
df1 %>%
summarise(across(-row, ~ {
i1 <- match(cur_column(), names(df1)[-1])
max(.x[beg_row[i1]:end_row[i1]])}))
V1 V2 V3 V4
1 7 5 2 10
Or another option is to create NA outside the range and then use colMaxs
library(matrixStats)
colMaxs(as.matrix((NA^!(row(df1[-1]) >= beg_row[col(df1[-1])] &
row(df1[-1]) <= end_row[col(df1[-1])])) * df1[-1]), na.rm = TRUE)
[1] 7 5 2 10
The fastest approach that I have found is to use data.table and a for loop. I have tested it with a dataframe of 2K rows and 50K columns.
library(data.table)
beg_row <- sample(1:50, 49999, replace = T)
end_row <- sample(100:150, 49999, replace = T)
df <- matrix(sample(1:50, 50000*2000, replace = T), 2000, 50000)
df <- as.data.frame(df)
dt <- setDT(df)
vmax <- rep(0, ncol(dt)-1)
for (i in 2:ncol(dt)) {
vmax[i-1] <- max(dt[[i]][beg_row[i-1]:end_row[i-1]])
}
Another possible solution, based on purrr::pmap_dbl:
library(purrr)
pmap_dbl(list(beg_row, end_row, 2:ncol(df)), ~ max(df[..1:..2, ..3]))
#> [1] 7 5 2 10
Say I have a list c of three data frames:
> c
$first
a b
1 1 2
2 2 3
3 3 4
$second
a b
1 2 4
2 4 6
3 6 8
$third
a b
1 3 6
2 6 9
3 9 12
I want to run an lapply on c that will do a custom function on each data frame.
The custom function depends on three numbers and I want the function to use a different number depending on which data frame it's evaluating.
I was thinking of utilizing the names 'first', 'second', and 'third', but I'm unsure how to get those names once they're inside the lapply function. It would look something like this:
lapply(c, function(list, num1 = 1, num2 = -1, num3 = 0) {num <- ifelse(names(list) == "first", num1, ifelse(names(list) == "second", num2, num3)); return(list*num)})
So the result I would want would be first multiplied by 1, second multiplied by -1, and third multiplied by 0.
The names function gives the values a and b (the column names) instead of the name of the data frame itself, so that doesn't work. Is there a function that would be able to give me the 'first', 'second', and 'third' values I need?
Or alternatively, is there a better way of doing this in a lapply function?
May be, it would be easier with Map. We pass the number of interest in the order we want and do a simple multiplication
Map(`*`, lst1, c(1, -1, 0))
If the numbers are named
num1 <- setNames(c(1, -1, 0), c("first", "third", "second"))
then, match with the names of the list
Map(`*`, lst1, num1[names(lst1)])
#$first
# a b
#1 1 2
#2 2 3
#3 3 4
#$second
# a b
#1 0 0
#2 0 0
#3 0 0
#$third
# a b
#1 -3 -6
#2 -6 -9
#3 -9 -12
Or if we decide to go with lapply, loop over the names of the list , extract the list element based on the name as well as the corresponding vector element (named vector)
lapply(names(lst1), function(nm) lst1[[nm]] * num1[nm])
Or with sapply
sapply(names(lst1), function(nm) lst1[[nm]] * num1[nm], simplify = FALSE)
Or another option is map2 from purrr
library(purrr)
map2(lst1, num1[names(lst1)], `*`)
Note: c is a function name and it is not recommended to create object names with function names
data
lst1 <- list(first = structure(list(a = 1:3, b = 2:4), class = "data.frame",
row.names = c("1",
"2", "3")), second = structure(list(a = c(2L, 4L, 6L), b = c(4L,
6L, 8L)), class = "data.frame", row.names = c("1", "2", "3")),
third = structure(list(a = c(3L, 6L, 9L), b = c(6L, 9L, 12L
)), class = "data.frame", row.names = c("1", "2", "3")))
Besides the solutions by #akrun, you can also try the following code
mapply(`*`, lst1, c(1, -1, 0),SIMPLIFY = F)
or
lapply(seq_along(lst1), function(k) lst1[[k]]*c(1,-1,0)[k])
I have two data frames. dfOne is made like this:
X Y Z T J
3 4 5 6 1
1 2 3 4 1
5 1 2 5 1
and dfTwo is made like this
C.1 C.2
X Z
Y T
I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.
Example. I need simultaneously (in the same row):
X, Y > 2
Z, T > 4
I need to use the second data frame to reach my objective, I expect something like:
dfTwo$C.1>2
so the result would be a new dataframe with this structure:
X Y Z T J
3 4 5 6 1
How could I do it?
Here is a base R method with Map and Reduce.
# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
X Y Z T J
1 3 4 5 6 1
Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.
data
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"),
C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1",
"C.2"), class = "data.frame", row.names = c(NA, -2L))
We can use the purrr package
Here is the input data.
# Data frame from lmo's solution
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)
Here is the implementation
library(purrr)
map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()
# A tibble: 1 x 5
X Y Z T J
<int> <int> <int> <int> <int>
1 3 4 5 6 1
map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.
Update
Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.
First, let's create the dfTwo data frame.
dfTwo <- read.table(text = "C.1 C.2
X Z
Y T",
header = TRUE, stringsAsFactors = FALSE)
dfTwo
C.1 C.2
1 X Z
2 Y T
To complete the task, I load the dplyr and tidyr package.
library(dplyr)
library(tidyr)
Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group))
dfTwo2
Col Group
1 X C.1
2 Y C.1
3 Z C.2
4 T C.2
The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
threshold_df
Group Value
1 C.1 2
2 C.2 4
And then we can use the left_join function to combine the data frame.
dfTwo3 <- dfTwo2 %>% left_join(threshold_dt, by = "Group")
dfTwo3
Col Group Value
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.
dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 J <NA> NA
2 T C.2 4
3 X C.1 2
4 Y C.1 2
5 Z C.2 4
The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.
dfTwo5 <- dfTwo4 %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
dfTwo5
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
5 J <NA> NA
We are almost there. Now we can create vals from dfTwo5.
vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
X Y Z T J
2 2 4 4 NA
Now we are ready to use the purrr package to filter the data.
The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.
library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group)) %>%
left_join(threshold_df, by = "Group") %>%
complete(Col = colnames(dat)) %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
which(dfOne["Y"] > 2),
which(dfOne["Z"] > 4),
which(dfOne["T"] > 4))),]
# X Y Z T J
#1 3 4 5 6 1
Or iteratively (so fewer inequalities are tested):
vals = c(X = 2, Y = 2, Z = 4, T = 4) # from #lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
# X Y Z T J
#1 3 4 5 6 1
I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:
dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]
Or, using dplyr:
library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)
In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.
I would like to use R to get all pairs from two column with index. It may need some loop to finish this function. For example, turn two columns with the gene name and index:
a 1,
b 1,
c 1,
d 2,
e 2
into a new matrix
a b 1,
b c 1,
a c 1,
d e 2
Can anyone help?
A tidyverse option using combn on a grouped data.frame:
library(tidyverse)
df %>% group_by(index) %>%
summarise(gene = list(as_data_frame(t(combn(gene, 2))))) %>%
unnest(.sep = '_')
## # A tibble: 4 × 3
## index gene_V1 gene_V2
## <int> <chr> <chr>
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
The same logic can be replicated in base R:
df2 <- aggregate(gene ~ index, df, function(x){t(combn(x, 2))})
do.call(rbind, apply(df2, 1, data.frame))
## index gene.1 gene.2
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
Data
df <- structure(list(gene = c("a", "b", "c", "d", "e"), index = c(1L,
1L, 1L, 2L, 2L)), .Names = c("gene", "index"), row.names = c(NA,
-5L), class = "data.frame")
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'index', we get the combn of 'gene', transpose it and set the names of the 2nd and 3rd column (if needed).
library(data.table)
setnames(setDT(df)[, transpose(combn(gene, 2, FUN = list)),
by = index], 2:3, paste0("gene", 1:2))[]
# index gene1 gene2
#1: 1 a b
#2: 1 a c
#3: 1 b c
#4: 2 d e
I have a dataset with irregular dates column. I want to create an index column. Index ID (e.g. 1) is same for three dissimilar consecutive dates then changes (e.g. to 2) for next three dissimilar consecutive dates and so on. Here is a sample of dates and how the desired column shall look like:
structure(list(Date = c(42370, 42371, 42371, 42371, 42372, 42372,
42375, 42375, 42375, 42377, 42377, 42383, 42383, 42385, 42386,
42386, 42386, 42393, 42393, 42394, 42394, 42395, 42398, 42398,
42398, 42398), Index = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4)), .Names = c("Date",
"Index"), row.names = c(NA, 26L), class = "data.frame")
Using rleid from the data.table package and cumsum:
library(data.table)
setDT(d1)[, index := (rleid(Date)-1) %% 3
][, index := cumsum(index < shift(index, fill=1))][]
gives:
Date index
1: 01-01-16 1
2: 02-01-16 1
3: 02-01-16 1
4: 02-01-16 1
5: 03-01-16 1
6: 03-01-16 1
7: 06-01-16 2
8: 06-01-16 2
9: 06-01-16 2
10: 08-01-16 2
11: 08-01-16 2
12: 14-01-16 2
13: 14-01-16 2
14: 16-01-16 3
15: 17-01-16 3
16: 17-01-16 3
17: 17-01-16 3
18: 24-01-16 3
19: 24-01-16 3
20: 25-01-16 4
21: 25-01-16 4
22: 26-01-16 4
23: 29-01-16 4
24: 29-01-16 4
25: 29-01-16 4
26: 29-01-16 4
Explanation:
The rleid function creates a runlength id. This means that every time Date changes, the runlength id is increased by 1.
By substracting 1 from the runlength id and taking the modulus of it (the %% 3 part) you get a vector of sequences of 0,1&2's.
As last step you take the cumulative sum of the comparison of the values with the previous values. When index < shift(index, fill=1) is TRUE, the cumsum function will count that as a one.
In order to better see what this code does, see the output of the following code which creates a variable for each step:
setDT(d1)[, index1 := (rleid(Date)-1) %% 3
][, index2 := cumsum(index1 < shift(index1, fill=1))][]
Used data:
d1 <- structure(list(Date = structure(c(16801, 16802, 16802, 16802, 16803, 16803, 16806,
16806, 16806, 16808, 16808, 16814, 16814, 16816,
16817, 16817, 16817, 16824, 16824, 16825, 16825,
16826, 16829, 16829, 16829, 16829), class = "Date")),
.Names = "Date", row.names = c(NA, 26L), class = "data.frame")
This constructs a grouped by 3 index for the unique values of Date and then uses character names to manage a lookup table for the conversion:
fac <- ((seq(length(unique(dat$Date)))-1) %/%3) +1
names(fac) <- unique(dat$Date)
dat$myIndex <- fac[as.character(dat$Date)]
dat
#-------
Date Index myIndex
1 42370 1 1
2 42371 1 1
3 42371 1 1
4 42371 1 1
5 42372 1 1
6 42372 1 1
7 42375 2 2
8 42375 2 2
9 42375 2 2
10 42377 2 2
11 42377 2 2
12 42383 2 2
13 42383 2 2
14 42385 3 3
15 42386 3 3
16 42386 3 3
17 42386 3 3
18 42393 3 3
19 42393 3 3
20 42394 4 4
21 42394 4 4
22 42395 4 4
23 42398 4 4
24 42398 4 4
25 42398 4 4
26 42398 4 4
base R. We can modify the rle (run-length encoding) of the object to group trios of values:
DF$index = with(rle(DF$Date), {
g = ceiling(seq_along(values)/3)
split(values, g) <- seq(tail(g,1))
inverse.rle(list(lengths = lengths, values = values))
})
The weird split(x,g) <- bit was borrowed from ave. If the Date column is increasing, this can be done more simply (thanks to #Jaap):
DF$index = ceiling(match(DF$Date, unique(DF$Date))/3) # or...
DF$index = ceiling(as.integer(factor(DF$Date))/3)
data.table. The data.table analogue is simpler:
library(data.table)
setDT(DF)[, index := ceiling(rleid(Date)/3)]
I used data from an earlier version of the question:
df <- data.frame(Date = c("01-01-16", "02-01-16", "02-01-16", "02-01-16",
"03-01-16", "03-01-16", "06-01-16", "06-01-16", "06-01-16", "08-01-16",
"08-01-16", "14-01-16", "14-01-16", "16-01-16", "17-01-16", "17-01-16",
"17-01-16", "24-01-16", "24-01-16", "25-01-16", "25-01-16", "26-01-16",
"29-01-16", "29-01-16", "29-01-16", "29-01-16"),
Index = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L))
I would start by converting the Date column from character to date, and making sure the data frame is sorted by date (you don't need that part with the new version of the data where Date is already numeric, and if you are sure the data frame is already sorted by Date):
df$Date <- as.Date(df$Date, format="%d-%m-%y")
df <- df[ order(df$Date),]
Then I would convert the date to consecutive integers - one way to do it is to convert to factor and then unclass (here I used c as a shorthand to do it) - and then cut it at equal intervals:
df$ndx <- c(factor(as.numeric(df$Date)))
df$ndx <- cut(df$ndx, seq(0.5, max(df$ndx)+0.5, by=3), labels=FALSE)