Subset a dataframe using start and stop points from another dataframe? - r

I have a dataframe df with 3 columns id, first and last
id <- c(27,27,134,134)
first <- c(14,20,9,16)
last <- c(17,24,13,20)
df <- as.data.frame(cbind(id,first,last))
df
Each row corresponds to a chunk of data from another dataframe that I want to keep.
first and last indicate the first and last frames of the relevant chunk
I want to use this to subset the other dataframe dat which is structured as below
dat_id <- c(rep(27, 30), rep(134,30))
dat_frame <- c(seq(1:30), seq(1:30))
dat_data <- c(sample(1:60))
dat <- as.data.frame(cbind(dat_id,dat_frame,dat_data))
dat
The only way I know to extract the relevant portion is with a for loop as below (this produces the expected output), but I expect this is a horribly inefficient way to do it. What's a better way?
#header row
new_df <- data.frame(id = numeric(), frame = numeric(), data = numeric())
#populate
for (i in (seq (1:nrow(df)))){
new_df <- rbind(new_df, subset(dat, dat_id == df[i,"id"])[df[i,"first"]:df[i,"last"],])
}
new_df

This can be done with a complex join in sql. This avoids creating a large intermediate data frame based on joining only on id and then cutting it down.
library(sqldf)
sqldf("
select dat.*
from dat
join df on dat.dat_id = df.id and
dat.dat_frame between df.first and df.last
")
Update
The example in the question changed and the solution has been simplified assuming the new example.

Using dplyr we can do a left_join on dat and df and select only those rows which lie in between first and last of their respective id.
library(dplyr)
left_join(dat, df, by = c("dat_id" = "id")) %>%
filter(between(dat_frame, first, last)) %>%
select(-first, -last)
Or using the same logic in base R
subset(merge(dat, df, by.x = "dat_id", by.y = "id", all.x = TRUE),
dat_frame >= first & dat_frame <= last)

We can use a non-equi join for this. It would be more faster and efficient
library(data.table)
setDT(dat)[, newcol := dat_frame][df, on = .(dat_id = id,
newcol >= first, newcol <=last)][, .(dat_id, dat_frame, dat_data)]
# dat_id dat_frame dat_data
# 1: 27 14 26
# 2: 27 15 56
# 3: 27 16 30
# 4: 27 17 49
# 5: 27 20 23
# 6: 27 21 37
# 7: 27 22 7
# 8: 27 23 40
# 9: 27 24 12
#10: 134 9 57
#11: 134 10 35
#12: 134 11 31
#13: 134 12 53
#14: 134 13 38
#15: 134 16 15
#16: 134 17 14
#17: 134 18 33
#18: 134 19 54
#19: 134 20 43
Or another option is fuzzyjoin
library(fuzzyjoin)
library(dplyr)
dat %>%
mutate(newcol = dat_frame) %>%
fuzzy_left_join(df, by = c("dat_id" = 'id', 'newcol' = 'first',
'newcol' = 'last'), match_fun = list(`==`, `>=`, `<=`)) %>%
na.omit %>%
select(dat_id, dat_frame, dat_data)
# dat_id dat_frame dat_data
#14 27 14 26
#15 27 15 56
#16 27 16 30
#17 27 17 49
#20 27 20 23
#21 27 21 37
#22 27 22 7
#23 27 23 40
#24 27 24 12
#39 134 9 57
#40 134 10 35
#41 134 11 31
#42 134 12 53
#43 134 13 38
#46 134 16 15
#47 134 17 14
#48 134 18 33
#49 134 19 54
#50 134 20 43
Or using base R
out <- do.call(rbind, Map(function(x, y) do.call(rbind,
Map(function(u, v) subset(x, dat_frame >= u & dat_frame <= v),
y$first, y$last)), split(dat, dat$dat_id), split(df, df$id)))
row.names(out) <- NULL
out
# dat_id dat_frame dat_data
#1 27 14 26
#2 27 15 56
#3 27 16 30
#4 27 17 49
#5 27 20 23
#6 27 21 37
#7 27 22 7
#8 27 23 40
#9 27 24 12
#10 134 9 57
#11 134 10 35
#12 134 11 31
#13 134 12 53
#14 134 13 38
#15 134 16 15
#16 134 17 14
#17 134 18 33
#18 134 19 54
#19 134 20 43
NOTE: All the above solutions work
Also, note that the solution in the other post gives Error
left_join(dat, df, by = c("dat_id" = "id")) %>%
filter(between(dat_frame, first, last)) %>%
select(-first, -last)
#Error: Expecting a single value: [extent=120].
NOTE: That the accepted answer is wrong and it is giving error.

Related

Put the first row as the column names of my dataframe with dplyr in R

This is my dataframe:
x<-data.frame(A = c(letters[1:10]), M1 = c(11:20), M2 = c(31:40), M3 = c(41:50))
colnames(x)<-NULL
I want to tranpose (t(x)) and consider the first column of x as the colnames of the new dataframe t(x).
Also I need them (the colnames of t(x)) to be identified as words/letters (as character right?)
Is it possible to do this with dplyr package?
Any help?
The {janitor} package is good for this and is flexible enough to be able to select any row to push to column names:
library(tidyverse)
library(janitor)
x <- x %>% row_to_names(row_number = 1)
You can do this easily in base R. Just make the first column of x be the row names, then remove the first column and transpose.
row.names(x) = x[,1]
x = t(x[,-1])
x
a b c d e f g h i j
M1 11 12 13 14 15 16 17 18 19 20
M2 31 32 33 34 35 36 37 38 39 40
M3 41 42 43 44 45 46 47 48 49 50
Try this:
library(dplyr)
library(tidyr)
x <- data.frame(
A = c(letters[1:10]),
M1 = c(11:20),
M2 = c(31:40),
M3 = c(41:50))
x %>%
gather(key = key, value = value, 2:ncol(x)) %>%
spread(key = names(x)[1], value = "value")
key a b c d e f g h i j
1 M1 11 12 13 14 15 16 17 18 19 20
2 M2 31 32 33 34 35 36 37 38 39 40
3 M3 41 42 43 44 45 46 47 48 49 50
I think column_to_rownames from the tibble package would be your simplest solution. Use it before you transpose with t.
library(magrittr)
library(tibble)
x %>%
column_to_rownames("A") %>%
t
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
The "M1", "M2", "M3" above are row names. If you want to keep them inside (as a column), you can add rownames_to_column from the same package.
x %>%
column_to_rownames("A") %>%
t %>%
as.data.frame %>%
rownames_to_column("key")
#> key a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
Essentially,
column_to_rownames("A") converts column "A" in x to row names,
t transposes the data.frame (now a matrix),
as.data.frame reclassifies it back as a data.frame (which is necessary for the next function), and
rownames_to_column("key") converts the row names into a new column called "key".
Using rownames_to_column() from the tibble package
library(magrittr)
library(tibble)
x %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
`colnames<-`(.[1,]) %>%
.[-1,] %>%
`rownames<-`(NULL)
#> A a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
x %>%
`row.names<-`(.[, 1]) %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
.[-1,]
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
Created on 2018-10-06 by the reprex package (v0.2.1.9000)

Group data.frame by column and select number of rows based on numeric vector

Let's say I have got a data.frame like the following:
df = read.table(text = 'A B
11 98
11 87
11 999
11 22
12 34
12 34
12 44
12 98
17 77
17 67
17 87
17 66
33 6
33 45
33 12
33 10', header = TRUE)
I need to group df by col A and select only a given number of rows based on the following vector:
n_rows = c(2, 3, 4, 2)
So that the first group will have only 2 rows (no matter their order), the second group 3 rows, etc...
Here my expected output:
A B
11 98
11 87
12 34
12 34
12 44
17 77
17 67
17 87
17 66
33 6
33 45
I tried to do the trick with dplyr by doing the following:
df %>%
group_by(A) %>%
top_n(n = n_rows, wt =B)
but I got the following error:
Error: n must be a scalar integer
Any suggestion?
thanks
Another base R option,
do.call(rbind, Map(function(x, y) x[seq(y),], split(df, df$A), n_rows))
which gives,
A B
11.1 11 98
11.2 11 87
12.5 12 34
12.6 12 34
12.7 12 44
17.9 17 77
17.10 17 67
17.11 17 87
17.12 17 66
33.13 33 6
33.14 33 45
Here's a possibility, splitting first the data.frame then using map2:
library(dplyr)
library(purr)
df %>% split(.$A) %>%
map2_dfr(n_rows,head)
# A B
# 1 11 98
# 2 11 87
# 3 12 34
# 4 12 34
# 5 12 44
# 6 17 77
# 7 17 67
# 8 17 87
# 9 17 66
# 10 33 6
# 11 33 45
If order doesn't matter you don't need top_n, head works just fine (and faster), else just replace head with top_n.
EDIT:
Here is also a tidy solution, a few characters longer but maybe more satisfying as you don't separate things of the same "kind" but rather work completely inside of the data.frame (same output).
df %>% nest(B) %>%
mutate(data = map2(data,n_rows,head)) %>%
unnest
In base R, you can do something like:
df2 <- data.frame()
for (i in seq_along(unique(df$A))) {
df2 <- rbind(df2, df[df$A == unique(df$A)[i], ][1:n_rows[i], ])
}
> df2
A B
1 11 98
2 11 87
5 12 34
6 12 34
7 12 44
9 17 77
10 17 67
11 17 87
12 17 66
13 33 6
14 33 45
Here is an option with top_n
library(tidyverse)
df %>%
split(., .$A) %>%
map2_df(., n_rows, ~ top_n(., .y, wt = .$B))
If we are not looking for top_n, then another option is slice
df %>%
group_by(A) %>%
nest(B) %>%
mutate(newcol = map2(data, n_rows, ~ .x %>% slice(seq(.y)))) %>%
select(-data) %>%
unnest

Filter data frame by results from tapply function

I'm trying to apply a tapply function I wrote to filter a dataset. Here is a sample data frame (df) below to describe what I'm trying to do.
I want to keep in my data frame the rows where the value of df$Cumulative_Time is closest to the value of 14. It should do this for each factor level in df$ID (keep row closest the value 14 for each ID factor).
ID Date Results TimeDiff Cumulative_Time
A 7/10/2015 71 0 0
A 8/1/2015 45 20 20
A 8/22/2015 0 18 38
A 9/12/2015 79 17 55
A 10/13/2015 44 26 81
A 11/27/2015 98 37 118
B 7/3/2015 75 0 0
B 7/24/2015 63 18 18
B 8/21/2015 98 24 42
B 9/26/2015 70 30 72
C 8/15/2015 77 0 0
C 9/2/2015 69 15 15
C 9/4/2015 49 2 17
C 9/8/2015 88 2 19
C 9/12/2015 41 4 23
C 9/19/2015 35 6 29
C 10/10/2015 33 18 47
C 10/14/2015 31 3 50
D 7/2/2015 83 0 0
D 7/28/2015 82 22 22
D 8/27/2015 100 26 48
D 9/17/2015 19 17 65
D 10/8/2015 30 18 83
D 12/9/2015 96 51 134
D 1/6/2016 30 20 154
D 2/17/2016 32 36 190
D 3/19/2016 42 27 217
I got as far as the following:
spec_day = 14 # value I want to compare df$Cumulative_Time to
# applying function to calculate closest value to spec_day
tapply(df$Cumulative_Time, df$ID, function(x) which(abs(x - spec_day) == min(abs(x - spec_day))))
Question: how do I include this tapply function as a means to do the filtering of my data frame df? Am I approaching this problem the right way, or is there some simpler way to accomplish this that I'm not seeing? Any help would be appreciated--thanks!
Here's a way you can do it, note that I didn't use tapply:
spec_day <- 14
new_df <- do.call('rbind',
by(df, df$ID,
FUN = function(x) x[which.min(abs(x$Cumulative_Time - spec_day)), ]
))
new_df
ID Date Results TimeDiff Cumulative_Time
A A 8/1/2015 45 20 20
B B 7/24/2015 63 18 18
C C 9/2/2015 69 15 15
D D 7/28/2015 82 22 22
which.min (and its sibling which.max) is a very useful function.
Here's a more concise and faster alternative using data.table:
library(data.table)
setDT(df)[, .SD[which.min(abs(Cumulative_Time - 14))], by = ID]
# ID Date Results TimeDiff Cumulative_Time
#1: A 8/1/2015 45 20 20
#2: B 7/24/2015 63 18 18
#3: C 9/2/2015 69 15 15
#4: D 7/28/2015 82 22 22

How can I select the 10 largest values from three different columns and save them in a new data frame in R?

Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE

R: how to find corresponding value in different columns of a dataframe

I am new to R and I really got stuck on stuff, which may seem easy to you. I have a dataframe which have a huge amount of data like AGE, which is related to a particular person so is repeated. I had to divide it into ranges and see how many people are in each group. So I have this
`
[,1]
(1,23] 5912
(23,26] 5579
(26,28] 3314
(28,33] 6693
(33,37] 4682
(37,41] 4514
(41,46] 5169
(46,51] 4812
(51,57] 4236
(57,76] 4031`
Now I have another column G/B which indicates if the person is BAD or GOOD (as 1,0, respectively)
It is required to calculate how many of 1s and 0s, i.e 'bad's and 'good'sin each group of people of different ages.
So data should be something like
Total Bad Good
`(1,23] 5912 2912 3000 `.
ect.
Hope to get help with this one.
May be you could try
library(data.table)
setDT(dat1)[,list(Total=.N, Bad=sum(GB), Good=sum(!GB)), keyby=range]
# range Total Bad Good
# 1: (0,1] 16 7 9
# 2: (1,23] 257 132 125
# 3: (23,26] 29 16 13
# 4: (26,28] 19 8 11
# 5: (28,33] 60 34 26
# 6: (33,37] 52 30 22
# 7: (37,41] 41 19 22
# 8: (41,46] 56 25 31
# 9: (46,51] 65 27 38
#10: (51,57] 57 28 29
#11: (57,76] 196 110 86
#12: (76,85] 101 44 57
#13: (85,100] 51 24 27
Or using dplyr
library(dplyr)
dat1 %>%
group_by(range) %>%
summarise(Total=n(), Bad=sum(GB), Good=sum(!GB))
Or using aggregate from base R
res <- do.call(`data.frame`,aggregate(GB~range, dat1,
FUN=function(x) c(length(x), sum(x), sum(!x))))
data
set.seed(42)
dat <- data.frame(AGE= sample(1:90, 1000, replace=TRUE),
GB=sample(0:1, 1000, replace=TRUE))
dat1 <- transform(dat, range=cut(AGE,
breaks=c(0,1,23,26,28,33,37,41,46,51,57,76,85,100)))

Resources