labeling the last date in R - r

I have dataset with data of gamesessions (id,count of session, averege second of session and dates of sessions for each id).Here is sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
mydat
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need after each last date when player was seen put 1, and put zero for another dates when this id was seen.
For example id 96308016 has 5 obs.Therefore, we mark the last fifth observation with 1 and the previous 4 observations are marked with zero.
if id has 1 observation we marked it with 1 too like id 74385162.
to make it more clear, here my expected output
udid count avg_duration date diff
74385162 1 39 31.10.16 1
79599601 1 1216 15.11.16 0
79599601 1 568 15.11.16 1
91475825 1 5 11.10.16 0
91475825 3 6 11.10.16 1
91492531 1 79 8.10.16 1
92137561 1 9 12.12.16 1
96308016 2 426 28.11.16 0
96308016 2 78 28.11.16 0
96308016 1 884 28.11.16 0
96308016 1 785 29.11.16 0
96308016 1 785 29.11.16 1
96495076 1 22 21.10.16 1
97135620 2 302 18.10.16 0
97135620 1 738 15.12.16 0
97135620 1 280 15.12.16 0
97135620 1 2782 16.12.16 0
97135620 1 5 17.12.16 0
97135620 1 2284 18.12.16 0
97135620 1 144 18.12.16 0
97135620 1 234 18.12.16 0
97135620 1 231 18.12.16 0
97135620 1 539 18.12.16 1
97165942 1 450 26.10.16 1
How do that?

You could do the following:
library(dplyr)
mydat = mydat %>%
group_by(udid) %>%
mutate(diff=ifelse(row_number()==n(),1,0)) %>%
as.data.frame()
Output:
udid count avg_duration date diff
1 74385162 1 39 31.10.16 1
2 79599601 1 1216 15.11.16 0
3 79599601 1 568 15.11.16 1
4 91475825 1 5 11.10.16 0
5 91475825 3 6 11.10.16 1
6 91492531 1 79 8.10.16 1
7 92137561 1 9 12.12.16 1
8 96308016 2 426 28.11.16 0
9 96308016 2 78 28.11.16 0
10 96308016 1 884 28.11.16 0
11 96308016 1 785 29.11.16 0
12 96308016 1 785 29.11.16 1
13 96495076 1 22 21.10.16 1
14 97135620 2 302 18.10.16 0
15 97135620 1 738 15.12.16 0
16 97135620 1 280 15.12.16 0
17 97135620 1 2782 16.12.16 0
18 97135620 1 5 17.12.16 0
19 97135620 1 2284 18.12.16 0
20 97135620 1 144 18.12.16 0
21 97135620 1 234 18.12.16 0
22 97135620 1 231 18.12.16 0
23 97135620 1 539 18.12.16 1
24 97165942 1 450 26.10.16 1

If it's already sorted by date, then this will work:
mydat$diff = as.integer(!duplicated(mydat$udid, fromLast = TRUE))
head(mydat)
# udid count avg_duration date diff
# 1 74385162 1 39 31.10.16 1
# 2 79599601 1 1216 15.11.16 0
# 3 79599601 1 568 15.11.16 1
# 4 91475825 1 5 11.10.16 0
# 5 91475825 3 6 11.10.16 1
# 6 91492531 1 79 8.10.16 1
If it's not already sorted by date, convert to Date class, sort, and then do the above:
mydat$date = as.Date(mydat$date, format = "%d.%M.%y")
mydat = mydat[order(mydat$udid, mydat$date), ]

If you dont want to sort on date then the logical answer should be achieved by:
mydat$date = as.Date(mydat$date, "%d.%M.%y")
mydat %>%
group_by(udid) %>%
mutate(diff = ifelse(date == max(date), 1L, 0L)) #Last date
udid count avg_duration date diff
<int> <int> <int> <date> <int>
1 74385162 1 39 2016-01-31 1
2 79599601 1 1216 2016-01-15 1
3 79599601 1 568 2016-01-15 1
4 91475825 1 5 2016-01-11 1
5 91475825 3 6 2016-01-11 1
6 91492531 1 79 2016-01-08 1
7 92137561 1 9 2016-01-12 1
8 96308016 2 426 2016-01-28 0
9 96308016 2 78 2016-01-28 0
10 96308016 1 884 2016-01-28 0
# ... with 14 more rows
But seems, your sample date got duplicate date which is not allowing above logic to work. But the solution should work in real data especially when date is in date/time.

Related

How can I identify the first row with value lower than the first row in different column in groups in R?

I have a data set that looks like this:
unique score value day
1 2 52 33.75 1
2 2 39 36.25 2
3 3 47 41.25 1
4 3 26 41.00 2
5 3 17 32.25 3
6 3 22 28.00 4
7 3 11 19.00 5
8 3 9 14.75 6
9 3 20 15.50 7
10 4 32 18.00 1
11 4 20 20.25 2
12 5 32 26.00 1
13 5 31 28.75 2
14 5 25 27.00 3
15 5 27 28.75 4
16 6 44 31.75 1
17 6 25 30.25 2
18 6 31 31.75 3
19 6 37 34.25 4
20 6 28 30.25 5
I would like to identify the first row in each group (unique) where the score is lower than the value on day 1.
I have tried this:
result<-df %>%
group_by(unique.id) %>%
filter(dailyMyoActivity < globaltma[globalflareday==1])
But it doesn't seem to do exactly what I want it to do.
Is there a way of doing this?
If I understood your rationale correctly, and if your dataset is already ordered by day, this dplyr solution may come in handy
library(dplyr)
df %>%
group_by(unique) %>%
filter(score < value[day==1]) %>%
slice(1)
Output
# A tibble: 3 x 4
# Groups: unique [3]
# unique score value day
# <int> <int> <dbl> <int>
# 1 3 26 41 2
# 2 5 25 27 3
# 3 6 25 30.2 2
This could help:
library(dplyr)
df %>% group_by(unique) %>% mutate(Index=ifelse(score<value & day==1,1,0))
# A tibble: 20 x 5
# Groups: unique [5]
unique score value day Index
<int> <int> <dbl> <int> <dbl>
1 2 52 33.8 1 0
2 2 39 36.2 2 0
3 3 47 41.2 1 0
4 3 26 41 2 0
5 3 17 32.2 3 0
6 3 22 28 4 0
7 3 11 19 5 0
8 3 9 14.8 6 0
9 3 20 15.5 7 0
10 4 32 18 1 0
11 4 20 20.2 2 0
12 5 32 26 1 0
13 5 31 28.8 2 0
14 5 25 27 3 0
15 5 27 28.8 4 0
16 6 44 31.8 1 0
17 6 25 30.2 2 0
18 6 31 31.8 3 0
19 6 37 34.2 4 0
20 6 28 30.2 5 0
Then you filter by Index==1
We could also use slice
library(dplyr)
df1 %>%
group_by(unique) %>%
slice(which(score < value[day == 1])[1])
# A tibble: 3 x 4
# Groups: unique [3]
# unique score value day
# <int> <int> <dbl> <int>
#1 3 26 41 2
#2 5 25 27 3
#3 6 25 30.2 2
data
df1 <- structure(list(unique = c(2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L), score = c(52L, 39L,
47L, 26L, 17L, 22L, 11L, 9L, 20L, 32L, 20L, 32L, 31L, 25L, 27L,
44L, 25L, 31L, 37L, 28L), value = c(33.75, 36.25, 41.25, 41,
32.25, 28, 19, 14.75, 15.5, 18, 20.25, 26, 28.75, 27, 28.75,
31.75, 30.25, 31.75, 34.25, 30.25), day = c(1L, 2L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
Given that you have asked for identifying the first row which fulfills the criterion score < value a new column which gives you the row number has been added.
result <- df %>%
mutate(row_nr = row_number()) %>%
group_by(unique) %>%
filter(score < value) %>%
slice(1)

TABLE of age groups

I have different (2 in my example, 85 in my real data) and would like to produce a table of age classes (0-10, 11-20,21-30,31-40 etc.) for each group:
group age
1 1 34
2 1 37
3 1 22
4 1 10
5 1 11
6 1 12
7 1 14
8 2 56
9 2 46
10 2 25
11 2 24
12 2 13
13 2 13
14 2 45
15 2 45
16 2 23
17 2 56
18 2 54
19 2 31
20 2 68
I have tried various solutions from the forum:
mydf$ageclass<-cut(mydf$age, seq(0,100,10))
only works for the entire df and has no possibilty of groups.
mydf$ageclass<-Freq(mydf$age, breaks=c(0,20,30,40,50,60,70,80))
also only returns a solution for the entire dataframe
I have no way of integrating the "group" into these functions.
Also, both return a column with the age class given as '(30,40]' (meaning upper and lower class bound) and I would like the result to be a table like this:
group 0-10 11-20 21-30 31-40
1
2
What am I missing? perhaps a for loop? I am new to base R and really would enjoy some pointers as to how to think about the problem.
Is this what you are trying to achieve?
df$ageclass <- with(mydf, cut(age, seq(0,100,10)))
with(df, table(group, ageclass))
ageclass
group (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Edit
cut() also has a labels argument:
df$ageclass <- with(mydf, cut(age, seq(0,100,10), labels = paste0(seq(0,90,10) + 1, "-", seq(0,90,10) + 10)))
with(df, table(group, ageclass))
ageclass
group 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
1 1 3 1 1 0 0 0 0 0 0
2 0 2 3 1 3 3 1 0 0 0
Data
mydf <- structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), age = c(37L, 22L, 10L,
11L, 12L, 14L, 56L, 46L, 25L, 24L, 13L, 13L, 45L, 45L, 23L, 56L,
54L, 31L, 68L)), row.names = c(NA, -19L), class = "data.frame")

Calculating within group differences R

I’m trying to figure out how to append a column that identifies whether a difference of 10 exists between different IDs for a given day using the column named reading.
**Day ID Reading**
19-Jan 1 10
19-Jan 1 10
19-Jan 1 10
19-Jan 1 20
19-Jan 2 20
19-Jan 2 20
19-Jan 2 20
19-Jan 2 20
20-Jan 1 10
21-Jan 1 10
22-Jan 1 10
23-Jan 1 10
24-Jan 1 20
25-Jan 2 20
25-Jan 2 20
25-Jan 2 20
25-Jan 2 10
I would like:
**Day ID Reading Difference**
19-Jan 1 10 Y
19-Jan 1 10 Y
19-Jan 1 10 Y
19-Jan 1 20 Y
19-Jan 2 20 N
19-Jan 2 20 N
19-Jan 2 20 N
19-Jan 2 20 N
20-Jan 1 10 N
21-Jan 1 10 N
22-Jan 1 10 N
23-Jan 1 10 N
24-Jan 1 20 N
25-Jan 2 20 Y
25-Jan 2 20 Y
25-Jan 2 20 Y
25-Jan 2 10 Y
What you could do is to check whether the difference of the range is equal to or greater than 10 for each group.
dat$Diff <- with(dat, ave(Reading, Day, ID, FUN = function(x) diff(range(x)) >= 10))
dat
# Day ID Reading Diff
#1 19-Jan 1 10 1
#2 19-Jan 1 10 1
#3 19-Jan 1 10 1
#4 19-Jan 1 20 1
#5 19-Jan 2 20 0
#6 19-Jan 2 20 0
#7 19-Jan 2 20 0
#8 19-Jan 2 20 0
#9 20-Jan 1 10 0
#10 21-Jan 1 10 0
#11 22-Jan 1 10 0
#12 23-Jan 1 10 0
#13 24-Jan 1 20 0
#14 25-Jan 2 20 1
#15 25-Jan 2 20 1
#16 25-Jan 2 20 1
#17 25-Jan 2 10 1
data
dat <- structure(list(Day = c("19-Jan", "19-Jan", "19-Jan", "19-Jan",
"19-Jan", "19-Jan", "19-Jan", "19-Jan", "20-Jan", "21-Jan", "22-Jan",
"23-Jan", "24-Jan", "25-Jan", "25-Jan", "25-Jan", "25-Jan"),
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), Reading = c(10L, 10L, 10L, 20L, 20L, 20L,
20L, 20L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 10L)), .Names = c("Day",
"ID", "Reading"), class = "data.frame", row.names = c(NA, -17L
))
We can use data.table
library(data.table)
setDT(df1)[, Difference := abs(Reduce(`-`, as.list(range(Reading)))) >= 10,
.(ID, Day)]
df1
# Day ID Reading Difference
# 1: 19-Jan 1 10 TRUE
# 2: 19-Jan 1 10 TRUE
# 3: 19-Jan 1 10 TRUE
# 4: 19-Jan 1 20 TRUE
# 5: 19-Jan 2 20 FALSE
# 6: 19-Jan 2 20 FALSE
# 7: 19-Jan 2 20 FALSE
# 8: 19-Jan 2 20 FALSE
# 9: 20-Jan 1 10 FALSE
#10: 21-Jan 1 10 FALSE
#11: 22-Jan 1 10 FALSE
#12: 23-Jan 1 10 FALSE
#13: 24-Jan 1 20 FALSE
#14: 25-Jan 2 20 TRUE
#15: 25-Jan 2 20 TRUE
#16: 25-Jan 2 20 TRUE
#17: 25-Jan 2 10 TRUE
data
df1 <- structure(list(Day = c("19-Jan", "19-Jan", "19-Jan", "19-Jan",
"19-Jan", "19-Jan", "19-Jan", "19-Jan", "20-Jan", "21-Jan", "22-Jan",
"23-Jan", "24-Jan", "25-Jan", "25-Jan", "25-Jan", "25-Jan"),
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), Reading = c(10L, 10L, 10L, 20L, 20L, 20L,
20L, 20L, 10L, 10L, 10L, 10L, 20L, 20L, 20L, 20L, 10L)),
class = "data.frame", row.names = c(NA, -17L))
Using tidyverse you could do something like
library(tidyverse)
your_data %>%
group_by(Day, ID) %>%
mutate(difference = (max(difference) - min(difference)) >= 10)

ddply using "group_by" logic

I'm trying to use ddply to find the smallest distance between two positions pos where the corresponding chrom is the same in two dataframes:
head(bps, 10)
chrom pos iteration
1 1 4 1
2 1 14 1
3 1 68 1
4 1 79 1
5 1 200 1
6 1 205 1
7 1 270 1
8 1 304 1
9 2 7 1
10 2 13 1
head(flocs)
chrom pos
1 1 100
2 1 200
3 1 220
4 1 312
5 2 500
6 2 501
As an example, for the first line in bps, I want to find the closest pos in flocs where chrom = 1, which gives a value of -96.
The pseudocode for what I'm trying to do is:
foreach iteration (bps$iteration):
foreach chrom (bps$chrom):
foreach pos (bps$pos):
features_pos = pos in dataframe flocs closest to pos on the same chromosome
min_dist = feature_pos - pos
return features_pos, min_dist
I am trying to do this with ddply:
minDists <- ddply(bp_data, c("chrom", "pos"), function(x) {
index <- which.min(abs(flocs$pos[which(flocs$chrom==x$chrom)] - x$pos))
closestMotif <- flocs$pos[index]
chrom <- as.character(flocs$chrom[index])
dist <- (x$pos - closestMotif)
data.frame(features_pos = closestMotif, pos = x$pos, min_dist = dist, feature = feature)
})
But this doesn't constrain comparisons to the same chromosome:
head(minDists, 10)
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 100 7 -93 feature1 # bps chrom=2, flocs chrom=1
10 2 100 13 -87 feature1 # bps chrom=2, flocs chrom=1
The expected output here is:
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 500 7 -493 feature1 # bp1 chrom=2, flocs chrom=2
10 2 500 13 -487 feature1 # bp1 chrom=2, flocs chrom=2
I thought that by providing the columns c("chrom", "pos") essentially performed a group_by to the function call.
Is there any way that I can improve what I've written to achieve the desired result?
bps <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), pos = c(4L, 14L, 68L, 79L, 200L, 205L,
270L, 304L, 7L, 13L, 23L, 39L, 100L, 150L, 17L, 55L, 75L, 79L,
102L, 109L, 123L, 155L, 157L, 200L, 260L, 299L, 300L, 320L, 323L,
345L, 450L, 550L), iteration = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor")), row.names = c(NA,
-32L), class = "data.frame")
flocs <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1", "2", "3"), class = "factor"), pos = c(100L,
200L, 220L, 312L, 500L, 501L, 123L, 444L)), row.names = c(NA,
-8L), class = "data.frame")
data.table approach using a rolling join...
updated answer
(initially forgot all about the by-reference joining, which is faster and most certainly shorter ;-) )
library( data.table )
#set data as data.table
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
output
# chrom pos iteration mindist
# 1: 1 4 1 -96
# 2: 1 14 1 -86
# 3: 1 68 1 -32
# 4: 1 79 1 -21
# 5: 1 200 1 0
# 6: 1 205 1 5
# 7: 1 270 1 -42
# 8: 1 304 1 -8
# 9: 2 7 1 -493
# 10: 2 13 1 -487
# 11: 2 23 1 -477
# 12: 2 39 1 -461
# 13: 2 100 1 -400
# 14: 2 150 1 -350
# 15: 3 17 1 -106
# 16: 3 55 1 -68
# 17: 3 75 1 -48
# 18: 3 79 1 -44
# 19: 3 102 1 -21
# 20: 3 109 1 -14
# 21: 3 123 1 0
# 22: 3 155 1 32
# 23: 3 157 1 34
# 24: 3 200 1 77
# 25: 3 260 1 137
# 26: 3 299 1 -145
# 27: 3 300 1 -144
# 28: 3 320 1 -124
# 29: 3 323 1 -121
# 30: 3 345 1 -99
# 31: 3 450 1 6
# 32: 3 550 1 106
# chrom pos iteration mindist
Benchmarking answer until now
# Unit: milliseconds
# expr min lq mean median uq max neval
# Ronak_base 2.355879 2.555768 2.973069 2.626415 2.773581 8.016016 100
# Wimpel_data.table 1.697921 2.035788 2.416199 2.209616 2.361001 17.724528 100
# Pawel_tidyverse 14.845354 15.310505 16.333158 15.814819 16.541618 24.077871 100
microbenchmark::microbenchmark(
Ronak_base = {
bps$min_dist <- unlist(mapply(return_min_value, unique(bps$chrom), split(bps$pos, bps$chrom)))
},
Wimpel_data.table = {
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
},
Pawel_tidyverse = {
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
}
)
Looks like my data-table answer and the answer by Ronak Shah are pretty close together. I believe that data.table will gain the clear advantage when the data-sets are getting lager-huge (but I haven't tested)..
My base R attempt by creating a helper function (return_min_value). This function subset flocs based on current chrom and then returns the minimum value after subtracting it from pos. We split the pos column based on chrom and pass these values along with unique chrom values in mapply to return_min_value function.
return_min_value <- function(x, y) {
sapply(y, function(p) {
vals = p - flocs$pos[flocs$chrom == x]
vals[which.min(abs(vals))]
})
}
bps$min_dist <- unlist(mapply(return_min_value,
unique(bps$chrom), split(bps$pos, bps$chrom)))
bps
# chrom pos iteration min_dist
#1 1 4 1 -96
#2 1 14 1 -86
#3 1 68 1 -32
#4 1 79 1 -21
#5 1 200 1 0
#6 1 205 1 5
#7 1 270 1 -42
#8 1 304 1 -8
#9 2 7 1 -493
#10 2 13 1 -487
#...
Check this solution:
library(tidyverse)
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
Output:
chrom pos features_pos min_dist
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 -96
2 1 14 100 -86
3 1 68 100 -32
4 1 79 100 -21
5 1 200 200 0
6 1 205 200 5
7 1 270 312 -42
8 1 304 312 -8
9 2 7 500 -493
10 2 13 500 -487
# ... with 22 more rows

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources