extract a column based on other two column - r

ID MON in out
2 1 23 12
3 1 23 12
7 1 33 22
1 2 22 11
2 2 111 100
1 3 21 10
2 3 22 11
2 4 111 100
7 4 21 10
2 5 31 20
7 2046 41 30
I have a large data set in this format. I want to extract column four for the value of column 1==2 and column 2 smaller then 5.

It's basic R.
df[,4][df[,1]==2 & df[,2]<5]

Related

R:How to apply a sliding conditional branch to consecutive values in the sequential data

I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.

Finding the k-largest clusters in dbscan result

I have a dataframe df, consists of 2 columns: x and y coordinates.
Each row refers to a point.
I feed it into dbscan function to obtain the clusters of the points in df.
library("fpc")
db = fpc::dbscan(df, eps = 0.08, MinPts = 4)
plot(db, df, main = "DBSCAN", frame = FALSE)
By using print(db), I can see the result returned by dbscan.
> print(db)
dbscan Pts=13131 MinPts=4 eps=0.08
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
border 401 38 55 5 2 3 0 0 0 8 0 6 1 3 1 3 3 2 1 2 4 3
seed 0 2634 8186 35 24 561 99 7 22 26 5 75 17 9 9 54 1 2 74 21 3 15
total 401 2672 8241 40 26 564 99 7 22 34 5 81 18 12 10 57 4 4 75 23 7 18
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
border 4 1 2 6 2 1 3 7 2 1 2 3 11 1 3 1 3 2 5 5 1 4 3
seed 14 9 4 48 2 4 38 111 5 11 5 14 111 6 1 5 1 8 3 15 10 15 6
total 18 10 6 54 4 5 41 118 7 12 7 17 122 7 4 6 4 10 8 20 11 19 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
border 2 4 2 1 3 2 1 1 3 1 0 2 2 3 0 3 3 3 3 0 0 2 3 1
seed 15 2 9 11 4 8 12 4 6 8 7 7 3 3 4 3 3 4 2 9 4 2 1 4
total 17 6 11 12 7 10 13 5 9 9 7 9 5 6 4 6 6 7 5 9 4 4 4 5
69 70 71
border 3 3 3
seed 1 1 1
total 4 4 4
From the above summary, I can see cluster 2 consists of 8186 seed points (core points), cluster 1 consists of 2634 seed points and cluster 5 consists of 561 points.
I define the largest cluster as the one contains the largest amount of seed points. So, in this case, the largest cluster is cluster 2. And the 1st, 2nd, 3th largest clusters are 2, 1 and 5.
Are they any direct way to return the rows (points) in the largest cluster or the k-largest cluster in general?
I can do it in an indirect way.
I can obtain the assigned cluster number of each point by
db$cluster.
Hence, I can create a new dataframe df2 with db$cluster as the
new additional column besides the original x column and y
column.
Then, I can aggregate the df2 according to the cluster numbers in
the third column and find the number of points in each cluster.
After that, I can find the k-largest groups, which are 2, 1 and 5
again.
Finally, I can select the rows in df2 with third column value equals to 2 to return the points in the largest cluster.
But the above approach re-computes many known results as stated in the summary of print(db).
The dbscan function doesn't appear to retain the data.
library(fpc)
set.seed(665544)
n <- 600
df <- data.frame(x=runif(10, 0, 10)+rnorm(n, sd=0.2), y=runif(10, 0, 10)+rnorm(n,sd=0.2))
(dbs <- dbscan(df, 0.2))
#dbscan Pts=600 MinPts=5 eps=0.2
# 0 1 2 3 4 5 6 7 8 9 10 11
#border 28 4 4 8 5 3 3 4 3 4 6 4
#seed 0 50 53 51 52 51 54 54 54 53 51 1
#total 28 54 57 59 57 54 57 58 57 57 57 5
attributes(dbs)
#$names
#[1] "cluster" "eps" "MinPts" "isseed"
#$class
#[1] "dbscan"
Your indirect steps are not that indirect (only two lines needed), and these commands won't recalculate the clusters. So just run those commands, or put them in a function and then call the function in one command.
cluster_k <- function(dbs, data, k){
kth <- names(rev(sort(table(dbs$cluster)))[k])
data[dbs$cluster == kth,]
}
cluster_k(dbs=dbs, data=df, k=1)
## x y
## 3 6.580695 8.715245
## 13 6.704379 8.528486
## 23 6.809558 8.160721
## 33 6.375842 8.756433
## 43 6.603195 8.640206
## 53 6.728533 8.425067
## a data frame with 59 rows

Creating Groups by Matching Values of Different Columns

I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

Counting rows based on column values in R

I have a dataframe df
Reads Counts
aaaa 10
bbbb 20
cccc 25
and so on.
I want to calculate the number of reads which exceed a certain value of counts and plot that. Example I want a data frame that looks like
Counts>= #reads with Counts>=
1 3
2 3
3 3
11 2
20 2
21 1
and so on. Can you suggest how I can get such a dataframe and plot it.
Given the levels you want to plot at...
cutoffs <- 1:30
... you could do something like:
data.frame(cutoff=cutoffs, num.above=Reduce("+", lapply(dat$Counts, ">=", cutoffs)))
# cutoff num.above
# 1 1 3
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 3
# 6 6 3
# 7 7 3
# 8 8 3
# 9 9 3
# 10 10 3
# 11 11 2
# 12 12 2
# 13 13 2
# 14 14 2
# 15 15 2
# 16 16 2
# 17 17 2
# 18 18 2
# 19 19 2
# 20 20 2
# 21 21 1
# 22 22 1
# 23 23 1
# 24 24 1
# 25 25 1
# 26 26 0
# 27 27 0
# 28 28 0
# 29 29 0
# 30 30 0
Basically for each value in the original data frame you compute a vector of whether it's greater than or equal to each cutoff (using lapply with >=). Then you add them up (using Reduce with +), getting the total number greater than or equal to each cutoff.
Another option would be using outer/colSums
cutoff <- 1:30
data.frame(cutoff=cutoffs, num.above=colSums(outer(df$Counts, cutoffs, ">=")))

Resources