subset data based on matrix of row numbers - r

Say i have the following data
B <- (5:20)
C <- (6:21)
D <- (7:22)
E <- (8:23)
data <- data.frame(B,C,D,E)
and I also have a matrix of
id <- c(4,7,9,12,15)
where this matrix represents the row identities I want to output into a new data.frame
How can one use the subset function to subset the original data
new <- subset(data, .....)
so that new only consists of the 5 observations

Try
data[id,]
# B C D E
#4 8 9 10 11
#7 11 12 13 14
#9 13 14 15 16
#12 16 17 18 19
#15 19 20 21 22
The syntax data[i,j] creates a subset of data with row(s) i and column(s) j

Related

lag/lead entire dataframe in R

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

Create multiple data frames from one based off values with a for loop

I have a large data frame that I would like to convert in to smaller subset data frames using a for loop. I want the new data frames to be based on the the values in a column in the large/parent data frame. Here is an example
x<- 1:20
y <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","C","C","C")
df <- as.data.frame(cbind(x,y))
ok, now I want three data frames, one will be columns x and y but only where y == "A", the second where y==
"B" etc etc. So the end result will be 3 new data frames df.A, df.B, and df.C. I realize that this would be easy to do out of a for loop but my actual data has a lot of levels of y so using a for loop (or similar) would be nice.
Thanks!
If you want to create separate objects in a loop, you can use assign. I used unique because you said you had many levels.
for(i in unique(df$y)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$y==i,])
}
> df.A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
> df.B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
I think you just need the split function:
split(df, df$y)
$A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
$B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
$C
x y
18 18 C
19 19 C
20 20 C
It is just a matter of properly subsetting the output to split and store the results to objects like dfA <- split(df, df$y)[[1]] and dfB <- split(df, df$y)[[2]] and so on.

Filtering a Dataset by another Dataset in R

The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.
The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).
Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
accession_number values
1 a 1
2 b 3
3 c 4
4 d 2
5 e 3
6 f 12
I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.
>SubsetData1
accession_number
1 c
2 d
3 e
The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.
>Data2
accession_number values Intensity col4 col6
1 a 1 -0.0251304 a -0.4816370
2 a 2 -0.4308735 b -1.0335971
3 c 3 -1.9001321 c 0.6416735
4 c 4 0.1163934 d -0.4489048
5 c 5 0.7586820 e 0.5408650
6 b 6 0.4294415 f 0.6828412
7 b 7 -0.8045201 g 0.6677730
8 b 8 -0.9898947 h 0.3948412
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
18 f 18 -0.6660976 r 0.6159731
19 f 19 0.2089186 s -0.8222399
20 f 20 -1.5809582 t 1.5567113
21 f 21 0.3610700 u 0.3264431
22 f 22 1.2923324 v 0.9636267
What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.
accession_number values Intensity col4 col6
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.
Thank You
Try this
WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]
You can also use inner_join of dplyr package.
dat = inter_join(Data2, SubsetData1)
The subset function is designed for basic subsetting:
subset(Data2,accession_number %in% SubsetData1$accession_number)
Alternately, here you could merge:
merge(Data2,SubsetData1)
The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.
library(dplyr)
new_dataset <- Data2 %>%
filter(accession_number %in% SubsetData1$accession_number)

Expand Records to create Edges for igraph

I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.

Taking a subset of a dataframe according to an integer vector

I have a dataframe and it has four columns. Now I want to take a subset of this dataframe according to an integer vector. I tried to use subset and looked at other posts in vain.
b=c('p','q','r','s','t','u')
a=c('at','bt','ct','dt','et','ft')
d=c(22,23,24,25,26,27)
e=c(1,2,3,4,5,6)
dat=data.frame(b,a,d,e)
dat
b a d e
1 p at 22 1
2 q bt 23 2
3 r ct 24 3
4 s dt 25 4
5 t et 26 5
6 u ft 27 6
test=c(2,5)
Now I want to select all the rows (keeping all the columns too) that are in test that is 2nd and 5th rows and keep all other columns.
Given your definitions of dat and test,
dat[test,]
# b a d e
# 2 q bt 23 2
# 5 t et 26 5
or
dat[dat$e %in% test,]
# b a d e
# 2 q bt 23 2
# 5 t et 26 5
The first approach just treats the elements of test as row numbers of dat. The second extracts all rows of dat for which dat$e is in test.

Resources