Taking a subset of a dataframe according to an integer vector - r

I have a dataframe and it has four columns. Now I want to take a subset of this dataframe according to an integer vector. I tried to use subset and looked at other posts in vain.
b=c('p','q','r','s','t','u')
a=c('at','bt','ct','dt','et','ft')
d=c(22,23,24,25,26,27)
e=c(1,2,3,4,5,6)
dat=data.frame(b,a,d,e)
dat
b a d e
1 p at 22 1
2 q bt 23 2
3 r ct 24 3
4 s dt 25 4
5 t et 26 5
6 u ft 27 6
test=c(2,5)
Now I want to select all the rows (keeping all the columns too) that are in test that is 2nd and 5th rows and keep all other columns.

Given your definitions of dat and test,
dat[test,]
# b a d e
# 2 q bt 23 2
# 5 t et 26 5
or
dat[dat$e %in% test,]
# b a d e
# 2 q bt 23 2
# 5 t et 26 5
The first approach just treats the elements of test as row numbers of dat. The second extracts all rows of dat for which dat$e is in test.

Related

Filtering a Dataset by another Dataset in R

The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.
The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).
Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
accession_number values
1 a 1
2 b 3
3 c 4
4 d 2
5 e 3
6 f 12
I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.
>SubsetData1
accession_number
1 c
2 d
3 e
The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.
>Data2
accession_number values Intensity col4 col6
1 a 1 -0.0251304 a -0.4816370
2 a 2 -0.4308735 b -1.0335971
3 c 3 -1.9001321 c 0.6416735
4 c 4 0.1163934 d -0.4489048
5 c 5 0.7586820 e 0.5408650
6 b 6 0.4294415 f 0.6828412
7 b 7 -0.8045201 g 0.6677730
8 b 8 -0.9898947 h 0.3948412
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
18 f 18 -0.6660976 r 0.6159731
19 f 19 0.2089186 s -0.8222399
20 f 20 -1.5809582 t 1.5567113
21 f 21 0.3610700 u 0.3264431
22 f 22 1.2923324 v 0.9636267
What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.
accession_number values Intensity col4 col6
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.
Thank You
Try this
WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]
You can also use inner_join of dplyr package.
dat = inter_join(Data2, SubsetData1)
The subset function is designed for basic subsetting:
subset(Data2,accession_number %in% SubsetData1$accession_number)
Alternately, here you could merge:
merge(Data2,SubsetData1)
The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.
library(dplyr)
new_dataset <- Data2 %>%
filter(accession_number %in% SubsetData1$accession_number)

Data frame manoeuvre [duplicate]

This question already has an answer here:
R programming - data frame manoevur
(1 answer)
Closed 7 years ago.
Suppose I have the following dataframe:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d 45
5 e 52
6 f 65
7 g 76
8 a 13
9 b 24
I'd like to turn it into a new dataframe like the following:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d+e 97
5 f 65
6 g 76
7 a 13
8 b 24
How can I do it? (Surely, the dataframe is much larger, but I want the sum of all categories of d and e and group it into a new category, say 'H').
Many thanks!
This is a good question but unfortunately OT here. So I'll answer until it get migrated.
I'm assuming Variable is of class factor, so you'll need to properly re-level it (assuming your data is called df)
levels(df$Categories)[levels(df$Categories) %in% c("d", "e")] <- "h"
Next, I'll use the data.table package as you have a large data set and it's devel version (v >= 1.9.5) has a convinient function called rleid (download from GitHub)
library(data.table) ## v >= 1.9.5
setDT(df)[, .(Variable = sum(Variable)), by = .(indx = rleid(Categories), Categories)]
# indx Categories Variable
# 1: 1 a 11
# 2: 2 b 21
# 3: 3 c 34
# 4: 4 h 97
# 5: 5 f 65
# 6: 6 g 76
# 7: 7 a 13
# 8: 8 b 24
You can try this:
# plyr package provides rbind.fill() function for row binding
library(plyr)
# Assuming you have a rows.cvs containing the data, read it into a data frame
data<-read.csv("rows.csv",stringsAsFactors=FALSE)
# Find the lowest index of d or e (whichever comes first)
index<-min(match("d",data$Var1.nominal.), match("e",data$Var1.nominal.))
# Returns all rows containing d and e in Var1(nominal) column
tempData<-data[data$Var1.nominal. %in% c("d","e"),]
# Remove all the rows containing d and e from original data frame
data<-data[!data$Var1.nominal. %in% c("d","e"),]
# Reorder row index numbers in data
rownames(data)<-NULL
# Combine rows containing d and e in Var1(nominal)column, and sum up the column Var2(numeric)
tempData<-data.frame(Var1.nominal.="d+e",Var2.numeric.=sum(tempData[,2]))
# Combine original data and tempData frame with use of index
data<-rbind.fill(data[1:(index-1),],tempData,data[index:length(data[,1]),])
# Renaming "d+e" to"h"
data[index,1]="h"
# Getting rid of the tempData data frame
rm(tempData)
Output:
> data
Var1.nominal. Var2.numeric.
1 a 11
2 b 21
3 c 34
4 h 97
5 f 65
6 g 76
7 a 13
8 b 24

Expand Records to create Edges for igraph

I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.

Extract rows from data frame based on multiple identifiers in another data frame

I would like to extract a selection of rows from a data frame based on multiple identifying variables contained in another data frame. Consider the following illustrative data set:
df <- data.frame(id=c(1,2,2,3,4,4,4,4,5), ref=c("A","B","C","D","E","F","F","G","H"), amount=c(10,15,20,25,30,35,-35,40,45))
required <- data.frame(id=c(2,3,4,4), ref=c("B","D","E","F"))
I would like the output in a data frame with id, ref and amount as follows:
id ref amount
2 B 15
3 D 25
4 E 30
4 F 35
4 F -35
Note in particular that id 4 and ref F have two matches from the df with amounts 35 and -35.
You want to merge:
merge(df, required)
## id ref amount
## 1 2 B 15
## 2 3 D 25
## 3 4 E 30
## 4 4 F 35
## 5 4 F -35

Making a data frame that is a subset of two data frames

I am stumped again.
I have two data frames
dataframe1
a b c
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
and
dataframe2
f g h
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
I want to take the first column of dataframe1 and make three new dataframes with the second column being each of the three f,g and h
Obviously I could just do a subset over and over
subset1 <- cbind(dataframe1[,1]dataframe2[,1])
subset2 <- cbind(dataframe1[,1]dataframe2[,2])
but my dataframes will have variable numbers of columns and are very long row numberwise. So I am looking for a little more something general. My data frames will always be the same length.
The closest I have come to getting anything was with apply and cbind but I got either a set of three rows that were a and f, a and g, a and h each combined as single numeric vector or I get a single data frame with four columns, a,f,g,h.
Help is deeply appreciated.
You can use lapply it iterate over the columns of dataframe2 like so:
lapply(dataframe2, function(x) as.data.frame(cbind(dataframe1[,1], x)))
This will result in a list object where each entry corresponds to a column of dataframe2. For example:
$f
V1 x
1 21 21
2 11 11
3 4 4
$g
V1 x
1 21 12
2 11 9
3 4 6
$h
V1 x
1 21 22
2 11 6
3 4 7

Resources