Subsetting a dataframe by names in another dataframe [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I have very big reference file with thousands of pairwise comparisons between thousands of objects ("OTUs). The dataframe is in long format:
data.frame': 14845516 obs. of 3 variables:
$ OTU1 : chr "0" "0" "0" "0" ...
$ OTU2 : chr "8192" "1" "8194" "3" ...
$ gendist: num 78.7 77.8 77.6 74.4 75.3 ...
I also have a much smaller subset with observed data (slightly different structure):
'data.frame': 286903 obs. of 3 variables:
$ OTU1 : chr "1239" "1603" "2584" "1120" ...
$ OTU2 : chr "12136" "12136" "12136" "12136" ...
$ ecodist: num 2.08 1.85 2 1.73 1.53 ...
- attr(*, "na.action")=Class 'omit' Named int [1:287661] 1 759 760 1517 1518 1519 2275 2276 2277 2278 ...
.. ..- attr(*, "names")= chr [1:287661] "1" "759" "760" "1517" ...
Again, its a pairwise comparison of objects ('OTUs'). All objects in the smaller dataset are also in the reference dataset.
I want to reduce the reference that it only contains objects that are also found in the smaller dataset. It is very important that its done on both columns (OTU1, OTU2).
Here is toy data:
library(reshape)
###reference
Ref <- cor(as.data.frame(matrix(rnorm(100),10,10)))
row.names(Ref) <- colnames(Ref) <- LETTERS[1:10]
Ref[upper.tri(Ref)] <- NA
diag(Ref) <- NA
Ref.m <- na.omit(melt(Ref, varnames = c('row', 'col')))
###query
tmp <- cor(as.data.frame(matrix(rnorm(25),5,5)))
row.names(tmp) <- colnames(tmp) <- LETTERS[1:5]
tmp[upper.tri(tmp)] <- NA
diag(tmp) <- NA
tmp.m <- na.omit(melt(tmp, varnames = c('row', 'col')))

The following works for me using your toy data:
Ref[rownames(tmp), colnames(tmp)]
This selects (by name) only those rows in Ref whose names are also the names of rows in tmp, and likewise for columns.
If you want to stick with the long format in the str outputs in the first part of your question, you can instead use something like:
data1[(data1$OTU1 %in% data2$OTU1) & (data1$OTU2 %in% data2$OTU2), ]
Here I'm creating a logical vector that indicates which rows of your reference data frame (data1) have their OTU1 entry somewhere in data2$OTU1, and the same for OTU2. Said logical vector is then used to select rows of data1.

Related

Object size increases hugely when transposing a data frame

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str() of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the structure of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:
A data frame is first coerced to a matrix: see as.matrix.
OK, see ?as.matrix:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

Change global environment to read in numerical order

I have a dataset in which I have 22 animals. Each animal has been named as follows: c(" Shark1", "Shark2", "Shark3", ...) etc.
I am trying to plot a two category variables against each other do determine the proportion of time each shark spent at separate depths:
Sharks<-table(merge$DepthCat, merge$ID2) #Depth category vs. ID
merge$DepthCat[merge$Depth2>200]<-"4"
Sharks<-table(merge$DepthCat, merge$ID2)
plot(t(Sharks), main="",
col=c("whitesmoke", "slategray3", "slategray", "slategray4"),
ylab="Depth catagory", xlab="Month")
axis(side=4)
While the plot works, it is not plotting in numerical order but instead alphabetical therefore I am getting the following graph below.
Does anyone know how to resolve this for the plot? I have research the array method but unsure how it would be implemented here.
You didn't provide your complete data set, so I generated my own random data. Given that the bar headers derived from ID2 are sorting lexicographically, I assumed they are stored as characters in your data.frame merge, so I generated them thusly.
set.seed(2L);
NR <- 300L;
merge <- data.frame(ID2=sample(as.character(1:22),NR,T),Depth2=pmax(0,rnorm(NR,100,50)),stringsAsFactors=F);
merge$DepthCat <- as.character(findInterval(merge$Depth2,c(0,66,133,200)));
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : chr "5" "16" "13" "4" ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
And sure enough, we can reproduce the problem with this test data:
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);
The solution is to coerce the ID2 vector to numeric so it sorts numerically.
merge$ID2 <- as.integer(merge$ID2);
str(merge);
## 'data.frame': 300 obs. of 3 variables:
## $ ID2 : int 5 16 13 4 21 21 3 19 11 13 ...
## $ Depth2 : num 148.8 91.5 136.1 57.8 163.9 ...
## $ DepthCat: chr "3" "2" "3" "1" ...
Sharks <- table(merge$DepthCat,merge$ID2);
plot(t(Sharks),main='',col=c('whitesmoke','slategray3','slategray','slategray4'),ylab='Depth category',xlab='Month');
axis(side=4L);

Subsetting SPSS data imported into r with package haven?

I've used the package haven to read SPSS data into R. All seems ok, except that when I try to subset the data it doesn't seem to behave correctly. Here's the code (I don't have SPSS to create example data and can't post the real stuff):
require(haven)
df <- read_spss("filename1.sav")
tmp <- df[as_factor(df$variable1) == "factor1",]
tmp <- tmp[!is.na(tmp$variable2), ]
The above df has "NA" scattered throughout. I expected the above to subset only the data, keeping only rows with variable1 with "factor1" and discarding all rows with NAs in variable2. The first subset works as expected. But the second subset does not. It removes rows, but NAs are still present.
I suspect the issue has something to do with the way haven structures the imported data and uses the class labelled instead of an actual factor variable, but it's over my head. Anyone know what could be happening and how to accomplish the same?
Here's the structure of df, variable1 and variable2:
> str(df)
'data.frame': 4573 obs. of 316 variables:
> str(df$variable1)
Class 'labelled' atomic [1:4573] 9 9 9 14 8 8 2 4 8 16 ...
..- attr(*, "labels")= Named num [1:18] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, "names")= chr [1:18] "factor1" "factor2" "factor3" "factor4" ...
> str(df$variable2)
Class 'labelled' atomic [1:4573] 3 NA 3 NA 3 NA 1 1 NA NA ...
..- attr(*, "labels")= Named num [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "Sponsor" "Not a Sponsor" "Don't Know"

How to acces composite elements in a data frame

I've created this data frame and want to access the individual elements for plotting. But it seems I can't. What kind of data frame did I have created and how can I access its individual elements?
> print(df)
B.mean B.conf1 B.conf2
1 0.75000000 -0.18826132 1.68826132
2 0.66666667 0.01334534 1.31998799
3 0.33333333 -0.31998799 0.98665466
> names(df)
[1] "B"
> struct(df)
'data.frame': 3 obs. of 1 variable:
$ B: num [1:3, 1:3] 0.75 0.6667 0.3333 -0.1883 0.0133 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mean" "conf1" "conf2"
The 'B' column is a matrix as evident from the str of 'df'. By using do.call with data.frame, it gets converted to 3 columns of a data.frame.
do.call(data.frame, df)

Changing names in a list of dataframes [duplicate]

This question already has answers here:
Changing Column Names in a List of Data Frames in R
(6 answers)
Closed 4 years ago.
I read all textfiles in the working directory into a list, and cut some columns
all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])
I am ending up with a "list of 2"
> str(data.list)
List of 2
$ 001.txt:'data.frame': 71330 obs. of 3 variables:
..$ V1: Factor w/ 71321 levels
..$ V2: Factor w/ 1382 levels
..$ V3: num [1:71330] 89.1 99.5 98.8 99.4 99.5 ...
$ 002.txt:'data.frame': 98532 obs. of 3 variables
..$ V1: Factor w/ 98517 levels
..$ V2: Factor w/ 1348 levels
..$ V3: num [1:98532] 99.5 99 99.5 98.4 100 ...
I want to rename V1,V2,V3 according to
new.names<-c("query", "sbjct", "ident")
How is this possible with lapply?
You can try setNames
data.list <- lapply(data.list, setNames, new.names)

Resources