How do I delete columns from a matrix based on a list in R? - r

I have a very large matrix that looks like this:
AGAG AAGAA AGTG AGAT AAGAT AGTT
1001 14691 0 0 0 0 5
1002 13 12 0 5831 20473 4
1003 0 5831 20473 0 0 0
1004 0 7936 7936 7936 0 0
1005 16066 0 0 24 2 2
There are >8000 columns. I need to delete many (~3000) of these columns by the column name. I have abbreviated the column names here (genome sequences). Obviously, I can't type that many individually.
I made a separate table that has the column names I want to delete. For example:
AGAG
AGTG
AGTT
AGAT
I tried to use subset and %in% so far to no avail.

Here's an example with the iris dataset:
col_to_rm <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
col_to_keep <- setdiff(colnames(iris), col_to_rm )
iris[, col_to_keep]
iris is a dataframe, but it will work as well if you have a matrix.
You should be probably thinking whether that data structure (matrix) is the best to save your data: you've got very many columns...

Related

Cannot get completed dataset using imputeMCA

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.
I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

Identify most variable rows within multiple subsets of a data.frame and merge this information into a final data.frame

I have a data.frame named data containing 18472 rows by 2229 columns. The last column of this data.frame (data$bin) contains a bin number from 1:7, although this may be dynamic later down the road. What I'd like to accomplish is to identify the 25 most variable rows for each bin and create a final data.frame with these. Ultimately this would result in a data.frame with 25*7 rows by 2228 columns. I am able to identify variable rows but I'm not sure how to preform this on all bins within data:
> # identify variable rows
> library(genefilter)
> mostVarRows = head(order(rowVars(data), decreasing=TRUE), 25)
Data looks something like this:
> head(data[(ncol(data)-3):ncol(data)])
D6_NoSort_6000x3b_CCCCCGCCCTGA D6_NoSort_2250b_ATTATACTATTT D6_EcadSort_6000x3b_CACGACCTCCAC bin
0610005C13RIK 0 0 0 2
0610007P14RIK 0 0 0 6
0610009B22RIK 0 0 0 3
0610009L18RIK 0 0 0 2
0610009O20RIK 0 0 0 3
0610010B08RIK 0 0 0 6
I need to extract out the most variant rows from each bin into a separate data.frame!
Below I create a mock data set. For future reference, the burden is on you to do this since you know what you want better than I do.
# create mock data
set.seed(1)
data<-replicate(1000,rnorm(500,500,100))
data<-data.frame(data,bins= sample(c(1:7),500,replace=TRUE)) # create bins column
Next I find the variance of each row (assuming this is how you want to define "most variable"). Then I sort by bin and variance (greatest to lowest).
data$var_by_row<-apply(data[,1:1000],1,var) # find variance of each row
data<-data[order(data$bins, -data$var_by_row),] # sort by bin and variance
Since the data is sorted properly, it remains to take the first 25 observations of each bin and stack them together. You were definitely on the right track with your use of order() and head(). The do.call() step afterwards is necessary to stack the head() results and is probably what you're looking for.
data_sub_list<-by(data,INDICES = data$bins, head,n=25) # grab the first 25 observations of each bin
data_sub<-do.call('rbind',data_sub_list) # the above returns a list of 7 data frames...one per bin. this stacks them
> table(data_sub$bins) # each bin appears 25 times.
1 2 3 4 5 6 7
25 25 25 25 25 25 25
> nrow(data_sub) # number of rows is 25*7
[1] 175

Calculate mean value of subsets and store them in a vector for further analysis

Hullo, I've been working on a dataset for a while now, but am also kind of stuck. One question/answer here was already helpful, but I need to calculate the mean not for a single value, but sixty.
My dataset is basically this:
> data[c(1:5, 111:116), c(1:6, 85:87)]
plotcode block plot subsample year month Alo.pra Ant.odo Arr.ela
91 B1A01 B1 A01 1 2003 May 0 9 0
92 B1A02 B1 A02 1 2003 May 38 0 0
93 B1A03 B1 A03 1 2003 May 0 0 0
94 B1A04 B1 A04 1 2003 May 0 0 0
95 B1A05 B1 A05 1 2003 May 0 0 0
214 B2A16 B2 A16 2 2003 May 0 0 0
215 B2A17 B2 A17 2 2003 May 0 0 0
216 B2A18 B2 A18 2 2003 May 486 0 0
217 B2A19 B2 A19 2 2003 May 0 0 0
218 B2A20 B2 A20 2 2003 May 0 0 0
219 B2A21 B2 A21 2 2003 May 0 0 0
The first few columns are general data about the data point. Each plot has had up to 4 subsamples. The columns 85:144 are the data I want to calculate the means of.
I used this command:
tapply(data2003[,85] , as.factor(data2003$plotcode), mean, na.rm=T)
But like I said, I need to calculate the mean sixty times, for columns 85:144.
My idea was using a for–loop.
for (i in 85:144)
{
temp <- tapply(data2003[,i], data2003$plotcode, mean, na.rm=T)
mean.mass.2003 <- rbind(mean.mass.2003, temp)
}
But that doesn't work. I get multiple error messages, "number of columns of result is not a multiple of vector length (arg 2)".
What I basically want is a table in which the columns represent the species, with the rows as the plotcode and the actual entries in the fields being the respective means.
I figured and fiddled and had some help that worked as I wanted it. I know that's a kind of convoluted approach, but I only just started R, so I do like to understand what I code:
data.plots<-matrix(NA, 88,60) ## A new, empty matrix we'll fill with the loop
for (i in 85:144) # The numbers because that's where our relevant data is
{
temp <- tapply(data2007[,i], data2007$plotcode, mean, na.rm=T) # What tapply does in this instance: It calculates the mean value of the i-th column form data2003 for every row in which the plotcode is the same, ignoring NAs. temp will be a single row of values, obviously.
data.plots[,i-84]<-as.numeric(temp) # shunts the single row from temp we just calculated consecutively into data.plots.
}
colnames(data.plots) <- colnames(data[85:144])
rownames(data.plots) <- as.data.frame(table(data$plotcode))[,1] # the second part is basically a count() function, returning in the first column the unique entries found and in the second the frequency of that entry.
This works. It shunts the mean biomass per species into a temporary vector(? data frame? matrix?) as its being calculated for every unique entry in data2003$plotcode, and then overwrites consecutively the rows of the target matrix data.plots.
After naming the rows and columns of data.plots I can work with it without always having to remember each name.

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

Resources