How to add strata information to a genind - r

I have created a genind object from a table containing SNPs information.
I need to insert population information into this genind.
I know which individuals (which are identified by numbers) should go into each population.
How do I pick the correct individuals and place them into separate populations?

It's always helpful to make a reproducible example when asking a question.
First, loading the necessary library (pretty sure its adegenet)
library(adegenet)
Making some fake data by first getting a vector of alleles
alleles <- paste0("0",1:4)
Setting number of loci, individuals per population, and the number of populations
nloci <- 10
nind <- 10
npops <- 2
Using a for loop to make the fake dataset
i <- NULL
out <- NULL
for(i in 1:npops){
#there are nind*nloci genotypes in each population
#make a
gts <- replicate(n = nind*nloci,
expr = paste0(sample(x = alleles,size = 1,replace = T),
sample(x = alleles,size = 1,replace = T)))
gts <- as.data.frame(matrix(data = gts,
nrow = nind, ncol = nloci, byrow = T))
#making generic locus colnames()
colnames(gts) <- paste("locus_",1:nloci)
out <- rbind(out,gts)
} #end of for loop
head(out)
Now converting that data.frame into a genind
obj <- df2genind(out, ploidy=2, ncode=2)
obj
Note that the row.names() are considered individual IDs
Now for setting the populations, note its empty right now
obj#pop
You just need a vector that represents the populations corresponding to each individual.
Option 1
If your individual IDs are clustered by population (e.g. 1-10 are from pop1 and 11-20 are from pop2), then something like this should work
pops<- paste0("pop",1:npops)
Set the populations using that vector, make sure it's a factor
obj#pop <- as.factor(rep(pops,each=nind))
obj#pop
Option 2
If the original data.frame (table) that contained your SNP information also contained population information, you could use that as your vector
e.g. If out looked like this
out$pops <- sample(x = pops,size = nrow(out),replace = T)
head(out)
Then do could use that column as your vector
obj#pop <- as.factor(out$pops)
obj#pop
Option 3
Alternatively, if you had another table that enabled you to identify which individuals corresponded to which population, then you use that information. It assumes that the second table (data.frame) is the same number of rows as out
Here is an example second table
df <- data.frame(pops = rep(pops,each=nind),
id = sample(x = 1:nrow(out),size = nrow(out),replace = F))
head(df)
Note that the IDs are not in order, but they were in out and therefore are in the obj, so df needs to be ordered by df$id
df <- df[order(df$id),]
head(df)
After they are in the correct order
obj#pop <- as.factor(df$pops)
obj#pop

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Extract p values and r values for all pairwise variables

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

How to count number of occurrences of data combinations and save in a matrix in R?

I have a problem in which I try to create a matrix with the number of occurrences of specific 'coordinates'. I am working in R.
To illustrate, this is (part of) my data:
pre = c(3,1,3,2,2,4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8)
post = c(4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8,8,9,7,9,9)
df = data.frame(pre,post)
I then define this output matrix with the possible coordinate dimensions(range 1-20 in all data):
matrix = matrix(NA, nrow=20, ncol=20)
colnames(matrix) = seq(1,20,1)
rownames(matrix) = seq(1,20,1)
I then need a loop to run through my data and to store how many of the specific pre-post combinations exist within the data:
for (i in 1:40){matrix[df$post[i], df$pre[i]] = 1}
This works as in that the output now shows which 'coordinates' occurred in the data, but it doesn't say how many times.
For example, I know that pre=4, post=4 occurred twice.
Thus the loop needs to remember the combination already occurred and needs to add one more 1, but I don't know how to program this.
I hope somebody can be of help,
Anne
You could initialize the matrix with zeros instead of NA and then increment the matrix value like this:
pre = c(3,1,3,2,2,4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8)
post = c(4,3,5,3,4,6,5,6,5,4,5,6,6,5,6,5,7,6,7,7,7,4,8,4,8,8,4,4,8,3,9,8,6,9,8,8,9,7,9,9)
df = data.frame(pre,post)
matrix = matrix(0, nrow=20, ncol=20)
colnames(matrix) = seq(1,20,1)
rownames(matrix) = seq(1,20,1)
for (i in 1:40){matrix[df$post[i], df$pre[i]] = matrix[df$post[i], df$pre[i]] + 1}
By the way, the setting of the matrix colnames and rownames is not needed if you don't need it for any other reasons.
We can use table to do this. Convert the 'pre', 'post' columns to factor with levels specified as 1 to 20 and then use table
table(factor(df$pre, levels = 1:20), factor(df$post, levels = 1:20))
If we are using the already created 'matrix' an option is
out <- as.data.frame(table(df))
matrix[as.matrix(out[1:2])] <- out$Freq

Largest set of columns having at least k shared rows

I have a large data frame (50k by 5k). I would like to make a smaller data frame using the following rule. For a given k, 0>k>n, I would like to select the largest set of columns such that k rows have non-na values in all of these columns.
This seems like it might be too hard for a computer to do on a big data frame, but I'm hoping it is possible. I have written code for this operation.
It seems my way of doing this is too complex. It relies on (1) computing a list of all possible subsets of the set of columns, and then (2) checking how many shared rows they have. For even small numbers of columns (1) gets too slow (e.g. 45 seconds for 25 columns).
Question: Is it theoretically possible to get the largest set of columns sharing at least k non-na rows? If so, what is a more realistic approach?
#alexis_laz's elegant answer to a similar question takes an inverse approach to mine, examining all (fixed-size) subsets of the observations/samples/draws/units and checking which variables are present in them.
Taking combinations of n observations is difficult for large n. For example, length(combn(1:500, 3, simplify = FALSE)) yields 20,708,500 combinations for 500 observations and fails to produce the combinations on my computer for sizes greater than 3. This makes me worry that it's not gonna be possible for large n and p in either approach.
I have included an example matrix for reproducibility.
require(dplyr)
# generate example matrix
set.seed(123)
n = 100
p = 25
missing = 25
mat = rnorm(n * p)
mat[sample(1:(n*p), missing)] = NA
mat = matrix(mat, nrow = n, ncol = p)
colnames(mat) = 1:p
# matrix reporting whether a value is na
hasVal = 1-is.na(mat)
system.time(
# collect all possible subsets of the columns' indices
nameSubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = FALSE),
recursive = FALSE,
use.names = FALSE)
)
#how many observations have all of the subset variables
countObsWithVars = function(varsVec){
selectedCols = as.matrix(hasVal[,varsVec])
countInRow = apply(selectedCols, 1, sum) # for each row, number of matching values
numMatching = sum(countInRow == length(varsVec)) #has all selected columns
}
system.time(
numObsWithVars <<- unlist(lapply(nameSubsets, countObsWithVars))
)
# collect results into a data.frame
df = data.frame(subSetNum = 1:length(numObsWithVars),
numObsWithVars = numObsWithVars,
numVarsInSubset = unlist(lapply(nameSubsets, length)),
varsInSubset = I(nameSubsets))
# find the largest set of columns for each number of rows
maxdf = df %>% group_by(numObsWithVars) %>%
filter(numVarsInSubset== max(numVarsInSubset)) %>%
arrange(numObsWithVars)

Calculate statistics (e.g. average) across cells of identical data-frames

I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}

Resources