Insert random NAs in a vector in R as a Loop - r

So I have a data frame consisting of values of 0 and 1. I would like to make a loop that randomly samples 38 of those observations and replaces them with NA. I am successful in doing one iteration, where the original vector observations are replaced with the following one line code:
foo$V2[sample(seq(foo$V2), 38)] <- NA
However, I would like to do this 20 times and have each iteration compiled as separate columns in a single object. Ultimately, I would have a 20 column data frame with each having 191 observations, each with 38 randomly substituted NA's. At the end of the loop, I would like the data frame to be written out as a text file. Thank you for any help in the right direction.
Data Set:
https://drive.google.com/file/d/0BxfytpfgCdAcdEQ2LWFuVWVqMVU/view?usp=sharing

Maybe something like this:
# Fake data
set.seed(1998)
foo = data.frame(V2=sample(0:1, 191, replace=TRUE))
# Create 20 samples where we replace 38 randomly chosen values with NA
n = 20
samples = as.data.frame(replicate(n, {
x = foo$V2
x[sample(seq_along(x), 38)] = NA
x
}))
Then you can write it in whatever format you wish. For example:
write.table(samples, "samples.txt")
write.csv(samples, "samples.csv", row.names=FALSE)

Related

Sum rows by conditions

I have 18 models and each model replicates 12 times, so I have a CSV file that has 216 rows. I want to sum every 12 rows by its columns "sta_tss". But as some models don't run successfully, for example first model should have 12 rows, but the last replicate of the first model failed. So I only need to sum rows from 1 to 11 of "sta_tss" of the first model. And there are also other cases. Could you help me with this?
The code is only summed every 12 rows, which is not for my job. I write a loop because I have similar 10 files to run.
scal <- list.files("./outputs2/combine_evalu_by_species")
for(i in 1:length(scal)){
filenames <- paste0("./outputs2/combine_evalu_by_species/sta_tss", "_", i,".csv",sep="")
df <- read.csv(filenames,header=T)
num <- rowsum(df$stat_tss, rep(seq_len(length(df$stat_tss)/12), each=12))[,1]
curfilenam <- paste0("./outputs2/combine_evalu_by_species/sum_species",i,".csv", sep="")
write.csv(num,curfilenam, row.names = F)
}
The CSV file has two columns (id and sta_tss): id is from 1 to 216, some rows may be absent as I write.

For Loop to fill a Column in R

I have a data frame with zero columns and zero rows, and I want to have the for loop fill in numbers from 1 to 39. The numbers should be repeating themselves twice until 39, so for instance, the result I am looking for will be in one column, where each number repeats twice
Assume st is the data frame I have set already. This is what I have so far:
for(i in 1:39) {
append(st,i)
for(i in 1:39) {
append(st,i)
}
}
Expected outcome will be in a column structure:
1
1
2
2
3
3
.
.
.
.
39
39
You don't need to use for loop. Instead use rep()
# How many times you want each number to repeat sequentially
times_repeat <- 2
# Assign the repeated values as a data frame
test_data <- as.data.frame(rep(1:39, each = times_repeat))
# Change the column name if you want to
names(test_data) <- "Dont_encourage_the_use_of_blanks_in_column_names"

Performing a Specific Function for One Column For The First 12 Rows?

This is easy, but for some reason I'm having trouble with it. I have a set of Data like this:
File Trait Temp Value Rep
PB Mortality 16 52.2 54
PB Mortality 17 21.9 91
PB Mortality 18 15.3 50
...
And it goes on like that for 36 rows. What I need to do is divide the Value column by 100 in only the first 12 rows. I did:
NewData <- Data[1:12,4]/100
to try and create a new data frame without changing the old data. When I do this it divides the fourth column, but saves only the fourth column (rows 1-12) as a Values in the Global Environment by itself, not as Data with the rest of the rows/columns in the original set. Overall, I'm trying to fit the NewData in a nls function, so I need to save the modified data with the rest of the data, and not as a separate value. Is there a way for me to modify the first 12 rows without having R save it as a value?
Consider copying the dataframe and then updating column at select rows:
NewData <- Data
NewData$Value[1:12] <- NewData$Value[1:12]/10
# NewData[1:12,4] <- NewData[1:12,4]/10 ' ALTERNATE EQUIVALENT
library(dplyr)
newdata <- data[1:12,] %>% mutate(newV = VALUE/100)
newdata$Value = newdata$newV
newdata = newdata %>% select(-newV)
then you can do
full_data = rbind(newdata,data[13:36,])

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Apply function to every 20 rows between pairs of columns in a matrix

I have a set of genetic SNP data that looks like:
Founder1 Founder2 Founder3 Founder4 Founder5 Founder6 Founder7 Founder8 Sample1 Sample2 Sample3 Sample...
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
The size of the matrix is 56 columns by 46482 rows. I need to first bin the matrix by every 20 rows, then compare each of the first 8 columns (founders) to each columns 9-56, and divide the total number of matching letters/alleles by the total number of rows (20). Ultimately I need 48 8 column by 2342 row matrices, which are essentially similarity matrices. I have tried to extract each pair separately by something like:
"length(cbind(odd[,9],odd[,1])[cbind(odd[,9],cbind(odd[,9],odd[,1])[,1])[,1]=="T" & cbind(odd[,9],odd[,1])[,2]=="T",])/nrow(cbind(odd[,9],odd[,1]))"
but this is nowhere near efficient, and I do not know of a faster way of applying the function to every 20 rows and across multiple pairs.
In the example given above, if the rows were all identical like shown across 20 rows, then the first row of the matrix for Sample1 would be:
1 1 1 0 0 0 0
I think this is what you want? It helps to break the problem down into smaller pieces and then repeatedly apply a function to those pieces. My solution takes a few minutes to run on my laptop, but I think it should give you or others a start. If you're looking for better speed, I'd recommend looking at the data.table package. I'm sure there are other ways to make the code below a little faster too.
# Make a data set of random sequences
rows = 46482
cols = 56
binsize = 20
founder.cols = 1:8
sample.cols = setdiff(1:cols,founder.cols)
data = as.data.frame( matrix( sample( c("A","C","T","G"),
rows * cols, replace=TRUE ),
ncol=cols ) )
# Split the data into bins
binlevels = gl(n=ceiling(rows/binsize),k=20,length=rows)
databins = split(data,binlevels)
# A function for making a similarity matrix
compare_cols = function(i,j,mat) mean(mat[,i] == mat[,j])
compare_group_cols = function(mat, group1.cols, group2.cols) {
outer( X=group1.cols, Y=group2.cols,
Vectorize( function(X,Y) compare_cols(X,Y,mat) ) )
}
# Apply the function to each bin
mats = lapply( databins, compare_group_cols, sample.cols, founder.cols )
# And just to check. Random sequences should match 25% of the time. Right?
hist( vapply(mats,mean,1), n=30 ) # looks like this is the case

Resources