Set values less than threshold to zero, with column-specific thresholds - r

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.

I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)

It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).

This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Related

R: Seperating several observations of a variable and building a matrix

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Apply/lapply function to all columns in matrix

I have a matrix called seq$num, consisting of 100 columns and 30k rows. Each column corresponds to the name of a specific sample (es. CAGTCA), and every row is about a numeric value. With this type of object, I can access the first row writing seq$num[[1]] and so on for the other rows. This is a brief example of my database:
CAGTCA
AGATCA
GCTCGA
GCTCGA
-0.4930
-2.0330
0.7100
0.1560
1.0030
0.0120
-1.0433
0.6701
0.0013
1.0013
1.2451
-1.3421
I would like to loop through all the samples using the lapply function and for each sample classify:
the numbers above 1.5 as "high".
the numbers below 0 as "low".
the numbers between 0 and 1.5 as "medium".
Then I need also to take note of how many high, low and medium numbers have a sample.
How can this be done? I've tried applying the lapply function, but I don't get the output I want.
You can write a function which divides data into categories.
classify <- function(x) ifelse(x >= 1.5, 'high', ifelse(x < 0, 'low', 'medium'))
For each dataframe in seq$num apply the classify function to it and use table to count.
res <- lapply(seq$num, function(x) table(classify(as.matrix(x))))

How to accsss R data frame contents using element in factor level

As below, dataframe factorizedss is the factorized version of a sourcedata dataframe ss.
ss <- data.frame(c('a','b','a'), c(1,2,1)); #There are string columns and number columns.
#So, I factorized them as below.
factorizedss <- data.frame(lapply(ss, as.factor)); #factorized version
indices <- data.frame(c(1,1,2,2), c(1,1,1,2)); #Now, given integer indices
With given indices, using factorizedss, is it possible to get corresponding element of the source dataframe as below? (The purpose is to access data frame element by integer number in factor level )
a 1
a 1
b 1
b 2
You can access the first column like this
factorizedss[indices[,1],][,1]
and the second in a similar way
factorizedss[indices[,2],][,2]
It gets more difficult when trying to combine them, you might have to convert them back to native types
t(rbind(as.character(factorizedss[indices[,1],][,1]),as.numeric(factorizedss[indices[,2],][,2])))

I need to build new matrices by selecting alternate columns of the original matrix

I have a matrix where the first column is the ID of the samples, the columns 2 to 15 are the observed presences of 14 fish species, and the columns 16 to 29 are the predicted presences of the same 14 species.
I need to build 14 matrices (1 per species) with 3 columns each: first column = ID of the samples (e.g. column 1 of the original matrix), 2nd column = observed presence of the species, 3rd column = predicted presence of the species.
Lets say that A is the ID of my samples:
A<-c(1,2,3,4,5,6,7,8,9,10)
B are the observed values for my species
B<-replicate(14,rnorm(10))
C are the predicted values for my species
C<-replicate(14,rnorm(10))
So I have the matrix "data":
data<-cbind(A, B, C)
I want to do something like this
A1<-cbind(data[,1],data[,2],data[,16])
A2<-cbind(data[,1],data[,3],data[,17])
etc.. until having A1 to A14 matrices, one for each species. I suspect that I need to use the lapply function but I am lost. Can anyone help me?
Thanks!!
We can use lapply to create a list of matrices by looping through the sequence of columns
lst <- lapply(seq_len(ncol(B)), function(i) cbind(A, B= B[,i], C=C[,i]))
names(lst) <- paste0("A", seq_along(lst))
It is better to keep it in a list instead of creating multiple objects in the global environment. But, if we need it anyway
list2env(lst, .GlobalEnv)

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Resources