Apply/lapply function to all columns in matrix - r

I have a matrix called seq$num, consisting of 100 columns and 30k rows. Each column corresponds to the name of a specific sample (es. CAGTCA), and every row is about a numeric value. With this type of object, I can access the first row writing seq$num[[1]] and so on for the other rows. This is a brief example of my database:
CAGTCA
AGATCA
GCTCGA
GCTCGA
-0.4930
-2.0330
0.7100
0.1560
1.0030
0.0120
-1.0433
0.6701
0.0013
1.0013
1.2451
-1.3421
I would like to loop through all the samples using the lapply function and for each sample classify:
the numbers above 1.5 as "high".
the numbers below 0 as "low".
the numbers between 0 and 1.5 as "medium".
Then I need also to take note of how many high, low and medium numbers have a sample.
How can this be done? I've tried applying the lapply function, but I don't get the output I want.

You can write a function which divides data into categories.
classify <- function(x) ifelse(x >= 1.5, 'high', ifelse(x < 0, 'low', 'medium'))
For each dataframe in seq$num apply the classify function to it and use table to count.
res <- lapply(seq$num, function(x) table(classify(as.matrix(x))))

Related

R: Seperating several observations of a variable and building a matrix

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Summing a specific vector index

I'm having trouble figuring out how vectors are formatted. I need to find the average height of participants in the cystfibr package of the ISwR library. When printing the entire height data set it appears to be a 21x2 matrix with height values and a 1 or 2 to indicate sex. However, ncol returns a value of NA suggesting it is a vector. Trying to get specific indexes of the matrix (heightdata[1,]) also returns an incorrect number of dimensions error.
I'm looking to sum up only the height values in the vector but when I run the code I get the sum of the male and female integers. (25)
install.packages("ISwR")
library(ISwR)
attach(cystfibr)
heightdata = table(height)
print(heightdata)
print(sum(heightdata))
This is what the output looks like.
You can convert the cystfibr to a dataframe format to find out the sum of all vectors present in the data.
install.packages("ISwR")
library(ISwR)
data <- data.frame(cystfibr) # attach and convert to dataframe format
As there are no unique identifier present in the data, so done sum across observations
apply(data [,"height", drop =F], 2, sum) # to find out the sum of height vector
height
3820
unlist(lapply(data , sum))
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
sapply(data, sum)
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
table gives you the count of values in the vector.
If you want to sum the output of height from heightdata, they are stored in names of heightdata but it is in character format, convert it to numeric and sum.
sum(as.numeric(names(heightdata)))
#[1] 3177
which is similar to summing the unique values of height.
sum(unique(cystfibr$height))
#[1] 3177

filter a correlation matrix based on value and occurrence

Does anyone have a way to filter a correlation matrix (or list of correlations) based on a ranking that includes value and breadth? For example, if a certain variable has a high enough correlation with a large enough number of other variables, then keep it. If a variable does not meet these criteria, filter it out.
as an example:
if a correlation > 0.25 is found in > 3 entries, keep this variable. If not, discard the variable.
Currently I'm able to construct a correlation matrix and filter it based on values, but have not been able to progress past this. For filtering, I'm setting values below my threshold to 0
correlation_matrix <- round(cor(data, method = "pearson", use = "pairwise.complete.obs"), digits = 4)
correlation_matrix[correlation_matrix < 0.13 & correlation_matrix > -0.13] <- 0
I've now done this using apply as Rui mentioned above.
This is code to select all rows (and columns) in the correlation matrix that contain at least 75 (breadth) values over 0.2 (threshold):
1) define variables; set diagonal values from 1 to 0
threshold <- 0.2
breadth <- 75
correlation_matrix_filter <- correlation_matrix
diag(correlation_matrix_filter) <- 0
2) count how many values per row are greater than the threshold of 0.2
filter <- apply(correlation_matrix_filter,1, function(x) sum(abs(x) >= threshold))
3) select only rows containing 75 values greater than the threshold; subset the original correlation matrix to only include these rows (and columns)
sel <- filter >= breadth
correlation_matrix_final <- correlation_matrix[sel,sel]

Calculating Gini by Row in R

stackoverflow.
I'm trying to calculate the gini coefficient within each row of my dataframe, which is 1326 rows long, by 6 columns (1326 x 6).
My current code...
attacks$attack_gini <- gini(x = c(attacks$attempts_open_play,
attacks$attempts_corners,attacks$attempts_throws,
attacks$attempts_fk,attacks$attempts_set_play,attacks$attempts_penalties))
... fills all the rows with the same figure of 0.7522439 - which is evidently wrong.
Note: I'm using the gini function from the reldist package.
Is there a way that I can calculate the gini for the 6 columns in each row?
Thanks in advance.
Function gini of reldist does not accept a dataframe as an input. You could easily get the coefficient of the first column of your dataframe like this:
> gini(attacks$attempts_open_play)
[1] 0.1124042
However when you do c(attacks$attempts_open_play, attacks$attempts_corners, ...) you are actually generating one list with all the columns of your dataframe just after the other, thus your gini call gives back a single number, e.g.:
> gini(c(attacks$attempts_open_play, attacks$attempts_corners))
[1] 0.112174
And that's why you are assigning the same single number to every row at attacks$attack_gini. If I understood properly, you what to calculate the gini coefficient for the values of your columns per row, you can use apply, something like
attacks$attack_gini <- apply(attacks[,c('attempts_open_play', 'attempts_corners', ...)], 1, gini)
where the 2nd parameter with value 1 is applying the function gini per row.
head(apply(attacks[,c('attempts_open_play', 'attempts_corners')], 1, gini))
[1] 0.026315789 0.044247788 0.008928571 0.053459119 0.019148936 0.007537688
Hope it helps.

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Resources