Summing a specific vector index - r

I'm having trouble figuring out how vectors are formatted. I need to find the average height of participants in the cystfibr package of the ISwR library. When printing the entire height data set it appears to be a 21x2 matrix with height values and a 1 or 2 to indicate sex. However, ncol returns a value of NA suggesting it is a vector. Trying to get specific indexes of the matrix (heightdata[1,]) also returns an incorrect number of dimensions error.
I'm looking to sum up only the height values in the vector but when I run the code I get the sum of the male and female integers. (25)
install.packages("ISwR")
library(ISwR)
attach(cystfibr)
heightdata = table(height)
print(heightdata)
print(sum(heightdata))
This is what the output looks like.

You can convert the cystfibr to a dataframe format to find out the sum of all vectors present in the data.
install.packages("ISwR")
library(ISwR)
data <- data.frame(cystfibr) # attach and convert to dataframe format
As there are no unique identifier present in the data, so done sum across observations
apply(data [,"height", drop =F], 2, sum) # to find out the sum of height vector
height
3820
unlist(lapply(data , sum))
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
sapply(data, sum)
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0

table gives you the count of values in the vector.
If you want to sum the output of height from heightdata, they are stored in names of heightdata but it is in character format, convert it to numeric and sum.
sum(as.numeric(names(heightdata)))
#[1] 3177
which is similar to summing the unique values of height.
sum(unique(cystfibr$height))
#[1] 3177

Related

create lists that contain the rownumbers for which column i contains the maximum value of that row

In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4.
the first column contains parameter names,
column 2 a shapiro test outcome on the raw data of parameter x
column 3, shapiro test outcome of log10 transformed data for parameter x
column 4, shapiro test outcome of a custom transformation given by the user for parameter x
if this is the data:
Parameter xval xlog10val xcustomval
1 FWS.Range 0.62233371 0.9741614 0.9619065
2 FL.Red.Range 0.48195980 0.9855781 0.9643206
3 FL.Orange.Range 0.43338087 0.9727243 0.8239867
4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407
5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224
6 SWS.Range 0.46932823 0.9487955 0.9825318
7 SWS.Length 0.02927791 0.4565962 0.7309313
8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000
9 FL.Red.Total 0.22437754 0.9655873 0.9923307
QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval)
detailed explanation, ignore perhaps. ....
list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score
list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row:
'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range',
'FL.Red.Gradient', 'FWS.Fill.factor'
and list 3 the rest of the names
I tried something like
df$Parameter[which(df$xval == max(df[ ,2:4]))]
but this gives integer(0) results.
EDIT
to clarify:
Lets start with looking at column 2 (xval).
PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval
if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list
Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list.
To make the DF:
df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'),
xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707),
xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157),
xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter'
i1 <- max.col(df[-1], 'first')
split(df$Parameter, i1)
EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well
df$Parameter <- as.character(df$Parameter)
par.xval.max <- df[which.max(df$xval), "Parameter"]
par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"]
par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)]
In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%

Chaining dataframes in a list

I have a list of data.frames an example of which can be found in the example.data below
example.data <- list(
stage1 <- data.frame(stuff=c("Apples","Oranges","Bananas"),
Prop1=c(1,2,3),
Prop2=c(3,2,1),
Wt=c(1,2,3)),
stage2 <- data.frame(stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Prop1=c(8,9,10,1,2),
Prop2=c(23,32,55,5,4),
Wt=c(45,23,56,99,2)),
stage3 <- data.frame(stuff=c("Gooseberry","Bread","Grapes","Butter"),
Prop1=c(9,8,9,10),
Prop2=c(34,45,67,88),
Wt=c(24,56,31,84))
)
The data.frames will always have the same number of columns but their rows will vary, as will the number of data.frames in the list. Notice the chain through the list apples go to bananas, bananas go to gooseberry and gooseberry goes to butter. That is, each pair of data.frames has a common element.
I want to scale-up the weights throughout the whole list as follows. Firstly, I need to input my final weight, say 20e3. Secondly I need a scale factor for the last row, last column of the last data frame: in this particular case this will be 20e3/84 for the last dataframe. I want to use this scale factor at some point to create new columns in the last dataframe.
Next I want to scale between the last dataframe and the previous one. So using the scale factor previously calculated the input for the stage2 is (24*20e3/84) / 2 that is the weight of stage3 Gooseberry multiplied by the scale factor with respect to 20e3 divided by the stage2 Gooseberry weight to give a new scale factor. This process is repeated (via Bananas) to give the stage1 scale factor.
In this particular example the scale factors should be 42858.0 2857.2 238.1 for stage1 stage2 stage3.
I tried doing a for loop over the reverse of the length of the list with appropriate sub-setting after extracting the coordinates of the last element of each data.frame. This failed because the for loop was out of synch. I'm loathe to post what I've tried in case I lead anyone astray.
Not getting many responses so here's what I've done so far ...
last.element <- function(a.list) {
## The function finds the last element in a list of dataframes which
a <- length(a.list) ## required to subset the last element
x <- dim(a.list[[a]])[1]
y <- dim(a.list[[a]])[2]
details <- c(a,x,y)
return(details)
}
details <- as.data.frame(matrix(,nrow=length(example.data),ncol=3))
for (i in length(example.data):1) {
details[i,1:3] <- last.element(example.data[1:i])
}
The function gives the last element in each of the data.frames down the list. I've set up a data.frame which I want to populate with the scale factor. Next,
details[,4] <- 1
for (i in length(example.data):1) {
details[i,4] <- 20e3 / as.numeric(example.data[[i]][as.matrix(details[i,2:3])])
}
I set an extra column in the details data.frame ready for the scale up factors. But the for loop only gives me the last scale factor,
> details
V1 V2 V3 V4
1 1 3 4 6666.6667
2 2 5 4 10000.0000
3 3 4 4 238.0952
If I multiply 238.0952 by 84 it will give me 20000.
But the scale factor for the second data frame should be (24 * 238.0952) / 2 that is ... all the weights in the third data.frame are multiplied by the scale factor. A new scale factor is derived by dividing the scaled up Gooseberry value in the third data.frame by the Gooseberry value in the second data.frame. The scale factor for the first data frame is found in a similar manner.

How do I extract a series of n rows after initial row extraction?

Suppose I have matrix called M,
"Date" "X" "Y"
1991 T 10
1992 T 5
1993 F 2
1994 F 1
1995 T 7
where date is a character value, X is Boolean, and Y is numeric. Also, assume that the total number of rows are 50, each filled with values mentioned above.
My initial selection criteria is for the second column to be True. Thus,
initial_row<-M[M[,2]==T,]
I am looking for a way to extract 10 rows (or any constant number) following the initial row, regardless of their values on any column. Basically, I'm trying to mine all the rows that follow the initial extraction, then move on until next row that meets the initial selection criteria.
This code should extract the rows following the initial rows and put the batches of rows into a list
const <- 10
lapply(which(M$X), function(n){
indices <- n + 1:const # 0:const to include initial row
indices <- indices[indices <= nrow(M)] # exclude out of bounds values
M[indices,]
})

Get the position of maximum value and the respective row element in a Data frame

I created a data frame named "data" and has 100 rows of names and corresponding ages (colnames "NAMES" and "AGES"). Now I try to find the maximum age using the max() function by using
max(data[,"AGES"])
I get the maximum age, but I want to get the position also and the names of the people having the maximum age. And after getting the names of the people of maximum age I want to arrange them alphabetically.. How do I do this?
I tried searching on the net, but wasnt successful in summing the different things up..
Let's first generate some demo data:
data<-data.frame(NAMES=replicate(100, paste(sample(letters, 8, replace=T), collapse="")), AGES=sample(20:60, 100, replace=T))
head(data)
NAMES AGES
1 oepefudt 21
2 ibmuaemm 49
3 mkockaqu 23
4 whyzomna 59
5 omqqtbsz 35
6 qnbmjmuf 25
We can then find the rows that have the maximum age, extract their names, and finally sort them in alphabetical order in a single line:
sort(as.character(data$NAMES[data$AGES==max(data$AGES)]))
Or maybe more transparently:
# Find the maximum age
max.age<-max(data$AGES)
# Which rows have the maximum age value?
ind<-which(data$AGES==max.age)
# Extract the name using the ind from above
persons<-as.character(data$NAMES[ind])
# Sort the names
persons.sorted<-sort(persons)
persons.sorted
Would this help?

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Resources