Combining row vectors for data frame after using quantile function - r

Novice problem. I ran following command:
CI_95_outcomes_male <- data.frame(do.call(cbind,lapply(1:ncol(outcomes_male_dt), function(r) quantile(outcomes_male_dt[,r],c(.95)))))
and end up with this output:
CI_95_outcomes_male
X1 X2 X3 X4
95% 9629902039 0 2.968924e+15 2.968924e+15
I would like to combine this vector with following vector to end up with 2X4 matrix:
#
mean_outcomes_male
ylg_smoking_simS deaths_averted total_cig total_tax_
9.62990 0.0000 2.78248 2.782480
I tried:
CI_95_outcomes_male<-colnames(mean_outcomes_male)
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Error in data.frame(mean_outcomes_male, CI_95_outcomes_male) :
arguments imply differing number of rows: 4, 0
Any guidance appreciated, thanks!

CI_95_outcomes_male<-colnames(mean_outcomes_male)
I think you forgot to put colnames around CI_95_outcomes_male. But there's another problem here. I'm assuming that mean_outcomes_male is a vector, in which case colnames(mean_outcomes_male) is NULL.
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Even if CI_95_outcomes_male was correct, the above command will result in a 4x5 data frame, with the first column being the mean_outcomes_male vector, second column being the CI_95_outcomes_male value for your first variable (repeated for each row),...,and the fifth column being the CI_95_outcomes_male value for your fourth variable (repeated for each row).
You need to do something like this:
set.seed(42)
# Generate a random dataset for outcomes_male_dt with 4 variables and n rows
n <- 100
outcomes_male_dt <- data.frame(x1=runif(n),x2=runif(n),x3=runif(n),x4=runif(n))
# I'm assuming you want the 95th percentile of each variable in outcomes_male_dt and store them in CI_95_outcomes_male
ptl <- .95 # if you want to add other percentiles you can replace this with something like "ptl <- c(.10,.50,.90,.95)"
CI_95_outcomes_male <- apply(outcomes_male_dt,2,quantile,probs=ptl)
# I'm going to assume that mean_outcomes_male is a vector of means for all the variables in outcomes_male_dt
mean_outcomes_male <- colMeans(outcomes_male_dt)
# You want to end up with a 2x4 matrix - I'm assuming you meant row 1 will be the means, and row 2 will be the 95th percentiles, and the columns will be the variables
want <- rbind(mean_outcomes_male, CI_95_outcomes_male)
colnames(want) <- colnames(outcomes_male_dt)
row.names(want) <- c('Mean',paste0("p",ptl*100)) # paste0("p",ptl*100) is equivalent to paste("p",ptl*100,sep="")
want # Resulting matrix

Related

create lists that contain the rownumbers for which column i contains the maximum value of that row

In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4.
the first column contains parameter names,
column 2 a shapiro test outcome on the raw data of parameter x
column 3, shapiro test outcome of log10 transformed data for parameter x
column 4, shapiro test outcome of a custom transformation given by the user for parameter x
if this is the data:
Parameter xval xlog10val xcustomval
1 FWS.Range 0.62233371 0.9741614 0.9619065
2 FL.Red.Range 0.48195980 0.9855781 0.9643206
3 FL.Orange.Range 0.43338087 0.9727243 0.8239867
4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407
5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224
6 SWS.Range 0.46932823 0.9487955 0.9825318
7 SWS.Length 0.02927791 0.4565962 0.7309313
8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000
9 FL.Red.Total 0.22437754 0.9655873 0.9923307
QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval)
detailed explanation, ignore perhaps. ....
list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score
list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row:
'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range',
'FL.Red.Gradient', 'FWS.Fill.factor'
and list 3 the rest of the names
I tried something like
df$Parameter[which(df$xval == max(df[ ,2:4]))]
but this gives integer(0) results.
EDIT
to clarify:
Lets start with looking at column 2 (xval).
PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval
if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list
Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list.
To make the DF:
df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'),
xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707),
xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157),
xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter'
i1 <- max.col(df[-1], 'first')
split(df$Parameter, i1)
EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well
df$Parameter <- as.character(df$Parameter)
par.xval.max <- df[which.max(df$xval), "Parameter"]
par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"]
par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)]
In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%

Resampling in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code

R chi-squared statistic for two different distribution

I have two file.dat (random1.dat and random2.dat) which are generated from a random uniform distribution (changing the seed):
http://www.filedropper.com/random1_1: random1.dat
http://www.filedropper.com/random2 : random2.dat
I like to use R to make the X-squared to understand if the two distribution are statistically the same.
To do that i prove:
x1 -> read.table("random1.dat")
x2 -> read.table("random2.dat")
chisq.test(x1,x2)
but I receive an error message:
'x' and 'y' need to have the same length
Now the problem is that this two files are both 1000's rows. So I don't understand that. Another question is if I want to make this process automatic (iterate it) for istance 100 times with 100 different file, can i make something like:
DO i=1,100
x1 -> read.table("random'(i)'.dat")
x2 -> read.table("fixedfile.dat")
chisq.test(x1,x2)
save results from the chisq analys
END DO
Thanks so much for Your help.
ADDED:
#eipi10,
I try to use the first method You gave here and it works well for the data You generate here. Then, when I try it for my data (I put in a single file a 2-column matrix enter link description here of 1000 rows of two uniform distribution with a different seed) something do not work correctly:
I load the file with: dat = read.table("random2col.dat");
I use the command: csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x))) and a warning message appear;
finally I use: unlist(lapply(csq, function(x) x$p.value)) BUT the output is something like:
[...] 1 1 1 1 1 1 1 1 1 1 1 1 1
[963] 1 1 1 1 1.....1 1 1 1
[1000] 1
I don't think you need to use a loop. You can use lapply instead. Also, you're entering x1 and x2 as separate columns of data. When you do this, chisq.test computes a contingency table from these two columns, which wouldn't be meaningful for columns of real numbers. Instead, you need to feed chisq.test a single matrix or data frame whose columns are x1 and x2. But even then, the chisq.test is expecting count data, which isn't what you have here (although the "expected" frequency doesn't necessarily have to be an integer). In any case, here's some code that will make the test run the way you seem to be hoping:
# Simulate data: 5 columns of data, each from the uniform distribution
dat = data.frame(replicate(5, runif(20)))
# Chi-Square test of each column against column 1.
# Note use of cbind to combine the two columns into a single data frame,
# rather than entering each column as separate arguments.
csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x)))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)
On the other hand, if you were intending your data to be two streams of values that would then be converted into a contingency table, here's an example of that:
# Simulate data of 5 factor variables, each with 10 different levels
dat = data.frame(replicate(5, sample(c(1:10), 1000, replace=TRUE)))
# Chi-Square test of each column against column 1. Here the two columns of data are
# entered as separate arguments, so that chisq.test will convert them to a two-way
# contingency table before doing the test.
csq = lapply(dat[,-1], function(x) chisq.test(dat[,1],x))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

rowsum for matrix over specified number of columns in R

I'm trying to get the sum of columns in a matrix in R for a certain row. However, I don't want the whole row to be summed but only a specified number of columns i.e. in this case all column above the diagonal. I have tried sum and rowSums function but they are either giving me strange results or an error message. To illustrate, please see example code for an 8x8 matrix below. For the first row I need the sum of the row except item [1,1], for second row the sum except items [2,1] and [2,2] etc.
m1 <- matrix(c(0.2834803,0.6398198,0.0766999,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,
0.0000000,0.1101746,0.6354086,0.2544168,0.0000000,0.0000000,0.0000000,0.0000000,
0.0000000,0.0000000,0.0548145,0.9451855,0.0000000,0.0000000,0.0000000,0.0000000,
0.0000000,0.0000000,0.0000000,0.3614786,0.6385214,0.0000000,0.0000000,0.0000000,
0.0000000,0.0000000,0.0000000,0.0000000,0.5594658,0.4405342,0.0000000,0.0000000,
0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.7490395,0.2509605,0.0000000,
0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.5834363,0.4165637,
0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,1.0000000),
8, 8, byrow = TRUE,
dimnames = list(c("iAAA", "iAA", "iA", "iBBB", "iBB", "iB", "iCCC", "iD"),
c("iAAA_p", "iAA_p", "iA_p", "iBBB_p", "iBB_p", "iB_p", "iCCC_p", "iD_p")))
I have tried the following:
rowSums(m1[1, 2:8]) --> Error in rowSums(m1[1, 2:8]) :
'x' must be an array of at least two dimensions
Alternatively:
sum(m1[1,2]:m1[1,8]) --> wrong result of 0.6398198 (which is item [1,2])
As I understand rowSums needs an array rather than a vector (although not sure why). But I don't understand why the second way using sum doesn't work. Ideally, there is some way to only sum all columns in a row that lie above the diagonal.
Thanks a lot!
The problem is you are not passing an array to rowSums:
class(m1[1,2:8])
# [1] "numeric"
This is a numeric vector. Use more than a single row and it will work just fine:
class(m1[1:2,2:8])
# [1] "matrix"
rowSums(m1[1:2,2:8])
# iAAA iAA
#0.7165197 1.0000000
If you want to sum all the columns that lie above the diagonal then you can use lower.tri to set all elements below the diagonal to 0 (or perhaps NA) and then use rowSums. If you do not want to include the diagonal elements themselves you can set diag = TRUE (thanks to #Fabio for pointing this out):
m1[lower.tri(m1 , diag = TRUE)] <- 0
rowSums(m1)
# iAAA iAA iA iBBB iBB iB iCCC iD
#0.7165197 0.8898254 0.9451855 0.6385214 0.4405342 0.2509605 0.4165637 0.0000000
# With 'NA'
m1[lower.tri(m1)] <- NA
rowSums(m1,na.rm=T)
# iAAA iAA iA iBBB iBB iB iCCC iD
#0.7165197 0.8898254 0.9451855 0.6385214 0.4405342 0.2509605 0.4165637 0.0000000

Resources