Difference in cumulative sum based on specified values - r

I have a dataframe containing one column (depth, z) in which I am trying to find the difference in the accumulative depth values based on regular depth values. I would like to create a new dataframe with 3 columns: the criteria value, its respective cumulative depth value, and the third column with difference between consecutive cumulative depths, so for example:
z1<-c(1.2, 1.5, 0.8, 0.7, 1.6, 1.9, 1.1, 0.6, 1.3, 1.0)
z<-data.frame(z1)
crit1<-c(0.5,1,1.5,2)
# A loop comes to mind,
for(i in c(0.5,1,1.5,2)){
print( sum(subset(z1,z1<=i)))
} # But I get an error, because I cannot use integers
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
Attempting with cumsum
cumsum(z1)[seq(0.5,2,by=0.5)] # Which doesn't work either
I would like to get a table like this:
Crit Cumulative Difference
0.5 0 0
1 3.1 3.1
1.5 8.2 5.1

Don't use a for loop here , you should use sapply since you store the result.
y <- sapply(crit1,function(x)sum(z1[z1<=x]))
d <- c(0,diff(y))
data.frame(Crit = crit1, Cumulative =y, Difference=d)
# Crit Cumulative Difference
# 1 0.5 0.0 0.0
# 2 1.0 3.1 3.1
# 3 1.5 8.2 5.1
# 4 2.0 11.7 3.5

You could try
Difference <- setNames(c(0,tapply(z1,cut(z1, breaks=crit1,labels=F),FUN=sum)),NULL)
data.frame(Crit=crit1, Cumulative=cumsum(Difference), Difference)
# Crit Cumulative Difference
#1 0.5 0.0 0.0
#2 1.0 3.1 3.1
#3 1.5 8.2 5.1
#4 2.0 11.7 3.5

Related

means/pvalue table from t.test in R

Is there a way to extract the mean and p-value from a t.test output and create a table that includes the features, mean, and p-value? Say there are 10 columns put through t.test, and that means there are 10 means, and 10 p-values. How would I be able to create a table which only shows those specific items?
here is an example: data (iris):
a. b. c. d. e.
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
t.test(a)
t.test(b) #...ect we obtain the mean and p-value.
this is the output im looking for:
feature mean p-val
col1 0.01 0.95
col2 0.01 0.95
.
.
.
coln
hope it makes sense!
Using the iris built in data set as an example
t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
The sapply() function is iteratively performing that custom function - which performs a t-test on a variable and returns the estimate and p-value - through columns 1 to 4 of iris. That is then transposed by t() to rotate the data to the desired shape. You can store that as a data.frame using data.frame() and use row.names() to get the variable names into a new column on that if you like.
values <- t(sapply(iris[, 1:4], function(i){
t.test(i)[c(5,3)]
}))
values <- data.frame("feature"=row.names(values), values)
row.names(values) <- NULL
values
Beware multiple testing though...

Repeatedly apply a conditional summary to groups in a dataframe

I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))

R - How to create a new column in a dataframe with calculations based on condition of another column

In a project, I measured the iodine concentration of tumors (column=ROI_IC) at different off center positions (column=Offcenter) (table heights) in a CT scanner. I know the true concentration of each of the tumors (column=Real_IC; there are 4 different tumors with 4 different real_IC concentrations). Each tumor was measured at each off-center position 10 times (column=Measurement_repeat). I calculated an absolute error between the measured iodine concentration and the real iodine concentration (column=absError_IC)
This is just the head of the data:
Offcenter Measurement_repeat Real_IC ROI_IC absError_IC
1 0 1 0.0 0.4 0.4
2 0 2 0.0 0.3 0.3
3 0 3 0.0 0.3 0.3
4 0 4 0.0 0.0 0.0
5 0 5 0.0 0.0 0.0
6 0 6 0.0 -0.1 0.1
7 0 7 0.0 -0.2 0.2
8 0 8 0.0 -0.2 0.2
9 0 9 0.0 -0.1 0.1
10 0 10 0.0 0.0 0.0
11 0 1 0.4 0.4 0.0
12 0 2 0.4 0.3 0.1
13 0 3 0.4 0.2 0.2
14 0 4 0.4 0.0 0.4
15 0 5 0.4 0.0 0.4
16 0 6 0.4 -0.1 0.5
17 0 7 0.4 0.1 0.3
18 0 8 0.4 0.3 0.1
19 0 9 0.4 0.6 0.2
20 0 10 0.4 0.7 0.3
Now I would like to create a new column called corrError_IC.
In this column, the measured iodine concentration (ROI_IC) should be corrected based on the mean absolute error (mean of 10 measurements) that was found for that specific Real_IC concentration at Offcenter = 0
Because there are 4 tumor concentrations there are 4 mean values at Off-center =0 that I want to apply on the other off-center-values.
mean1=mean of the 10 absError-IC measurements of the `Real_IC=0`
mean2=mean of the 10 absError-IC measurements of the `Real_IC=0.4`
mean3=mean of the 10 absError-IC measurements of the `Real_IC=3`
mean4=mean of the 10 absError-IC measurements of the `Real_IC=5`
Basically, I want the average absolute error for a specific tumor at Offcenter = 0 (there are 4 different tumor types with four different Real_IC) and then I want correct all tumors at the other Offcenter positions by this absolute error values that were derived from the Offcenter = 0 data.
I tried ifelse statements but I was not able to figure it out.
EDIT: Off-center has specific levels: c(-6,-4,-3,-2,-1,0,1,2,3,4,6)
Here is how I would approach this problem.
compute mean of absError_IC grouped by Real_IC.
left join original data.frame with grouped mean
Code Example
## replicate sample data sets
ROI_IC = c(0.4, 0.3, 0.3, 0.0, 0.0, -0.1, -0.2, -0.2, -0.1, 0.0,
0.4, 0.3, 0.2, 0.0, 0.0, -0.1, 0.1, 0.3, 0.6, 0.7)
df = data.frame("Offcenter"=rep(0, 40),
"Measurement_repeat"=rep( c(1:10), 4),
"Real_IC"=rep( c(0,0.4,3,5), each=10),
"ROI_IC"=rep(ROI_IC, 2),
stringsAsFactors=F)
df$absError_IC = abs(df$Real_IC - df$ROI_IC)
## compute mean of "absError_IC" grouped by "Real_IC"
mean_values = aggregate(df[df$Offcenter==0, c("absError_IC")],
by=list("Real_IC"=df$Real_IC),
FUN=mean)
names(mean_values)[which(names(mean_values)=="x")] = "MAE"
## left join to append column
df = merge(df, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
## notice that column order shifts based on "key"
df[c(1:5, 10:15), ]
I suggest using data.table package which is particularly useful when there is need to manipulate large data.
library(data.table)
## dt = data.table(df) or dt = fread(<path>)
## dt[dt$Offcenter==0, c("absError_IC") := abs(dt$Real_IC - dt$ROI_IC)]
## compute grouped mean
mean_values = dt[, j=list("MAE"=mean(absError_IC)), by=list(Real_IC)]
## left join
dt = merge(dt, mean_values, by.x="Real_IC", by.y="Real_IC", all.x=T, all.y=F, sort=F)
Consider ave for inline aggregation where its first argument is the numeric quantity field, next arguments is grouping fields, and very last argument requiring named parameter, FUN, is the numeric function: ave(num_vector, ..., FUN=func).
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=mean))
To handle NAs, extend the function argument for na.rm argument:
df$corrError_IC <- with(df, ave(absError_IC, Real_IC, FUN=function(x) mean(x, na.rm=TRUE))
I found a way to compute what I want by creating an extra column taking the average absolute errors from the 4 Real_IC levels for Off-center = 0 and matching them whenever Real_IC has a certain level.
In a second step, I subtract these from the ROI_ICs. However, how can I simplify that code to a more general form (at the moment I calculate the average absErrors based on their row location)? Sorry I am an absolute beginner ;(
Of note: My data.frame is called "ds_M"
#Define absolute errors for the 4 Real_IC levels as variables
average1<-mean(ds_M$absError_IC[1:10]) #for Real_IC=0
average2<-mean(ds_M$absError_IC[11:20]) #for Real_IC=0.4
average3<-mean(ds_M$absError_IC[21:30]) #for Real_IC=3
average4<-mean(ds_M$absError_IC[31:40]) #for Real_IC=5
# New column assigning the correction factor to each Real_IC level
ds_M$absCorr[ds_M$Real_IC==0]<-average1
ds_M$absCorr[ds_M$Real_IC==0.4]<-average2
ds_M$absCorr[ds_M$Real_IC==3]<-average3
ds_M$absCorr[ds_M$Real_IC==5]<-average4
# Calculate new column with corrected ROI_ICs
ds_M$corrError_IC<-ds_M$ROI_IC - ds_M$absCorr

R- reduce dimensionality LSA

I am following an example of svd, but I still don't know how to reduce the dimension of the final matrix:
a <- round(runif(10)*100)
dat <- as.matrix(iris[a,-5])
rownames(dat) <- c(1:10)
s <- svd(dat)
pc.use <- 1
recon <- s$u[,pc.use] %*% diag(s$d[pc.use], length(pc.use), length(pc.use)) %*% t(s$v[,pc.use])
But recon still have the same dimension. I need to use this for Semantic analysis.
The code you provided does not reduce the dimensionality. Instead it takes first principal component from your data, removes the rest of principal components, and then reconstructs the data with only one PC.
You can check that this is happening by inspecting the rank of the final matrix:
library(Matrix)
rankMatrix(dat)
as.numeric(rankMatrix(dat))
[1] 4
as.numeric(rankMatrix(recon))
[1] 1
If you want to reduce dimensionality (number of rows) - you can select some principal principal components and compute the scores of your data on those components instead.
But first let's make some things clear about your data - it seems you have 10 samples (rows) with 4 features (columns). Dimensionality reduction will reduce the 4 features to a smaller set of features.
So you can start by transposing your matrix for svd():
dat <- t(dat)
dat
1 2 3 4 5 6 7 8 9 10
Sepal.Length 6.7 6.1 5.8 5.1 6.1 5.1 4.8 5.2 6.1 5.7
Sepal.Width 3.1 2.8 4.0 3.8 3.0 3.7 3.0 4.1 2.8 3.8
Petal.Length 4.4 4.0 1.2 1.5 4.6 1.5 1.4 1.5 4.7 1.7
Petal.Width 1.4 1.3 0.2 0.3 1.4 0.4 0.1 0.1 1.2 0.3
Now you can repeat the svd. Centering the data before this procedure is advisable:
s <- svd(dat - rowMeans(dat))
Principal components can be obtained by projecting your data onto PCs.
PCs <- t(s$u) %*% dat
Now if you want to reduce dimensionality by eliminating PCs with low variance you can do so like this:
dat2 <- PCs[1:2,] # would select first two PCs.

R filtering a dataframe for a proportion of columns meeting criteria

I'm sure the answer to this question is out there already, but I can't find it, since I'm a beginner at R and don't know what search terms to use.
I want to retrieve the rows in a data frame where a given proportion of the columns meet a criteria. For example, 2/3 columns >1.3.
Here is what I have so far:
a<-c(1.1,1.2,1.3,1.4,1.5)
b<-c(1.3,1.4,1.5,1.6,1.7)
c<-c(1.5,1.6,1.7,1.8,1.9)
data<-data.frame(a,b,c)
data`
a b c
1 1.1 1.3 1.5
2 1.2 1.4 1.6
3 1.3 1.5 1.7
4 1.4 1.6 1.8
5 1.5 1.7 1.9
c<-function(x) (length(x[(x>1.4)]))>=(2/3*ncol(data))
d<-apply(data,1,c)
result<-data[d,]
result
a b c
3 1.3 1.5 1.7
4 1.4 1.6 1.8
5 1.5 1.7 1.9
This works, but I feel like there must be a simpler way, or that the function could be written differently? I'm still trying to properly undestand this whole function-thing.
Of course, in reality my dataframe would have alot of columns.
/Grateful beginner
Maybe (Should be more efficient as rowSums is vectorized and saves the need in using apply loop)
data[rowSums(data > 1.4) >= 2/3*ncol(data),]
## a b c
## 3 1.3 1.5 1.7
## 4 1.4 1.6 1.8
## 5 1.5 1.7 1.9
Or if you prefer a function, could try
myfunc <- function(x) x[rowSums(x > 1.4) >= 2/3*ncol(x), ]
myfunc(data)
## a b c
## 3 1.3 1.5 1.7
## 4 1.4 1.6 1.8
## 5 1.5 1.7 1.9
Just to give another alternative to David's answer. You can use the mean function on a vector of logical values in R to return the percentage of TRUE values in the vector.
Create the data
a<-c(1.1, 1.2, 1.3, 1.4, 1.5)
b<-c(1.3, 1.4, 1.5, 1.6, 1.7)
c<-c(1.5, 1.6, 1.7, 1.8, 1.9)
data<-data.frame(a, b, c)
A function to return a logical vector indicating if the values are above the threshold
gt <- function(x, threshold){
tmp <- x > threshold
return(tmp)
}
An example using the first row of the data.frame
gt(data[1,], 1.4)
If you take the sum of the logical vector it returns the number of TRUE instances:
sum(gt(data[1,], 1.4))
# [1] 1
and if you use the mean function it returns the percentage of positive instances:
mean(gt(data[1,], 1.4))
# [1] 0.3333333
Using that you can use David's approach:
index <- apply(data,1, function(x) sum(gt(x, 1.4)) >= 2/3 * length(x))
or you can use the percentage via the mean function.
index <- apply(data,1, function(x) mean(gt(x, 1.4)) > 0.6)

Resources