I apologize in advance if this has been asked before, or if I have missed something obvious.
I have two data sets, 'olddata' and 'newdata'
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -5:5, z = -5:5)
I create a model from the old data, and want to predict values from the new data
mymodel <- lm(y ~ x+z, data = olddata)
predict.lm(mymodel, newdata)
However, I'd like to restrict the range of variables in 'newdata' to the range of variables on which the model was trained.
of course I could do this:
newnewdata <- subset(newdata,
x < max(olddata$x) & x > min(olddata$x) &
z < max(olddata$z) & z > max(olddata$z))
But this gets intractable over many dimensions. Is there a less repetitive way to do this?
It seems that all the values in your newdata are already within the appropriate ranges, so there's nothing there to subset. If we expand the ranges of newdata:
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)
newdata
x z
1 -10 -10
2 -9 -9
3 -8 -8
4 -7 -7
5 -6 -6
6 -5 -5
7 -4 -4
8 -3 -3
9 -2 -2
10 -1 -1
11 0 0
12 1 1
13 2 2
14 3 3
15 4 4
16 5 5
17 6 6
18 7 7
19 8 8
20 9 9
21 10 10
Then all we need to do is identify the ranges for each variable of olddata and then loop through as many iterations of subset as newdata has columns:
ranges <- sapply(olddata, range, na.rm = TRUE)
for(i in 1:ncol(newdata)) {
col_name <- colnames(newdata)[i]
newdata <- subset(newdata,
newdata[,col_name] >= ranges[1, col_name] &
newdata[,col_name] <= ranges[2, col_name])
}
newdata
x z
4 -7 -7
5 -6 -6
6 -5 -5
7 -4 -4
8 -3 -3
9 -2 -2
10 -1 -1
11 0 0
12 1 1
13 2 2
14 3 3
15 4 4
16 5 5
17 6 6
Here is an approach using the *apply family (using SchaunW's newdata):
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0, 5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)
minmax <- sapply(olddata[-2], range)
newdata[apply(newdata, 1, function(a) all(a > minmax[1,] & a < minmax[2,])), ]
Some care is required because I have assumed the columns of olddata (after dropping the second column) are identical to newdata.
Brevity comes at the cost of speed. After increasing nrow(newdata) to 2000 to emphasis the difference I found:
test replications elapsed relative user.self sys.self user.child sys.child
1 orizon() 100 2.193 27.759 2.191 0.002 0 0
2 SchaunW() 100 0.079 1.000 0.075 0.004 0 0
My guess at the main cause is that repeated subsetting avoids testing whether rows meet the criteria examined after they are excluded.
Related
The task is to multiply all negative numbers by 10 in 'df'.
So I am only able to multiply everything by 10 but when I add an if-statement then everything stops working.
df =
x <- c('a', 'b', 'c', 'd', 'e')
y <- c(-4,-2,0,2,4)
z <- c(3, 4, -5, 6, -8)
# Join the variables to create a data frame
df <- data.frame(x,y,z)
df
##
x y z
1 a -4 3
2 b -2 4
3 c 0 -5
4 d 2 6
5 e 4 -8
my code so far
df2 <- df
df2
for(i in 2:ncol(df2)) {
df2[ , i] <- df2[ , i] *10
}
df2
cbind(df[1], 10^(df[-1] < 0) * df[-1])
x y z
1 a -40 3
2 b -20 4
3 c 0 -50
4 d 2 6
5 e 4 -80
You can achieve this using the dplyr functions mutate and across.
library(dplyr) # install if required
df %>%
mutate(across(-x, ~ifelse(. < 0, 10 * ., .)))
This says "for all columns except x, multiple by 10 where the value is < 0, otherwise leave value as is".
Result:
x y z
1 a -40 3
2 b -20 4
3 c 0 -50
4 d 2 6
5 e 4 -80
I have a dataframe with the following structure:
> str(data_l)
'data.frame': 800 obs. of 5 variables:
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: Factor w/ 4 levels "35","37","39",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Region : Factor w/ 5 levels "Eyes","Front",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Time : Factor w/ 5 levels "0","15","30",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rating : num 5 5 5 4 5 5 5 5 5 5 ...
I want to run one-sample t-test for each combination of all factors all levels, for a total of 4*5*5 = 100 t-tests, with Rating as dependent variables, or y.
I am stuck at looping through the combinations, and performing t-test at each combo.
I tried splitting the dataframe by the factors, then lapply t.test() through the list, but to no avail.
Does anyone have a better approach? Cheers!
Edit
My ultimate intention is to calculate confidence interval for arrays in all factors all levels. For instance, I was able to do this:
subset1 <- data_l$Rating[data_l$Temperature == 35 & data_l$Region == "Front" & data_l$Time == 0]
Then,
t.test(subset1)$conf.int
But the problem is I will have to do this 100 times.
Edit 2
I am recreating the dataframe.
Temperature <- rep(seq(35, 41, 2), 10)
Region <- rep(c("Front", "Back", "Eyes", "Left", "Right"), 8)
Time <- rep(seq(0, 60, 15), 8)
Rating <- sample(1:5, 40, replace = TRUE)
data_l <- data.frame(Region = factor(Region), Temperature = factor(Temperature), Time = factor(Time), Rating = as.numeric(Rating))
Two things.
Can this be done? Certainly. Should it? Many of your combinations may have insufficient data to find a reasonable confidence interval. While your data sample is certainly reduced and simplified, I don't have assurances that there will be sufficient fillingness of your factor combinations.
table(sapply(split(data_l$Rating, data_l[,c("Temperature","Region","Time")]), length))
# 0 2
# 80 20
(There are 80 "empty" combinations of your factor levels.)
Let's try this:
outs <- aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) if (length(unique(x)) > 1) t.test(x)$conf.int else c(NA, NA))
nrow(outs)
# [1] 20
head(outs)
# Temperature Region Time x.1 x.2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
Realize that this is not five columns; the fourth is really a matrix embedded in a frame column:
head(outs$x)
# [,1] [,2]
# [1,] NA NA
# [2,] -9.706205 15.706205
# [3,] -2.853102 9.853102
# [4,] -15.559307 22.559307
# [5,] -15.559307 22.559307
# [6,] -4.853102 7.853102
It's easy enough to extract:
outs$conf1 <- outs$x[,1]
outs$conf2 <- outs$x[,2]
outs$x <- NULL
head(outs)
# Temperature Region Time conf1 conf2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
(If you're wondering why I have a conditional on length(unique(x)) > 1, then see what happens without it:
aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) t.test(x)$conf.int)
# Error in t.test.default(x) : data are essentially constant
This is because there are combinations with empty data. You'll likely see something similar with not-empty but still invariant data.)
I am stuck at looping through the combinations, and performing t-test
at each combo.
I'm not sure if this is what you wanted.
N <- 800
df <- data.frame(Participant=1:N,
Temperature=gl(4,200),
Region=sample(1:5, 800, TRUE),
Time=sample(1:5, 800, TRUE),
Rating=sample(1:5, 800, TRUE))
head(df)
t_test <- function(data, y, x){
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
comb <- combn(levels(x), m=2) # this gives all pair-wise combinations
n <- dim(comb)[2]
t <- vector(n, mode="list")
for(i in 1:n){
xlevs <- comb[,i]
DATA <- subset(data, subset=x %in% xlevs)
x2 <- factor(x, levels=xlevs)
tt <- t.test(y~x2, data=DATA)
t[[i]] <- tt
names(t)[i] <- toString(xlevs)
}
t
}
T.test <- t_test(df, Rating, Temperature)
T.test[1]
$`1, 2`
Welch Two Sample t-test
data: y by x2
t = -1.0271, df = 396.87, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4079762 0.1279762
sample estimates:
mean in group 1 mean in group 2
2.85 2.99
I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.
df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10
I have a dataframe as follows:
chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3
I have to create a loop which:
a) Calculates the upper and lower limit (UL and LL) for each column from the third column onwards.
b) Only includes rows that fall outside of the UL and LL (Zoutliers).
c) Then count the number of rows where the Zoutlier is the same direction (i.e. positive or negative) as the previous or the subsequent row for the same chr.
The output would therefore be:
ZScore1 TBGGT 12_try 324Gtt AMN2
nrow 4 6 4 4
So far I have code as follows:
library(data.table)#v1.9.5
f1 <- function(df, ZCol){
#A) Determine the UL and LL and then generate the Zoutliers
UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
Zoutliers <- which(ZCol > UL | ZCol < LL)
#B) Exclude Zoutliers per chr if same direction as previous or subsequent row
na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol)))
.SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
by=chr])[, list(.N)]}
nm1 <- paste0(names(df)
setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]
The code is patched together from various places. The problem I have is combining parts A) and B) of the code to get the output I want
Can you try this function? I was not sure what alpha is, so I could not reproduce the expected output and included it as variable in the function.
# read your data per copy&paste
d <- read.table("clipboard",header = T)
# or as in Frank comment mentioned solution via fread
d <- data.table::fread("chr leftPos TBGGT 12_try 324Gtt AMN2
1 24352 34 43 19 43
1 53534 2 1 -1 -9
2 34 -15 7 -9 -18
3 3443 -100 -4 4 -9
3 3445 -100 -1 6 -1
3 3667 5 -5 9 5
3 7882 -8 -9 1 3")
# set up the function
foo <- function(x, alpha, chr){
# your code for task a) and b)
UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE)
LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE)
Zoutliers <- which(x > UL | x < LL)
# part (c
# factor which specifies the direction. 0 values are set as positives
pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative")
# count the occurrence per chromosome and direction.
aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length)
}
# apply over the columns and get a list of dataframes with number of outliers per chr and direction.
apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)
I have a problem where I have a bunch of lengths and want to start at the origin (pretend I'm facing to the positive end of the y axis), I make a right and move positively along the x axis for the distance of length_i. At this time I make another right turn, walk the distance of length_i and repeat n times. I can do this but I think there's a more efficient way to do it and I lack a math background:
## Fake Data
set.seed(11)
dat <- data.frame(id = LETTERS[1:6], lens=sample(2:9, 6),
x1=NA, y1=NA, x2=NA, y2=NA)
## id lens x1 y1 x2 y2
## 1 A 4 NA NA NA NA
## 2 B 2 NA NA NA NA
## 3 C 5 NA NA NA NA
## 4 D 8 NA NA NA NA
## 5 E 6 NA NA NA NA
## 6 F 9 NA NA NA NA
## Add a cycle of 4 column
dat[, "cycle"] <- rep(1:4, ceiling(nrow(dat)/4))[1:nrow(dat)]
##For loop to use the information from cycle column
for(i in 1:nrow(dat)) {
## set x1, y1
if (i == 1) {
dat[1, c("x1", "y1")] <- 0
} else {
dat[i, c("x1", "y1")] <- dat[(i - 1), c("x2", "y2")]
}
col1 <- ifelse(dat[i, "cycle"] %% 2 == 0, "x1", "y1")
col2 <- ifelse(dat[i, "cycle"] %% 2 == 0, "x2", "y2")
dat[i, col2] <- dat[i, col1]
col3 <- ifelse(dat[i, "cycle"] %% 2 != 0, "x2", "y2")
col4 <- ifelse(dat[i, "cycle"] %% 2 != 0, "x1", "y1")
mag <- ifelse(dat[i, "cycle"] %in% c(1, 4), 1, -1)
dat[i, col3] <- dat[i, col4] + (dat[i, "lens"] * mag)
}
This gives the desired result:
> dat
id lens x1 y1 x2 y2 cycle
1 A 4 0 0 4 0 1
2 B 2 4 0 4 -2 2
3 C 5 4 -2 -1 -2 3
4 D 8 -1 -2 -1 6 4
5 E 6 -1 6 5 6 1
6 F 9 5 6 5 -3 2
Here it is as a plot:
library(ggplot2); library(grid)
ggplot(dat, aes(x = x1, y = y1, xend = x2, yend = y2)) +
geom_segment(aes(color=id), size=3, arrow = arrow(length = unit(0.5, "cm"))) +
ylim(c(-10, 10)) + xlim(c(-10, 10))
This seems slow and clunky. I'm guessing there's a better way to do this than the items I do in the for loop. What's a more efficient way to keep making programatic rights?
(As suggested by #DWin) Here is a solution using complex numbers, which is flexible to any kind of turn, not just 90 degrees (-pi/2 radians) right angles. Everything is vectorized:
set.seed(11)
dat <- data.frame(id = LETTERS[1:6], lens = sample(2:9, 6),
turn = -pi/2)
dat <- within(dat, { facing <- pi/2 + cumsum(turn)
move <- lens * exp(1i * facing)
position <- cumsum(move)
x2 <- Re(position)
y2 <- Im(position)
x1 <- c(0, head(x2, -1))
y1 <- c(0, head(y2, -1))
})
dat[c("id", "lens", "x1", "y1", "x2", "y2")]
# id lens x1 y1 x2 y2
# 1 A 4 0 0 4 0
# 2 B 2 4 0 4 -2
# 3 C 5 4 -2 -1 -2
# 4 D 8 -1 -2 -1 6
# 5 E 6 -1 6 5 6
# 6 F 9 5 6 5 -3
The turn variable should really be considered as an input together with lens. Right now all turns are -pi/2 radians but you can set each one of them to whatever you want. All other variables are outputs.
Now having a little fun with it:
trace.path <- function(lens, turn) {
facing <- pi/2 + cumsum(turn)
move <- lens * exp(1i * facing)
position <- cumsum(move)
x <- c(0, Re(position))
y <- c(0, Im(position))
plot.new()
plot.window(range(x), range(y))
lines(x, y)
}
trace.path(lens = seq(0, 1, length.out = 200),
turn = rep(pi/2 * (-1 + 1/200), 200))
(My attempt at replicating the graph here: http://en.wikipedia.org/wiki/Turtle_graphics)
I also let you try these:
trace.path(lens = seq(1, 10, length.out = 1000),
turn = rep(2 * pi / 10, 1000))
trace.path(lens = seq(0, 1, length.out = 500),
turn = seq(0, pi, length.out = 500))
trace.path(lens = seq(0, 1, length.out = 600) * c(1, -1),
turn = seq(0, 8*pi, length.out = 600) * seq(-1, 1, length.out = 200))
Feel free to add yours!
This is yet another method using complex numbers. You can rotate a vector "to the right" in the complex plane by multiplying by -1i. The code below makes the first traversal go in the positive X (the Re()-al axis) and each subsequent traversal would be rotated to the "right"
imVecs <- lengths*c(0-1i)^(0:3)
imVecs
# [1] 9+0i 0-5i -9+0i 0+9i 8+0i 0-5i -8+0i 0+7i 8+0i 0-1i -5+0i 0+3i 4+0i 0-7i -4+0i 0+2i
#[17] 3+0i 0-7i -5+0i 0+8i
cumsum(imVecs)
# [1] 9+0i 9-5i 0-5i 0+4i 8+4i 8-1i 0-1i 0+6i 8+6i 8+5i 3+5i 3+8i 7+8i 7+1i 3+1i 3+3i 6+3i 6-4i 1-4i
#[20] 1+4i
plot(cumsum(imVecs))
lines(cumsum(imVecs))
This is the approach to using complex plane rotations to do 45 degree turns to the right:
> sqrt(-1i)
[1] 0.7071068-0.7071068i
> imVecs <- lengths*sqrt(0-1i)^(0:7)
Warning message:
In lengths * sqrt(0 - (0+1i))^(0:7) :
longer object length is not a multiple of shorter object length
> plot(cumsum(imVecs))
> lines(cumsum(imVecs))
And the plot:
This isn't a pretty plot, but I've included it to show that this 'vectorized' coordinate calculation produces correct results which shouldn't be too hard to adapt to your needs:
xx <- c(1,0,-1,0)
yy <- c(0,-1,0,1)
coords <- suppressWarnings(cbind(x = cumsum(c(0,xx*dat$lens)),
y = cumsum(c(0,yy*dat$lens))))
plot(coords, type="l", xlim=c(-10,10), ylim=c(-10,10))
It might be useful to think about this in terms of distance and bearing. Distance is given by dat$lens, and bearing is the angle of movement relative to some arbitrary reference line (say, the x-axis). Then, at each step,
x.new = x.old + distance * cos(bearing)
y.new = y.old + distance * sin(bearing)
bearing = bearing + increment
Here, since we start at the origin and move in the +x direction, (x,y)=(0,0) and bearing starts at 0 degrees. A right turn is simply a bearing increment of -90 degrees (-pi/2 radians). So in R code, using your definition of dat:
x <-0
y <- 0
bearing <- 0
for (i in 1:nrow(dat)){
dat[i,c(3,4)] <- c(x,y)
length <- dat[i,2]
x <- x + length * cos(bearing)
y <- y + length * sin(bearing)
dat[i,c(5,6)] <- c(x,y)
bearing <- bearing - pi/2
}
This produces what you had and has the advantage that you can update it very simply to make left turns, or 45 degree turns, or whatever. You can even add a bearing.increment column to dat to create a random walk.
Very similar to Josh's solution:
lengths <- sample(1:10, 20, repl=TRUE)
x=cumsum(lengths*c(1,0,-1,0))
y=cumsum(lengths*c(0,1,0,-1))
cbind(x,y)
x y
[1,] 9 0
[2,] 9 5
[3,] 0 5
[4,] 0 -4
[5,] 8 -4
[6,] 8 1
[7,] 0 1
[8,] 0 -6
[9,] 8 -6
[10,] 8 -5
[11,] 3 -5
[12,] 3 -8
[13,] 7 -8
[14,] 7 -1
[15,] 3 -1
[16,] 3 -3
[17,] 6 -3
[18,] 6 4
[19,] 1 4
[20,] 1 -4
Base graphics:
plot(cbind(x,y))
arrows(cbind(x,y)[-20,1],cbind(x,y)[-20,2], cbind(x,y)[-1,1], cbind(x,y)[-1,2] )
This does highlight the fact that both Josh's and my solutions are "turning the wrong way", so you need to change the signs on our "transition matrices". And we probably should have started at (0,0), but You should have not trouble adapting this you your needs.