Suppose we run a regression with dummy variables and we obtain results like
Yt=a+b1X1+b2X2+b3X3...+bnXn+C1Dum1+C2Dum2+C3Dum3+...+CnDumN
I want to create an index I, such that,
I=W1*Dum1+W2*Dum2+W3*Dum3+...+Wn*DumN
where Wi's are the weights as the regression coefficients of dummies.
Supposed you have a data frame like this.
head(dat)
# X1 X2 Y D1 D2 D3 W1 W2 W3
# 1 1.37 -0.31 0.21 0 1 0 -1.04 1.39 -0.03
# 2 -0.56 -1.78 -0.36 0 1 0 -0.09 -0.48 0.11
# 3 0.36 -0.17 0.76 0 1 0 0.62 0.65 -0.49
# 4 0.63 1.21 -0.73 0 1 0 -0.95 1.39 -0.50
# 5 0.40 1.90 -1.37 0 1 0 -0.54 -1.11 -1.66
# 6 -0.11 -0.43 0.43 0 1 0 0.58 -0.86 -0.38
You may use which.max to subset.
dat <- transform(dat,
weight=apply(dat, 1, function(x) x[weights][which.max(x[dummies])]))
dat
# X1 X2 Y D1 D2 D3 W1 W2 W3 weight
# 1 1.37 -0.31 0.21 0 1 0 -1.04 1.39 -0.03 1.39
# 2 -0.56 -1.78 -0.36 0 1 0 -0.09 -0.48 0.11 -0.48
# 3 0.36 -0.17 0.76 0 1 0 0.62 0.65 -0.49 0.65
# 4 0.63 1.21 -0.73 0 1 0 -0.95 1.39 -0.50 1.39
# 5 0.40 1.90 -1.37 0 1 0 -0.54 -1.11 -1.66 -1.11
# 6 -0.11 -0.43 0.43 0 1 0 0.58 -0.86 -0.38 -0.86
# 7 1.51 -0.26 -0.81 0 0 1 0.77 -1.13 -0.51 -0.51
# 8 -0.09 -1.76 1.44 0 0 1 0.46 -1.46 2.70 2.70
# 9 2.02 0.46 -0.43 1 0 0 -0.89 0.08 -1.36 -0.89
# 10 -0.06 -0.64 0.66 0 1 0 -1.10 0.65 0.14 0.65
# 11 1.30 0.46 0.32 0 0 1 1.51 1.20 -1.49 -1.49
# 12 2.29 0.70 -0.78 0 1 0 0.26 1.04 -1.47 1.04
# 13 -1.39 1.04 1.58 0 1 0 0.09 -1.00 0.12 -1.00
# 14 -0.28 -0.61 0.64 0 0 1 -0.12 1.85 -1.00 -1.00
# 15 -0.13 0.50 0.09 0 0 1 -1.19 -0.67 0.00 0.00
# 16 0.64 -1.72 0.28 0 1 0 0.61 0.11 -0.43 0.11
# 17 -0.28 -0.78 0.68 0 0 1 -0.22 -0.42 -0.61 -0.61
# 18 -2.66 -0.85 0.09 1 0 0 -0.18 -0.12 -2.02 -0.18
# 19 -2.44 -2.41 -2.99 0 0 1 0.93 0.19 -1.22 -1.22
# 20 1.32 0.04 0.28 0 1 0 0.82 0.12 0.18 0.12
Data:
I show here how I created the example data, which probably answers additional questions.
set.seed(42)
dat <- data.frame(matrix(round(rnorm(60), 2), 20, 3))
dat$X4 <- rbinom(20, 2, .5)
names(dat)[3] <- "Y"
dat <- cbind(dat[-4], setNames(data.frame(model.matrix(X1 ~ 0 + factor(X4), dat)),
paste0("D", 1:3)))
dat <- cbind(dat, setNames(data.frame(matrix(round(rnorm(60), 2), 20, 3)),
paste0("W", 1:3)))
Related
I have 2 dataframes with different number of rows and columns, and I'd like to show both of them in a circos plot with circlize.
My data looks like this:
df1=data.frame(replicate(7,sample(-200:200,200,rep=TRUE))/100)
df2=data.frame(replicate(2,sample(-200:200,200,rep=TRUE))/100)
#head(df1)
X1 X2 X3 X4 X5 X6 X7
1 -0.03 0.63 -0.33 0.73 -1.37 -1.39 1.96
2 -1.81 -1.24 -1.63 1.58 0.13 1.39 -0.76
3 0.02 -2.00 -1.93 -1.35 1.06 -0.58 -0.77
4 -1.11 -1.38 -0.66 -0.40 1.69 -0.47 -1.55
5 0.98 0.06 0.00 -0.35 1.97 1.74 0.72
6 1.51 -1.68 -0.44 -1.74 0.15 0.26 0.36
#head(df2)
X1 X2
1 0.16 -0.81
2 -1.38 -0.16
3 -0.22 -0.74
4 0.73 -0.82
5 0.58 -1.87
6 -0.63 1.50
I want to build a single circos plot where the top is showing df1 and bottom is showing df2, but I can only show individual dfs. For instance, this is how I show df1:
col_fun1=colorRamp2(c(min(df1), 0, max(df1)), c("blue", "white", "red"))
circos.heatmap(df1, col = col_fun1, cluster = T, track.height = 0.2, rownames.side = "outside", rownames.cex = 0.6)
circos.clear()
How can I df1 only in the top half, and df2 only in the bottom half?
Suppose I have a dataframe as follows:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:
newfun <- function(var = NULL) {
newname <- paste0(var, "NEW")
df[[newname]] <- df[[var]]/100
return(df)
}
I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:
df <- newfun("alpha")
df <- newfun("beta")
df <- newfun("gamma")
Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
map(function(x) newfun(x))
How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?
Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:
df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable
You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:
While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.
To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".
myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
# [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
# [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20
From here, we just need to column-bind (cbind) it:
cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
# alpha beta gamma newalpha newbeta newgamma
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# ...
Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)
additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
# $ :List of 3
# ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
do.call(cbind, c(list(dat), additions))
# alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1 0 30 100 0.00 0.30 1.00 0.00 0.30 1.00 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01 0.01 0.31 1.01 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02 0.02 0.32 1.02 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03 0.03 0.33 1.03 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04 0.04 0.34 1.04 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05 0.05 0.35 1.05 0.05 0.35 1.05
# ...
An alternative approach is to change your function to only return a vector:
newfun2 <- function(var = NULL) {
df[[var]] / 100
}
newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20
Then, using base, you can use lapply() to loop through your list of functions to do:
cols <- c("alpha", "beta", "gamma")
df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df
alpha beta gamma alphaNEW betaNEW gammaNEW
1 0 30 100 0.00 0.30 1.00
2 1 31 101 0.01 0.31 1.01
3 2 32 102 0.02 0.32 1.02
4 3 33 103 0.03 0.33 1.03
5 4 34 104 0.04 0.34 1.04
6 5 35 105 0.05 0.35 1.05
7 6 36 106 0.06 0.36 1.06
8 7 37 107 0.07 0.37 1.07
9 8 38 108 0.08 0.38 1.08
10 9 39 109 0.09 0.39 1.09
11 10 40 110 0.10 0.40 1.10
12 11 41 111 0.11 0.41 1.11
13 12 42 112 0.12 0.42 1.12
14 13 43 113 0.13 0.43 1.13
15 14 44 114 0.14 0.44 1.14
16 15 45 115 0.15 0.45 1.15
17 16 46 116 0.16 0.46 1.16
18 17 47 117 0.17 0.47 1.17
19 18 48 118 0.18 0.48 1.18
20 19 49 119 0.19 0.49 1.19
21 20 50 120 0.20 0.50 1.20
Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.
vars <- names(df)
for (i in vars){
df <- newfun(i)
}
df
# alpha beta gamma alphaNEW betaNEW gammaNEW
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05
# 7 6 36 106 0.06 0.36 1.06
# 8 7 37 107 0.07 0.37 1.07
# 9 8 38 108 0.08 0.38 1.08
# 10 9 39 109 0.09 0.39 1.09
# 11 10 40 110 0.10 0.40 1.10
# 12 11 41 111 0.11 0.41 1.11
# 13 12 42 112 0.12 0.42 1.12
# 14 13 43 113 0.13 0.43 1.13
# 15 14 44 114 0.14 0.44 1.14
# 16 15 45 115 0.15 0.45 1.15
# 17 16 46 116 0.16 0.46 1.16
# 18 17 47 117 0.17 0.47 1.17
# 19 18 48 118 0.18 0.48 1.18
# 20 19 49 119 0.19 0.49 1.19
# 21 20 50 120 0.20 0.50 1.20
Given a table of values, where A = state of system, B = length of state, and C = cumulative length of states:
A B C
1 1.16 1.16
0 0.51 1.67
1 1.16 2.84
0 0.26 3.10
1 0.59 3.69
0 0.39 4.08
1 0.78 4.85
0 0.90 5.75
1 0.78 6.53
0 0.26 6.79
1 0.12 6.91
0 0.51 7.42
1 0.26 7.69
0 0.51 8.20
1 0.39 8.59
0 0.51 9.10
1 1.16 10.26
0 1.10 11.36
1 0.59 11.95
0 0.51 12.46
How would I use R to calculate the number of transitions (where A gives the state) per constant interval length - where the intervals are consecutive and could be any arbitrary number (I chose a value of 2 in my image example)? For example, using the table values or the image included we count 2 transitions from 0-2, 3 transitions from greater than 2-4, 3 transitions from >4-6, etc.
This is straightforward in R. All you need is column C and ?cut. Consider:
d <- read.table(text="A B C
1 1.16 1.16
0 0.51 1.67
1 1.16 2.84
0 0.26 3.10
1 0.59 3.69
0 0.39 4.08
1 0.78 4.85
0 0.90 5.75
1 0.78 6.53
0 0.26 6.79
1 0.12 6.91
0 0.51 7.42
1 0.26 7.69
0 0.51 8.20
1 0.39 8.59
0 0.51 9.10
1 1.16 10.26
0 1.10 11.36
1 0.59 11.95
0 0.51 12.46", header=TRUE)
fi <- cut(d$C, breaks=seq(from=0, to=14, by=2))
table(fi)
# fi
# (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14]
# 2 3 3 5 3 3 1
I have the following data frame:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
I want to calculate the geometric mean for each row. My codes is
dat <- read.csv("MXreport.csv")
if(any(dat$X18S > 25)){ print("Fail!") } else { print("Pass!")}
datpass <- subset(dat, dat$X18S <= 25)
gene <- datpass[, 42:52]
gm_mean <- function(x){ prod(x)^(1/length(x))}
gene$score <- apply(gene, 1, gm_mean)
head(gene)
I got this output after typing this code:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
The problem is I got NA after applying the geometric mean function to the row that has NA. How do I skip NA and calculate the geometric mean for the row that has NA
When I used gene<- na.exclude(datpass[, 42:52]). It skipped the row that has NA and not calculate the geometric mean at all. That is now what I want. I want to also calculate the geometric mean for the row that has NA also. How do I do this?
I have a data frame that's about ts1[100, 2000] in dimension as follows:
> ts1[1:8, 1:6]
DD LEVEL X136747 X136749 X136752 X136753 ... ...
1 D04MX.x LC 0.25 0.30 -0.01 -0.05
2 D08MX.x LC 0.22 0.11 0.11 0.00
3 D15MX.x LC 0.31 0.33 -0.23 -0.08
4 D29MX.x LC 0.28 0.14 -0.28 -0.08
5 D04HX.x SC 0.11 -0.26 -0.21 -0.33
6 D08HX.x SC 0.25 -0.23 -0.07 -0.25
7 D15HX.x SC 0.29 0.03 -0.05 -0.10
8 D29HX.x SC 0.29 0.13 -0.09 0.02
... ...
I would like to replace all the values that are between -0.1 and 0.1 under the columns named X###### (ts1[3:ncol(ts1)]) to be 0. I tried the following:
> ts1 <- ifelse(abs(ts1) < 1, 0, ts1)
Error in Math.data.frame(ts1) :
non-numeric variable in data frame: DDLEVEL
> ts1[which(abs(ts1) < 1)] <- 0
Error in Math.data.frame(ts1) :
non-numeric variable in data frame: DDLEVEL
> ts1[which(abs(is.numeric(ts1)) < 1)] <- 0
> ts1
DD LEVEL X1367471_at X1367495_at X1367527_at X1367536_at
1 0 LC 0.25 0.30 -0.01 -0.05
2 0 LC 0.22 0.11 0.11 0.00
3 0 LC 0.31 0.33 -0.23 -0.08
... ...
> ts1 <- ts1[, lapply(.SD[3:ncol(ts1)], ifelse(abs(ts1) < 1, 0, ts1))]
Error in Math.data.frame(ts1) :
non-numeric variable in data frame: DDLEVEL
What am I doing wrong? I do need to retain the first two columns. Any shortcut? Thanks.
Assuming your data is named df:
colsToEdit <- grepl("X", names(df))
df[, colsToEdit][abs(df[, colsToEdit]) <= 0.1] <- 0
Gives you:
DD LEVEL X136747 X136749 X136752 X136753
1 D04MX.x LC 0.25 0.30 0.00 0.00
2 D08MX.x LC 0.22 0.11 0.11 0.00
3 D15MX.x LC 0.31 0.33 -0.23 0.00
4 D29MX.x LC 0.28 0.14 -0.28 0.00
5 D04HX.x SC 0.11 -0.26 -0.21 -0.33
6 D08HX.x SC 0.25 -0.23 0.00 -0.25
7 D15HX.x SC 0.29 0.00 0.00 0.00
8 D29HX.x SC 0.29 0.13 0.00 0.00