Manipulating Data.Frames - r

I have different data.frame objects with two columns. These data.frame objects are called Experiment1, Experiment2, Experiment3 ... Experiment{n}
> Experiment1
Name Statistic
1 a -1.050
2 b 0.058
3 c 0.489
4 d 1.153
5 e 0.736
6 f -1.155
7 g 0.186
> Experiment2
Name Statistic
1 a 0.266
2 b 0.067
3 c -0.385
4 d 0.068
5 e 1.563
6 f 0.745
7 g 1.671
> Experiment3
Name Statistic
1 a 0.004
2 b -2.074
3 c 0.746
4 d 0.207
5 e 0.700
6 f 0.158
7 g 0.067
> Experiment4
Name Statistic
1 a 0.255
2 b -0.542
3 c 0.477
4 d 1.552
5 e 0.025
6 f 1.027
7 g 0.326
> Experiment5
Name Statistic
1 a 1.817
2 b 0.147
3 c 0.052
4 d 0.194
5 e -0.137
6 f 2.321
7 g -0.939
> Experiment6
Name Statistic
1 a 1.817
2 b 0.147
3 c 0.052
4 d 0.194
5 e -0.137
6 f 2.321
7 g -0.939
> ExperimentalDesign$metabolite
[1] "butyrate" "h2s" "hippurate" "acetate" "propionate" "butyrate_2" [7] "h2s_2" "hippurate_2" "acetate_2" "propionate_2"
I have different data.frame objects with three columns. These data.frame objects are called Experiment1, Experiment2, Experiment3 ... Experiment{n} (where n is NumberTubes divided by NumberParameters).
Now I want to merge from each data.frame object the .$Statistic column in a table (3 statistic columns per output..)
tab_1 <- cbind(Experiment1, Experiment2$Statistic, Experiment3$Statistic). Also, take the metabolite from ExperimentalDesign$metabolite in order. e.g. Table_3 would get hippurate.
NumberRepeats <- 3 (Table_1 = merge Experiment_1,
Experiment_2$Statistic, Experiment_3$Statistic , Table_2 = merge
Experiment_4, Experiment_5$Statistic, Experiment_6$Statistic, etc.)
Experiment_n <- 17 (e.g. Experiment_1, Experiment_2, etc..)
skipTube <- c(11) (skip Experiment_11)
Desired outputs:
Table_1:
Experiment1 Experiment2 Experiment3 metabolite
a -1.050 0.266 0.004 butyrate
b 0.058 0.067 -2.074 butyrate
c 0.489 -0.385 0.746 butyrate
d 1.153 0.068 0.207 butyrate
e 0.736 1.563 0.700 butyrate
f -1.155 0.745 0.158 butyrate
g 0.186 1.671 0.067 butyrate
Table_2
Experiment4 Experiment5 Experiment6 metabolite
a 0.255 1.817 -0.827 h2s
b -0.542 0.147 0.219 h2s
c 0.477 0.052 1.561 h2s
d 1.552 0.194 1.493 h2s
e 0.025 -0.137 0.063 h2s
f 1.027 2.321 0.844 h2s
g 0.326 -0.939 -0.373 h2s
TRIED SO FAR:
With this you merge on column of different dataframe objects to one table. You can control the number of column by the NumberRepeats variable. All table which are stored in a list have same number of data columns like the
NumberRepeats variable except the last table...
# created test data
for(i in 1:17){
Name <- letters[1:7]
Statistic <- round(rnorm(7), 3)
assign(paste0("Experiment",i), data.frame(Name, Statistic))
}
# set some parameters
NumberRepeats <- 3
Experiment_n <- 17
skipTube <- c(11)
# lets go
out <- list()
list_index <- 1
counter <- 1
while(counter < Experiment_n) {
tab <- NULL
nam <- NULL
while((is.null(tab) || ncol(tab) < NumberRepeats) & Experiment_n >= counter){
if(!any(counter == skipTube)){
tab <- cbind(tab, get(paste0("Experiment", counter))$Statistic)
# tab <- as.data.frame(tab)
nam <- c(nam,paste0("Experiment", counter))
}
counter <- counter + 1
}
colnames(tab) <- nam
rownames(tab) <- as.matrix(Experiment1$Name)
out[[list_index]] <- tab
assign(paste0('table_', list_index), tab)
list_index <- list_index + 1
}
out
Output from above code:
Experiment1 Experiment2 Experiment3
a 0.136 0.260 -1.089
b 0.946 -1.165 -0.599
c -0.462 -1.445 0.044
d -1.936 -0.391 0.622
e 0.537 -0.502 1.192
f 0.259 0.096 -1.873
g 1.352 0.049 -0.644
Desired output from the above code:
Experiment1 Experiment2 Experiment3 metabolite
a -1.050 0.266 0.004 butyrate
b 0.058 0.067 -2.074 butyrate
c 0.489 -0.385 0.746 butyrate
d 1.153 0.068 0.207 butyrate
e 0.736 1.563 0.700 butyrate
f -1.155 0.745 0.158 butyrate
g 0.186 1.671 0.067 butyrate

Something like this should work but this also quite manual:
table1 = Reduce(function(x,y){cbind(x,y)},
list(Experiment1$Statistic,Experiment2$Statistic,
Experiment3$Statistic,ExperimentalDesign$metabolite[1]))
table2 = Reduce(function(x,y){cbind(x,y)},
list(Experiment4$Statistic,Experiment5$Statistic,
Experiment6$Statistic,ExperimentalDesign$metabolite[2]))
EDIT: A more robust solution:
First create a list of all the experiment data.frames named ldf:
ldf = list(Experiment1,Experiment2,Experiment3,...,Experimentn)
And then:
lapply(1:ceiling(length(ldf)/3),
function(t,l,df){
if(t==ceiling(length(l)/3)){
ind = ((3*t)-2):(3*t-(length(l)%%3))
}else{
ind = ((3*t)-2):(3*t)
};
cbind(Reduce(function(x,y){cbind(x,y)},lapply(l[ind],'[[','Statistic')),
df$metabolite[t])
},
ldf,ExperimentalDesign)

This solution should do what you want in case you want to aggregate every 3 tables.
library(reshape)
for(i in 1:17){
Name <- letters[1:7]
Statistic <- round(rnorm(7), 3)
ExperimentName <- rep(paste0("Experiment",i), 7)
assign(paste0("Experiment",i), data.frame(ExperimentName, Name, Statistic, stringsAsFactors = FALSE) )
}
# set some parameters
NumberRepeats <- 5
Experiment_n <- 17
skipTube <- c(3,7,11)
# Create dummy list for the metabolites
metabolites <- c("met1", "met2", "met3", "met4", "met5")
for (iteration in c(1:Experiment_n)){
if (iteration %% 3 == 0){
temp_df <- rbind(get(paste0("Experiment", iteration - 2)), get(paste0("Experiment", iteration - 1)), get(paste0("Experiment", iteration)))
print(temp_df)
temp_df <- melt(data = temp_df)
aggregates <- dcast(data = temp_df, formula = Name ~ ExperimentName, value.var = "value")
aggregates$metabolite <- metabolites[iteration/3]
print(aggregates)
}
}

Related

Replace values of one dataframe from the corresponding values from another dataframe in R

I have two data.frames A (dat1) and B (dat2).
Is it possible to find small differences (up to tol) between one or more numeric columns (cols) across A and B and the replace those in A with the corresponding ones in B
For example, if you look at the numeric columns across A and B, you'll see THIS in A for column upper.CL is 1.770 but the same in B is 1.771 i.e., they are different by tol = .001. In this case, all we need is to replace 1.770 in A with 1.771 from B so that all numeric columns in A and B are the same.
Is it possible to write an R function to find & use all numeric columns that differ by tol and replace the values as described above?
foo <- function(dat1, dat2, cols = NULL, tol){
# Solution
}
# EXAMPLE OF USE:
#### foo(dat1 = A, dat2 = B, cols = upper.CL, tol = .002)
# OR
#### foo(dat1 = A, dat2 = B, tol = .002)
A = read.table(h=TRUE, text="
task_dif time emmean SE lower.CL upper.CL
1 complex 1 1.733 0.023 1.686 1.779
2 simple 1 1.734 0.018 1.697 1.770# <- THIS
3 complex 2 1.702 0.025 1.652 1.751
4 simple 2 1.714 0.017 1.680 1.747
5 complex 3 1.757 0.019 1.720 1.794
6 simple 3 1.740 0.027 1.687 1.794
7 complex 4 1.773 0.019 1.735 1.810
8 simple 4 1.764 0.025 1.713 1.814")
B = read.table(h=TRUE, text="
order time emmean SE lower.CL upper.CL
1 c2s 1 1.733 0.023 1.686 1.779
2 s2c 1 1.734 0.018 1.697 1.771# <- THIS
3 c2s 2 1.714 0.017 1.680 1.747
4 s2c 2 1.702 0.025 1.652 1.751
5 c2s 3 1.757 0.019 1.720 1.794
6 s2c 3 1.740 0.027 1.687 1.794
7 c2s 4 1.764 0.025 1.713 1.814
8 s2c 4 1.773 0.019 1.735 1.810")
Desired output:
A = read.table(h=TRUE, text="
task_dif time emmean SE lower.CL upper.CL
1 complex 1 1.733 0.023 1.686 1.779
2 simple 1 1.734 0.018 1.697 1.771# <- Replaced using corresponding value in `B`
3 complex 2 1.702 0.025 1.652 1.751
4 simple 2 1.714 0.017 1.680 1.747
5 complex 3 1.757 0.019 1.720 1.794
6 simple 3 1.740 0.027 1.687 1.794
7 complex 4 1.773 0.019 1.735 1.810
8 simple 4 1.764 0.025 1.713 1.814")
B = read.table(h=TRUE, text="
order time emmean SE lower.CL upper.CL
1 c2s 1 1.733 0.023 1.686 1.779
2 s2c 1 1.734 0.018 1.697 1.771# <- THIS
3 c2s 2 1.714 0.017 1.680 1.747
4 s2c 2 1.702 0.025 1.652 1.751
5 c2s 3 1.757 0.019 1.720 1.794
6 s2c 3 1.740 0.027 1.687 1.794
7 c2s 4 1.764 0.025 1.713 1.814
8 s2c 4 1.773 0.019 1.735 1.810")
Try this:
library(dplyr)
close <- function(tol) function(a, b) abs(a - b) <= tol
mutate(A, rn = row_number()) %>%
fuzzyjoin::fuzzy_left_join(
select(B, upper.CL),
by = "upper.CL",
match_fun = list(close(0.001))) %>%
distinct(rn, upper.CL.x, upper.CL.y, .keep_all = TRUE) %>%
select(-upper.CL.x, upper.CL = upper.CL.y)
# task_dif time emmean SE lower.CL rn upper.CL
# 1 complex 1 1.733 0.023 1.686 1 1.779
# 2 simple 1 1.734 0.018 1.697 2 1.771
# 3 complex 2 1.702 0.025 1.652 3 1.751
# 4 simple 2 1.714 0.017 1.680 4 1.747
# 5 complex 3 1.757 0.019 1.720 5 1.794
# 6 simple 3 1.740 0.027 1.687 6 1.794
# 7 complex 4 1.773 0.019 1.735 7 1.810
# 8 simple 4 1.764 0.025 1.713 8 1.814
Notes:
I add rn in the likely condition of a 1-to-many join; if you look at the data before distinct(..) above, you'll see that rows 5 and 6 are repeated, which makes sense since 1.794 occurs twice each in A and B (though twice in one alone is sufficient for 1-to-many). Your real data might introduce more duplicate rows that distinct do not address, you can use rn to summarize/aggregate or reconstruct as needed.
Joining on a floating-point number is not guaranteed; while it should likely work well enough with this data, the issue is never an error: failures due to floating-point equality concerns will merely evidence as "did not join", which is not an error nor even a warning. The tol= should address most of that, but caveat emptor.
It seems likely that this join might also need more join-columns. For those, within fuzzy_left_join one would include `==` as the match function. For instance, if time were also a join on equality (just work with me here), then
mutate(A, rn = row_number()) %>%
fuzzyjoin::fuzzy_left_join(
select(B, time, upper.CL),
by = c("time", "upper.CL"),
match_fun = list(`==`, close(0.001))) %>%
...
(If not clear, those are backticks, not single-quotes.)
my close is a function that returns a function ... that may seem too meta, but it works well in a more generalized fashion here. match_fun must be a function (either anonymous or named) or a ~-function (rlang-style). I think this looks more readable than the equalalent match_fun=list(~ abs(.x - .y) < 0.001), though that works as well.
In the end, I think this function might meet your needs.
close <- function(tol) function(a, b) abs(a - b) <= tol
myjoin <- function(X, Y, by = NULL, tol = 1e-9,
type = c("full", "left", "right"), reduce = TRUE) {
if (is.null(names(by))) names(by) <- by
stopifnot(
all(names(by) %in% names(X)),
all(by %in% names(Y)),
all(!is.na(tol) & tol >= 0)
)
type <- match.arg(type)
if (length(tol) == 1L) tol <- rep(tol, length(by))
funs <- lapply(tol, function(z) if (z < 1e-15) `==` else close(z))
joinfun <- switch(
type,
full = fuzzyjoin::fuzzy_full_join,
left = fuzzyjoin::fuzzy_left_join,
right = fuzzyjoin::fuzzy_right_join)
out <- joinfun(
transform(X, rn = seq_len(nrow(X))),
subset(Y, select = by),
by = by, match_fun = funs)
if (reduce) {
samename <- (names(by) == by)
byx <- paste0(names(by), ifelse(samename, ".x", ""))
byy <- paste0(by, ifelse(samename, ".y", ""))
out <- out[!duplicated(out[, unique(c("rn", byx, byy))]),]
rownames(out) <- NULL
# ASSUMING you always want to replace the LHS 'by' columns with the
# RHS columns ...
out[names(by)] <- out[byy]
out[c(byx[samename], byy)] <- NULL
}
out
}
myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left")
# task_dif time emmean SE lower.CL rn upper.CL
# 1 complex 1 1.733 0.023 1.686 1 1.779
# 2 simple 1 1.734 0.018 1.697 2 1.771
# 3 complex 2 1.702 0.025 1.652 3 1.751
# 4 simple 2 1.714 0.017 1.680 4 1.747
# 5 complex 3 1.757 0.019 1.720 5 1.794
# 6 simple 3 1.740 0.027 1.687 6 1.794
# 7 complex 4 1.773 0.019 1.735 7 1.810
# 8 simple 4 1.764 0.025 1.713 8 1.814
myjoin(A, B, by = "upper.CL", tol = 0.001, type = "left", reduce = FALSE)
# task_dif time emmean SE lower.CL upper.CL.x rn upper.CL.y
# 1 complex 1 1.733 0.023 1.686 1.779 1 1.779
# 2 simple 1 1.734 0.018 1.697 1.770 2 1.771
# 3 complex 2 1.702 0.025 1.652 1.751 3 1.751
# 4 simple 2 1.714 0.017 1.680 1.747 4 1.747
# 5 complex 3 1.757 0.019 1.720 1.794 5 1.794
# 6 complex 3 1.757 0.019 1.720 1.794 5 1.794
# 7 simple 3 1.740 0.027 1.687 1.794 6 1.794
# 8 simple 3 1.740 0.027 1.687 1.794 6 1.794
# 9 complex 4 1.773 0.019 1.735 1.810 7 1.810
# 10 simple 4 1.764 0.025 1.713 1.814 8 1.814
this might not be the fastest since equal values also get reassigned but i think its the most straight forward implementation.
foo <- function(dat1,dat2,tol) {
## Filter numerics
O<-lapply(list(dat1,dat2),\(x) Filter(is.numeric,x))
# flag differences based on tol
ERR<-(abs(O[[1]]-O[[2]])<=tol)
# reassign
dat2[names(O[[2]])][ERR] <- O[[1]][ERR]
dat2
}
foo(A,B,.001)
If order and the size are the same, you can simply cbind(A,B[,6])
Hope this solves your problem:
foo <- function(A, B, cols = NULL, tol) {
if (!is.null(cols)) {
C <- abs(A[cols]-B[cols])
idx <- C > tol
idx
for (i in 1:nrow(A)) {
if (idx[i]) {
A[i,cols] <- B[i,cols]
}
}
return(A)
}
num <- unlist(lapply(A, is.numeric), use.names = FALSE)
nn <- unlist(lapply(A, function(x) !is.numeric(x)), use.names = FALSE)
A_num <- A[,num]
B_num <- B[,num]
A_nn <- A[,nn]
C <- abs(A_num-B_num)
idx <- C > tol
for (i in 1:nrow(A_num)) {
for (j in 1:ncol(A_num)) {
if (idx[i,j]) {
A_num[i,j] <- B_num[i,j]
}
}
}
A_new <- cbind(A_nn, A_num)
A_new
}

Make a matrix of 2 rows into a row and a column in R

I'm using R
I have a csv file from single cell data like this, where the column 'cluster' is repeated for all the unique 'gene' column.
dput(markers)
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
APOC1 0 1.696639642 0.939 0.394 0 0 APOC1
APOE 0 1.487160872 0.958 0.475 0 0 APOE
GPNMB 9.30E-269 1.31714457 0.745 0.301 2.49E-264 0 GPNMB
FTL 2.24E-230 0.766844152 1 0.977 6.00E-226 0 FTL
PSAP 2.27E-225 0.98726538 0.925 0.685 6.07E-221 0 PSAP
CTSB 4.84E-211 0.925031015 0.902 0.606 1.29E-206 0 CTSB
CTSS 1.37E-197 0.898457063 0.869 0.609 3.67E-193 0 CTSS
CSTB 8.05E-191 0.853658991 0.918 0.732 2.15E-186 0 CSTB
CTSD 1.23E-187 1.08931251 0.787 0.443 3.30E-183 0 CTSD
IGKC 0 1.560337702 0.998 0.237 0 1 IGKC
IGLC2 0 1.546344857 0.997 0.152 0 1 IGLC2
IGLC3 0 1.342649567 0.967 0.073 0 1 IGLC3
C11orf96 0 1.245172517 0.99 0.253 0 1 C11orf96
COL3A1 0 1.212528128 1 0.343 0 1 COL3A1
LUM 0 1.202452925 0.971 0.143 0 1 LUM
IGHG4 0 0.977399051 0.876 0.092 0 1 IGHG4
HSPG2 0 0.957478533 0.883 0.148 0 1 HSPG2
NNMT 0 0.952577589 0.945 0.213 0 1 NNMT
IGHG1 0 0.913733424 0.861 0.07 0 1 IGHG1
COL6A31 0 1.847828827 0.907 0.192 0 2 COL6A3
PDGFRA 5.38E-292 0.849349193 0.503 0.052 1.44E-287 2 PDGFRA
COL5A21 2.67E-280 1.400314195 0.649 0.105 7.14E-276 2 COL5A2
CALD1 1.11E-275 1.292924443 0.771 0.155 2.98E-271 2 CALD1
CCDC80 1.73E-271 1.168549626 0.706 0.123 4.64E-267 2 CCDC80
COL1A21 1.66E-268 2.004626869 0.966 0.326 4.45E-264 2 COL1A2
DCN1 1.47E-253 1.540631398 0.886 0.254 3.93E-249 2 DCN
COL3A11 3.88E-253 2.216642854 0.955 0.353 1.04E-248 2 COL3A1
FBN1 6.40E-251 0.949521182 0.525 0.07 1.71E-246 2 FBN1
I want to transform my matrix so that the row name is the unique cluster name and each column has all the genes from that cluster name (picture 2). How should i write the code?
dput(markers)
0 1 2
APOC1 IGKC COL6A3
APOE IGLC2 PDGFRA
GPNMB IGLC3 COL5A2
FTL C11orf96 CALD1
PSAP COL3A1 CCDC80
CTSB LUM COL1A2
CTSS IGHG4 DCN
CSTB HSPG2 COL3A1
CTSD NNMT FBN1
I tried this and the result file has no values.
markers = read.csv("./markers.csv", row.names=1, stringsAsFactors=FALSE)
z1 = matrix("", ncol = length(unique(markers$cluster)))
colnames(z1) = unique(markers$cluster)
for (i in 1:nrow(z1)){
for (j in 1:ncol(z1)){
genes1 = as.character(markers$gene)[markers$cluster == rownames(z1)[i]]
z1[i,0] = paste(genes1, collapse=" ")
z1 = matrix("", ncol = length(unique(markers$cluster)))
colnames(z1) = unique(markers$cluster)
for (i in 1:nrow(z1)){
for (j in 1:ncol(z1)){
genes1 = as.character(markers$gene)[markers$cluster == rownames(z1)[i]]
z1[i,0] = paste(genes1, collapse=" ")
}
}
write.csv(z1, "test.csv")
This may accomplish what you want, but first we need a reproducible example:
set.seed(42)
cluster <- c(rep(0, 8), rep(1, 10), rep(2, 12))
gene <- replicate(30, paste0(sample(LETTERS, 4), collapse=""))
markers <- data.frame(cluster, gene, stringsAsFactors=FALSE)
This data frame only contains the two columns you are interested in. We need to split the data frame by gene:
markers.split <- split(markers$gene, markers$cluster)
Print this out. It is a list containing 3 character vectors, one for 0, 1, and 2. The problem with the table format you want is that tables and matrices have to have the same number of rows in each column. We have to pad the vectors so they are all as long as the longest one (12 in this case):
rows <- max(sapply(markers.split, length))
markers.sp <- lapply(markers.split, function(x) c(x, rep("", rows - length(x))))
markers.df <- do.call(data.frame, list(markers.sp, stringsAsFactors=FALSE))
markers.df
# X0 X1 X2
# 1 QEAJ ZHDX TIKC
# 2 DRQO VRME PEXN
# 3 XGDE DBXR EVBR
# 4 NTRO CXWQ XQRE
# 5 CIDE URFX NHWY
# 6 METB BTCV UDYG
# 7 HCAJ UBWF JRMU
# 8 XKOV ZJHE VSPZ
# 9 AQGD QLIU
# 10 MJIL KYPH
# 11 WFAM
# 12 NEIW
R automatically adds "X" to any column name that starts with a number.

How to calculate the Bonferroni Lower and Upper limits in R?

With the following data, I am trying to calculate the Chi Square and Bonferroni lower and upper Confidence intervals. The column "Data_No" identifies the dataset (as calculations needs to be done separately for each dataset).
Data_No Area Observed
1 3353 31
1 2297 2
1 1590 15
1 1087 16
1 817 2
1 847 10
1 1014 28
1 872 29
1 1026 29
1 1215 21
2 3353 31
2 2297 2
2 1590 15
3 1087 16
3 817 2
The code I used is
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),] data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/2*count(Data_No),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')
And the error message I get is:
Error in UseMethod("summarise_") : no applicable method for
'summarise_' applied to an object of class "c('integer', 'numeric')"
The column that I am calling "Alpha" is the alpha value of 0.05/2k, where K is the number of categories - in my example, I have 10 categories ("Data_No" column) for the first dataset, so "Alpha" needs to be 0.05/20 = 0.0025, and it's corresponding Z value is 2.807. The second dataset has 3 categories (so 0.05/6) and the third has 2 categories (0.05/4) in my example table (Data_No" column). Using the values from the newly calculated "Alpha" column, I then need to calculate the ZValue column (Zvalue = qnorm(Alpha,lower.tail=FALSE)) which I then use to calculate the lower and upper confidence intervals.
From the above data, here are the results that I should get, but note that I have had to manually calculate Alpha column and Zvalue, rather than insert those calculations within the R code:
Data_No Area Observed RelativeArea Alpha Z value lower upper
1 3353 31 0.237 0.003 2.807 0.092 0.247
1 2297 2 0.163 0.003 2.807 -0.011 0.033
1 1590 15 0.113 0.003 2.807 0.025 0.139
1 1087 16 0.077 0.003 2.807 0.029 0.146
1 817 2 0.058 0.003 2.807 -0.011 0.033
1 847 10 0.060 0.003 2.807 0.007 0.102
1 1014 28 0.072 0.003 2.807 0.078 0.228
1 872 29 0.062 0.003 2.807 0.083 0.234
1 1026 29 0.073 0.003 2.807 0.083 0.234
1 1215 21 0.086 0.003 2.807 0.049 0.181
2 3353 31 0.463 0.008 2.394 0.481 0.811
2 2297 2 0.317 0.008 2.394 -0.027 0.111
2 1590 15 0.220 0.008 2.394 0.152 0.473
3 1087 16 0.571 0.013 2.241 0.723 1.055
3 817 2 0.429 0.013 2.241 -0.055 0.277
Please note that I only included some of the columns generated from the code.
# You need to check the closing bracket for lower c.f. sqrt value. Following code should work.
data <- read.csv("test.csv")
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
#Answer to follow-up question.
#Sample Data
Data_No Area Observed
1 3353 31
1 2297 2
2 1590 15
2 1087 16
#Code to run
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data)
}
write.csv(result_data,file='final_result.csv')
#Issue in calculating Alpha. I have updated the code.
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
#Creating the NO_OF_CATEGORIES column based on your question.
total_data$NO_OF_CATEGORIES <- 0
total_data[which(total_data$Data_No==1),]$NO_OF_CATEGORIES <- 10
total_data[which(total_data$Data_No==2),]$NO_OF_CATEGORIES <- 3
total_data[which(total_data$Data_No==3),]$NO_OF_CATEGORIES <- 2
#Actual code
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/(2*(unique(data$NO_OF_CATEGORIES))),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')

Non linear regression for exponential decay model in R

I have the following problem:
I asked 5 people (i=1, ..., 5) to forecast next period's return of 3 different stocks. This gives me the following data:
S_11_i_c <-read.table(text = "
i c_1 c_2 c_3
1 0.150 0.70 0.190
2 0.155 0.70 0.200
3 0.150 0.75 0.195
4 0.160 0.80 0.190
5 0.150 0.75 0.180
",header = T)
In words, in period t=10 participant i=1 expects the return of stock c_1 to be 0.15 in period t=11.
The forecasts are based on past returns of the stocks. These are the following:
S_t_c <-read.table(text = "
time S_c_1 S_c_2 S_c_3
1 0.020 0.015 0.040
2 0.045 0.030 0.050
3 0.060 0.045 0.060
4 0.075 0.060 0.060
5 0.090 0.070 0.060
6 0.105 0.070 0.090
7 0.120 0.070 0.120
8 0.125 0.070 0.140
9 0.130 0.070 0.160
10 0.145 0.070 0.180
",header = T)
In words, stock c=1 had a return of 0.145 in period 10.
So, the variables in table S_11_i_c are the dependent variables.
The variables in table S_t_c are the independet variables.
The model I want to estimate is the following:
My problem with coding this is as follows:
I do only know how to express
with the help of a loop. As in:
Sum_S_t_c <- data.frame(
s = seq(1:9),
c_1 = rnorm(9)
c_2 = rnorm(9)
c_3 = rnorm(9)
)
Sum_S_t_c = 0
for (c in 2:4) {
for (s in 0:9) {
Sum_S_t_c[s,c] <- Sum_S_t_c + S_t_c[10-s, c]
Sum_S_t_c = Sum_S_t_c[s,c]
}
}
However, loops within a regression are not possible. So, my other solution would be to rewrite the sum to
However, as my actual problem has a much larger n, this isn*t realy working for me.
Any ideas?

How can I make a function for calculating variables using two dataframes?

Here are two dataframes, data1 and data2
data1
id A B C D E F G
1 id1 1.00 0.31 -3.20 2.50 3.1 -0.300 -0.214
2 id2 0.40 -2.30 0.24 -1.47 3.2 0.152 -0.140
3 id3 1.30 -3.20 2.00 -0.62 2.3 0.460 1.320
4 id4 -0.71 0.98 2.10 1.20 -1.5 0.870 -1.550
5 id5 2.10 -1.57 0.24 1.70 -1.2 -1.300 1.980
> data2
factor constant
1 A -0.321
2 B 1.732
3 C 1.230
4 D 3.200
5 E -0.980
6 F -1.400
7 G -0.300
Actually, data1 is a large set of data with id up to 1000 and factor up to z.
data2 also has the factor from a to z and corresponding constant variable.
And, I want to multiply the value of the factor in data1 and the constant of data2 corresponding to the factor, for all factors. And then, I want to create the total of multipliers into a new variable 'total' in data1.
for example> creating 'total' of 'id1'= (A value 1.0 (data1) x A constant -0.32 (data2) + (B value 0.31 x 1.732) + (C -3.20 x 1.230) + ( D 2.5 x 3.2) + (E 3.1 x 0.980) + (F -0.300 x -1.40) + (G -0.214 x -0.300)
If you have ordered your column names in data1 and the rows in data2 in the same order, you can do:
t(t(dat1[-1]) * dat2$constant)
# A B C D E F G
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940
Or if you need the totals:
res = t(t(dat1[-1]) * dat2$constant)
res = cbind(res, total = rowSums(res))
res
# A B C D E F G total
#1 -0.32100 0.53692 -3.9360 8.000 -3.038 0.4200 0.0642 1.72612
#2 -0.12840 -3.98360 0.2952 -4.704 -3.136 -0.2128 0.0420 -11.82760
#3 -0.41730 -5.54240 2.4600 -1.984 -2.254 -0.6440 -0.3960 -8.77770
#4 0.22791 1.69736 2.5830 3.840 1.470 -1.2180 0.4650 9.06527
#5 -0.67410 -2.71924 0.2952 5.440 1.176 1.8200 -0.5940 4.74386

Resources