How to divide a set of overlapping ranges into non-overlapping ranges? but in R - r

Let's say we have two datasets:
assays:
BHID<-c(127,127,127,127,128)
FROM<-c(950,959,960,961,955)
TO<-c(958,960,961,966,969)
Cu<-c(0.3,0.9,2.5,1.2,0.5)
assays<-data.frame(BHID,FROM,TO,Cu)
and litho:
BHID<-c(125,127,127,127)
FROM<-c(940,949,960,962)
TO<-c(949,960,961,969)
ROCK<-c(1,1,2,3)
litho<-data.frame(BHID,FROM,TO,ROCK)
and I want to join the two sets and the results after running the algorithm would be:
BHID FROM TO CU ROCK
125 940 970 - 1
127 949 950 - 1
127 950 958 0.3 1
127 958 959 - 1
127 959 960 0.9 1
127 960 961 2.5 2
127 961 962 1.2 -
127 962 966 1.2 3
127 966 969 - 3
128 955 962 0.5 -

Use merge
merge(assays, litho, all=T)
In essence, all=T is the SQL equivalent for FULL OUTER JOIN. I haven't specified any columns, because in this case merge function will perform the join across the column with same names.

Tough one but the code seems to work. The idea is to first expand each row into many, each representing a one-increment from FROM to TO. After merging, identify contiguous rows and un-expand them... Obviously it is not a very efficient approach so it may or may not work if your real data has very large FROM and TO ranges.
library(plyr)
ASSAYS <- adply(assays, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
Cu)
})
LITHO <- adply(litho, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
ROCK)
})
not.as.previous <- function(x) {
x1 <- head(x, -1)
x2 <- tail(x, -1)
c(TRUE, !is.na(x1) & !is.na(x2) & x1 != x2 |
is.na(x1) & !is.na(x2) |
!is.na(x1) & is.na(x2))
}
MERGED <- merge(ASSAYS, LITHO, all = TRUE)
MERGED <- transform(MERGED,
gp.id = cumsum(not.as.previous(BHID) |
not.as.previous(Cu) |
not.as.previous(ROCK)))
merged <- ddply(MERGED, "gp.id", function(x) {
out <- head(x, 1)
out$TO <- tail(x$TO, 1)
out
})
merged
# BHID FROM TO Cu ROCK gp.id
# 1 125 940 949 NA 1 1
# 2 127 949 950 NA 1 2
# 3 127 950 958 0.3 1 3
# 4 127 958 959 NA 1 4
# 5 127 959 960 0.9 1 5
# 6 127 960 961 2.5 2 6
# 7 127 961 962 1.2 NA 7
# 8 127 962 966 1.2 3 8
# 9 127 966 969 NA 3 9
# 10 128 955 969 0.5 NA 10
Note that the first row is not exactly the same as in your expected output, but I think mine makes more sense.

Related

R - Percentage of whole dataframe per column

I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303

How to use loop to generate the data in a table in Shiny?

I just started to learn shiny few days, and I have been troubled by this problem for a long time.
I need to generate a table(Two-column table), and the data in the table needs to be calculated based on the input (then I can use this table to generate a scatter plot in ggplot()).
I try to make the code more visible, so I want to use for loop to replace potentially hundreds of lines of highly repetitive code. Otherwise, it will look like (input$meansy1)-1)^2, (input$meansy1)-2)^2......(input$meansy1)-100)^2.
I don't know why it can't be used correctly in data.frame().
This is part of the code,
shinyUI(fluidPage(
numericInput("y1", "y1:", sample(1:100,1), min = 1, max = 100)),
tableOutput("tb")
))
shinyServer(function(input, output,session) {
list <-c()
for (i in 1:100) {
local({
list[[i]] <-reactive(((input$y1)-i)^2)}
)}
dt = data.frame(y_roof = 1:100, B=list)
output$tb <- renderTable({
dt
})
})
When developing a feature for a shiny app it makes sense to look at the underlying operation separately from the shiny context. That way you can figure out if you have a shiny specific issue or not.
Let's look at the operation you want to do first: Iteratively subtracting the values 1 to 100 from x and squaring the result.
You can do this in base R, like this:
x <- 1
dt1 <- data.frame(y_roof = 1:100)
(x - dt1$y_roof)^2
#> [1] 0 1 4 9 16 25 36 49 64 81 100 121 144 169 196
#> [16] 225 256 289 324 361 400 441 484 529 576 625 676 729 784 841
#> [31] 900 961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849 1936
#> [46] 2025 2116 2209 2304 2401 2500 2601 2704 2809 2916 3025 3136 3249 3364 3481
#> [61] 3600 3721 3844 3969 4096 4225 4356 4489 4624 4761 4900 5041 5184 5329 5476
#> [76] 5625 5776 5929 6084 6241 6400 6561 6724 6889 7056 7225 7396 7569 7744 7921
#> [91] 8100 8281 8464 8649 8836 9025 9216 9409 9604 9801
To store the results in a dataframe change the last line to:
dt1$col2 <- (x - dt1$y_roof)^2
head(dt1)
#> y_roof col2
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 9
#> 5 5 16
#> 6 6 25
Doing the same in the tidyverse would look like this:
library(dplyr)
dt2 <-
data.frame(y_roof = 1:100) %>%
mutate(col2 = (x - y_roof)^2)
head(dt2)
#> y_roof col2
#> 1 1 0
#> 2 2 1
#> 3 3 4
#> 4 4 9
#> 5 5 16
#> 6 6 25
Now we can work this into the shiny app:
library(shiny)
library(dplyr)
ui <-
shinyUI(fluidPage(
numericInput("y1", "y1:", sample(1:100, 1), min = 1, max = 100),
tableOutput("tb")
))
server <-
shinyServer(function(input, output, session) {
output$tb <- renderTable({
data.frame(y_roof = 1:100) %>%
mutate(col2 = (input$y1 - y_roof) ^ 2)
})
})
shinyApp(ui, server, options = list(launch.browser = TRUE))

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

I want to repeat plotting for each section

I am trying to repeat plotting in R
my main command is
data<-read.csv(file.choose(),header=TRUE)
t=data[,1]
PCI=data[,2]
plot(t,PCI,xlim=c(0,30))
boxplot(t,PCI,xlim=c(0,30))
# some starting values
alpha=1
betha=1
theta=1
# do the fit
fit = nls(ydata ~ alpha-beta*exp((-theta*(xdata^gama))), start=list(alpha=80,beta=15,theta=15,gama=-2))
ydata=data[,2]
xdata=data[,1]
new = data.frame(xdata = seq(min(xdata),max(xdata),len=200))
lines(new$xdata,predict(fit,newdata=new))
resid(fit)
qqnorm(resid(fit))
fitted(fit)
resid(fit)
xdata=fitted(fit)
ydata=resid(fit)
plot(xdata,ydata,xlab="fitted values",ylab="Residuals")
abline(0, 0)
My first column is number of section, my second column is t=x and my third column is PCI=y. I want to repeat my plotting command for each section individually.but I think I can not use loop since the number of data in each section is not equal.
I would really appreciate your help since i am new in R.
SecNum t PCI
961 1 94.84
961 2 93.04
961 3 91.69
961 11 80.47
961 12 79.26
961 13 77
962 1 90.46
962 2 90.01
962 3 86.88
962 4 86.36
962 5 84.56
962 6 85.11
963 1 91.33
963 2 90.7
963 3 86.46
963 4 88.47
963 5 81.07
963 6 84.07
963 7 82.55
963 8 73.58
963 9 71.85
963 10 83.8
963 11 82.16
To repeat your code for each different SecNum in your data, do something like:
sections <- unique(data$SecNum)
for (sec in sections) {
# just get the data for that section
data.section <- subset(data, SecNum == sec)
# now do all your plotting commands. `data.section` is the
# subset of `data` that corresponds to this SecNum only.
}

Resources