How to do this in R - r

I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4

groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))

As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)

Related

Resizing and interpolating middle values in column in R

I have a dataframe.
df <- data.frame(level = c(1:10), values = c(3,4,5,6,8,9,4,2,1,6))
Which I would like to resize to fewer levels, lets say 6 levels.
Where level 0 and level 10 are corresponding to level 0 and level 6 in the new dataframe. (I just guessed some floats in between, not sure what the result would actually be)
level value
1 3
2 3.4
3 4.6
4 6.2
5 2.2
6 6
How would I go about doing this?
Maybe you want to use approxfun for interpolation like below?
data.frame(
level = 1:6,
values = approxfun(df$level, df$values)(seq(1, nrow(df), length.out = 6))
)
which gives
level values
1 1 3.0
2 2 4.8
3 3 7.2
4 4 7.0
5 5 1.8
6 6 6.0

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

Creating uneven sequences in R

Given df=data.frame(x = seq(1:100), y = rnorm(100, mean=3, sd=0.5)) I would like to create a new vector whose ith element is determined by the row in question. If it's one of the first 3 elements after subsetting the data into 5 element subsets, I'd like to put an "a" otherwise a "b".
Output would look like so:
1 2.6 a
2 3.5 a
3 2.6 a
4 2.7 b
5 2.1 b
6 1.8 a
7 3.7 a
8 2.9 a
9 2.7 b
10 3.4 b
The only thought I have is that this questions boils down to how one would create uneven sequences, hence the title. Something like if the ith row is a member of each sequence created by ((5*j)-4):((5*j)-2), then call it a, otherwise b. But how could I create a vector of these values? Something like the below would of course not work because each element in rows is itself a sequence and not all the numbers in the sequence.
>rows=vector()
>for (j in 1:(nrow(df)/5)) {
rows[j]=((5*j)-4):((5*j)-2)
}
>classify=vector()
>for (i in 1:(nrow(df))) {
if (is.element(df[i,1], rows)) {
classify[i]="a"
} else {
classify[i]="b"
}
}
>df=cbind(df, classify)
You can try:
rep_len(c(rep("a", 3), rep("b", 2)), nrow(df))
x y z
1 1 3.467233 a
2 2 2.599982 a
3 3 3.941228 a
4 4 2.833142 b
5 5 4.070231 b
6 6 3.835760 a
7 7 3.688950 a
8 8 2.882646 a
9 9 3.071788 b
10 10 3.358480 b

Changing duplicated coordinate values by adding a decimal place R

I have UTM coordinate values from GPS collared leopards, and my analysis gets messed up if there are any points that are identical. What I want to do is add a 1 to the end of the decimal string to make each value unique.
What I have:
> View(coords)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.7 4980096
5 618522.7 4980096
6 622674.1 4976161
I want something like this, or something that will make each number unique (doesn't have to be a +1)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.71 4980096.1
5 618522.72 4977148.2
6 622674.1 4976161
Ive looked at existing questions and got this to work for a simulated data set, but not for values with more than 1 duplicated value.
DF <- data.frame(A=c(5,5,6,6,7,7), B=c(1, 1, 2, 2, 2, 3))
>View(DF)
A B
1 5 1
2 5 1
3 6 2
4 6 2
5 7 2
6 7 3
DF <- do.call(rbind, lapply(split(DF, list(DF$A, DF$B)),
function(x) {
x$A <- x$A + seq(0, by=0.1, length.out=nrow(x))
x$B <- x$B + seq(0, by=0.1, length.out=nrow(x))
x
}))
>View(DF
A B
5.1.1 5.0 1.0
5.1.2 5.1 1.1
6.2.3 6.0 2.0
6.2.4 6.1 2.1
7.2 7.0 2.0
7.3 7.0 3.0
The'2s' in column B don't continue to add a decimal place when there are more than 2. I also had a problem accomplishing this when the number was more than 4 digits (i.e. XXXXX vs XX) There's probably a better way to do this, but I would love help on adding these decimals and possibly altering them in the original data frame which has 12 columns of various data.
It is easier to use make.unique
DF[] <- lapply(DF, function(x) as.numeric(make.unique(as.character(x))))
DF
# A B
#1 5.0 1.0
#2 5.1 1.1
#3 6.0 2.0
#4 6.1 2.1
#5 7.0 2.2
#6 7.1 3.0

for loop through data frame and looping with unique values

I'm trying to work on code to build a function for three stage cluster sampling, however, I am just working with dummy data right now so I can understand what is going into my function.
I am working on for loops and have a data frame with grouped values. I'm have a data frame that has data:
Cluster group value value.K.bar value.M.bar N.bar
1 1 A 1 1.5 2.5 4
2 1 A 2 1.5 2.5 4
3 1 B 3 4.0 2.5 4
4 1 B 4 4.0 2.5 4
5 2 B 5 4.0 6.0 4
6 2 C 6 6.5 6.0 4
7 2 C 7 6.5 6.0 4
and I am trying to run the for loop
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {e = data.y$value.M.bar[i] - data$N.bar[i]
total = total + e^2}
My question is: Is there a way to run the same loop but for the unique values in the group? Say by:
Group 'A', 'B', 'C'
Any help would be greatly appreciated!
Edit: for correct language
You can use by for example, to apply your data per group. First I wrap your code in a function that take data as input.
get.total <- function(data){
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {
e <- data$value.M.bar[i] - data$N.bar[i] ## I correct this line
total <- total + e^2
}
total
}
Then to compute total just for group B and C you do this :
by(data,data$group,FUN=get.total)
data$group: A
[1] 4.5
----------------------------------------------------------------------------------------------------
data$group: B
[1] 8.5
----------------------------------------------------------------------------------------------------
data$group: C
[1] 8
But better , Here a vectorized version of your function
by(data,data$group,
function(dat)with(dat, sum((value.M.bar - N.bar)^2)))

Resources