My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to emphasize my search at the time point 1 and based on the values on that
time point to filter out the groups that do not fulfil a condition from the later time points.
I would like to delete the values of the groups that on the time point 1 are bigger than 0.5
and smaller than 0.1.
I want my data.frame to look like this.
group time value
1 A 1 0.2
2 A 2 0.1
3 A 3 10.0
Any help is highly appreciated.
You can select groups where value at time = 1 is between 0.1 and 0.5.
library(dplyr)
data %>%
group_by(group) %>%
filter(between(value[time == 1], 0.1, 0.5))
# group time value
# <chr> <dbl> <dbl>
#1 A 1 0.2
#2 A 2 0.1
#3 A 3 10
My data is quite large, so I create a small matrix to better illustrate my demand.
test <- matrix(c(1:3, rep(0.5,3),4:1), nrow = 1, dimnames = list(1, 1:10))
The matrix will like this:
1 2 3 4 5 6 7 8 9 10
1 1 2 3 0.5 0.5 0.5 4 3 2 1
I want to subset this matrix with multiple columns when its value is equal to specifc value likes 0.5:
4 5 6
1 0.5 0.5 0.5
Since my data would have more than 10,000 columns, I'm looking for codes that can solve my problem.
I am trying to create a difference variable by subtracting each value in a column from the highest value within the three years prior, for each individual id.
My data looks like this:
data <- data.frame(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4),
testocc = c(1,2,3,4,1,2,1,2,3,4,1,2,3),
score = c(0.8,0.3,0.1,0.2,0.1,0.5,0.9,0.5,0.7,0.6,0.3,0.2,0.6),
time = c(0,1,1,3,0,4,0,4,2,1,0,3,2))
And my desired output looks like this:
> data Score.Maximum
id testocc score time Within.3.Years.Prior Difference (= Score - Score Maximum within 3 Years Prior)
1 1 0.8 0 - 0
1 2 0.3 1 0.8 -0.5
1 3 0.1 1 0.8 -0.7
1 4 0.2 3 0.1 0.1
2 1 0.1 0 - 0
2 2 0.5 4 0.1 (or NA) 0.4 (or NA)
3 1 0.9 0 - 0
3 2 0.5 4 0.9 (or NA) -0.4 (or NA)
3 3 0.7 2 0.5 0.2
3 4 0.6 1 0.7 -0.1
4 1 0.3 0 - 0
4 2 0.2 3 0.3 -0.1
4 3 0.6 2 0.2 0.4
Time here (in years) is the time since the previous testocc, and I want to find out what the highest score is within the past three years from a single testocc. Then I want to subtract the current score from that highest score. Each individual id is treated separately.
I am also hoping for two versions of this:
If the only prior value is >3 ago, I still want to subtract the current value from it (as shown in the desired output above)
If the only prior value is >3 ago, I want to put an NA (as shown beside the desired output in brackets above).
I figure I'll have to calculate all the pairwise times between all testocc's, make a cutoff at 3 years, and then subtract from the max value within that, I just have no idea how to go about this.
Here's a way to solve your version 2 using dplyr. You first group by id and calculate the running time from the time steps. Then you can calculate the maxima using the do syntax. Finally you fix the initial times and calculate the difference.
require(dplyr)
data %>%
group_by(id) %>%
mutate(cumtime = cumsum(time)) %>%
do({
mutate(.,
max = sapply(.$cumtime, function(t){
max(.$score[.$cumtime < t & t - .$cumtime <= 3])
}))
}) %>%
mutate(max = ifelse(cumtime == 0, score, max),
max = ifelse(!is.finite(max), NA, max),
difference = score - max)
I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))
As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)
I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!
Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")