How to make new variable across conditions - r

I need to calculate new variable from data using conditions. New Pheno.
Data set is huge.
I have data set: Animal, Record, Days, Pheno
A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450
Conditions are:
Constant pheno per day is 2.
If record days is more than 305 old pheno should be keept.
If record is less than 305 but has next records Pheno should be keept.
If record is less than 305 and have no next records it should be calculated as : 305-days*constant+pheno = (305 - 260)*2+300
Example for animal 1 having less than 305 for both records. So First record will be same in new pheno, but secon record is las and has less than 305, so we need to re-calculate... (305-230)*2+290=440
Finaly data will be like:
A R D P N_P
1 1 240 300 300
1 2 230 290 440
2 1 305 350 350
2 2 260 290 380
3 1 350 450 450
How to do it in R or linux ...

Here is a solution with base R
df <- read.table(header=TRUE, text=
"A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450")
newP <- function(d) {
np <- numeric(nrow(d))
for (i in 1:nrow(d)) {
if (d$D[i] > 305) { np[i] <- d$P[i]; next }
if (d$D[i] <= 305 && i<nrow(d)) { np[i] <- d$P[i]; next }
np[i] <- (305-d$D[i])*2 + d$P[i]
}
d$N_P <- np
return(d)
}
D <- split(df, df$A)
D2 <- lapply(D, newP)
do.call(rbind, D2)

Check this out (I assume R is the number of records sorted, so if you have 10 records the last will have R=10)
library(dplyr)
df <- data.frame(A=c(1,1,2,2,3),
R=c(1,2,1,2,1),
D=c(240,230,305,260,350),
P=c(300,290,350,290,450))
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*2)+P # calculate new P
,P)) # Else : use old P
Source: local data frame [5 x 5]
Groups: A [3]
A R D P N_P
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 240 300 300
2 1 2 230 290 440
3 2 1 305 350 350
4 2 2 260 290 380
5 3 1 350 450 450
If you have predefined constants that depend on R value in the df, for example :
const <- c(1,2,1.5,2.5,3)
You can replace R in the code by const[R]
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*const[R])+P # calculate new P
,P)) # Else : use old P

Related

Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213
table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

Cross joining for the computation of a new variable

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?
The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787
Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

Divide a vector by different values based on the result of the division

I have a Df like this:
x y z
<dbl> <dbl> <dbl>
1 408001.9 343 0
2 407919.2 343 0
3 407839.6 343 0
4 407761.2 343 0
5 407681.7 343 0
6 407599.0 343 0
7 407511.0 343 0
8 407420.5 343 0
9 407331.0 343 0
10 407242.0 343 0
11 407152.7 343 0
12 407062.5 343 0
13 406970.7 343 0
14 406876.6 342 0
15 406777.1 342 0
16 406671.0 342 0
17 406560.9 342 0
18 406449.4 342 0
19 406339.0 342 0
20 406232.5 342 0
... ... ... ...
with x decreasing.
And a vector like
vec=(a1, a2, a3, a4, a5, a6, ...)
with a1< a2< a3< a4...
Now I want to divide df$x by vec[1], what will give the same result (rounded) as for df$y.
But now, when the value in df$z drops by one to 342, I want to divide the value in df$x by vec[2] from then on, to get the new df$z values.
From here the result will be different from df$y, as for df$y the number to divide with is allways vec[1]and will not change
Every time the value I get for df$z drops by one, the next values for df$z shal be calculated with the corresponding vec[i] where i is the number of drops+1 so far
In the end I want a vector df$z, where the values are df$x / vec[i], where vec [i] depends on, what the last number of df$z is.
reproducible example:
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
)
vec <- seq(10, 10.9, 0.03)
for(i in 1:31){
test[i+1] <- round(test$x/vec[i])
}
This will give you a df with one col for every value of vec, that test$x got divided by.
Now, in the end, my vector shall contain the values of col2 until the value in col2 drops from 60 to 59. Afterwards I want the values from col3 until the value in col3 drops below 59 to 58. Then I want the values from col4 and so on.
How can I achive this with any data(like mine above, which is not linear ditributed as this example.)
I tried some for and while loops, but none worked. I didn't even get close to what I want.
I think my problem is that I dont know how to make the condition depenent on a value(the value of df$z at point i), that I want to calculate in the same operation. I want to calculate the value of df$z[i] with the value of vec[t], that has been used so far. But if the value of df$z drops by one at a certain observation[i], the value of vec[t+1] shall be used for the division from then on.
Thanks for your help.
I hope I've understood what you are asking. This might be it...
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
vec <- seq(10, 10.9, 0.03)
#this function determines the index of `vec` to use
xcol<-function(v){
x<-rep(NA,length(v))
x[1] <- 1
for(i in 2:length(v)){
x[i] <- x[i-1]
if(round(v[i]/vec[x[i]])<round(v[i-1]/vec[x[i]])){
x[i] <- x[i]+1
}
}
return(x)
}
test$xcol <- xcol(test$x)
test$z <- round(test$x/vec[test$xcol])
test
x xcol z
1 600 1 60
2 598 1 60
3 596 1 60
4 594 2 59
5 592 2 59
6 590 2 59
7 588 2 59
8 586 3 58
9 584 3 58
10 582 3 58
11 580 3 58
12 578 4 57
...

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources