Cross joining for the computation of a new variable - r

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?

The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787

Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

Related

Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213
table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

Transforming columns based off separate dataframe - R solution

I'm trying to write some more sophisticated code. I have a problem which I have a solution for, but I would like to increase the flexibility of the code.
Input data (d):
ID Height_cm Height_m Weight_kg Weight_lb
1 180 70
2 165 120
3 1.8 80
4 100 60
5 190 200
6 1.7 100
I want to transform height into cm and weight in to kg, all in one column like this:
ID Height Weight
1 180 70
2 165 54
3 180 80
4 100 45
5 190 90
6 170 100
I have a solution, but it's hard-coded:
library(tidyverse)
d$Height <- NA
d$Weight <-NA
d$Height <- ifelse(!is.na(d$Height_cm), d$Height_cm, d$Height_m * 0.01)
d$Weight <- ifelse(!is.na(d$Weight_kg), d$Weight_kg, d$Weight_lb * 0.45)
d <- d %>% select(ID, Height, Weight)
I want to be more sophisticated and take in an input file (below) and make the transformation based on that dataframe. The results will be the same, but it uses this transformation df to achieve the same thing:
Transformation_d:
marker unit new_col_name transformation
Height_cm centimetre Height 1
Height_m metre Height 0.01
Weight_kg kilogram Weight 1
Weight_lb pounds Weight 0.45
This is where I'm stuck... I'd appreciate some guidance.
Go easy, I'm new to R!
Height_m should have transformation value as 100, I guess.
You can use cur_column() to get the column name to match with Transformation_d dataframe and get corresponding transformation value to multiply.
library(dplyr)
d %>%
mutate(across(-1, ~. * Transformation_d$transformation[match(cur_column(),
Transformation_d$marker)])) %>%
transmute(ID,
Height = coalesce(Height_cm, Height_m),
Weight = coalesce(Weight_kg, Weight_lb))
# ID Height Weight
#1 1 180 70
#2 2 165 54
#3 3 180 80
#4 4 100 27
#5 5 190 90
#6 6 170 100
A similar process in base R
cbind(d[1],
transform(sweep(d[-1], 2,
Transformation_d$transformation[match(names(d)[-1],
Transformation_d$marker)], `*`),
Height = ifelse(is.na(Height_cm), Height_m, Height_cm),
Weight = ifelse(is.na(Weight_kg), Weight_lb, Weight_kg)))

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

How to make new variable across conditions

I need to calculate new variable from data using conditions. New Pheno.
Data set is huge.
I have data set: Animal, Record, Days, Pheno
A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450
Conditions are:
Constant pheno per day is 2.
If record days is more than 305 old pheno should be keept.
If record is less than 305 but has next records Pheno should be keept.
If record is less than 305 and have no next records it should be calculated as : 305-days*constant+pheno = (305 - 260)*2+300
Example for animal 1 having less than 305 for both records. So First record will be same in new pheno, but secon record is las and has less than 305, so we need to re-calculate... (305-230)*2+290=440
Finaly data will be like:
A R D P N_P
1 1 240 300 300
1 2 230 290 440
2 1 305 350 350
2 2 260 290 380
3 1 350 450 450
How to do it in R or linux ...
Here is a solution with base R
df <- read.table(header=TRUE, text=
"A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450")
newP <- function(d) {
np <- numeric(nrow(d))
for (i in 1:nrow(d)) {
if (d$D[i] > 305) { np[i] <- d$P[i]; next }
if (d$D[i] <= 305 && i<nrow(d)) { np[i] <- d$P[i]; next }
np[i] <- (305-d$D[i])*2 + d$P[i]
}
d$N_P <- np
return(d)
}
D <- split(df, df$A)
D2 <- lapply(D, newP)
do.call(rbind, D2)
Check this out (I assume R is the number of records sorted, so if you have 10 records the last will have R=10)
library(dplyr)
df <- data.frame(A=c(1,1,2,2,3),
R=c(1,2,1,2,1),
D=c(240,230,305,260,350),
P=c(300,290,350,290,450))
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*2)+P # calculate new P
,P)) # Else : use old P
Source: local data frame [5 x 5]
Groups: A [3]
A R D P N_P
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 240 300 300
2 1 2 230 290 440
3 2 1 305 350 350
4 2 2 260 290 380
5 3 1 350 450 450
If you have predefined constants that depend on R value in the df, for example :
const <- c(1,2,1.5,2.5,3)
You can replace R in the code by const[R]
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*const[R])+P # calculate new P
,P)) # Else : use old P

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources