I have a dataset with two groups of subjects, Group A, Group B like this.
Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18
The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.
What I am trying to do is
a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.
b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.
Expecting a dataset like this.
ID Group Age AvDistance DistanceToMedian
1 A 17 NA NA
2 A 14 NA NA
3 A 10 NA NA
4 A 17 NA NA
5 A 12 NA NA
6 A 6 NA NA
7 A 18 NA NA
8 A 7 NA NA
9 B 18 6 2.11
9 B 13 3.875 2.88
10 B 6 ... ...
10 B 12 ... ...
11 B 16 ... ...
11 B 17 ... ...
12 B 11 ... ...
12 B 18 ... ...
I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.
# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)+ sqrt((14-18)^2)+ sqrt((10-18)^2) + sqrt((17-18)^2) + sqrt((12-18)^2) + sqrt((6-18)^2) + sqrt((18-18)^2) + sqrt((7-18)^2))/8 = 6
(sqrt((17 - 13)^2)+ sqrt((14-13)^2)+ sqrt((10 - 13)^2) + sqrt((17-13)^2) + sqrt((12-13)^2) + sqrt((6-13)^2) + sqrt((18-13)^2) + sqrt((7-13)^2))/8 = 3.875
estimate_mode <- function(x) {
d <- density(x)
d$x[which.max(d$y)]
}
# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)
m1 <- estimate_mode(x)
# Estimate Mode of
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88
This will be easier with a unique row ID, so I'll create one:
library(dplyr)
library(tibble)
df = df %>%
mutate(rownum = paste0("row", row_number()))
ages = setNames(df$Age, df$rownum)
## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))
## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")
## calculate median age for A
med_a = median(ages[df$Group == "A"])
## add back to original data
df %>%
left_join(av_dist, by = "rownum") %>%
mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
# Id Group Age rownum AvDist DistanceToMedian
# 1 1 A 17 row1 NA NA
# 2 2 A 14 row2 NA NA
# 3 3 A 10 row3 NA NA
# 4 4 A 17 row4 NA NA
# 5 5 A 12 row5 NA NA
# 6 6 A 6 row6 NA NA
# 7 7 A 18 row7 NA NA
# 8 8 A 7 row8 NA NA
# 9 9 B 18 row9 5.375 5
# 10 9 B 13 row10 3.875 0
# 11 10 B 6 row11 6.625 7
# 12 10 B 12 row12 3.875 1
# 13 11 B 16 row13 4.375 3
# 14 11 B 17 row14 4.625 4
# 15 12 B 11 row15 4.125 2
# 16 12 B 18 row16 5.375 5
I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.
Using this sample data:
df = read.table(text = 'Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18', header = T)
Related
I want to create a new column in my data set containing the proficiency level of students based on their grade in a test. So, if students grade was between 0 and 10, then the level assigned to them should be A1, if the grade is between 11 and 16, the level assigned should be A2 and so on. How to code this in R? I've tried the code below. The new column was created, but containing only the level A1. So, the condition did not work. Can anyone help me with that?
data$CatEnglishTest=as.factor(ifelse(data$EnglishTestGrade %in%
data$EnglishTestGrade<=10,'A1',
ifelse(data$EnglishTestGrade %in% data$EnglishTestGrade > 10
&& data$EnglishTestGrade < 15,'A2',
as.character(data$EnglishTestGrade))))
library(dplyr)
data <- data.frame(EnglishTestGrade = c(0, 5, 10, 14, 20))
data <- mutate(data,
CatEnglishTest = case_when(
EnglishTestGrade <= 10 ~ "A1",
EnglishTestGrade < 15 ~ "A2",
TRUE ~ "Undefined category"
))
Say you have the students scores in
scores <- data.frame(student=paste("ID",1:10),score = seq(10,100,10))
student score
1 ID 1 10
2 ID 2 20
3 ID 3 30
4 ID 4 40
5 ID 5 50
6 ID 6 60
7 ID 7 70
8 ID 8 80
9 ID 9 90
10 ID 10 100
And the grading scale in
scale <- data.frame(score = seq(0,100,25), grade = LETTERS[5:1])
score grade
1 0 E
2 25 D
3 50 C
4 75 B
5 100 A
Then you could use this code to assign each student a grade
scores$grades <- scale$grade[sapply(scores$score, function(x) tail(which(x >= scale$score),1))]
student score grades
1 ID 1 10 E
2 ID 2 20 E
3 ID 3 30 D
4 ID 4 40 D
5 ID 5 50 C
6 ID 6 60 C
7 ID 7 70 C
8 ID 8 80 B
9 ID 9 90 B
10 ID 10 100 A
I have created a little example DF.
EnglishTestGrade <- sample(1:20,20)
Student <- LETTERS[1:20]
data <- data.frame(EnglishTestGrade, Student)
str(data)
#> 'data.frame': 20 obs. of 2 variables:
#> $ EnglishTestGrade: int 1 15 19 12 3 14 8 16 2 18 ...
#> $ Student : chr "A" "B" "C" "D" ...
data$CatEnglishTest <- ifelse(data$EnglishTestGrade %in%
data[data$EnglishTestGrade <= 10,]$EnglishTestGrade, 'A1',
ifelse(data$EnglishTestGrade %in%
(data[data$EnglishTestGrade > 10,]$EnglishTestGrade &
data[data$EnglishTestGrade < 15,]$EnglishTestGrade), 'A2',
as.character(data$EnglishTestGrade)
)
)
#> Warning in data[data$EnglishTestGrade > 10, ]$EnglishTestGrade & data[data$EnglishTestGrade < : Länge des längeren Objektes
#> ist kein Vielfaches der Länge des kürzeren Objektes
data
#> EnglishTestGrade Student CatEnglishTest
#> 1 1 A A1
#> 2 15 B 15
#> 3 19 C 19
#> 4 12 D 12
#> 5 3 E A1
#> 6 14 F 14
#> 7 8 G A1
#> 8 16 H 16
#> 9 2 I A1
#> 10 18 J 18
#> 11 9 K A1
#> 12 10 L A1
#> 13 7 M A1
#> 14 4 N A1
#> 15 11 O 11
#> 16 17 P 17
#> 17 20 Q 20
#> 18 13 R 13
#> 19 6 S A1
#> 20 5 T A1
I have a dataframe with multiple columns. I have another dataframe with two columns, factor and coefficient. I want to create a new column in the initial dataframe (mydata) that is the sum of multiplying each element in each row of mydata(a:e) by the coefficients (a:e) in df. The result for the first row in the newcol should be 64 (1*1 + 2*2 + 3*3 + 4*4 + 7*5). Ideally, I would be able to somehow replicate this 20+ times with different coefficients.
mydata <- data.frame(a = 1:10, b = 2:11, c = 3:12, d = 4:13, d_1 = 5:14, d_2 = 6:15, d_3 = 7:16, e = 8:17)
df <- data.frame(factor = c('a','b','c','d','e'), coefficient = 1:5)
mydata$newcol <- mydata[,c("a","b","c","d","e")] %*% df$coefficient
mydata$newcol2 <- mydata[,c("a","b","c","d_1","e")] %*% df$coefficient
Any advice would be helpful!
We can use sweep here, subset mydata based on factor column in df and multiply it with coefficient for each element and then take rowSums to calculate the sum.
mydata$newcol <- rowSums(sweep(mydata[as.character(df$factor)], 2,df$coefficient, `*`))
mydata
# a b c d d_1 d_2 d_3 e newcol
#1 1 2 3 4 5 6 7 8 70
#2 2 3 4 5 6 7 8 9 85
#3 3 4 5 6 7 8 9 10 100
#4 4 5 6 7 8 9 10 11 115
#5 5 6 7 8 9 10 11 12 130
#6 6 7 8 9 10 11 12 13 145
#7 7 8 9 10 11 12 13 14 160
#8 8 9 10 11 12 13 14 15 175
#9 9 10 11 12 13 14 15 16 190
#10 10 11 12 13 14 15 16 17 205
Or we can also transpose mydata and multiply the coefficient and get colSums.
colSums(t(mydata[as.character(df$factor)]) * df$coefficient)
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
My dataset has as features: players IDs, team, weeks and points.
I want to calculate the mean of TEAM points for previous weeks, but not all past weeks, just to the last 5 or less (if the current week is smaller than 5).
Example: For team = A, week = 7, the result will be the average of POINTS for team = A and weeks 2, 3, 4, 5 and 6.
The dataset can be created using the following code:
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
I would like to have a solution without a heavy looping, because the dataset is huge.
I have done related questions here that maybe will help, but I could not adapt to this case.
Question 1
Question 2
Thank you!
We adapt the approach from my answer to one of your other questions if you want a dplyr solution:
library(dplyr)
library(zoo)
# set the seed for reproducibility
set.seed(123)
player_id<-c(rep(1,15),rep(2,15),rep(3,15),rep(4,15))
week<-1:15
team<-c(rep("A",30),rep("B",30))
points<-round(runif(60,1,10),0)
mydata<- data.frame(player_id=player_id,team=team,week=rep(week,4),points)
roll_mean <- function(x, k) {
result <- rollapplyr(x, k, mean, partial=TRUE, na.rm=TRUE)
result[is.nan(result)] <- NA
return( result )
}
It might first be easier to aggregate by team:
team_data <- mydata %>%
select(-player_id) %>%
group_by(team, week) %>%
arrange(week) %>%
summarise(team_points = sum(points)) %>%
mutate(rolling_team_mean = roll_mean(lag(team_points), k=5)) %>%
arrange(team)
team_data
# A tibble: 30 x 4
# Groups: team [2]
team week team_points rolling_team_mean
<fctr> <int> <dbl> <dbl>
1 A 1 13 NA
2 A 2 11 13.00
3 A 3 6 12.00
4 A 4 13 10.00
5 A 5 19 10.75
6 A 6 10 12.40
7 A 7 13 11.80
8 A 8 16 12.20
9 A 9 16 14.20
10 A 10 12 14.80
# ... with 20 more rows
Then, if you like we can put everything back together:
mydata <- inner_join(mydata, team_data) %>%
arrange(week, team, player_id)
mydata[1:12, ]
player_id team week points team_points rolling_team_mean
1 1 A 1 4 13 NA
2 2 A 1 9 13 NA
3 3 B 1 10 12 NA
4 4 B 1 2 12 NA
5 1 A 2 8 11 13
6 2 A 2 3 11 13
7 3 B 2 9 12 12
8 4 B 2 3 12 12
9 1 A 3 5 6 12
10 2 A 3 1 6 12
11 3 B 3 7 12 12
12 4 B 3 5 12 12
Here's one way:
# compute points per team per week
pts <- with(mydata, tapply(points, list(team, week), sum, default = 0))
pts
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 11 6 13 19 10 13 16 16 12 17 11 13 10 4
#B 12 12 12 11 10 6 13 11 6 9 5 7 13 13 6
# compute the 5-week averages
sapply(setNames(seq(2, ncol(pts)), seq(2, ncol(pts))),
function(i) {
apply(pts[, seq(max(1, i - 5), i - 1), drop = FALSE], 1, mean)
})
# 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#A 13 12 10 10.75 12.4 11.8 12.2 14.2 14.8 13.4 14.8 14.4 13.8 12.6
#B 12 12 12 11.75 11.4 10.2 10.4 10.2 9.2 9.0 8.8 7.6 8.0 9.4
This will give the wrong result if the week variable has gaps.
Given a data frame with 6 variables:
x1 var1 x2 var2 x3 var3
How do you count the missing values in variables: var1, var2, var3 BY ROW such that the data frame will have these variables:
x1 var1 x2 var2 x3 var3 num.missing
A reproducible data set with expected answer would have been very helpful. I'll create one for you;
set.seed(1337)
dat <- data.frame(x1=1:10, var1=runif(10),
x2=11:20, var2=runif(10),
x3=21:30, var3=runif(10))
dat
x1 var1 x2 var2 x3 var3
1 1 0.57632155 11 0.97943029 21 0.84916377
2 2 0.56474213 12 0.99371759 22 0.72408821
3 3 0.07399023 13 0.82735873 23 0.04661798
4 4 0.45386562 14 0.19398230 24 0.15367816
5 5 0.37327926 15 0.98132543 25 0.56259417
6 6 0.33131745 16 0.02522857 26 0.98142569
7 7 0.94763002 17 0.97238848 27 0.93177423
8 8 0.28111731 18 0.92379666 28 0.89861494
9 9 0.24540405 19 0.33913968 29 0.46979326
10 10 0.14604362 20 0.24657940 30 0.99500811
Deleting a random sample of values;
dat[sample(1:10, 3), "var1"] <- NA
dat[sample(1:10, 3), "var2"] <- NA
dat[sample(1:10, 3), "var3"] <- NA
dat
x1 var1 x2 var2 x3 var3
1 1 NA 11 0.9794303 21 0.8491638
2 2 0.56474213 12 0.9937176 22 0.7240882
3 3 0.07399023 13 NA 23 NA
4 4 0.45386562 14 0.1939823 24 0.1536782
5 5 0.37327926 15 0.9813254 25 0.5625942
6 6 NA 16 NA 26 0.9814257
7 7 0.94763002 17 0.9723885 27 NA
8 8 0.28111731 18 NA 28 0.8986149
9 9 NA 19 0.3391397 29 0.4697933
10 10 0.14604362 20 0.2465794 30 NA
Given that logicals equate to binary integers (TRUE==1, FALSE==0) we can just sum the is.na() tests
dat$num.missing <- is.na(dat$var1) + is.na(dat$var2) + is.na(dat$var3)
dat
x1 var1 x2 var2 x3 var3 num.missing
1 1 NA 11 0.9794303 21 0.8491638 1
2 2 0.56474213 12 0.9937176 22 0.7240882 0
3 3 0.07399023 13 NA 23 NA 2
4 4 0.45386562 14 0.1939823 24 0.1536782 0
5 5 0.37327926 15 0.9813254 25 0.5625942 0
6 6 NA 16 NA 26 0.9814257 2
7 7 0.94763002 17 0.9723885 27 NA 1
8 8 0.28111731 18 NA 28 0.8986149 1
9 9 NA 19 0.3391397 29 0.4697933 1
10 10 0.14604362 20 0.2465794 30 NA 1