create a column in r based on a condition involving another column - r

I want to create a new column in my data set containing the proficiency level of students based on their grade in a test. So, if students grade was between 0 and 10, then the level assigned to them should be A1, if the grade is between 11 and 16, the level assigned should be A2 and so on. How to code this in R? I've tried the code below. The new column was created, but containing only the level A1. So, the condition did not work. Can anyone help me with that?
data$CatEnglishTest=as.factor(ifelse(data$EnglishTestGrade %in%
data$EnglishTestGrade<=10,'A1',
ifelse(data$EnglishTestGrade %in% data$EnglishTestGrade > 10
&& data$EnglishTestGrade < 15,'A2',
as.character(data$EnglishTestGrade))))

library(dplyr)
data <- data.frame(EnglishTestGrade = c(0, 5, 10, 14, 20))
data <- mutate(data,
CatEnglishTest = case_when(
EnglishTestGrade <= 10 ~ "A1",
EnglishTestGrade < 15 ~ "A2",
TRUE ~ "Undefined category"
))

Say you have the students scores in
scores <- data.frame(student=paste("ID",1:10),score = seq(10,100,10))
student score
1 ID 1 10
2 ID 2 20
3 ID 3 30
4 ID 4 40
5 ID 5 50
6 ID 6 60
7 ID 7 70
8 ID 8 80
9 ID 9 90
10 ID 10 100
And the grading scale in
scale <- data.frame(score = seq(0,100,25), grade = LETTERS[5:1])
score grade
1 0 E
2 25 D
3 50 C
4 75 B
5 100 A
Then you could use this code to assign each student a grade
scores$grades <- scale$grade[sapply(scores$score, function(x) tail(which(x >= scale$score),1))]
student score grades
1 ID 1 10 E
2 ID 2 20 E
3 ID 3 30 D
4 ID 4 40 D
5 ID 5 50 C
6 ID 6 60 C
7 ID 7 70 C
8 ID 8 80 B
9 ID 9 90 B
10 ID 10 100 A

I have created a little example DF.
EnglishTestGrade <- sample(1:20,20)
Student <- LETTERS[1:20]
data <- data.frame(EnglishTestGrade, Student)
str(data)
#> 'data.frame': 20 obs. of 2 variables:
#> $ EnglishTestGrade: int 1 15 19 12 3 14 8 16 2 18 ...
#> $ Student : chr "A" "B" "C" "D" ...
data$CatEnglishTest <- ifelse(data$EnglishTestGrade %in%
data[data$EnglishTestGrade <= 10,]$EnglishTestGrade, 'A1',
ifelse(data$EnglishTestGrade %in%
(data[data$EnglishTestGrade > 10,]$EnglishTestGrade &
data[data$EnglishTestGrade < 15,]$EnglishTestGrade), 'A2',
as.character(data$EnglishTestGrade)
)
)
#> Warning in data[data$EnglishTestGrade > 10, ]$EnglishTestGrade & data[data$EnglishTestGrade < : Länge des längeren Objektes
#> ist kein Vielfaches der Länge des kürzeren Objektes
data
#> EnglishTestGrade Student CatEnglishTest
#> 1 1 A A1
#> 2 15 B 15
#> 3 19 C 19
#> 4 12 D 12
#> 5 3 E A1
#> 6 14 F 14
#> 7 8 G A1
#> 8 16 H 16
#> 9 2 I A1
#> 10 18 J 18
#> 11 9 K A1
#> 12 10 L A1
#> 13 7 M A1
#> 14 4 N A1
#> 15 11 O 11
#> 16 17 P 17
#> 17 20 Q 20
#> 18 13 R 13
#> 19 6 S A1
#> 20 5 T A1

Related

r average of distance by Id

I have a dataset with two groups of subjects, Group A, Group B like this.
Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18
The subjects in Group A are unique. One row per subject. The subjects in Group B are not unique. There are two or in some cases 3 rows of observations per subject in Group B, example ID 9, 10, 10 etc.
What I am trying to do is
a) estimate the average distance of subjects in GroupB to everyone in Group A. Using Age to estimate the distance.
b) estimate the distance of subjects in GroupB to the mode of subjects in Group A. Using Age to estimate the mode in Group A and Age in Group B to estimate the distance from the mode.
Expecting a dataset like this.
ID Group Age AvDistance DistanceToMedian
1 A 17 NA NA
2 A 14 NA NA
3 A 10 NA NA
4 A 17 NA NA
5 A 12 NA NA
6 A 6 NA NA
7 A 18 NA NA
8 A 7 NA NA
9 B 18 6 2.11
9 B 13 3.875 2.88
10 B 6 ... ...
10 B 12 ... ...
11 B 16 ... ...
11 B 17 ... ...
12 B 11 ... ...
12 B 18 ... ...
I can do this manually, any suggestions on how to make this more efficient is much appreciated. Thanks.
# Estimate Average Distance of Id in Group B to all subjects in Group A
(sqrt((17 - 18)^2)+ sqrt((14-18)^2)+ sqrt((10-18)^2) + sqrt((17-18)^2) + sqrt((12-18)^2) + sqrt((6-18)^2) + sqrt((18-18)^2) + sqrt((7-18)^2))/8 = 6
(sqrt((17 - 13)^2)+ sqrt((14-13)^2)+ sqrt((10 - 13)^2) + sqrt((17-13)^2) + sqrt((12-13)^2) + sqrt((6-13)^2) + sqrt((18-13)^2) + sqrt((7-13)^2))/8 = 3.875
estimate_mode <- function(x) {
d <- density(x)
d$x[which.max(d$y)]
}
# Estimate Mode for Age in Group A
x <- c(17, 14, 10, 17, 12, 6, 18, 7)
estimate_mode(x)
m1 <- estimate_mode(x)
# Estimate Mode of
sqrt((18 - m1)^2) = 2.11
sqrt((13 - m1)^2) =2.88
This will be easier with a unique row ID, so I'll create one:
library(dplyr)
library(tibble)
df = df %>%
mutate(rownum = paste0("row", row_number()))
ages = setNames(df$Age, df$rownum)
## make a distance matrix
dist = outer(ages[df$Group == "B"], ages[df$Group == "A"], FUN = \(x, y) abs(x - y))
## calculate average distances
av_dist = data.frame(AvDist = rowMeans(dist)) %>% rownames_to_column("rownum")
## calculate median age for A
med_a = median(ages[df$Group == "A"])
## add back to original data
df %>%
left_join(av_dist, by = "rownum") %>%
mutate(DistanceToMedian = ifelse(Group == "B", abs(Age - med_a), NA))
# Id Group Age rownum AvDist DistanceToMedian
# 1 1 A 17 row1 NA NA
# 2 2 A 14 row2 NA NA
# 3 3 A 10 row3 NA NA
# 4 4 A 17 row4 NA NA
# 5 5 A 12 row5 NA NA
# 6 6 A 6 row6 NA NA
# 7 7 A 18 row7 NA NA
# 8 8 A 7 row8 NA NA
# 9 9 B 18 row9 5.375 5
# 10 9 B 13 row10 3.875 0
# 11 10 B 6 row11 6.625 7
# 12 10 B 12 row12 3.875 1
# 13 11 B 16 row13 4.375 3
# 14 11 B 17 row14 4.625 4
# 15 12 B 11 row15 4.125 2
# 16 12 B 18 row16 5.375 5
I used median, not mode, because I was looking at your column names, but you can easily swap in your mode instead.
Using this sample data:
df = read.table(text = 'Id Group Age
1 A 17
2 A 14
3 A 10
4 A 17
5 A 12
6 A 6
7 A 18
8 A 7
9 B 18
9 B 13
10 B 6
10 B 12
11 B 16
11 B 17
12 B 11
12 B 18', header = T)

How to code data selected and not selected in r

I have two datasets. One is the big original one (lets call this x). The other data set is the subsetted dataset from the original x (lets call this y). I want to add a column in the x data set that determines whether the participant was selected or not. How do I do that?
thank you.
Difficult to tell exaclty without looking at data, but see if this example explains the approach:
> x <- data.frame(ID = 1:10,
+ Name = LETTERS[1:10],
+ Score = round(rnorm(10, 50,2)))
> x
ID Name Score
1 1 A 49
2 2 B 52
3 3 C 49
4 4 D 52
5 5 E 48
6 6 F 47
7 7 G 56
8 8 H 49
9 9 I 51
10 10 J 51
> y <- subset(x, ID > 6)
> y
ID Name Score
7 7 G 56
8 8 H 49
9 9 I 51
10 10 J 51
> x$In_y <- ifelse(x$ID %in% y$ID, 1, 0)
> x
ID Name Score In_y
1 1 A 49 0
2 2 B 52 0
3 3 C 49 0
4 4 D 52 0
5 5 E 48 0
6 6 F 47 0
7 7 G 56 1
8 8 H 49 1
9 9 I 51 1
10 10 J 51 1
>

Is there an R function for selecting common values of 2 dataframe?

I am trying to select common values of two data frame. I have a big_df and a small_df
What I am trying to obtain is a data frame where only the "ID" values are common in both data frame, and I am only interested to keep the big_df and not the small_df ones.
library(dplyr)
df3 <- merge(big_df, small_df, by =("ID"))
> df3
ID Age Name Colour
1 1 21 a blue
2 4 20 d green
3 8 87 h red
4 9 9 i black
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
> big_df
ID Age Name
1 1 21 a
2 2 15 b
3 3 1 c
4 4 20 d
5 5 34 e
6 6 45 f
7 7 67 g
8 8 87 h
9 9 9 i
10 10 77 l
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
> small_df
ID Colour
1 1 blue
2 4 green
3 8 red
4 9 black
I would like to have instead, withouth the colour information
> df3
ID Age Name
1 1 21 a
2 4 20 d
3 8 87 h
4 9 9 i
dplyr's semi_join() was intended for exactly this
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
library(dplyr)
semi_join(big_df,small_df,by='ID')
#
# ID Age Name
# 1 1 21 a
# 2 4 20 d
# 3 8 87 h
# 4 9 9 i
I have a feeling what you really need is:
#check which big IDs exist in small IDs and subset
big_df[big_df$ID %in% unique(small_df$ID), ]
# ID Age Name
#1 1 21 a
#4 4 20 d
#8 8 87 h
#9 9 9 i
So, I don't think you need a join in this case.

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Resources