Calculate variable based on criteria in r - r

How can i add a new column to my data frame that would take into consideration some criteria such as:
ID AGE PERNO
1 30 1
1 25 2
2 25 1
2 24 2
2 3 3
3 65 1
3 55 2
to end with a table like:
ID AGE PERNO AGE_HEAD
1 30 1 30
1 25 2 30
2 25 1 25
2 24 2 25
2 3 3 25
3 65 1 65
3 55 2 65
Pretty much have the age of perno one in all the rows related to the id

Plyr solution:
library(plyr)
ddply(df,.(ID),transform,AGE_HEAD=head(AGE,1))
OR
ddply(df,.(ID),transform,AGE_HEAD=AGE[PERNO==1])
ID AGE PERNO AGE_HEAD
1 1 30 1 30
2 1 25 2 30
3 2 25 1 25
4 2 24 2 25
5 2 3 3 25
6 3 65 1 65
7 3 55 2 65
data.table solution:
library(data.table)
DT<-data.table(df)
DT[, AGE_HEAD := AGE[PERNO==1], by="ID"]
ID AGE PERNO AGE_HEAD
1: 1 30 1 30
2: 1 25 2 30
3: 2 25 1 25
4: 2 24 2 25
5: 2 3 3 25
6: 3 65 1 65
7: 3 55 2 65

As far as I understand, what you want is choosing the value of AGE for each level of ID when PERNO is 1 which in this example is the same (by chance) as taking just the maximum value of AGE, if I'm not wrong, this code is what are after.
> transform(df, AGE_HEAD=rep(df$AGE[df$PERNO==1], rle(df$ID)$lengths))
ID AGE PERNO AGE_HEAD
1 1 30 1 30
2 1 25 2 30
3 2 25 1 25
4 2 24 2 25
5 2 3 2 25
6 3 65 1 65
7 3 55 2 65

Related

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.
Let's say I create the following example data frame:
(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))
This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).
> head(df1,n=20)
patient visit trial var1
1 1 1 1 25
2 1 1 2 23
3 1 1 3 48
4 1 1 4 37
5 1 2 1 41
6 1 2 2 45
7 1 2 3 8
8 1 2 4 9
9 2 1 1 26
10 2 1 2 14
11 2 1 3 41
12 2 1 4 35
13 2 2 1 37
14 2 2 2 30
15 2 2 3 14
16 2 2 4 28
17 3 1 1 34
18 3 1 2 19
19 3 1 3 28
20 3 1 4 10
I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.
I figured out how to calculate the cumulative 2nd max number like so:
df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1
However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:
library(dplyr)
df2 <- df1 %>%
group_by(patient,visit) %>%
mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))
But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.
Ideally, my result would look something like this:
patient visit trial var1 cum2ndmax
1 1 1 25 NA
1 1 2 23 23
1 1 3 48 25
1 1 4 37 37
1 2 1 41 NA
1 2 2 45 41
1 2 3 8 41
1 2 4 9 41
2 1 1 26 NA
2 1 2 14 14
2 1 3 41 26
2 1 4 35 35
… … … … …
Any help in getting this to work in R would be much appreciated! Thank you!
One dplyr and purrr option could be:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))
patient visit trial var1 cum_second_max
<int> <int> <int> <int> <dbl>
1 1 1 1 25 NA
2 1 1 2 23 23
3 1 1 3 48 25
4 1 1 4 37 37
5 1 2 1 41 NA
6 1 2 2 45 41
7 1 2 3 8 41
8 1 2 4 9 41
9 2 1 1 26 NA
10 2 1 2 14 14
11 2 1 3 41 26
12 2 1 4 35 35
13 2 2 1 37 NA
14 2 2 2 30 30
15 2 2 3 14 30
16 2 2 4 28 30
17 3 1 1 34 NA
18 3 1 2 19 19
19 3 1 3 28 28
20 3 1 4 10 28
Here is an Rcpp solution.
cum_second_max is a modification of cummax which keeps track of the second maximum.
library(tidyverse)
Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
double max_value = R_NegInf, max_value2 = NA_REAL;
NumericVector result(x.length());
for (int i = 0 ; i < x.length() ; ++i) {
if (x[i] > max_value) {
max_value2 = max_value;
max_value = x[i];
}
else if (x[i] < max_value && x[i] > max_value2) {
max_value2 = x[i];
}
result[i] = isinf(max_value2) ? NA_REAL : max_value2;
}
return result;
}
")
df1 %>%
group_by(patient, visit) %>%
mutate(
c2max = cum_second_max(var1)
)
#> # A tibble: 20 x 5
#> # Groups: patient, visit [5]
#> patient visit trial var1 c2max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 25 NA
#> 2 1 1 2 23 23
#> 3 1 1 3 48 25
#> 4 1 1 4 37 37
#> 5 1 2 1 41 NA
#> 6 1 2 2 45 41
#> 7 1 2 3 8 41
#> 8 1 2 4 9 41
#> 9 2 1 1 26 NA
#> 10 2 1 2 14 14
#> 11 2 1 3 41 26
#> 12 2 1 4 35 35
#> 13 2 2 1 37 NA
#> 14 2 2 2 30 30
#> 15 2 2 3 14 30
#> 16 2 2 4 28 30
#> 17 3 1 1 34 NA
#> 18 3 1 2 19 19
#> 19 3 1 3 28 28
#> 20 3 1 4 10 28
Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:
df1 %>%
group_by(patient, visit) %>%
mutate(cum_second_max = map_dbl(.x = seq_along(var1),
~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

How to create a matrix in simple correspondence analysis?

I am trying to create a matrix in order to apply a simple correspondence analysis on it; I have 2 categorical variables: exp and conexinternet with 3 levels each.
obs conexinternet exp
1 1 2
2 1 1
3 2 2
4 1 1
5 1 1
6 2 1
7 1 2
8 1 2
9 1 2
10 2 1
11 1 1
12 2 1
13 2 2
14 2 1
15 1 1
16 2 2
17 1 1
18 2 2
19 2 2
20 2 2
21 2 2
22 1 1
23 2 3
24 1 1
25 2 1
26 2 1
27 1 1
28 2 2
29 2 1
30 1 2
31 1 2
32 2 3
33 2 1
34 2 1
35 2 1
36 3 2
37 2 1
38 3 2
39 2 3
40 2 3
41 2 2
42 2 3
43 2 2
44 2 2
45 2 1
46 2 2
47 2 3
48 1 3
49 2 3
50 3 2
51 2 2
52 2 2
53 2 1
54 1 2
55 1 1
56 2 3
57 3 2
58 3 1
59 3 1
60 1 2
61 2 3
62 2 2
63 3 1
64 3 2
65 3 2
66 1 2
67 3 2
68 3 2
69 3 3
70 2 1
71 3 3
72 3 2
73 3 2
74 3 2
75 3 1
76 3 2
77 3 1
I want to make a vector to categorize the observations as 11, 12, 13, 21, 22, 23, 31, 32, 33, how can I do it?
Is this what you want?
d <- read.table(text="obs conexinternet exp
1 1 2
...
77 3 1", header=T)
(tab <- xtabs(~conexinternet+exp, d))
# exp
# conexinternet 1 2 3
# 1 10 9 1
# 2 14 15 9
# 3 5 12 2

dplyr append group id sequence?

I have a dataset like below, it's created by dplyr and currently grouped by ‘Stage', how do I generate a sequence based on unique, incremental value of Stage, starting from 1 (for eg row$4 should be 1 row#1 and #8 should be 4)
X Y Stage Count
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
I tried the approach in below post but didn't work.
how to mutate a column with ID in group
Thanks.
Here is another dplyr solution:
> df
# A tibble: 11 × 4
X Y Stage Count
<dbl> <dbl> <dbl> <dbl>
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
To create the group id's use dpylr's group_indicies:
i <- df %>% group_indices(Stage)
df %>% mutate(group = i)
# A tibble: 11 × 5
X Y Stage Count group
<dbl> <dbl> <dbl> <dbl> <int>
1 61 74 1 2 4
2 58 56 2 1 5
3 78 76 0 1 3
4 100 100 -2 1 1
5 89 88 -1 1 2
6 47 44 3 1 6
7 36 32 4 1 7
8 75 58 1 2 4
9 24 21 5 1 8
10 12 11 6 1 9
11 0 0 10 1 10
It would be great if you could pipe both commands together. But, as of this writing, it doesn't appear to be possible.
After some experiment, I did %>% ungroup() %>% mutate(test = rank(Stage)), which will yield the following result.
X Y Stage Count test
1 100 100 -2 1 1.0
2 89 88 -1 1 2.0
3 78 76 0 1 3.0
4 61 74 1 2 4.5
5 75 58 1 2 4.5
6 58 56 2 1 6.0
7 47 44 3 1 7.0
8 36 32 4 1 8.0
9 24 21 5 1 9.0
10 12 11 6 1 10.0
11 0 0 10 1 11.0
I don't know whether this is the best approach, feel free to comment....
update
Another approach, assuming the data called Node
lvs <- levels(as.factor(Node$Stage))
Node %>% mutate(Rank = match(Stage,lvs))

How do I use plyr to number rows?

Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut)
> myDataFrame
size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570
I want a column added that has new row names based on the kmer/cvCut group
> myDataFrame
size kmer cvCut cumsum newID
1 8132 23 10 8132 1
10000 778 23 10 13789274 2
30000 324 23 10 23658740 3
50000 182 23 10 28534840 4
100000 65 23 10 33943283 5
200000 25 23 10 37954383 6
250000 584 23 12 16546507 1
300000 110 23 12 29435303 2
400000 28 23 12 34697860 3
600000 127 23 2 47124443 1
600001 127 23 2 47124570 2
I'd do it like this:
library(plyr)
ddply(df, c("kmer", "cvCut"), transform, newID = seq_along(kmer))
Just add a new column each time plyr calls you:
R> DF <- data.frame(kmer=sample(1:3, 50, replace=TRUE), \
cvCut=sample(LETTERS[1:3], 50, replace=TRUE))
R> library(plyr)
R> ddply(DF, .(kmer, cvCut), function(X) data.frame(X, newId=1:nrow(X)))
kmer cvCut newId
1 1 A 1
2 1 A 2
3 1 A 3
4 1 A 4
5 1 A 5
6 1 A 6
7 1 A 7
8 1 A 8
9 1 A 9
10 1 A 10
11 1 A 11
12 1 B 1
13 1 B 2
14 1 B 3
15 1 B 4
16 1 B 5
17 1 B 6
18 1 C 1
19 1 C 2
20 1 C 3
21 2 A 1
22 2 A 2
23 2 A 3
24 2 A 4
25 2 A 5
26 2 B 1
27 2 B 2
28 2 B 3
29 2 B 4
30 2 B 5
31 2 B 6
32 2 B 7
33 2 C 1
34 2 C 2
35 2 C 3
36 2 C 4
37 3 A 1
38 3 A 2
39 3 A 3
40 3 A 4
41 3 B 1
42 3 B 2
43 3 B 3
44 3 B 4
45 3 C 1
46 3 C 2
47 3 C 3
48 3 C 4
49 3 C 5
50 3 C 6
R>
I think that this is what you want:
Load the data:
x <- read.table(textConnection(
"id size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570"), header=TRUE)
Use ddply:
library(plyr)
ddply(x, .(kmer, cvCut), function(x) cbind(x, 1:nrow(x)))

Resources