How to sample from categorical variables in R data.frame ? - r

I am trying to sample from an R data frame but I have some problems with the categorical variables.
I am not taking a random subsamples of rows but I am generating rows such that the new variables have individually the same distribution of the original one.
I am having problem with the categorical variables.
> head(x0)
Symscore1 Symscore2 exercise3 exerciseduration3 groupchange age3
3 1 0 1 0 Transitional to Menopausal 52
4 0 0 5 2 Transitional to Menopausal 62
6 0 0 2 0 Transitional to Menopausal 54
8 0 0 5 3 Transitional to Menopausal 56
10 0 0 4 3 Transitional to Menopausal 59
13 0 1 4 3 Transitional to Menopausal 55
packyears bmi3 education3
3 2.357143 23.24380 Basic
4 2.000000 16.76574 University
6 1.000000 23.30668 Basic
8 1.428571 22.14533 University
10 1.428571 22.14533 University
13 0.000000 22.03857 University
> xa = as.data.frame(sapply(X = x0, FUN = sample))
> head(xa)
Symscore1 Symscore2 exercise3 exerciseduration3 groupchange age3 packyears
1 1 0 2 3 4 49 53.571430
2 0 0 3 0 3 46 2.142857
3 1 0 3 3 4 49 4.000000
4 0 1 3 3 4 58 0.000000
5 0 0 2 0 1 57 0.000000
6 0 0 3 0 1 47 26.871429
bmi3 education3
1 25.84777 2
2 21.25850 2
3 25.79592 3
4 23.93899 1
5 25.97012 2
6 23.53037 2
> X = rbind(x0,xa)
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = c(4, 3, 4, 4, 1, 1, 2, 4, 4, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = c(2, 2, 3, 1, 2, 2, 3, 2, 2, :
invalid factor level, NA generated
>

You could try:
x2 <- x0
x2[] <- lapply(x0, FUN = sample)
x2
# Symscore1 Symscore2 exercise3 exerciseduration3 groupchange
#3 0 0 1 0 Transitional to Menopausal
#4 0 0 5 3 Transitional to Menopausal
#6 0 0 4 3 Transitional to Menopausal
#8 0 0 2 0 Transitional to Menopausal
#10 1 1 4 3 Transitional to Menopausal
#13 0 0 5 2 Transitional to Menopausal
age3
#3 54
#4 59
#6 52
#8 56
#10 62
#13 5
rbind(x0,x2)
data
x0 <- structure(list(Symscore1 = c(1L, 0L, 0L, 0L, 0L, 0L), Symscore2 = c(0L,
0L, 0L, 0L, 0L, 1L), exercise3 = c(1L, 5L, 2L, 5L, 4L, 4L), exerciseduration3 = c(0L,
2L, 0L, 3L, 3L, 3L), groupchange = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "Transitional to Menopausal", class = "factor"),
age3 = c(52L, 62L, 54L, 56L, 59L, 5L)), .Names = c("Symscore1",
"Symscore2", "exercise3", "exerciseduration3", "groupchange",
"age3"), class = "data.frame", row.names = c("3", "4", "6", "8",
"10", "13"))

Related

Change variable in a column with a for loop using R

In my data frame, I have a column named Localisation. I want to change the data that is stored in it.
Thalamus, External capsule or Lenticulate for 0,
Cerebellum or Brain stem for 2 and
Frontal, Occipital, Parietal or Temporal for 1
I’m trying to do a For loop for this operation. My for statement doesn’t seem to be right, because I received
Error in for (. in i) 15:nrow(Localisation) :
4 arguments passed to 'for' which requires 3
and I don’t know how to compose my if statement.
dfM <- dfM %>%
for(i in dfM$Localisation) if(i = "Thalamus"| "External capsule"| "Lenticulate"){
dfM$Localisation <- "0"
} else if ( i = "Cerebellum"| "Brain stem") {
dfM$Localisation <- "2"
} else {
dfM$Localisation <- "1"
}
I know similar questions have been asked multiple times, but I can’t find a way to work with my data.
dput(head(dfM))
structure(list(new_id = c("5", "9", "10", "16", "30", "31"),
Localisation = c("Frontal", "Thalamus", "Occipital ", "Frontal",
"External capsule", "Cerebellum"), HIV.CT.initial = c(0L,
1L, 1L, 1L, 0L, 0L), Anticoagulant = c("Warfarin", "DOAC",
"Warfarin", "Warfarin", "Warfarin", "Warfarin"), Sex = c(1L,
1L, 1L, 2L, 2L, 1L), HTA = c(1L, 1L, 1L, 1L, 1L, 0L), Systolic_BP = c(116L,
169L, 164L, 109L, 134L, 146L), Diastolic_BP = c(70L, 65L,
80L, 60L, 75L, 85L), ACO = c(1L, 0L, 1L, 1L, 1L, 1L), Type.NACO = c(NA,
2L, NA, NA, NA, NA), Dose.ACO = c(NA, NA, NA, NA, NA, NA),
APT = c(0L, 0L, 1L, 0L, 1L, 0L), INR = c(4.2, 1.1, 1.9, 1.3,
3.6, 2.8), GLW = c(13L, 15L, 14L, 14L, 15L, 15L), GLW.Prog = c(NA,
NA, NA, NA, 14L, NA), mRS = c(4L, 3L, 4L, 6L, 6L, 5L), Time.to.scan = c(NA,
NA, NA, NA, NA, NA), Chx = c(0L, 0L, 0L, 0L, 0L, 0L), Type.Chx = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), Date.Chx = c("",
"", "", "", "", ""), décès.hospit = c(0L, 0L, 0L, 1L, 1L,
0L), décès.HIP = c(NA, NA, NA, 2L, 1L, NA), X3.mois = c(NA,
NA, NA, NA, NA, 1L), X6.mois = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), X12.mois = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), X3.mois.1 = c(NA, NA, NA, 1L, 1L, 1L), X6.mois.1 = c(NA,
NA, NA, 1L, 1L, 1L), X12.mois.1 = c(NA, NA, NA, 1L, 1L, 1L
), X.1.y = c(NA, NA, NA, NA, NA, NA), ID = c(5L, 9L, 10L,
16L, 30L, 31L)), row.names = c(NA, 6L), class = "data.frame")
Up front: in general, don't use for in a %>%-pipe, it almost never is what you intend.
Since you have the %>%, I'll infer you're using dplyr and/or other related packages. Here's one way:
library(dplyr)
dfM %>%
mutate(Localisation = case_when(
Localisation %in% c("Thalamus", "External capsule", "Lenticulate") ~ "0",
Localisation %in% c("Cerebellum", "Brain stem") ~ "2",
TRUE ~ "1")
)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
Perhaps relevant: one of your values "Occipital " has a space at the end; if you have inadvertent spaces at times and need it to work, you can replace all references of Localisation == to trimws(Localisation) == to reduce the chance of a mis-classification.
Another method is to maintain a frame of translations:
translations <- tribble(
~Localisation, ~NewLocalisation
,"Thalamus", "0",
,"External capsule", "0",
,"Lenticulate", "0",
,"Cerebellum", "2",
,"Brain stem", "2"
)
dfM %>%
select(new_id, Localisation) %>% # just to reduce the display here on SO
left_join(., translations, by = "Localisation")
# new_id Localisation NewLocalisation
# 1 5 Frontal <NA>
# 2 9 Thalamus 0
# 3 10 Occipital <NA>
# 4 16 Frontal <NA>
# 5 30 External capsule 0
# 6 31 Cerebellum 2
I intentionally left the "default" step out to highlight that the NA values are natural: those are the ones not specifically identified.
dfM %>%
left_join(., translations, by = "Localisation") %>%
mutate(NewLocalisation = if_else(is.na(NewLocalisation), "1", NewLocalisation)) %>%
mutate(Localisation = NewLocalisation) %>%
select(-NewLocalisation)
# new_id Localisation HIV.CT.initial Anticoagulant Sex HTA Systolic_BP Diastolic_BP ACO Type.NACO Dose.ACO APT INR GLW GLW.Prog mRS Time.to.scan Chx Type.Chx Date.Chx décès.hospit décès.HIP X3.mois X6.mois X12.mois X3.mois.1 X6.mois.1 X12.mois.1 X.1.y ID
# 1 5 1 0 Warfarin 1 1 116 70 1 NA NA 0 4.2 13 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 5
# 2 9 0 1 DOAC 1 1 169 65 0 2 NA 0 1.1 15 NA 3 NA 0 NA 0 NA NA NA NA NA NA NA NA 9
# 3 10 1 1 Warfarin 1 1 164 80 1 NA NA 1 1.9 14 NA 4 NA 0 NA 0 NA NA NA NA NA NA NA NA 10
# 4 16 1 1 Warfarin 2 1 109 60 1 NA NA 0 1.3 14 NA 6 NA 0 NA 1 2 NA NA NA 1 1 1 NA 16
# 5 30 0 0 Warfarin 2 1 134 75 1 NA NA 1 3.6 15 14 6 NA 0 NA 1 1 NA NA NA 1 1 1 NA 30
# 6 31 2 0 Warfarin 1 0 146 85 1 NA NA 0 2.8 15 NA 5 NA 0 NA 0 NA 1 NA NA 1 1 1 NA 31
The reason I offer this alternative is that sometimes it's easier to maintain a separate table (in excel, CSV form, etc) in this fashion for automated processing of data. It doesn't offer any capability that the case_when or #dcarlson's ifelse methods do not.

How to add new variable with condition in longitudinal data in R

In the following data, I would like to add another variable say z .
mydata
y x sl
1 199.92989 1 1
2 27.73883 2 1
3 144.00000 3 1
4 72.00000 4 1
5 0.00000 5 1
6 392.60636 1 2
7 749.52499 2 2
8 3120.00000 3 2
9 1600.00000 4 2
10 1000.00000 5 2
11 5840.00000 6 2
12 3960.00000 7 2
13 4700.00000 8 2
14 1660.00000 9 2
15 5620.00000 10 2
16 0.00000 1 585
17 0.00000 2 585
18 0.00000 3 585
19 3062.32962 1 587
20 2048.97458 2 587
21 1280.00000 3 587
22 1440.00000 4 587
23 2960.00000 5 587
24 460.00000 6 587
25 530.00000 7 587
26 5190.00000 8 587
27 3200.00000 9 587
28 4620.00000 10 587
29 0.00000 1 651
30 0.00000 2 651
31 0.00000 3 651
32 0.00000 4 651
z=c(5,7,8) , The value 5 should be repeated 5 times and belongs to sl=1 , 7 should be repeated 10 times and belongs to sl=2, 8 should be repeated 10 times and belongs to sl=587, . If all the observations of y are for 0 for any sl say 585 and 651, then z must take value 0. the z column must be like this z=c(rep(5,5), rep(7,10), rep(0,3), rep(8,10), rep(0,4))=c(5 5 5 5 5 7 7 7 7 7 7 7 7 7 7 0 0 0 8 8 8 8 8 8 8 8 8 8 0 0 0 0)
How can I do it with the above conditions?
We can use case_when from dplyr and specify the conditions.
library(dplyr)
df %>%
mutate(z = case_when(sl == 1 ~ 5,
sl == 2 ~ 7,
sl == 587 ~ 8,
all(y[sl == 585] == 0) ~ 0,
all(y[sl == 651] == 0) ~ 0))
which returns :
# y x sl z
#1 199.92989 1 1 5
#2 27.73883 2 1 5
#3 144.00000 3 1 5
#4 72.00000 4 1 5
#5 0.00000 5 1 5
#6 392.60636 1 2 7
#7 749.52499 2 2 7
#8 3120.00000 3 2 7
#9 1600.00000 4 2 7
#10 1000.00000 5 2 7
#11 5840.00000 6 2 7
#12 3960.00000 7 2 7
#13 4700.00000 8 2 7
#14 1660.00000 9 2 7
#15 5620.00000 10 2 7
#16 0.00000 1 585 0
#17 0.00000 2 585 0
#18 0.00000 3 585 0
#19 3062.32962 1 587 8
#20 2048.97458 2 587 8
#21 1280.00000 3 587 8
#22 1440.00000 4 587 8
#23 2960.00000 5 587 8
#24 460.00000 6 587 8
#25 530.00000 7 587 8
#26 5190.00000 8 587 8
#27 3200.00000 9 587 8
#28 4620.00000 10 587 8
#29 0.00000 1 651 0
#30 0.00000 2 651 0
#31 0.00000 3 651 0
#32 0.00000 4 651 0
If we do not know which sl would have all 0 or if there are multiple such sl we can use
df %>%
mutate(z = case_when(sl == 1 ~ 5,
sl == 2 ~ 7,
sl == 587 ~ 8)) %>%
group_by(sl) %>%
mutate(z = replace(z, all(y == 0), 0))
data
df <- structure(list(y = c(199.92989, 27.73883, 144, 72, 0, 392.60636,
749.52499, 3120, 1600, 1000, 5840, 3960, 4700, 1660, 5620, 0,
0, 0, 3062.32962, 2048.97458, 1280, 1440, 2960, 460, 530, 5190,
3200, 4620, 0, 0, 0, 0), x = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L), sl = c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 585L, 585L, 585L,
587L, 587L, 587L, 587L, 587L, 587L, 587L, 587L, 587L, 587L, 651L,
651L, 651L, 651L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31", "32"))

ddply using "group_by" logic

I'm trying to use ddply to find the smallest distance between two positions pos where the corresponding chrom is the same in two dataframes:
head(bps, 10)
chrom pos iteration
1 1 4 1
2 1 14 1
3 1 68 1
4 1 79 1
5 1 200 1
6 1 205 1
7 1 270 1
8 1 304 1
9 2 7 1
10 2 13 1
head(flocs)
chrom pos
1 1 100
2 1 200
3 1 220
4 1 312
5 2 500
6 2 501
As an example, for the first line in bps, I want to find the closest pos in flocs where chrom = 1, which gives a value of -96.
The pseudocode for what I'm trying to do is:
foreach iteration (bps$iteration):
foreach chrom (bps$chrom):
foreach pos (bps$pos):
features_pos = pos in dataframe flocs closest to pos on the same chromosome
min_dist = feature_pos - pos
return features_pos, min_dist
I am trying to do this with ddply:
minDists <- ddply(bp_data, c("chrom", "pos"), function(x) {
index <- which.min(abs(flocs$pos[which(flocs$chrom==x$chrom)] - x$pos))
closestMotif <- flocs$pos[index]
chrom <- as.character(flocs$chrom[index])
dist <- (x$pos - closestMotif)
data.frame(features_pos = closestMotif, pos = x$pos, min_dist = dist, feature = feature)
})
But this doesn't constrain comparisons to the same chromosome:
head(minDists, 10)
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 100 7 -93 feature1 # bps chrom=2, flocs chrom=1
10 2 100 13 -87 feature1 # bps chrom=2, flocs chrom=1
The expected output here is:
chrom features_pos pos min_dist feature
1 1 100 4 -96 feature1
2 1 100 14 -86 feature1
3 1 100 68 -32 feature1
4 1 100 79 -21 feature1
5 1 200 200 0 feature1
6 1 200 205 5 feature1
7 1 312 270 -42 feature1
8 1 312 304 -8 feature1
9 2 500 7 -493 feature1 # bp1 chrom=2, flocs chrom=2
10 2 500 13 -487 feature1 # bp1 chrom=2, flocs chrom=2
I thought that by providing the columns c("chrom", "pos") essentially performed a group_by to the function call.
Is there any way that I can improve what I've written to achieve the desired result?
bps <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"
), class = "factor"), pos = c(4L, 14L, 68L, 79L, 200L, 205L,
270L, 304L, 7L, 13L, 23L, 39L, 100L, 150L, 17L, 55L, 75L, 79L,
102L, 109L, 123L, 155L, 157L, 200L, 260L, 299L, 300L, 320L, 323L,
345L, 450L, 550L), iteration = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor")), row.names = c(NA,
-32L), class = "data.frame")
flocs <- structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1", "2", "3"), class = "factor"), pos = c(100L,
200L, 220L, 312L, 500L, 501L, 123L, 444L)), row.names = c(NA,
-8L), class = "data.frame")
data.table approach using a rolling join...
updated answer
(initially forgot all about the by-reference joining, which is faster and most certainly shorter ;-) )
library( data.table )
#set data as data.table
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
output
# chrom pos iteration mindist
# 1: 1 4 1 -96
# 2: 1 14 1 -86
# 3: 1 68 1 -32
# 4: 1 79 1 -21
# 5: 1 200 1 0
# 6: 1 205 1 5
# 7: 1 270 1 -42
# 8: 1 304 1 -8
# 9: 2 7 1 -493
# 10: 2 13 1 -487
# 11: 2 23 1 -477
# 12: 2 39 1 -461
# 13: 2 100 1 -400
# 14: 2 150 1 -350
# 15: 3 17 1 -106
# 16: 3 55 1 -68
# 17: 3 75 1 -48
# 18: 3 79 1 -44
# 19: 3 102 1 -21
# 20: 3 109 1 -14
# 21: 3 123 1 0
# 22: 3 155 1 32
# 23: 3 157 1 34
# 24: 3 200 1 77
# 25: 3 260 1 137
# 26: 3 299 1 -145
# 27: 3 300 1 -144
# 28: 3 320 1 -124
# 29: 3 323 1 -121
# 30: 3 345 1 -99
# 31: 3 450 1 6
# 32: 3 550 1 106
# chrom pos iteration mindist
Benchmarking answer until now
# Unit: milliseconds
# expr min lq mean median uq max neval
# Ronak_base 2.355879 2.555768 2.973069 2.626415 2.773581 8.016016 100
# Wimpel_data.table 1.697921 2.035788 2.416199 2.209616 2.361001 17.724528 100
# Pawel_tidyverse 14.845354 15.310505 16.333158 15.814819 16.541618 24.077871 100
microbenchmark::microbenchmark(
Ronak_base = {
bps$min_dist <- unlist(mapply(return_min_value, unique(bps$chrom), split(bps$pos, bps$chrom)))
},
Wimpel_data.table = {
setDT( bps, key = c("chrom", "pos") )
setDT( flocs, key = c("chrom", "pos") )
#perform by-reference rolling join
bps[, mindist := pos - flocs[bps, x.pos, roll = "nearest"]][]
},
Pawel_tidyverse = {
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
}
)
Looks like my data-table answer and the answer by Ronak Shah are pretty close together. I believe that data.table will gain the clear advantage when the data-sets are getting lager-huge (but I haven't tested)..
My base R attempt by creating a helper function (return_min_value). This function subset flocs based on current chrom and then returns the minimum value after subtracting it from pos. We split the pos column based on chrom and pass these values along with unique chrom values in mapply to return_min_value function.
return_min_value <- function(x, y) {
sapply(y, function(p) {
vals = p - flocs$pos[flocs$chrom == x]
vals[which.min(abs(vals))]
})
}
bps$min_dist <- unlist(mapply(return_min_value,
unique(bps$chrom), split(bps$pos, bps$chrom)))
bps
# chrom pos iteration min_dist
#1 1 4 1 -96
#2 1 14 1 -86
#3 1 68 1 -32
#4 1 79 1 -21
#5 1 200 1 0
#6 1 205 1 5
#7 1 270 1 -42
#8 1 304 1 -8
#9 2 7 1 -493
#10 2 13 1 -487
#...
Check this solution:
library(tidyverse)
bps %>%
select(-iteration) %>%
unite('bps') %>%
crossing(flocs %>% unite('flocks')) %>%
separate(bps, c('chrom_bps', 'pos')) %>%
separate(flocks, c('chrom_flocks', 'features_pos')) %>%
filter(chrom_bps == chrom_flocks) %>%
select(-chrom_flocks) %>%
rename_at(1, ~'chrom') %>%
mutate_all(as.numeric) %>%
mutate(min_dist = pos - features_pos) %>%
group_by(chrom, pos) %>%
filter(abs(min_dist) == min(abs(min_dist)))
Output:
chrom pos features_pos min_dist
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 -96
2 1 14 100 -86
3 1 68 100 -32
4 1 79 100 -21
5 1 200 200 0
6 1 205 200 5
7 1 270 312 -42
8 1 304 312 -8
9 2 7 500 -493
10 2 13 500 -487
# ... with 22 more rows

How to subtract one row from multiple rows by group, for data set with multiple columns in R?

I would like to learn how to subtract one row from multiple rows by group, and save the results as a data table/matrix in R. For example, take the following data frame:
data.frame("patient" = c("a","a","a", "b","b","b","c","c","c"), "Time" = c(1,2,3), "Measure 1" = sample(1:100,size = 9,replace = TRUE), "Measure 2" = sample(1:100,size = 9,replace = TRUE), "Measure 3" = sample(1:100,size = 9,replace = TRUE))
patient Time Measure.1 Measure.2 Measure.3
1 a 1 19 5 75
2 a 2 64 20 74
3 a 3 40 4 78
4 b 1 80 91 80
5 b 2 48 31 73
6 b 3 10 5 4
7 c 1 30 67 55
8 c 2 24 13 90
9 c 3 45 31 88
For each patient, I would like to subtract the row where Time == 1 from all rows associated with that patient. The result would be:
patient Time Measure.1 Measure.2 Measure.3
1 a 1 0 0 0
2 a 2 45 15 -1
3 a 3 21 -1 3
4 b 1 0 0 0
5 b 2 -32 -60 -5
6 b 3 -70 -86 -76
7 c 1 0 0 0
....
I have tried the following code using the dplyr package, but to no avail:
raw_patient<- group_by(rawdata,patient, Time)
baseline_patient <-mutate(raw_patient,cpls = raw_patient[,]- raw_patient["Time" == 0,])
As there are multiple columns, we can use mutate_at by specifying the variables in vars and then subtract the elements from those elements in each column that corresponds to 'Time' 1 after grouping by 'patient'
library(dplyr)
df1 %>%
group_by(patient) %>%
mutate_at(vars(matches("Measure")), funs(.- .[Time==1]))
# A tibble: 9 × 5
# Groups: patient [3]
# patient Time Measure.1 Measure.2 Measure.3
# <chr> <int> <int> <int> <int>
#1 a 1 0 0 0
#2 a 2 45 15 -1
#3 a 3 21 -1 3
#4 b 1 0 0 0
#5 b 2 -32 -60 -7
#6 b 3 -70 -86 -76
#7 c 1 0 0 0
#8 c 2 -6 -54 35
#9 c 3 15 -36 33
data
df1 <- structure(list(patient = c("a", "a", "a", "b", "b", "b", "c",
"c", "c"), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Measure.1 = c(19L,
64L, 40L, 80L, 48L, 10L, 30L, 24L, 45L), Measure.2 = c(5L, 20L,
4L, 91L, 31L, 5L, 67L, 13L, 31L), Measure.3 = c(75L, 74L, 78L,
80L, 73L, 4L, 55L, 90L, 88L)), .Names = c("patient", "Time",
"Measure.1", "Measure.2", "Measure.3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

How to select data that have complete cases of a certain column?

I'm trying to get a data frame (just.samples.with.shoulder.values, say) contain only samples that have non-NA values. I've tried to accomplish this using the complete.cases function, but I imagine that I'm doing something wrong syntactically below:
data <- structure(list(Sample = 1:14, Head = c(1L, 0L, NA, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L), Shoulders = c(13L, 14L, NA,
18L, 10L, 24L, 53L, NA, 86L, 9L, 65L, 87L, 54L, 36L), Knees = c(1L,
1L, NA, 1L, 1L, 2L, 3L, 2L, 1L, NA, 2L, 3L, 4L, 3L), Toes = c(324L,
5L, NA, NA, 5L, 67L, 785L, 42562L, 554L, 456L, 7L, NA, 54L, NA
)), .Names = c("Sample", "Head", "Shoulders", "Knees", "Toes"
), class = "data.frame", row.names = c(NA, -14L))
just.samples.with.shoulder.values <- data[complete.cases(data[,"Shoulders"])]
print(just.samples.with.shoulder.values)
I would also be interested to know whether some other route (using subset(), say) is a wiser idea. Thanks so much for the help!
You can try complete.cases too which will return a logical vector which allow to subset the data by Shoulders
data[complete.cases(data$Shoulders), ]
# Sample Head Shoulders Knees Toes
# 1 1 1 13 1 324
# 2 2 0 14 1 5
# 4 4 1 18 1 NA
# 5 5 1 10 1 5
# 6 6 1 24 2 67
# 7 7 0 53 3 785
# 9 9 1 86 1 554
# 10 10 1 9 NA 456
# 11 11 1 65 2 7
# 12 12 1 87 3 NA
# 13 13 0 54 4 54
# 14 14 1 36 3 NA
You could try using is.na:
data[!is.na(data["Shoulders"]),]
Sample Head Shoulders Knees Toes
1 1 1 13 1 324
2 2 0 14 1 5
4 4 1 18 1 NA
5 5 1 10 1 5
6 6 1 24 2 67
7 7 0 53 3 785
9 9 1 86 1 554
10 10 1 9 NA 456
11 11 1 65 2 7
12 12 1 87 3 NA
13 13 0 54 4 54
14 14 1 36 3 NA
There is a subtle difference between using is.na and complete.cases.
is.na will remove actual na values whereas the objective here is to only control for a variable not deal with missing values/na's those which could be legitimate data points

Resources