Subsetting Longitudinal Dataframe by Randomly Selected Participant ID - r

I'd like to subset a longitudinal dataset by a randomly sampled number of participants. In this example there are three entries per participant and I want to sample 4 participants.
id <- rep(c(1:6), each = 3)
score <- rnorm(18, 10, 3)
group <- rep(c("a", "b"), each = 3, times = 3)
df <- data.frame(id, group, score)
I tried with this command...
dfSub <- df[df$id %in% sample(df$id, 4, replace = FALSE),]
But it only returns the entries for three participants, not the four I stipulated. Can anyone tell me why this didn't work and how to do it better?

We can use unique
df[df$id %in%sample(unique(df$id), 4, replace = FALSE),]
# id group score
#7 3 a 8.123872
#8 3 a 12.685344
#9 3 a 12.824781
#10 4 b 11.868296
#11 4 b 13.000660
#12 4 b 9.541258
#13 5 a 9.722255
#14 5 a 3.889751
#15 5 a 10.851232
#16 6 b 10.945997
#17 6 b 11.632380
#18 6 b 3.289507
The OP's command didn't work out because of the following
sample(c(1, 1, 4,3), 3, replace=FALSE)
#[1] 3 4 1
sample(c(1, 1, 4,3), 3, replace=FALSE)
#[1] 1 3 1
If there are duplicate values, sample can still return duplicates instead of unique values for the size specified. The replace only does whether sampling should be done with replacement or not. In the dummy example, we have 2 1s. So, even with replace=FALSE, the number of 1s that can be possible in the sample is 2.

Related

Assign a specific number of random rows into datasets in R

I have a dataset with 54285 observations. What I need is to assign randomly 50% of the rows into another dataframe, 30% into another dataset, and the rest (20%) into another one. This should be done without duplicates.
This is an example:
data<-data.frame(numbers=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
data
1
2
3
4
5
6
7
8
9
10
What I expect would be:
df1
5
3
8
1
7
df2
2
4
9
df3
6
10
Multiply the ratio by number of rows in the dataset and split the data to divide them in separate dataframes.
set.seed(123)
result <- split(data, sample(rep(1:3, nrow(data) * c(0.5, 0.3, 0.2))))
names(result) <- paste0('df', seq_along(result))
list2env(result, .GlobalEnv)
df1
# numbers
#1 1
#3 3
#7 7
#9 9
#10 10
df2
# numbers
#4 4
#5 5
#8 8
df3
# numbers
#2 2
#6 6
For large dataframes using sample with prob argument should work as well. However, note that this might not give you exact number of rows that you expect like the above rep answer.
result <- split(data, sample(1:3, nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2)))

impute data with condition (4 levels) in R

I have two different variables in my data. first one: vein size (there are NAs).
Second variable is: procedure site (c=(1,2,3,4))
I want to impute different value to vein size based on different procedure site. I tried if else, but it wasn't successful. e.g.: if procedure site is 1 or 2, impute 3; if procedure site is 3, impute 4; if procedure site is 4, impute 5.I am new to this field. Any help is much appreciated!
vein.size<-(3,3,3,NA,NA,NA)
procedure.site<-(1,2,2,3,4,4)
df<-cbind(vein.size,procedure.site)
My expected output is:
vein.size<-(3,3,3,4,5,5)
thank you
You can use chain of ifelse statement or try case_when from dplyr :
library(dplyr)
df <- df %>%
mutate(output = case_when(is.na(vein.size) & procedure.site %in% 1:2 ~ 3,
is.na(vein.size) & procedure.site == 3 ~ 4,
is.na(vein.size) & procedure.site == 4 ~ 5,
TRUE ~ vein.size))
# vein.size procedure.site output
#1 3 1 3
#2 3 2 3
#3 3 2 3
#4 NA 3 4
#5 NA 4 5
#6 NA 4 5
data
vein.size<-c(3,3,3,NA,NA,NA)
procedure.site<-c(1,2,2,3,4,4)
df<-data.frame(vein.size,procedure.site)
You could use a lookup table and then merge:
# your data
vein.size <- c(3,3,3,NA,NA,NA)
procedure.site <- c(1,2,2,3,4,4)
your_df <- data.frame(vein.size = vein.size,
procedure.site = procedure.site)
# the lookup table
lookup_df <- data.frame(
procedure.site = c(1, 2, 3, 4),
imputation = c(3, 3, 4, 5)
)
# result
merge(your_df, lookup_df, by='procedure.site')
Which gives:
procedure.site vein.size imputation
1 1 3 3
2 2 3 3
3 2 3 3
4 3 NA 4
5 4 NA 5
6 4 NA 5

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

Excluding both the minimum and maximum value

I want to exclude the minimum as well as the maximum value of each row in a data frame. (If one of those value are repeated, only one should be excluded.)
I can exclude either the minimum, or the maximum, but not both.
I don't seem to find a way to combine those (which both work fine by themselves):
d[-which(d == min(d))[1]]
d[-which(d == max(d))[1]]
This doesn't work:
d[
-which(d == min(d))[1] &
-which(d == max(d))[1]
]
It gives the full row.
(I also tried an approach using apply(d, 1, min/max), but this also fails.)
Update
Remembered after looking at #Rich Pauloo's answer, we can directly use which.max and which.min to get index of minimum and maximum value
as.data.frame(t(apply(df, 1, function(x) x[-c(which.max(x), which.min(x))])))
# V1 V2 V3
#1 13 11 6
#2 15 8 18
#3 5 10 21
#4 14 12 17
#5 19 9 20
Here which.max/which.min will ensure that you get the index of first minimum and maximum respectively for each row.
Some other variations could be
as.data.frame(t(apply(df, 1, function(x)
x[-c(which.max(x == min(x)), which.max(x == max(x)))])))
If you want to use which we can do
as.data.frame(t(apply(df, 1, function(x)
x[-c(which(x == min(x)[1]), which(x == max(x)[1]))])))
data
set.seed(1234)
df <- as.data.frame(matrix(sample(25), 5, 5))
df
# V1 V2 V3 V4 V5
#1 3 13 11 16 6
#2 15 1 8 25 18
#3 24 5 4 10 21
#4 14 12 17 2 22
#5 19 9 20 7 23
You were very close! With data.frames you need to use a comma within the brackets to accomplish row-column subsetting.
Use which.max() and which.min() to return the index of the max and min values of a vector, respectively.
Bind those indices into a new vector with c().
Use - and the vector from 2. to subset your data frame for the desired rows.
Here's an example to copy/paste:
d <- data.frame(a = 1:5) # make example data.frame
d[-c(which.max(d$a), which.min(d$a)), ]
[1] 2 3 4
This will remove the rows containing the min and max values of score as shown in the example data frame.
library(tidyverse)
df <- tribble(~name, ~score,
'John', 10,
'Mike', 2,
'Mary', 11,
'Jane', 1,
'Jill', 5)
df %>%
arrange(score) %>%
slice(-1, -nrow(.))
# A tibble: 3 x 2
name score
<chr> <dbl>
1 Mike 2
2 Jill 5
3 John 10
We can use
t(apply(df, 1, function(x) x[!x %in% range(x)]))

Take sum of rows for every 3 columns in a dataframe

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

Resources