All variables not read from data pipeline - dplyr - r

I have a dataset with 8 variables,when I run dplyr with syntax below, my output dataframe only has the variables I have used in the dplyr code, while I want all variables
ShowID<-MyData %>%
group_by(id) %>%
summarize (count=n()) %>%
filter(count==min(count))
ShowID
So my output will have two variables - ID and Count. How do I get rest of my variables in the new dataframe? Why is this happening, what am I clueless about here?
> ncol(ShowID)
[1] 2
> ncol(MyData)
[1] 8
MYDATA
key ID v1 v2 v3 v4 v5 v6
0-0-70cf97 1 89 20 30 45 55 65
3ad4893b8c 1 4 5 45 45 55 65
0-0-70cf97d7 2 848 20 52 66 56 56
0-0-70cf 2 54 4 846 65 5 5
0-0-793b8c 3 56454 28 6 4 5 65
0-0-70cf98 2 8 4654 30 65 6 21
3ad4893b8c 2 89 66 518 156 16 65
0-0-70cf97d8 3 89 20 161 1 55 45465
0-0-70cf 5 89 79 48 45 55 456
0-0-793b8c 5 89 20 48 545 654 4
0-0-70cf99 6 9 20 30 45 55 65
DESIRED
key ID count v1 v2 v3 v4 v5 v6
0-0-70cf99 6 1 9 20 30 45 55 65
RESULT FROM CODE
ID count
6 1

You can use the base R ave method to calculate number of rows in each group (ID) and then select those group which has minimum rows.
num_rows <- ave(MyData$v1, MyData$ID, FUN = length)
MyData[which(num_rows == min(num_rows)), ]
# key ID v1 v2 v3 v4 v5 v6
#11 0-0-70cf99 6 9 20 30 45 55 65
You could also use which.min in this case to avoid one step however, in case of multiple minimum values it would fail hence, I have used which.

No need to summarize:
ShowID <- MyData %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == min(count))

Related

Vectorizing lagged operations

How can I vectorize the following operation in R that involves modifying column Z recursively using lagged values of Z?
library(dplyr)
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df
X Y
1 20 27
2 69 49
3 92 32
4 28 56
5 10 26
6 70 20
7 53 39
8 81 89
9 96 55
10 11 84
df <- df %>% mutate(Z=if_else(row_number()==1, initial_Z-Y, NA_real_))
df
X Y Z
1 20 27 973
2 69 49 NA
3 92 32 NA
4 28 56 NA
5 10 26 NA
6 70 20 NA
7 53 39 NA
8 81 89 NA
9 96 55 NA
10 11 84 NA
for (i in 2:nrow(df)) {
df$Z[i] <- (df$Z[i-1]*df$X[i-1]/df$X[i])-df$Y[i]
}
df
X Y Z
1 20 27 973.000000
2 69 49 233.028986
3 92 32 142.771739
4 28 56 413.107143
5 10 26 1130.700000
6 70 20 141.528571
7 53 39 147.924528
8 81 89 7.790123
9 96 55 -48.427083
10 11 84 -506.636364
So the first value of Z is set first, based on initial_Z and first value of Y. Rest of the values of Z are calculated by using lagged values of X and Z, and current value of Y.
My actual df is large, and I need to repeat this operation thousands of times in a simulation. Using a for loop takes too much time. I prefer implementing this using dplyr, but other approaches are also welcome.
Many thanks in advance for any help.
I don't know that you can avoid the effect of for loops, but in general R should be pretty good at them. Given that, here is a Reduce variant that might suffice for you:
set.seed(5)
initial_Z=1000
df <- data.frame(X=round(100*runif(10),0), Y=round(100*runif(10),0))
df$Z <- with(df, Reduce(function(prevZ, i) {
if (i == 1) return(prevZ - Y[i])
prevZ*X[i-1]/X[i] - Y[i]
}, seq_len(nrow(df)), init = initial_Z, accumulate = TRUE))[-1]
df
# X Y Z
# 1 20 27 973.000000
# 2 69 49 233.028986
# 3 92 32 142.771739
# 4 28 56 413.107143
# 5 10 26 1130.700000
# 6 70 20 141.528571
# 7 53 39 147.924528
# 8 81 89 7.790123
# 9 96 55 -48.427083
# 10 11 84 -506.636364
To be clear, Reduce uses for loops internally to get through the data. I generally don't like using indices as the values for Reduce's x, but since Reduce only iterates over one value, and we need both X and Y, the indices (rows) are a required step.
The same can be accomplished using accumulate2. Note that these are just for-loops. You should consider writing the for loop in Rcpp if at all its causing a problem in R
df %>%
mutate(Z = accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000)[-1])
X Y Z
1 20 27 973
2 69 49 233.029
3 92 32 142.7717
4 28 56 413.1071
5 10 26 1130.7
6 70 20 141.5286
7 53 39 147.9245
8 81 89 7.790123
9 96 55 -48.42708
10 11 84 -506.6364
You could unlist(Z):
df %>%
mutate(Z = unlist(accumulate2(Y, c(1, head(X, -1)/X[-1]), ~ ..1 * ..3 -..2, .init = 1000))[-1])

Use mutate and dynamically named variables in dplyr

I would like to apply a function that selects the best transformation of certain variables in a data frame, and then adds new columns to the data frame with the transformed data. I can currently get the transformation to run as follows. However, this rewrites the existing data, instead of adding new, transformed variables. I have seen the other stackoverflow posts about dynamically-added variables but can't quite seem to get it to work. Here is what I have:
df <- data.frame(study_id = c(1:10),
v1 = (sample(1:100, 10)),
v2 = (sample(1:100, 10)),
v3 = (sample(1:100, 10)),
v4 = (sample(1:100, 10)))
require(bestNormalize)
transformed <- function(x) {
bn <- bestNormalize(x)
return(bn$x.t)
}
df <- df %>%
mutate(across(c(2,4:5), transformed))
Current output:
study_id v1 v2 v3 v4
1 1 -0.001846842 43 0.6559159 0.37893888
2 2 -2.416625847 81 -1.2998111 -0.64356058
3 3 1.012132345 95 -1.5086228 -0.48845289
4 4 0.798561562 2 0.8301299 0.30168982
5 5 -0.257460026 35 0.1322051 0.78737617
6 6 -0.179681789 42 -1.1352463 -2.42438347
7 7 0.378206706 22 -0.3635088 0.79583687
8 8 0.909304988 70 1.0748401 0.63712357
9 9 0.325879668 32 0.9041796 -0.09711216
10 10 -0.568470765 7 0.7099185 0.75254380
Desired output:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 72 7 87 100 4 3 2
2 2 57 78 64 69 10 8 6
3 3 35 65 83 96 3 5 4
4 4 24 58 94 53 6 10 10
5 5 100 62 82 63 -1 7 3
6 6 47 55 4 50 8 4 1
7 7 83 97 35 41 7 2 -1
8 8 78 86 22 73 1 -1 9
9 9 11 39 93 68 2 0 7
10 10 36 49 8 72 0 1 0
Many thanks in advance.
Use the .names= argument of across:
df %>%
mutate(across(c(2,4:5), transformed, .names = "{.col}_transformed"))
giving:
study_id v1 v2 v3 v4 v1_transformed v3_transformed v4_transformed
1 1 50 72 12 7 0.3850197 -0.7916019 -1.9775107
2 2 53 82 61 42 0.4425318 0.6132865 0.6790496
3 3 3 12 90 20 -2.3661268 0.9496526 -0.4232995
4 4 20 84 37 21 -0.5190229 0.1809655 -0.3508475
5 5 55 54 4 23 0.4790925 -1.7301008 -0.2157362
6 6 61 96 85 74 0.5812924 0.9002185 1.5209888
7 7 52 94 22 38 0.4237308 -0.2683955 0.5302984
8 8 72 41 57 35 0.7449435 0.5546340 0.4080778
9 9 13 67 6 45 -0.9434502 -1.3866702 0.7815968
10 10 74 48 93 14 0.7719892 0.9780114 -0.9526174

Replace column values based on column in another dataframe

I would like to replace some column values in a df based on column in another data frame
This is the head of the first df:
df1
A tibble: 253 x 2
id sum_correct
<int> <dbl>
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 16
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement
df 2
A tibble: 14 x 2
id sum_correct
<int> <dbl>
1 866103 61
2 866124 79
3 866152 85
4 867101 24
5 867140 76
6 867146 51
7 867152 56
8 867200 50
9 867209 97
10 879657 56
11 879680 61
12 879683 58
13 879693 77
14 881451 57
how I can achieve this in R studio? thanks for the help in advance.
You can make an update join using match to find where id matches and remove non matches (NA) with which:
idx <- match(df1$id, df2$id)
idxn <- which(!is.na(idx))
df1$sum_correct[idxn] <- df2$sum_correct[idx[idxn]]
df1
id sum_correct
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 61
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
you can do a left_join and then use coalesce:
library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>%
mutate(sum_correct_final = coalesce(sum_correct_2, sum_correct_1))
The new column sum_correct_final contains the value from df2 if it exists and from df1 if a corresponding entry from df2 does not exist.

Melt data frame row by row

How can I melt a data frame row by row?
I found a really similar question on the forum but I still can't solve my problem without a different id variable.
This is my data set:
V1 V2 V3 V4 V5
51 20 29 12 20
51 22 51 NA NA
51 14 NA NA NA
51 75 NA NA NA
And I want to melt it into:
V1 variable value
51 V2 20
51 V3 29
51 V4 12
51 V5 20
51 V2 22
51 V3 51
51 V2 14
51 V2 75
Currently my approach is melting it row by row with a for loop and then rbind them together.
library(reshape)
df <- read.table(text = "V1 V2 V3 V4 V5 51 20 29 12 20 51 22 51 NA NA 51
+14 NA NA NA 51 75 NA NA NA", header = TRUE)
dfall<-NULL
for (i in 1:NROW(df))
{
dfmelt<-melt(df,id="V1",na.rm=TRUE)
dfall<-rbind(dfall,dfmelt)
}
Just wondering if there is any way to do this faster? Thanks!
We replicate the first column "V1" and the names of the dataset except the first column name to create the first and second column of the expected output, while the 'value' column is created by transposing the dataset without the first column.
na.omit(data.frame(V1=df1[1][col(df1[-1])],
variable = names(df1)[-1][row(df1[-1])],
value = c(t(df1[-1]))))
# V1 variable value
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#9 51 V2 14
#13 51 V2 75
NOTE: No additional packages used.
Or we can use gather (from tidyr) to convert the 'wide' to 'long' format after we create a row id column (add_rownames from dplyr) and then arrange the rows.
library(dplyr)
library(tidyr)
add_rownames(df1) %>%
gather(variable, value, V2:V5, na.rm=TRUE) %>%
arrange(rowname, V1) %>%
select(-rowname)
# V1 variable value
# (int) (chr) (int)
#1 51 V2 20
#2 51 V3 29
#3 51 V4 12
#4 51 V5 20
#5 51 V2 22
#6 51 V3 51
#7 51 V2 14
#8 51 V2 75
Or with data.table
library(data.table)
melt(setDT(df1, keep.rownames=TRUE),
id.var= c("rn", "V1"), na.rm=TRUE)[
order(rn, V1)][, rn:= NULL][]
You can make a column with a unique ID for each row, so you can sort on it after melting. Using dplyr:
library(reshape2)
library(dplyr)
df %>% mutate(id = seq_len(n())) %>%
melt(id.var = c('V1','id'), na.rm = T) %>%
arrange(V1, id, variable) %>%
select(-id)
# V1 variable value
# 1 51 V2 20
# 2 51 V3 29
# 3 51 V4 12
# 4 51 V5 20
# 5 51 V2 22
# 6 51 V3 51
# 7 51 V2 14
# 8 51 V2 75
...or base R:
library(reshape2)
df$id <- seq_along(df$V1)
df2 <- melt(df, id.var = c('V1', 'id'), na.rm = TRUE)
df2[order(df2$V1, df2$id, df2$variable),-2]

Apply dplyr function to all but one column

Given a data frame with numeric values in all columns except for the last one, how can I compute the mean across the row?
In this example, I am using all columns, including the name column which I would like to omit.
df <- as.data.frame(matrix(1:40, ncol=10)) %>%
mutate(name=LETTERS[1:4]) %>%
mutate(mean=rowMeans(.))
Desired data frame output:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 mean name
1 1 5 9 13 17 21 25 29 33 37 19 A
2 2 6 10 14 18 22 26 30 34 38 20 B
3 3 7 11 15 19 23 27 31 35 39 21 C
4 4 8 12 16 20 24 28 32 36 40 22 D
You could try:
df %>%
mutate(mean = select(., -matches("name")) %>% rowMeans(.))
In your setting, you could use
df <- as.data.frame(matrix(1:40, ncol=10)) %>%
mutate(name=LETTERS[1:4]) %>%
mutate(mean=rowMeans(.[,1:10]))

Resources