adding noise to a column in dplyr - r

Related to a previous question, I want to add some random noise to every value in a column in dplyr. However, when I tried the below code I get identical values back. I understand why this is happening (dplyr generate the random number and then uses that very same number to add to every single value). Is there any way to prevent this?
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000236
# 2 1.000236
# 3 1.000236
# 4 1.000236
# 5 1.000236

Here is a solution with jitter:
library(dplyr)
set.seed(2020) # Make the results reproducible
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = jitter(value))
data
# value
#1 1.0058761
#2 0.9957690
#3 1.0047401
#4 0.9990756
#5 0.9854439

You could generate your random vector externally and then add it to data$value:
nrows <- nrow(data)
rands <- 1e-3 * runif(nrows)
data$value <- data$value + rands
Stepwise clarity works better for me.

Found my own answer. Adding rowwise() evaluates each row individually and thus gives a new random number.
data <- data.frame(value=c(1,1,1,1,1)) %>% rowwise() %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000625
# 2 1.000764
# 3 1.000588
# 4 1.000536
# 5 1.000079

Related

Sample percentages in different groups without repeating R

I have a df such as
mydata <- data.frame(variable = runif(142),
block = sample(x = c(1,2,3,4,5), 142, replace = TRUE))
I'm trying to sample 80 and 20% of each block value, without repeating, and adding each fraction to new dfs called train (80%) and test (20%). Important: Sometimes my blocks will not have exactly 80-20, but I'm trying to get as close as possible to this value.
How to proceed?
I was using sample_frac but wasn't able to avoid repeating and joining the data after.
We may use slice_sample with proportion as 0.8 after grouping by 'block'. Create a sequence column (row_number()) before grouping so that it can be used to create the 'test' data by removing those observations that were already taken in train
library(dplyr)
train <- mydata %>%
mutate(rn = row_number()) %>%
group_by(block) %>%
slice_sample(prop = 0.8) %>%
ungroup
test <- mydata[setdiff(seq_len(nrow(mydata)), train$rn),]
train$rn <- NULL
With rsample::initial_split with strata = block:
library(rsample)
split <- initial_split(mydata, prop = .8, strata = block)
training(split)
testing(split)
#> prop.table(table(training(split)$block))
# 1 2 3 4 5
#0.1769912 0.2300885 0.2035398 0.1769912 0.2123894
#> prop.table(table(testing(split)$block))
# 1 2 3 4 5
#0.1724138 0.2413793 0.2068966 0.1724138 0.2068966

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

How to transpose the first rows into new columns in R?

I want to transpose the first two rows into two new columns, and remain the rest of data frame. How do I do it in R?
My original data
A <- c("2012","PL",3,2)
B <- c("2012","PL",6,1)
C <- c("2012","PL",7,4)
DF <- data.frame(A,B,C)
My final data after transpose
V1 <- c("2012","2012")
V2 <- c("PL","PL")
A <- c(3,2)
B <- c(6,1)
C <- c(7,4)
DF <- data.frame(V1,V2,A,B,C)
Where V1 and V2 are the names for new columns and they are created automatically.
Thank you for any assistance.
Base R:
cbind(t(DF[1:2, 1, drop=FALSE]), DF[-(1:2),])
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# 1 2 A B C
# 1 2012 PL 3 6 7
# 2 2012 PL 2 1 4
though I have some concerns about the apparent key property of "2012" and "PL". That is, you start with three instances of each and end with two. Logically it makes sense, though really to me it looks as if you have a matrix of numbers associated with a single "2012","PL", but perhaps that's not how the data is coming to you. (If you can change the format of the data before getting to this point such that you have a matrix and its associated keys, then it might make data munging more direct, declarative, and resistant to bugs.)
Here is an option with slice
library(dplyr)
DF %>%
select(A) %>%
slice(1:2) %>%
t %>%
as.data.frame %>%
bind_cols(DF %>%
slice(-(1:2)))

Looping dplyr and creating multiple dataframe

I have a large dataset in which I would like to use dplyr and filter and select the data to create 12 separate dataframes.
Essentially, I am using only two columns of data from a larger dataset. The first column is "plot", where I filter by "plot" number and another condition in another 3rd column ("pos_ID"). I want to create a loop that filters by plot number (I tried plot==[i]) and the 3rd condition, and then creates a new dataframe. The loop would repeat 12 times (because plot spans from 1-12).
Here is the code that I used without a loop (based on sample data)
p1_Germ <- data %>% #p1 stands for plot 1
filter(plot==1, pos_ID<21) %>%
select(germ_bin)
Here is the code that I tried to incorporate a loop (based on sample data)
for(i in seq_along(plot)) {
data %>%
group_by(plot[[i]], pos_ID<21) %>%
select(germ_bin)
}
Here is some sample data
plot <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)
germ_bin <- c(0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0)
pos_ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
dataset <- data.frame(plot, germ_bin, pos_ID)
dataset
My guess is to use a list, but I'm not familiar with loops and list and could not find a solution online. I need to create 12 dataframes because I'm trying to convert them each into a matrix after for another function. Any helpful would be much appreciated!
We can use group_split and map to filter based on criteria to get list of dataframes.
library(dplyr)
library(purrr)
dataset %>%
group_split(plot) %>%
map(. %>% filter(pos_ID < 21) %>% select(germ_bin))
#[[1]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 0
#2 0
#[[2]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#[[3]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#....
For the shared example, if you want to drop empty groups you can filter first
dataset %>%
filter(pos_ID < 21) %>%
group_split(plot) %>%
map(. %>% select(germ_bin))
As far your attempt with for loop is concerned, you can correct that by doing
unique_plot <- unique(dataset$plot)
plot_list <- list(length = length(unique_plot))
for(i in seq_along(unique_plot)) {
plot_list[[i]] <- dataset %>%
filter(plot == unique_plot[i], pos_ID<21) %>%
select(germ_bin)
}
Or keeping it completely in base R
lapply(split(dataset, dataset$plot), function(x)
subset(x, pos_ID < 21, select = germ_bin, drop = FALSE))

Error when combining dplyr inside a function

I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:
group.count <- c(101,99,4)
data <- data.frame(
by = rep(3:1,group.count),
y = rep(letters[1:3],group.count))
data %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
Which gives me the outcome I'm looking for. However, when I try to do it as a function:
res0 <- function(x1,x2) {
output = data %>%
group_by(x2) %>%
summarise(non.miss = sum(!is.na(x1)))
}
res0(y,by)
I get an error (index out of bounds).
Can anybody tell me what I'm missing?
Thanks on advance.
You can't do this like that in dplyr.
The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:
data <- data.frame(
x2 = rep(3:1,group.count),
x1 = rep(letters[1:3],group.count)
)
Then call your function again and it will return the expected output.
I suggest changing the name of your dataframe to df.
This is basically what you have done:
df %>%
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))
which produces this:
# by non.miss
#1 1 4
#2 2 99
#3 3 101
but to count the number of observations per group, you could use length, which gives the same answer:
df %>%
group_by(by) %>%
summarise(non.miss = length(y))
# by non.miss
#1 1 4
#2 2 99
#3 3 101
or, use tally, which gives this:
df %>%
group_by(by) %>%
tally
# by n
#1 1 4
#2 2 99
#3 3 101
Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:
res0 <- function(df) {
df %>%
group_by(by) %>%
tally
}
res0(df)
# by n
#1 1 4
#2 2 99
#3 3 101
This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.

Resources