Random value in column in R - r

Does anyone have an idea how to generate column of random values where only one random row is marked with number "1". All others should be "0".
I need function for this in R code.
Here is what i need in photos:

df <- data.frame(subject = 1, choice = 0, price75 = c(0,0,0,1,1,1,0,1))
This command will update the choice column to contain a single random row with value of 1 each time it is called. All other rows values in the choice column are set to 0.
df$choice <- +(seq_along(df$choice) == sample(nrow(df), 1))

With integer(length(DF$choice)) a vector of 0 is created where [<- is replacing a 1 on the position from sample(length(DF$choice), 1).
DF <- data.frame(subject=1, choice="", price75=c(0,0,0,1,1,1,0,1))
DF$choice <- `[<-`(integer(nrow(DF)), sample(nrow(DF), 1L), 1L)
DF
# subject choice price75
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 1 1
#5 1 0 1
#6 1 0 1
#7 1 0 0
#8 1 0 1

> x <- rep(0, 10)
> x[sample(1:10, 1)] <- 1
> x
[1] 0 0 0 0 0 0 0 1 0 0

Many ways to set a random value in a row\column in R
df<-data.frame(x=rep(0,10)) #make dataframe df, with column x, filled with 10 zeros.
set.seed(2022) #set a random seed - this is for repeatability
#two base methods for sampling:
#sample.int(n=10, size=1) # sample an integer from 1 to 10, sample size of 1
#sample(x=1:10, size=1) # sample from 1 to 10, sample size of 1
df$x[sample.int(n=10, size=1)] <- 1 # randomly selecting one of the ten rows, and replacing the value with 1
df

Related

How can I use vectorisation in R to change a DF value based on a condition?

Suppose I have the following DF:
C1
C2
0
0
1
1
1
1
0
0
.
.
.
.
I now want to apply these following conditions on the Dataframe:
The value for C1 should be 1
A random integer between 0 and 5 should be less than 2
If both these conditions are true, I change the C1 and C2 value for that row to 2
I understand this can be done by using the apply function, and I have used the following:
C1 <- c(0, 1,1,0,1,0,1,0,1,0,1)
C2 <- c(0, 1,1,0,1,0,1,0,1,0,1)
df <- data.frame(C1, C2)
fun <- function(x){
if (sample(0:5, 1) < 2){
x[1:2] <- 2
}
return (x)
}
index <- df$C1 ==1 // First Condition
processed_Df <-t(apply(df[index,],1,fun)) // Applies Second Condition
df[index,] <- processed_Df
Output:
C1
C2
0
0
2
2
1
1
0
0
.
.
.
.
Some Rows have both conditions met, some doesn't (This is the main
functionality, I would like to achieve)
Now I want to achieve this same using vectorization and without using loops or the apply function. The only confusion I have is "If I don't use apply, won't each row get the same result based on the condition's result? (For example, the following:)
df$C1 <- ifelse(df$C1==1 & sample(0:5, 1) < 5, 2, df$C1)
This changes all the rows in my DF with C1==2 to 2 when there should possibly be many 1's.
Is there a way to get different results for the second condition for each row without using the apply function? Hopefully my question makes sense.
Thanks
You need to sample the values for nrow times. Try this method -
set.seed(167814)
df[df$C1 == 1 & sample(0:5, nrow(df), replace = TRUE) < 2, ] <- 2
df
# C1 C2
#1 0 0
#2 2 2
#3 2 2
#4 0 0
#5 1 1
#6 0 0
#7 2 2
#8 0 0
#9 1 1
#10 0 0
#11 1 1
Here is a fully vectorized way. Create the logical index index just like in the question. Then sample all random integers r in one call to sample. Replace in place based on the conjunction of the index and the condition r < 2.
x <- 'C1 C2
0 0
1 1
1 1
0 0'
df1 <- read.table(textConnection(x), header = TRUE)
set.seed(1)
index <- df1$C1 == 1
r <- sample(0:5, length(index), TRUE)
df1[index & r < 2, c("C1", "C2")] <- 2
df1
#> C1 C2
#> 1 0 0
#> 2 1 1
#> 3 2 2
#> 4 0 0
Created on 2022-05-11 by the reprex package (v2.0.1)

How to remove columns with more than 90% values as '0' in R

I had categorical variables, which I converted to dummy variables and got over 2381 variables. I won't be needing that many variables for analysis (say regression or correlation). I want to remove columns if over 90% of the total values in a given column is '0'. Also, is there a good metric to remove columns other than 90% of values being '0' ? Help!
This will give you a data.frame without the columns where more than 90% of the elements are 0:
df[sapply(df, function(x) mean(x == 0) <= 0.9)]
Or more elgantly as markus suggests:
df[colMeans(df == 0) <= 0.9]
This is easily done with colSums:
Example data:
df <- data.frame(x = c(rep(0, 9), 1),
y = c(rep(0,9), 1),
z = c(rep(0, 8), 1, 1))
> df
x y z
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 1 1 1
df[, colSums(df == 0)/nrow(df) < .9, drop = FALSE]
z
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 1
Regarding the question about a useful metric, this heavily depends on what you want to analyze. Even a column with above 90 % 0 values may be useful for a regression model. I would look at the content of the variable, or use a stepwise exclusion based on AIC or BIC to measure the relevance of your variables.
Hy,
I wrote some code with the dplyr package. Here is some example how you can ged rid of columns with more than 90% of zeros in it:
library(dplyr)
df <- data.frame(colA=sample(c(0,1), 100, replace=TRUE, prob=c(0.8,02)),
colB=sample(c(0,1), 100, replace=TRUE, prob=c(0.99,001)),
colC=sample(c(0,1), 100, replace=TRUE, prob=c(0.5,05)),
colD=sample(c(0,1), 100, replace=TRUE, prob=c(0,1)),
colE=rep(0, 100))
fct <- function (x) x==0
zero_count <- df %>% mutate_all(funs(fct)) %>% summarise_all(sum)
col_filter <- zero_count <= 0.9 * nrow(df)
df_filter <- df[, col_filter]

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

How to create a variable that indicates agreement from two dichotomous variables

I d like to create a new variable that contains 1 and 0. A 1 represents agreement between the rater (both raters 1 or both raters 0) and a zero represents disagreement.
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- cbind(rater_A, rater_B)
The new variable would be like the following vector I created manually:
df$agreement <- c(1,0,0,0,1,0,1,1,1,1)
Maybe there's a package or a function I don't know. Any help would be great.
You could create df as a data.frame (instead of using cbind) and use within and ifelse:
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- data.frame(rater_A, rater_B)
##
df <- within(df,
agreement <- ifelse(
rater_A==rater_B,1,0))
##
> df
rater_A rater_B agreement
1 1 1 1
2 0 1 0
3 1 0 0
4 1 0 0
5 1 1 1
6 0 1 0
7 0 0 1
8 1 1 1
9 0 0 1
10 0 0 1

Sampled the original values and convert it to no. of time it occured in sample

t has 20 values, c has also 20 values 0, 1. I am interested in t matrix. Here I have a Loop, repeating 5 times. Every time sel give 20 values. I want to store there frequency in t.mat. But how can I get the required results, the resulting table may look like the below table
t <- 1:20
# c <- seq(0:1, 10)
t.mat <- array(dim = c(20, 5))
rep <- 5
for(mm in 1:rep){
sel <- sample(1:20, replace = TRUE)
tt <- t[sel]
# cc <- c[sel]
t.mat[, mm] = tt[1:20] # here the problem lies, I have no clue how
}
The output for the above may be look like below. But t will be of 20 values, I roughly give just six lines:
t v1 v2 v3 v4 v5
1 1 0 1 0 1
2 0 0 2 0 1
3 0 1 1 1 0
4 1 1 0 0 1
5 2 0 2 1 0
6 0 0 0 1 2
I'm guessing a little bit as to what you want, but it's probably this:
do.call(cbind, lapply(1:5, function(i)
tabulate(sample(t, replace = T), nbins = 20)))
sample generates the samples you want, tabulate counts frequencies (with max specified manually as it will not always occur in sample), lapply iterates the procedure 5 times, and finally do.call(cbind, binds it all together by column.

Resources