Update specific values in a dataframe based on array index position - r

Let's say I have a dataframe
> colA <- c(1, 14, 8)
> colB <- c(4, 8, 9)
> colC <- c(1, 2, 14)
> df <- data.frame(c(colA, colB, colC))
> df
colA colB colC
1 1 4 1
2 14 8 2
3 8 9 14
What I want to do is create a second data frame which has the same structure as df, but has 1 whenever a specific number is found, and 0 otherwise, e.g., if the number were 14, df2 would look like this
> df2
colA colB colC
1 0 0 0
2 1 0 0
3 0 0 1
I thought I could create a 3x3 data frame of 0s (df2), use which() to get the index for the number in df, and then use that index to change what shows up in df2
> number <- 14
> index <- which(df == number)
> index
[1] 2 9
or perhaps more helpfully
> index <- which(df == number, arr.ind = T)
> index
row col
[1,] 2 1
[2,] 3 3
However I am unsure how to use this index to specifiy which values in the df of NAs should be TRUE and which FALSE (i.e. how to reverse the which)?
NB - I will actually be testing this for multiple numbers, so I figured I would do it inside a for loop. So I want the final DF to show ones for every location which has any of the numbers (i.e. gradually switching the 0's "on" to 1's
> numbers <- c(14, 9, 1
> for(i in numbers){
> index <- which(df == numbers, arr.ind = T)
> #then do whatever needs to be done to change the index locations in df2
P.S., in general, I work in the tidyverse, so tidyverse specific solutions would be grand, but base r would also be brilliant.
Ohh, and yes, this is for day 4 of Advent of Code - it's a useful challenge to help this non-expert coder learn.
Thanks

Here's a full example how it could be done.
Data
df <- structure(list(colA = c(1, 14, 8), colB = c(4, 8, 9), colC = c(1,
2, 14)), class = "data.frame", row.names = c(NA, -3L))
base R
data.frame( sapply( df, function(x) as.numeric( x == 14 | x == 8 ) ))
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
for any number in a loop
setNames( data.frame( matrix( rowSums( sapply( c(14,8,1), function(x)
df==x ) ), dim(df) ) ), colnames( df ) )
colA colB colC
1 1 0 1
2 1 1 0
3 1 0 1
dplyr
library(dplyr)
df %>% summarise_all( ~ as.numeric( .x == 14 | .x == 8 ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
# or
df %>% summarise( across( everything(), ~ as.numeric( .x == 14 | .x == 8 ) ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1

Related

Random Sample From a Dataframe With Specific Count

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

Calculate row sums by variable names

what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))
base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0
Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))

Detect sequences of ordered strings and group them using R

I have a string vector with about 500K elements in it and I want to assign a value to each of the element to show the group number of each element.
The grouping criteria goes like this:
a group number is assigned consecutively from the top of the list
Each element should be assigned different groups unless if a minimum of 3 consecutive elements are in ascending alphabetical order, in which these consecutive elements will be in one group.
How do I do this in R?
For example and expected output:
> my_strings <- c("xx1", "1xxx", "abc.xyz", "a", "ad022", "ghj1", "kf1", "991r",
+ "jdd", "12vd", "r34o", "z", "034mh")
> expected_output <- c(1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8)
> (df <- data.frame(input = my_strings, output = expected_output))
input output
1 xx1 1
2 1xxx 2
3 abc.xyz 3
4 a 4
5 ad022 4
6 ghj1 4
7 kf1 4
8 991r 5
9 jdd 6
10 12vd 7
11 r34o 7
12 z 7
13 034mh 8
So far, I attempt to use dplyr::lead and assign order based on two consecutive elements. I don't know how to proceed from here though.
res <- as_tibble(my_strings) %>%
mutate(after = lead(my_strings))
res$pre_group = apply(res, 1, function(x) order(c(x[1], x[2]))[2])
(Dang, this was a tough one :-)
tidyverse
library(dplyr)
df %>%
mutate(r1 = cumsum(c(TRUE, diff(rank(input)) < 0)) + 0) %>%
group_by(r1) %>%
mutate(r2 = r1 + seq(0, 0.9*(n() < 3), len = n()) / n()) %>%
ungroup() %>%
mutate(r1 = with(list(rl = rle(r2)$lengths), rep(seq_along(rl), times = rl))) %>%
select(-r2)
# # A tibble: 13 x 3
# input output r1
# <chr> <dbl> <int>
# 1 xx1 1 1
# 2 1xxx 2 2
# 3 abc.xyz 3 3
# 4 a 4 4
# 5 ad022 4 4
# 6 ghj1 4 4
# 7 kf1 4 4
# 8 991r 5 5
# 9 jdd 6 6
# 10 12vd 7 7
# 11 r34o 7 7
# 12 z 7 7
# 13 034mh 8 8
(The lengthy with(...) in the mutate is just an inline version of data.table::rleid.)
data.table
library(data.table)
as.data.table(df)[
, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ][
, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ][
, r1 := rleid(r1) ]
If you want to blur the lines of R-dialects a little, then
library(data.table)
library(magrittr)
as.data.table(df) %>%
.[, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ] %>%
.[, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ] %>%
.[, r1 := rleid(r1) ]
Notes:
... + 0 is short-hand for as.numeric(...). This is because data.table enforces the column's original class when updating a column; since the first definition of r1 (without +0) would be integer, the next reassignment of r1 returns numeric. However, since data.table persists the original class, the numbers will be coerced (truncated) to integer and my efforts halted.
seq(0, 0.9*(...)) reduces to seq(0,0) when there are three or more in a group, which results in a no-op on that group. (This uses dplyr's n() and data.table's .N for group-size.)
the implementations differ slightly because dplyr prohibits modifying the grouping variable(s); data.table has no issue with this. (I'm not certain which direction is correct or better ...)
Not nearly as good as r2evans', but also seems to give the result.
x <- my_strings
n <- length(x)
c(FALSE,x[-1L] > x[-n]) &
c(FALSE,FALSE,x[-1L][-1L] > x[-n][-(n-1)]) &
c(FALSE,FALSE,FALSE,x[-1L][-1L][-1L] > x[-n][-(n-1)][-(n-2)])
(lead(x, 1) > x & lead(x,2) > lead(x,1)) |
(lag(x, 1) < x & lead(x,1) > x) |
(lag(x, 1) < x & lag(x,2) < lag(x,1)) -> condition
condition[is.na(condition)] <- FALSE # remove NAs
#to visualize
tibble(lag(x,2), lag(x,1), x, lead(x,1), lead(x,2), condition)
# There may be a better way than a loop
cur_class <- 0
classes <- integer(n)
for(i in 1:(n)){
if(!condition[i]){ #not in a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else if(!condition[i-1]){ #first of a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else{ #mid-sequence
classes[i] <- cur_class
}
}
tibble(x, classes, condition*1L)
# A tibble: 13 x 3
# x classes `condition * 1L`
# <chr> <dbl> <int>
# 1 xx1 1 0
# 2 1xxx 2 0
# 3 abc.xyz 3 0
# 4 a 4 1
# 5 ad022 4 1
# 6 ghj1 4 1
# 7 kf1 4 1
# 8 991r 5 0
# 9 jdd 6 0
# 10 12vd 7 1
# 11 r34o 7 1
# 12 z 7 1
# 13 034mh 8 0

Variable names as Input in an R Function

I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3

Copy preceding row if condition is met

My data
set.seed(123)
df <- data.frame(loc = rep(1:5, each = 5),value = sample(0:4, 25, replace = T))
a <- c("x","y","z","k")
df$id <- ifelse(df$value == 0, "no.data", sample(a,1))
head(df)
loc value id
1 1 1 z
2 1 3 z
3 1 2 z
4 1 4 z
5 1 4 z
6 2 0 no.data
Rows for which I have no data, the id and value columns have no.data and 0. For all rows where I have no data (id == no.data and value == 0), I want to copy the value and id from the preceding row.
loc value id
1 1 1 z
2 1 3 z
3 1 2 z
4 1 4 z
5 1 4 z
6 2 4 z
Something like:
df %>% group_by(loc) %>% mutate(value = ifelse(value == 0, copy the value from preceding row), id = ifelse(id== "no.data", copy the id from preceding row ))
We could replace the 0s by NA and then do a fill
library(tidyverse)
library(naniar)
df %>%
replace_with_na(replace = list(value = 0, id = "no.data")) %>%
fill(value, id)
Unless you have a very big dataset a simple loop should do
for (r in 2:nrow(df)) {
if (with(df[r, ], id == "no.data" && value == 0)) {
df[r, c("id", "value")] <- df[r - 1L, c("id", "value")]
}
}

Resources