Calculate row sums by variable names - r

what's the easiest way to calculate row-wise sums? For example if I wanted to calculate the sum of all variables with "txt_"? (see example below)
df <- data.frame(var1 = c(1, 2, 3),
txt_1 = c(1, 1, 0),
txt_2 = c(1, 0, 0),
txt_3 = c(1, 0, 0))

base R
We can first use grepl to find the column names that start with txt_, then use rowSums on the subset.
rowSums(df[, grepl("txt_", names(df))])
[1] 3 1 0
If you want to bind it back to the original dataframe, then we can bind the output to the original dataframe.
cbind(df, sums = rowSums(df[, grepl("txt_", names(df))]))
var1 txt_1 txt_2 txt_3 sums
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Tidyverse
library(tidyverse)
df %>%
mutate(sum = rowSums(across(starts_with("txt_"))))
var1 txt_1 txt_2 txt_3 sum
1 1 1 1 1 3
2 2 1 0 0 1
3 3 0 0 0 0
Or if you want just the vector, then we can use pull:
df %>%
mutate(sum = rowSums(across(starts_with("txt_")))) %>%
pull(sum)
[1] 3 1 0
Data Table
Here is a data.table option as well:
library(data.table)
dt <- as.data.table(df)
dt[ ,sum := rowSums(.SD), .SDcols = grep("txt_", names(dt))]
dt[["sum"]]
# [1] 3 1 0

Another dplyr option:
df %>%
rowwise() %>%
mutate(sum = sum(c_across(starts_with("txt"))))

Related

Random Sample From a Dataframe With Specific Count

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

Update specific values in a dataframe based on array index position

Let's say I have a dataframe
> colA <- c(1, 14, 8)
> colB <- c(4, 8, 9)
> colC <- c(1, 2, 14)
> df <- data.frame(c(colA, colB, colC))
> df
colA colB colC
1 1 4 1
2 14 8 2
3 8 9 14
What I want to do is create a second data frame which has the same structure as df, but has 1 whenever a specific number is found, and 0 otherwise, e.g., if the number were 14, df2 would look like this
> df2
colA colB colC
1 0 0 0
2 1 0 0
3 0 0 1
I thought I could create a 3x3 data frame of 0s (df2), use which() to get the index for the number in df, and then use that index to change what shows up in df2
> number <- 14
> index <- which(df == number)
> index
[1] 2 9
or perhaps more helpfully
> index <- which(df == number, arr.ind = T)
> index
row col
[1,] 2 1
[2,] 3 3
However I am unsure how to use this index to specifiy which values in the df of NAs should be TRUE and which FALSE (i.e. how to reverse the which)?
NB - I will actually be testing this for multiple numbers, so I figured I would do it inside a for loop. So I want the final DF to show ones for every location which has any of the numbers (i.e. gradually switching the 0's "on" to 1's
> numbers <- c(14, 9, 1
> for(i in numbers){
> index <- which(df == numbers, arr.ind = T)
> #then do whatever needs to be done to change the index locations in df2
P.S., in general, I work in the tidyverse, so tidyverse specific solutions would be grand, but base r would also be brilliant.
Ohh, and yes, this is for day 4 of Advent of Code - it's a useful challenge to help this non-expert coder learn.
Thanks
Here's a full example how it could be done.
Data
df <- structure(list(colA = c(1, 14, 8), colB = c(4, 8, 9), colC = c(1,
2, 14)), class = "data.frame", row.names = c(NA, -3L))
base R
data.frame( sapply( df, function(x) as.numeric( x == 14 | x == 8 ) ))
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
for any number in a loop
setNames( data.frame( matrix( rowSums( sapply( c(14,8,1), function(x)
df==x ) ), dim(df) ) ), colnames( df ) )
colA colB colC
1 1 0 1
2 1 1 0
3 1 0 1
dplyr
library(dplyr)
df %>% summarise_all( ~ as.numeric( .x == 14 | .x == 8 ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
# or
df %>% summarise( across( everything(), ~ as.numeric( .x == 14 | .x == 8 ) ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1

Add multiple columns with dplyr and fill cells based on condition

I am trying to:
1) add multiple columns that correspond to existing columns (e.g., a1 exists and add a1_yes).
2) Next, if a given cell contains 1:3, put 1 in a#_yes column, otherwise, put 0.
I can easily to this with base R but I'm trying to also make it work with dplyr.
My data:
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
With base R:
df[paste0("a", 1:2, "_yes")] <- NA # add columns
for(c in 1:2) {
for(r in 1:nrow(df)) {
ifelse(df[r,c] %in% c(1,2,3), df[r,c+2] <- 1,df[r,c+2] <- 0)
}
}
> df
a1 a2 a1_yes a2_yes
1 1 NA 1 0
2 2 1 1 1
3 0 2 0 1
4 NA 3 0 1
5 NA 3 0 1
Thank you
Here is an option, assuming you want to do this to all columns of your dataframe
library(dplyr)
df %>%
mutate_all(., list('yes' = ~ifelse(.x %in% c(1:3), 1, 0)))
# a1 a2 a1_yes a2_yes
#1 1 NA 1 0
#2 2 1 1 1
#3 0 2 0 1
#4 NA 3 0 1
#5 NA 3 0 1
Edits
As #Akrun mentioned, you can do this without ifelse using as.integer or +
df %>%
mutate_all(., list('yes' = ~as.integer(.x %in% 1:3)))
You can also use mutate_at to select specific vars
df %>%
mutate_at(vars(a1, a2), list('yes' = ~as.integer(.x %in% 1:3)))
This will work without editing no matter how many columns you have if they are all in this format
df %>%
mutate_all(., function(x) ifelse(x == 0 | is.na(x), 0, 1)) %>%
rename_all(., function(x) paste0(x, "_yes")) %>%
bind_cols(df, .)
Here's a dplyr solution:
library(dplyr)
df <- data.frame(a1 = c(1, 2, 0, NA, NA),
a2 = c(NA, 1, 2, 3, 3))
df2 <- df %>%
mutate(a1_yes = ifelse(a1 == 0 | is.na(a1), 0, 1),
a2_yes = ifelse(a2 == 0 | is.na(a2), 0, 1))
Instead of putting the conditions so that the new columns' values are 1, I put the conditions so that they're equal to zero.
Here is a solution
df <- data.frame( a1 = c(1,2,0,NA,NA),
a2 = c(NA,1,2,3,3))
check_values <- c(1,2,3)
df %>% mutate(a1_yes = ifelse(a1 %in% check_values,1,0),
a2_yes =ifelse(a2 %in% check_values,1,0))

Reshaping data from long to wide with both sums and counts

I am trying to reshape data from long to wide format in R. I would like to get both counts of occurrences of a type variable by ID and sums of the values of a second variable (val) by ID and type as in the example below.
I was able to find answers for reshaping with either counts or sums but not for both simultaneously.
This is the original example data:
> df <- data.frame(id = c(1, 1, 1, 2, 2, 2),
+ type = c("A", "A", "B", "A", "B", "C"),
+ val = c(0, 1, 2, 0, 0, 4))
> df
id type val
1 1 A 0
2 1 A 1
3 1 B 2
4 2 A 0
5 2 B 0
6 2 C 4
The output I would like to obtain is the following:
id A.count B.count C.count A.sum B.sum C.sum
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
where the count columns display the number of occurrences of type A, B and C and the sum columns the sum of the values by type.
To achieve the counts I can, as suggested in this answer, use reshape2::dcast with the default aggregation function, length:
> require(reshape2)
> df.c <- dcast(df, id ~ type, value.var = "type", fun.aggregate = length)
> df.c
id A B C
1 1 2 1 0
2 2 1 1 1
Similarly, as suggested in this answer, I can also perform the reshape with the sums as output, this time using the sum aggregation function in dcast:
> df.s <- dcast(df, id ~ type, value.var = "val", fun.aggregate = sum)
> df.s
id A B C
1 1 1 2 0
2 2 0 0 4
I could merge the two:
> merge(x = df.c, y = df.s, by = "id", all = TRUE)
id A.x B.x C.x A.y B.y C.y
1 1 2 1 0 1 2 0
2 2 1 1 1 0 0 4
but is there a way of doing it all in one go (not necessarily with dcast or reshape2)?
From data.table v1.9.6, it is possible to cast multiple value.var columns and also cast by providing multiple fun.aggregate functions. See below:
library(data.table)
df <- data.table(df)
dcast(df, id ~ type, fun = list(length, sum), value.var = c("val"))
id val_length_A val_length_B val_length_C val_sum_A val_sum_B val_sum_C
1: 1 2 1 0 1 2 0
2: 2 1 1 1 0 0 4
Here is an approach with tidyverse
library(tidyverse)
df %>%
group_by(id, type) %>%
summarise(count = n(), Sum = sum(val)) %>%
gather(key, val, count:Sum) %>%
unite(typen, type, key, sep=".") %>%
spread(typen, val, fill = 0)
The data.table solution suggested is probably better but if you prefer using dcast and you have many value.var/fun.aggregate combinations, you could also do:
library(purrr)
cols <- c('type', 'val')
funs <- c(length, sum)
map2(cols, funs, ~ dcast(df, id~type, value.var = .x, fun.aggregate = .y)) %>%
reduce(left_join, by='id', suffix=c('.count', '.sum'))

How to use window function in R

I have the following data frame structure :
id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0
Here a,b,c are unique id's and status is a flag ranging from 0,1 and 2.
I need to select each individual id whose status has changed from 0 to 1 in any point during the whole time frame, so the expected output of this would be two id's 'b' and 'c'.
I thought of using lag to accomplish that but in that case, I wont't be able to handle id 'c', in which there is a 0 in the beginning but it reaches 1 at some stage. Any thoughts on how we can achieve this using window functions (or any other technique)
You want to find id's having a status of 1 after having had a status of 0.
Here is a dplyr solution:
library(dplyr)
# Generate data
mydf = data_frame(
id = c(rep("a", 3), rep("b", 4), rep("c", 4), rep("d", 3)),
status = c(1, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 0, 2, 0)
)
mydf %>% group_by(id) %>%
# Keep only 0's and 1's
filter(status %in% c(0,1)) %>%
# Compute diff between two status
mutate(dif = status - lag(status, 1)) %>%
# If it is 1, it is a 0 => 1
filter(dif == 1) %>%
# Catch corresponding id's
select(id) %>%
unique
One possible way using dplyr (Edited to include id only when a 1 appears after a 0):
library(dplyr)
df %>%
group_by(id) %>%
filter(status %in% c(0, 1)) %>%
filter(status == 0 & lead(status, default = 0) == 1) %>%
select(id) %>% unique()
#> # A tibble: 2 x 1
#> # Groups: id [2]
#> id
#> <chr>
#> 1 b
#> 2 c
Data
df <- read.table(text = "id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0", header = TRUE, stringsAsFactors = FALSE)
I dunno if this is the most efficient way, but: split by id, check statuses for 0, and if there is any, check for 1 behind the 0 index:
lst <- split(df$status, df$id)
f <- function(x) {
if (!any(x==0)) return(FALSE)
any(x[which.max(x==0):length(x)]==1)
}
names(lst)[(sapply(lst, f))]
# [1] "b" "c"

Resources