Get column names into a new variable based on conditions - r

I have a data frame like this and I am doing this on R. My problems can be divided into two steps.
SUBID
ABC
BCD
DEF
192838
4
-3
2
193928
-6
-2
6
205829
4
-5
9
201837
3
4
4
I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:
SUBID
ABC
BCD
DEF
output
192838
4
-3
2
"BCD"
193928
-6
-2
6
"ABC","BCD"
205829
4
-5
9
"BCD"
201837
3
4
4
" "
And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).
SUBID
output
19
2 ("ABC","BCD")
20
1 ("BCD")
Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know.
I would appreciate any help since right now I am not sure where to start on this. Thank you!

Another way:
library(dplyr)
library(tidyr)
df <- df %>% pivot_longer(-SUBID)
df1 <- df %>%
group_by(SUBID) %>%
summarise(output = paste(name[value < 0L], collapse = ','))
df2 <- df %>%
group_by(SUBID = substr(SUBID, 1, 2)) %>%
summarise(output_count = n_distinct(name[value < 0L]),
output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))
Outputs (two columns are created in the second case, one with just the count and another following your example):
df1
# A tibble: 4 x 2
SUBID output
<int> <chr>
1 192838 "BCD"
2 193928 "ABC,BCD"
3 201837 ""
4 205829 "BCD"
df2
# A tibble: 2 x 3
SUBID output_count output
<chr> <int> <chr>
1 19 2 2 (BCD,ABC,BCD)
2 20 1 1 (BCD)

This answers the first part of your question, the second one, I didn't understand
df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
df
SUBID ABC BCD DEF output
1 192838 4 3 -2 DEF
2 193928 -6 -2 6 ABC,BCD
3 205829 4 -5 9 BCD
4 201837 3 4 4
For the second part, try this:
id <- sapply(strsplit(sub("\\W+", "", df$output), split = ""), function(x){
sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )
data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
SUBID output string
1 19 3 DEF
2 19 2 ABC,BCD
3 20 3 BCD
4 20 0
I added the variable string for you make sure your count of unique values is ok.

One option is to take advantage of dplyr::cur_data() to access the names() of the data and subset based on your criteria. Then you can take advantage of tibble list-columns to hold on to a set of column names of arbitrary length and finally calculate the number of unique values in that list.
library(tidyverse)
d <- structure(list(SUBID = c(192838, 193928, 205829, 201837), ABC = c(4, -6, 4, 3), BCD = c(-3, -2, -5, 4), DEF = c(2, 6, 9, 4)), row.names = c(NA, -4L), class = "data.frame")
d %>%
rowwise() %>%
mutate(neg_col_names = list(names(cur_data())[cur_data() < 0])) %>%
group_by(ID_grp = str_sub(SUBID, 1, 2)) %>%
summarize(neg_col_count = n_distinct(unlist(c(neg_col_names))))
#> # A tibble: 2 × 2
#> ID_grp neg_col_count
#> <chr> <int>
#> 1 19 2
#> 2 20 1
Created on 2022-11-22 with reprex v2.0.2

Related

How can I divide one variable into two variables in R?

I have a variable x which can take five values (0,1,2,3,4). I want to divide the variable into two variables. Variable 1 is supposed to contain the value 0 and variable two is supposed to contain the values 1,2,3 and 4.
I'm sure this is easy but I can't find out what i need to do.
what my data looks like:
|variable x|
|-----------|
|0|
|1|
|0|
|4|
|3|
|0|
|0|
|2|
so i get the table:
0
1
2
3
4
125
34
14
15
15
But I want my data to look like this
variable 1
125
variable 2
78
So variable 1 is supposed to contain how often 0 is in my data
and variable 2 is supposed to contain the sum of how often 1,2,3 and 4 are in my data
You can convert the variable to logical by testing whether x == 0
x <- c(0, 1, 0, 4, 3, 0, 0, 2)
table(x)
#> x
#> 0 1 2 3 4
#> 4 1 1 1 1
table(x == 0)
#> FALSE TRUE
#> 4 4
If you want the exact headings, you can do:
setNames(table(x == 0), c(0, paste(unique(sort(x[x != 0])), collapse = ","))
#> 0 1,2,3,4
#> 4 4
And if you want to change the variable to a factor you could do:
c("zero", "not zero")[1 + (x != 0)]
#> x
#> 1 zero
#> 2 not zero
#> 3 zero
#> 4 not zero
#> 5 not zero
#> 6 zero
#> 7 zero
#> 8 not zero
Created on 2022-04-02 by the reprex package (v2.0.1)
base R
You can use cbind:
x = sample(0:5, 200, replace = T)
table(x)
# x
# 0 1 2 3 4 5
# 29 38 41 35 27 30
cbind(`0` = table(x)[1], `1,2,3,4` = sum(table(x)[2:5]))
# 0 1,2,3,4
# 0 29 141
tidyverse
library(tidyverse)
ta = as.data.frame(t(as.data.frame.array(table(x))))
ta %>%
mutate(!!paste(names(.[-1]), collapse = ",") := sum(c_across(`1`:`5`)), .keep = "unused")
# 0 1,2,3,4,5
# 1 29 171
Beginning with the vector, we can get the frequency from table then put it into a dataframe. Then, we can create a new column with the names collapsed (i.e., 1,2,3,4) and get the row sum for all columns except the first one.
library(tidyverse)
tab <- data.frame(value=c(0, 1, 2, 3, 4),
freq=c(125, 34, 14, 15, 15))
x <- rep(tab$value, tab$freq)
output <- data.frame(rbind(table(x))) %>%
rename_with(~str_remove(., 'X')) %>%
mutate(!!paste0(names(.)[-1], collapse = ",") := rowSums(select(., -1))) %>%
select(1, last_col())
Output
0 1,2,3,4
1 125 78
Then, to create the 2 variables in 2 dataframes, you can split the columns into a list, change the names, then put into the global environment.
list2env(setNames(
split.default(output, seq_along(output)),
c("variable 1", "variable 2")
), envir = .GlobalEnv)
Or you could just subset:
variable1 <- data.frame(`variable 1` = output$`0`, check.names = FALSE)
variable2 <- data.frame(`variable 2` = output$`1,2,3,4`, check.names = FALSE)
Update: deleted first answer:
df[paste(names(df[2:5]), collapse = ",")] <- rowSums(df[2:5])
df[, c(1,6)]
# A tibble: 1 × 2
`0` `1,2,3,4`
<dbl> <dbl>
1 125 78
data:
df <- structure(list(`0` = 125, `1` = 34, `2` = 14, `3` = 15, `4` = 15), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

How do I group a new variable based on a variables first letter?

I have a variable that is alphanumeric, ex: A890. I have over 1 million records and all 26 letters of the alphabet are used to start the variable, and I would like to know how to many of these records (the count) start with a given letter. Ideally creating a table would be great. Ex:
A - 2,900
B - 784,090
Etc.
Could anyone please help me with this?
Is this something you can use?
x <- c("A1234", "A234", "B7654", "A76768", "C980", "A767", "Z90898")
library(stringr)
table(str_extract(x, "^[A-Z]"))
A B C Z
4 1 1 1
Here you extract the upper-case letter occurring in first position in the string and tabulate the result.
Does this answer:
> df <- data.frame(name = c('A123','A2321','B32','B3232','C098','A989','C321','D233','D123','B2132'),
+ value = round(rnorm(10,100,2),2))
> df %>% group_by(substr(name,1,1)) %>% summarise(occurances = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 2
`substr(name, 1, 1)` occurances
<chr> <int>
1 A 3
2 B 3
3 C 2
4 D 2
>
A dplyr solution
library(dplyr)
df2 <- data.frame(alphanumeric = c(rep("A890", 5), rep("B2011", 8)))
df2 %>%
mutate(starter = substr(x = alphanumeric, 1, 1)) %>%
group_by(starter) %>%
tally
#> # A tibble: 2 x 2
#> starter n
#> <chr> <int>
#> 1 A 5
#> 2 B 8

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2

Making a new column by subtracting values based on a key in R?

I have a data table like this
ID DAYS FREQUENCY
"ads" 20 3
"jwa" 45 2
"mno" 4 1
"ads" 13 3
"jwa" 60 2
"ads" 18 3
I want to add a column that subtracts the days based on the id and subtract the closest days together.
My new table would like like this:
ID DAYS FREQUENCY DAYS DIFF
"ads" 20 3 2 (because 20-18)
"jwa" 45 2 NA (because no value greater than 45 for that id)
"mno" 4 1 NA
"ads" 13 3 NA
"jwa" 60 2 15
"ads" 18 3 5
Bonus: Is there a way to use the merge function?
Here's an answer using dplyr:
require(dplyr)
mydata %>%
mutate(row.order = row_number()) %>% # row numbers added to preserve original row order
group_by(ID) %>%
arrange(DAYS) %>%
mutate(lag = lag(DAYS)) %>%
mutate(days.diff = DAYS - lag) %>%
ungroup() %>%
arrange(row.order) %>%
select(ID, DAYS, FREQUENCY, days.diff)
Output:
ID DAYS FREQUENCY days.diff
<fctr> <int> <int> <int>
1 ads 20 3 2
2 jwa 45 2 NA
3 mno 4 1 NA
4 ads 13 3 NA
5 jwa 60 2 15
6 ads 18 3 5
You can do this using dplyr and a quick loop:
library(dplyr)
# Rowwise data.frame creation because I'm too lazy not to copy-paste the example data
df <- tibble::frame_data(
~ID, ~DAYS, ~FREQUENCY,
"ads", 20, 3,
"jwa", 45, 2,
"mno", 4, 1,
"ads", 13, 3,
"jwa", 60, 2,
"ads", 18, 3
)
# Subtract each number in a numeric vector with the one following it
rolling_subtraction <- function(x) {
out <- vector('numeric', length(x))
for (i in seq_along(out)) {
out[[i]] <- x[i] - x[i + 1] # x[i + 1] is NA if the index is out of bounds
}
out
}
# Arrange data.frame in order of ID / Days and apply rolling subtraction
df %>%
arrange(ID, desc(DAYS)) %>%
group_by(ID) %>%
mutate(days_diff = rolling_subtraction(DAYS))

Resources