How can I divide one variable into two variables in R? - r

I have a variable x which can take five values (0,1,2,3,4). I want to divide the variable into two variables. Variable 1 is supposed to contain the value 0 and variable two is supposed to contain the values 1,2,3 and 4.
I'm sure this is easy but I can't find out what i need to do.
what my data looks like:
|variable x|
|-----------|
|0|
|1|
|0|
|4|
|3|
|0|
|0|
|2|
so i get the table:
0
1
2
3
4
125
34
14
15
15
But I want my data to look like this
variable 1
125
variable 2
78
So variable 1 is supposed to contain how often 0 is in my data
and variable 2 is supposed to contain the sum of how often 1,2,3 and 4 are in my data

You can convert the variable to logical by testing whether x == 0
x <- c(0, 1, 0, 4, 3, 0, 0, 2)
table(x)
#> x
#> 0 1 2 3 4
#> 4 1 1 1 1
table(x == 0)
#> FALSE TRUE
#> 4 4
If you want the exact headings, you can do:
setNames(table(x == 0), c(0, paste(unique(sort(x[x != 0])), collapse = ","))
#> 0 1,2,3,4
#> 4 4
And if you want to change the variable to a factor you could do:
c("zero", "not zero")[1 + (x != 0)]
#> x
#> 1 zero
#> 2 not zero
#> 3 zero
#> 4 not zero
#> 5 not zero
#> 6 zero
#> 7 zero
#> 8 not zero
Created on 2022-04-02 by the reprex package (v2.0.1)

base R
You can use cbind:
x = sample(0:5, 200, replace = T)
table(x)
# x
# 0 1 2 3 4 5
# 29 38 41 35 27 30
cbind(`0` = table(x)[1], `1,2,3,4` = sum(table(x)[2:5]))
# 0 1,2,3,4
# 0 29 141
tidyverse
library(tidyverse)
ta = as.data.frame(t(as.data.frame.array(table(x))))
ta %>%
mutate(!!paste(names(.[-1]), collapse = ",") := sum(c_across(`1`:`5`)), .keep = "unused")
# 0 1,2,3,4,5
# 1 29 171

Beginning with the vector, we can get the frequency from table then put it into a dataframe. Then, we can create a new column with the names collapsed (i.e., 1,2,3,4) and get the row sum for all columns except the first one.
library(tidyverse)
tab <- data.frame(value=c(0, 1, 2, 3, 4),
freq=c(125, 34, 14, 15, 15))
x <- rep(tab$value, tab$freq)
output <- data.frame(rbind(table(x))) %>%
rename_with(~str_remove(., 'X')) %>%
mutate(!!paste0(names(.)[-1], collapse = ",") := rowSums(select(., -1))) %>%
select(1, last_col())
Output
0 1,2,3,4
1 125 78
Then, to create the 2 variables in 2 dataframes, you can split the columns into a list, change the names, then put into the global environment.
list2env(setNames(
split.default(output, seq_along(output)),
c("variable 1", "variable 2")
), envir = .GlobalEnv)
Or you could just subset:
variable1 <- data.frame(`variable 1` = output$`0`, check.names = FALSE)
variable2 <- data.frame(`variable 2` = output$`1,2,3,4`, check.names = FALSE)

Update: deleted first answer:
df[paste(names(df[2:5]), collapse = ",")] <- rowSums(df[2:5])
df[, c(1,6)]
# A tibble: 1 × 2
`0` `1,2,3,4`
<dbl> <dbl>
1 125 78
data:
df <- structure(list(`0` = 125, `1` = 34, `2` = 14, `3` = 15, `4` = 15), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

Related

Get column names into a new variable based on conditions

I have a data frame like this and I am doing this on R. My problems can be divided into two steps.
SUBID
ABC
BCD
DEF
192838
4
-3
2
193928
-6
-2
6
205829
4
-5
9
201837
3
4
4
I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:
SUBID
ABC
BCD
DEF
output
192838
4
-3
2
"BCD"
193928
-6
-2
6
"ABC","BCD"
205829
4
-5
9
"BCD"
201837
3
4
4
" "
And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).
SUBID
output
19
2 ("ABC","BCD")
20
1 ("BCD")
Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know.
I would appreciate any help since right now I am not sure where to start on this. Thank you!
Another way:
library(dplyr)
library(tidyr)
df <- df %>% pivot_longer(-SUBID)
df1 <- df %>%
group_by(SUBID) %>%
summarise(output = paste(name[value < 0L], collapse = ','))
df2 <- df %>%
group_by(SUBID = substr(SUBID, 1, 2)) %>%
summarise(output_count = n_distinct(name[value < 0L]),
output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))
Outputs (two columns are created in the second case, one with just the count and another following your example):
df1
# A tibble: 4 x 2
SUBID output
<int> <chr>
1 192838 "BCD"
2 193928 "ABC,BCD"
3 201837 ""
4 205829 "BCD"
df2
# A tibble: 2 x 3
SUBID output_count output
<chr> <int> <chr>
1 19 2 2 (BCD,ABC,BCD)
2 20 1 1 (BCD)
This answers the first part of your question, the second one, I didn't understand
df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
df
SUBID ABC BCD DEF output
1 192838 4 3 -2 DEF
2 193928 -6 -2 6 ABC,BCD
3 205829 4 -5 9 BCD
4 201837 3 4 4
For the second part, try this:
id <- sapply(strsplit(sub("\\W+", "", df$output), split = ""), function(x){
sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )
data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
SUBID output string
1 19 3 DEF
2 19 2 ABC,BCD
3 20 3 BCD
4 20 0
I added the variable string for you make sure your count of unique values is ok.
One option is to take advantage of dplyr::cur_data() to access the names() of the data and subset based on your criteria. Then you can take advantage of tibble list-columns to hold on to a set of column names of arbitrary length and finally calculate the number of unique values in that list.
library(tidyverse)
d <- structure(list(SUBID = c(192838, 193928, 205829, 201837), ABC = c(4, -6, 4, 3), BCD = c(-3, -2, -5, 4), DEF = c(2, 6, 9, 4)), row.names = c(NA, -4L), class = "data.frame")
d %>%
rowwise() %>%
mutate(neg_col_names = list(names(cur_data())[cur_data() < 0])) %>%
group_by(ID_grp = str_sub(SUBID, 1, 2)) %>%
summarize(neg_col_count = n_distinct(unlist(c(neg_col_names))))
#> # A tibble: 2 × 2
#> ID_grp neg_col_count
#> <chr> <int>
#> 1 19 2
#> 2 20 1
Created on 2022-11-22 with reprex v2.0.2

R: Count number of times B follows A using dplyr

I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?
Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1
You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)
not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)
A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1
Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1

Convert NA to 0 in columns selected by name using dplyr mutate across [duplicate]

This question already has answers here:
How to replace NA values in a table for selected columns
(12 answers)
Closed 1 year ago.
This is an example of my data
df = data.frame(id = rep(1:3, each = 1),
test = sample(40:100, 3),
Sets = c(NA,4,4),
CheWt = c(NA,4,NA),
LatWt = c(NA,5,5))
I'd like to turn all the NA to 0 in columns which have "Wt" in the header. I am trying to use dplyr and mutate across
df = df%>%
mutate(across(contains("Wt"), replace(is.na(), 0)))
This is the error
Error: Problem with `mutate()` input `..1`.
x argument "values" is missing, with no default
i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
replace needs 3 arguments: the value, the index of values to replace, and the replacement. And you need to use ~ for purrr-style anonymous functions:
df = df %>%
mutate(across(contains("Wt"), ~replace(., is.na(.), 0)))
df
# id test Sets CheWt LatWt
# 1 1 93 NA 0 0
# 2 2 44 4 4 5
# 3 3 80 4 0 5
Or you can use replace_na for a somewhat simpler interface:
df = df%>%
mutate(across(contains("Wt"), replace_na, 0))
df
# id test Sets CheWt LatWt
# 1 1 73 NA 0 0
# 2 2 43 4 4 5
# 3 3 54 4 0 5
Here is a solution using ifelse and is.na.
library(dplyr)
df = data.frame(id = rep(1:3, each = 1),
test = sample(40:100, 3),
Sets = c(NA,4,4),
CheWt = c(NA,4,NA),
LatWt = c(NA,5,5))
df = mutate(df, across(contains("Wt"), ~ifelse(is.na(.x), 0, .x)))
#> id test Sets CheWt LatWt
#> 1 1 97 NA 0 0
#> 2 2 79 4 4 5
#> 3 3 75 4 0 5
Created on 2021-03-08 by the reprex package (v0.3.0)

Add a new column by mutate involving a conditional

I need to add a new column in dplyr by mutate involving an conditional. I can't find a way to implement the following scheme in the tidyverse, but I can do it Excel. That makes me feel like something of a barbarian. Does someone know how to accomplish this in the tidyverse?
The first value of the running count column is 1, no matter what is in the "n" column.
After the first row, here is the conditional. If the n column=1, the running.count output is the running.count value from the row above +1. If the n column=0, the running.count output is the running.count value from the row above +1 only when it is the first 0 after a 1 in the "n" column. Otherwise, it is just the running.count value from the row above.
Here's some toy data with the desired output:
data.frame("n"=c(0,1,0,0,0,0,1,0,1,1),"running.count"=c(1,2,3,3,3,3,4,5,6,7))
We can use rleid from data.table to create the running.count column
library(dplyr)
library(data.table)
df1 %>%
group_by(running.count = rleid(n) ) %>%
mutate(ind = if(all(n==1)) row_number() - 1 else 0) %>%
ungroup %>%
mutate(running.count = rleid(running.count, ind)) %>%
select(-ind)
# A tibble: 10 x 2
# n running.count
# <dbl> <int>
# 1 0 1
# 2 1 2
# 3 0 3
# 4 0 3
# 5 0 3
# 6 0 3
# 7 1 4
# 8 0 5
# 9 1 6
#10 1 7
data
df1 ,- structure(list(n = c(0, 1, 0, 0, 0, 0, 1, 0, 1, 1)),
class = "data.frame", row.names = c(NA, -10L))

From dataframe with values per min max to value per key

I have a dataframe with values defined per bucket. (See df1 below)
Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below)
Now I would like to have the result df3 below.
df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
> df2
KEY
1 2
2 5
3 9
> df3
KEY VALUE
1 2 3
2 5 56
3 9 8
EDIT :
Extended the example.
> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 3
4 14 18 5
> df2
KEY
1 2
2 5
3 9
4 18
5 3
> df3
KEY VALUE
1 2 3
2 5 56
3 9 3
4 18 5
5 3 3
This solution assumes that KEY, MIN and MAX are integers, so we can create a sequence of keys and then join.
df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
group_by(VALUE, id=row_number()) %>% # for each value and row id
nest() %>% # nest rest of columns
mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>% # create a sequence of keys
unnest(KEY) %>% # unnest those keys
right_join(df2, by="KEY") %>% # join the other dataset
select(KEY, VALUE)
# # A tibble: 5 x 2
# KEY VALUE
# <dbl> <dbl>
# 1 2.00 3.00
# 2 5.00 56.0
# 3 9.00 3.00
# 4 18.0 5.00
# 5 3.00 3.00
Or, group just by the row number and add VALUE in the map:
df1 %>%
group_by(id=row_number()) %>%
nest() %>%
mutate(K = map(data, ~data.frame(VALUE = .$VALUE,
KEY = seq(.$MIN, .$MAX)))) %>%
unnest(K) %>%
right_join(df2, by="KEY") %>%
select(KEY, VALUE)
A very good and well-thought-out solution from #AntioniosK.
Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. The lookup values need not be unique or contiguous in this example, taking account of #Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges).
lookup = function(keydf, bucketdf){
keydf$rowid = 1:nrow(keydf)
T = merge(bucketdf, keydf)
T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
T = merge(T, keydf, all.y = TRUE)
T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}
The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; I doubt this would be a problem in this case unless x or y run into thousands of rows.
The second merge is done to recover any key values which are not matched to rows in the bucket list.
Using the example data as listed in #AntioniosK's post:
> lookup(df2, df1)
rowid KEY VALUE
2 1 2 3
4 2 5 56
5 3 9 3
1 4 18 5
3 5 3 3
Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below):
df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))
df4
MIN MAX VALUE
1 1 3 3
2 4 6 56
3 8 10 8
4 20 25 10
5 30 40 12
6 22 24 23
> df2A
KEY
1 3
2 6
3 22
4 30
5 50
> lookup(df2A, df4)
rowid KEY VALUE
1 1 3 3
2 2 6 56
3 3 22 10
4 3 22 23
5 4 30 12
6 5 50 NA
As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list.

Resources