R: ID column corrupting data frame - r

I am working with the R programming language. Suppose I have the following data frame:
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor(C)
My Question: Suppose I want to add an ID column to this data frame that ranks the first observation as "100" and increases the ID by 1 for each new column. I tried to do this as follows:
my_data_2$id = seq(101, 200, by = 1)
However, this "corrupted" the data frame:
head(my_data_2)
a b c
1 10.381397 9.534634 12.8330946
2 10.326785 6.397006 8.1217063
3 8.333354 11.474064 11.6035562
4 9.583789 12.096404 18.2764387
5 9.581740 12.302016 4.0601871
6 11.772943 9.151642 -0.3686874
group
1 c(9.98552413605153, 9.53807731118048, 6.92589246998173, 8.97095368638206, 9.70249918748529, 10.6161773148626, 9.2514231659343, 10.6566757899233, 10.2351848084123, 9.45970725813352, 9.15347719257448, 9.30428244749624, 8.43075784609759, 11.1200169905262, 11.3493313166827, 8.86895968334901, 9.13208319045466, 9.70062759133717)
2 c(8.90358954387628, 13.8756093430144, 12.9970566311467, 10.4227745183785, 21.3259516051226, 4.88590162247496, 10.260282181, 14.092109840631, 7.37839577680487, 9.09764173775965, 15.1636139760987, 9.9773055885761, 8.29361737323061, 8.61361852648607, 12.6807897406641, 0.00863359720839085, 10.7660528147358, 9.79616528370632)
3 c(25.8063583646201, -11.5722310383483, 8.56096791164312, 12.2858029391835, -0.312392781809937, 0.946343715084028, 2.45881422753051, 7.26197515743391, 0.333766891336273, 14.9149659649045, -4.55483090530928, -19.8075232688082, 16.59106194569, 18.7377329188129, 1.1771203751127, -6.19019973790205, -5.02277721344565, 23.3363430334739)
4 c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
5 c("B", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "A", "B", "B", "B", "B", "B", "B")
6 c("B", "B", "B", "B", "B", "A", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B")
id
1 101
2 102
3 103
4 104
5 105
6 106
Can someone please show me how to fix this problem?
Thanks!

Your problem isn‘t your ID column, your problem is where you define your group variable. You call as.factor(C) (note the uppercase C), but the column of your data frame is a lowercase c. So I guess you have defined another object C outsode of your data frame, that now „corrupts“ your data frame.
You maybe want to do:
my_data_2$group <- as.factor(my_data_2$c)

I was able to figure out the answer!
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor("C")
my_data_2$id = seq(101, 200, by = 1)
head(my_data_2)
a b c group id
1 9.436773 10.712568 3.7699748 C 101
2 10.265810 3.408589 11.9230024 C 102
3 10.503245 12.197000 8.3620889 C 103
4 9.279878 7.007812 16.8268852 C 104
5 10.683518 8.039032 5.2287997 C 105
6 11.097258 10.313103 0.4988398 C 106

Related

Expanding a column with row of NA when there is no match in R

I am trying to "clean" a dataset that has many "empty" rows deleted, however, I want these empty rows back (and adding NA). Here is a toy dataset:
values <- rnorm(12)
data <- data.frame(ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "A", "B", "C", "B", "A", "B", "C"),
value = values) #values are random
What I want is to insert rows that are missing, i.e. ID 2 is missing group C, and 4 is missing A and C. And the expected result is as follows:
data_expanded <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"),
value = c(values[1:5], NA, values[6:8], NA, values[9], NA, values[10:12]))
The rows with NA can be added at the end of the data frame (not necessarily to be grouped as in the example I provided). My real dataset has many rows, therefore, a method that is memory-efficient is highly appreciated. I do prefer the method using R, tidyr (or tidyverse).
tidyr::complete() does exactly what you want:
library(tidyr)
values <- rnorm(12)
data <- data.frame(ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "A", "B", "C", "B", "A", "B", "C"),
value = values) #values are random
data |>
complete(ID, event)
#> # A tibble: 15 × 3
#> ID event value
#> <dbl> <chr> <dbl>
#> 1 1 A 0.397
#> 2 1 B -0.595
#> 3 1 C 0.743
#> 4 2 A -0.0421
#> 5 2 B 1.47
#> 6 2 C NA
#> 7 3 A 0.218
#> 8 3 B -0.525
#> 9 3 C 1.05
#> 10 4 A NA
#> 11 4 B -1.79
#> 12 4 C NA
#> 13 5 A 1.18
#> 14 5 B -1.39
#> 15 5 C 0.748
Created on 2022-12-12 with reprex v2.0.2

Conditional rolling sum based on another column

I would like to compute the conditional rolling sum of a column, but based on the values of another column.
I have a table like this:
data_frame <- data.frame( category1 = c("A", "A", "A", "B", "B", "B", "A", "A", "B"),
category2 = c("B", "B", "B", "A", "A", "A", "B", "B", "A"),
value = c(1, 2, 1, 2, 1, 5, 3, 4, 2),
desired_output = c(0, 0, 0, 4, 4, 4, 8, 8, 11))
data_frame2 <- data_frame %>%
group_by(category1) %>%
mutate(cumsum = cumsum(value))
category1 category2 value cumsum desired_output
A B 1 1 0
A B 2 3 0
A B 1 4 0
B A 2 2 4
B A 1 3 4
B A 5 8 4
A B 3 7 8
A B 4 11 8
B A 2 10 11
I am able to compute the rolling sum of the value based on category1 or category2 using cumsum, but I would like a column which calculates a rolling sum of the value column when category1 equals the current value of category2. For example, in the last row of the above example it sums the value of all the above rows when category1 == A, as the current value of category2 is A.
I have tried various hacky ifelse/lag/fill solutions but nothing gets close to what I need. I have also tried adding a conditional into the ave function, as below, but not sure what the syntax should be...
data_frame2$desired_output <- ave(data_frame2$value, data_frame2$category1 = data_frame2$category2, FUN=cumsum)
Thanks in advance - first question so apologies about anything I missed/got wrong!

confusion between categories in dplyr

I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4

Matching rows to columns and counting same occurences R

I have a dataset which is of the following form:-
a <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B")
)
And I have another set of the following form:-
b <- data.frame(col_1=c("ASD", "ASD", "BSD", "BSD"),
col_2=c(1, 1, 1, 1),
col_3=c(12, 12, 31, 21),
col_4=("A", "B", "B", "A")
)
What I want to do is to take the column col_4 from set b and match row wise in set a, so that it tell me which row has how many elements from col_4 in a new column. The name of the new column does not matters.
For ex:- The first and fifth row in set a has all the elements of col_4 from set b.
Also, duplicates shouldn't be found. For ex. sixth row in set a has 3 "B"s. But since col_4 from set b has only two "B"s, it should tell me 2 and not 3.
Expected output is of the form:-
c <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B"),
found=c(4, 1, 2, 2, 4, 2)
)
We can use vecsets::vintersect which takes care of duplicates.
Using apply row-wise we can count how many common values are there between b$col4 and each row in a.
apply(a, 1, function(x) length(vecsets::vintersect(b$col_4, x)))
#[1] 4 1 2 2 4 2
An option using data.table:
library(data.table)
#convert a into a long format
m <- melt(setDT(a)[, rn:=.I], id.vars="rn", value.name="col_4")
#order by row number and create an index for identical occurrences in col_4
setorder(m, rn, col_4)[, vidx := rowid(col_4), rn]
#create a similar index for b
setDT(b, key="col_4")[, vidx := rowid(col_4)]
#count occurrences and lookup this count into original data
a[b[m, on=.(col_4, vidx), nomatch=0L][, .N, rn], on=.(rn), found := N]
output:
X1 X2 X3 X4 X5 rn found
1: A B B E A 1 4
2: B C E C C 2 1
3: C C A A C 3 2
4: A A A A A 4 2
5: B A A A B 5 4
6: C B B C B 6 2
Another idea to operate on sets efficiently is to count and compare the element occurences of b$col_4 in each row of a:
b1 = c(table(b$col_4))
#b1
#A B
#2 2
a1 = table(factor(as.matrix(a), names(b1)), row(a))
#a1
#
# 1 2 3 4 5 6
# A 2 0 2 5 3 0
# B 2 1 0 0 2 3
Finally, identify the least amount of occurences per element (for each row) and sum:
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2
In case of a larger dimension a "data.frame" and more elements, Matrix::sparseMatrix offers an appropriate alternative:
library(Matrix)
a.fac = factor(as.matrix(a), names(b1))
.i = as.integer(a.fac)
.j = c(row(a))
noNA = !is.na(.i) ## need to remove NAs manually
.i = .i[noNA]
.j = .j[noNA]
a1 = sparseMatrix(i = .i, j = .j, x = 1L, dimnames = list(names(b1), 1:nrow(a)))
a1
#2 x 6 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6
#A 2 . 2 5 3 .
#B 2 1 . . 2 3
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2

Tidy way of arranging data frame rows according to target sorting orders

Back in 2015, I asked a similar question on this, but I would like to find a tidy way of doing this.
This is the best that I could come up with so far. It works, but having to change column types just for sorting seems "wrong" somehow. However, so does resorting to dplyr::*_join() and match() comes with its own catches (plus it's hard to use it in tidy contexts).
So is there a good/recommended way of doing this in the tidyverse?
Define function
library(magrittr)
arrange_by_target <- function(
x,
targets
) {
x %>%
# Transform arrange-by columns to factors so we can leverage the order of
# the levels:
dplyr::mutate_at(
names(targets),
function(.x, .targets = targets) {
.col <- deparse(substitute(.x))
factor(.x, levels = .targets[[.col]])
}
) %>%
# Actual arranging:
dplyr::arrange_at(
names(targets)
) %>%
# Clean up by recasting factor columns to their original type:
dplyr::mutate_at(
.vars = names(targets),
function(.x, .targets = targets) {
.col <- deparse(substitute(.x))
vctrs::vec_cast(.x, to = class(.targets[[.col]]))
}
)
}
Test
x <- tibble::tribble(
~group, ~name, ~value,
"A", "B", 1,
"A", "C", 2,
"A", "A", 3,
"B", "B", 4,
"B", "A", 5
)
x %>%
arrange_by_target(
targets = list(
group = c("B", "A"),
name = c("A", "B", "C")
)
)
#> # A tibble: 5 x 3
#> group name value
#> <chr> <chr> <dbl>
#> 1 B A 5
#> 2 B B 4
#> 3 A A 3
#> 4 A B 1
#> 5 A C 2
x %>%
arrange_by_target(
targets = list(
group = c("B", "A"),
name = c("A", "B", "C") %>% rev()
)
)
#> # A tibble: 5 x 3
#> group name value
#> <chr> <chr> <dbl>
#> 1 B B 4
#> 2 B A 5
#> 3 A C 2
#> 4 A B 1
#> 5 A A 3
Created on 2019-11-06 by the reprex package (v0.3.0)
The easiest way to accomplish this is to convert your character columns to factors, like so:
x %>%
mutate(
group = factor(group, c("A", "B")),
name = factor(name, c("C", "B", "A"))
) %>%
arrange(group, name)
Another option that I frequently use is to utilize joins. For example:
x <- tibble::tribble(
~group, ~name, ~value,
"A", "B", 1,
"A", "C", 2,
"A", "A", 3,
"B", "B", 4,
"B", "A", 5,
"A", "A", 6,
"B", "C", 7,
"A", "B", 8,
"B", "B", 9
)
custom_sort <- tibble::tribble(
~group, ~name,
"A", "C",
"A", "B",
"A", "A",
"B", "B",
"B", "A"
)
x %>% right_join(custom_sort)

Resources