This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have data in this format:
How can I re-organize the data with R in the following format?
In other words: Create a new column for every single observation and paste a simple count if the observation occurs for the specific group.
This is most easily done using the tidyr package:
library(tidyr)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1),
value = 1)
spread(dat, number, value)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1))
I would like to provide an R base solution (maybe just for fun...), based on matrix indexing.
lev <- unique(dat[[1L]]); k <- length(lev) ## unique levels
x <- dat[[2L]]; p <- max(x) ## column position
z <- matrix(0L, nrow = k, ncol = p, dimnames = list(lev, seq_len(p))) ## initialization
z[cbind(match(dat[[1L]], lev), dat[[2L]])] <- 1L ## replacement
z ## display
# 1 2 3 4 5 6 7
#A 0 1 1 1 1 0 0
#B 0 0 0 1 1 1 0
#C 1 0 1 0 1 0 1
#D 1 0 0 0 0 0 0
Related
I would like to compute the conditional rolling sum of a column, but based on the values of another column.
I have a table like this:
data_frame <- data.frame( category1 = c("A", "A", "A", "B", "B", "B", "A", "A", "B"),
category2 = c("B", "B", "B", "A", "A", "A", "B", "B", "A"),
value = c(1, 2, 1, 2, 1, 5, 3, 4, 2),
desired_output = c(0, 0, 0, 4, 4, 4, 8, 8, 11))
data_frame2 <- data_frame %>%
group_by(category1) %>%
mutate(cumsum = cumsum(value))
category1 category2 value cumsum desired_output
A B 1 1 0
A B 2 3 0
A B 1 4 0
B A 2 2 4
B A 1 3 4
B A 5 8 4
A B 3 7 8
A B 4 11 8
B A 2 10 11
I am able to compute the rolling sum of the value based on category1 or category2 using cumsum, but I would like a column which calculates a rolling sum of the value column when category1 equals the current value of category2. For example, in the last row of the above example it sums the value of all the above rows when category1 == A, as the current value of category2 is A.
I have tried various hacky ifelse/lag/fill solutions but nothing gets close to what I need. I have also tried adding a conditional into the ave function, as below, but not sure what the syntax should be...
data_frame2$desired_output <- ave(data_frame2$value, data_frame2$category1 = data_frame2$category2, FUN=cumsum)
Thanks in advance - first question so apologies about anything I missed/got wrong!
I am working with the R programming language. Suppose I have the following data frame:
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor(C)
My Question: Suppose I want to add an ID column to this data frame that ranks the first observation as "100" and increases the ID by 1 for each new column. I tried to do this as follows:
my_data_2$id = seq(101, 200, by = 1)
However, this "corrupted" the data frame:
head(my_data_2)
a b c
1 10.381397 9.534634 12.8330946
2 10.326785 6.397006 8.1217063
3 8.333354 11.474064 11.6035562
4 9.583789 12.096404 18.2764387
5 9.581740 12.302016 4.0601871
6 11.772943 9.151642 -0.3686874
group
1 c(9.98552413605153, 9.53807731118048, 6.92589246998173, 8.97095368638206, 9.70249918748529, 10.6161773148626, 9.2514231659343, 10.6566757899233, 10.2351848084123, 9.45970725813352, 9.15347719257448, 9.30428244749624, 8.43075784609759, 11.1200169905262, 11.3493313166827, 8.86895968334901, 9.13208319045466, 9.70062759133717)
2 c(8.90358954387628, 13.8756093430144, 12.9970566311467, 10.4227745183785, 21.3259516051226, 4.88590162247496, 10.260282181, 14.092109840631, 7.37839577680487, 9.09764173775965, 15.1636139760987, 9.9773055885761, 8.29361737323061, 8.61361852648607, 12.6807897406641, 0.00863359720839085, 10.7660528147358, 9.79616528370632)
3 c(25.8063583646201, -11.5722310383483, 8.56096791164312, 12.2858029391835, -0.312392781809937, 0.946343715084028, 2.45881422753051, 7.26197515743391, 0.333766891336273, 14.9149659649045, -4.55483090530928, -19.8075232688082, 16.59106194569, 18.7377329188129, 1.1771203751127, -6.19019973790205, -5.02277721344565, 23.3363430334739)
4 c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
5 c("B", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "A", "B", "B", "B", "B", "B", "B")
6 c("B", "B", "B", "B", "B", "A", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B")
id
1 101
2 102
3 103
4 104
5 105
6 106
Can someone please show me how to fix this problem?
Thanks!
Your problem isn‘t your ID column, your problem is where you define your group variable. You call as.factor(C) (note the uppercase C), but the column of your data frame is a lowercase c. So I guess you have defined another object C outsode of your data frame, that now „corrupts“ your data frame.
You maybe want to do:
my_data_2$group <- as.factor(my_data_2$c)
I was able to figure out the answer!
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor("C")
my_data_2$id = seq(101, 200, by = 1)
head(my_data_2)
a b c group id
1 9.436773 10.712568 3.7699748 C 101
2 10.265810 3.408589 11.9230024 C 102
3 10.503245 12.197000 8.3620889 C 103
4 9.279878 7.007812 16.8268852 C 104
5 10.683518 8.039032 5.2287997 C 105
6 11.097258 10.313103 0.4988398 C 106
I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4
Based on the data below:
library(tidyverse)
limit <- c(7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5)
group <- c("a", "a", "a", "a", "a", "a", "a", "a", "a","b", "b", "b", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c")
df <- data.frame(limit, group)
df
I'd like to create a new column (NewCol) as follows:
If there is a row where limit = Id, that should be 0 on NewCol. But then I'd like all the rows before 0 to go back in reverse order until the first row of the group, and all the rows after 0 to be counted until the end of the group.
so for example, in that case, for group "a" it should look like
-6, -5, -4, -3, -2, -1, 0, 1, 2 where -6 is the first row and 2 is the 9th row for that group.
This is what I've tried but still not getting what I need
df %>% group_by(group) %>% mutate(Id = seq(1:length(limit))) %>%
mutate(NewCol = ifelse(limit == Id, 0, NA)) %>%
mutate(nn=ifelse(is.na(NewCol),
zoo::na.locf(NewCol) + cumsum(is.na(NewCol))*1,
NewCol))
Thank you
It is just a difference between the row_number() and the 'limit' after grouping by
library(dplyr)
df %>%
group_by(group) %>%
mutate(NewCol = row_number() - limit)
Or using data.table
library(data.table)
setDT(df)[, NewCol := seq_len(.N) - limit]
Or with base R
df$NewCol <- with(df, ave(seq_along(limit), group, FUN = seq_along) - limit)
In Base R, we can use ave :
df$NewCol <- with(df, ave(limit, group, FUN = seq_along) - limit)
# limit group NewCol
#1 7 a -6
#2 7 a -5
#3 7 a -4
#4 7 a -3
#5 7 a -2
#6 7 a -1
#7 7 a 0
#8 7 a 1
#9 7 a 2
#10 4 b -3
#11 4 b -2
#12 4 b -1
#13 4 b 0
#...
Or using data.table :
library(data.table)
setDT(df)[, NewCol := seq_along(limit) - limit, group]
#Or
#setDT(df)[, NewCol := seq_len(.N) - limit, group]
I'm trying to add a body count for each unique person. Each person has multiple data points.
df <- data.frame(PERSON = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
Y = c(2, 5, 4, 1, 2, 5, 3, 7, 1))
This is what I'd like it to look like:
PERSON Y UNIQ_CT
1 A 2 1
2 A 5 0
3 A 4 0
4 B 1 1
5 B 2 0
6 C 5 1
7 C 3 0
8 C 7 0
9 C 1 0
You can use duplicated and negate it:
transform(df, uniqct = as.integer(!duplicated(Person)))
Since there is dplyr tag to the question here is an option
library(dplyr)
df %>%
group_by(PERSON) %>%
mutate(UNIQ_CT = ifelse(row_number( ) == 1, 1, 0))