I have a large data set of hospital discharge records. There are procedure codes for each discharge, with numerous columns containing the codes (principle code, other 1, other 2...other 24). I would like to get a frequency list for 20 specific codes, so I need to get the frequency across multiple columns. Any help would be appreciated!
Example:
#Sample Data
ID <- c(112,113,114,115)
Sex <- c(1,0,1,0)
Princ_Code <- c(1,2,5,3)
Oth_Code_1 <- c(5,7,8,1)
Oth_Code_2 <- c(2,10,12,9)
discharges <- data.frame(ID,Sex,Princ_Code,Oth_Code_1, Oth_Code_2)
I'd like to get a frequency count of specific codes across the columns.
Something like:
x freq
1 2
2 2
3 1
12 1
One way to think about this problem is to transform the data from a wide format (multiple columns with identically-typed data) to a tall format (where each column is a fairly different type from the others). I'll demonstrate using tidyr, though there are base and data.table methods as well.
out <- tidyr::gather(discharges, codetype, code, -ID, -Sex)
out
# ID Sex codetype code
# 1 112 1 Princ_Code 1
# 2 113 0 Princ_Code 2
# 3 114 1 Princ_Code 5
# 4 115 0 Princ_Code 3
# 5 112 1 Oth_Code_1 5
# 6 113 0 Oth_Code_1 7
# 7 114 1 Oth_Code_1 8
# 8 115 0 Oth_Code_1 1
# 9 112 1 Oth_Code_2 2
# 10 113 0 Oth_Code_2 10
# 11 114 1 Oth_Code_2 12
# 12 115 0 Oth_Code_2 9
Do you see how transforming from "wide" to "tall" makes the problem seem a lot simpler? From here, you could use table or xtabs
table(out$code)
# 1 2 3 5 7 8 9 10 12
# 2 2 1 2 1 1 1 1 1
xtabs(~code, data=out)
# code
# 1 2 3 5 7 8 9 10 12
# 2 2 1 2 1 1 1 1 1
or you can continue with dplyr pipes and tidyr:
library(dplyr)
library(tidyr)
discharges %>%
gather(codetype, code, -ID, -Sex) %>%
group_by(code) %>%
tally()
# # A tibble: 9 × 2
# code n
# <dbl> <int>
# 1 1 2
# 2 2 2
# 3 3 1
# 4 5 2
# 5 7 1
# 6 8 1
# 7 9 1
# 8 10 1
# 9 12 1
Related
Data
id<-c("a","a","a","a","a","a","b","b","b","b","b","b")
d<-c(1,2,3,90,98,100000,4,6,7,8,23,45)
df<-data.frame(id,d)
I want to detect observational discontinuities of each "id".
My expected result is obtain a way to detect discontinuities without using means or medians as a reference.
You can check whether the difference between a row and the next one within each group is different than 1:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dis = +(c(F, diff(d) != 1)))
# A tibble: 12 × 3
# Groups: id [2]
id d dis
<chr> <dbl> <int>
1 a 1 0
2 a 2 0
3 a 3 0
4 a 90 1
5 a 98 1
6 a 100000 1
7 b 4 0
8 b 6 1
9 b 7 0
10 b 8 0
11 b 23 1
12 b 45 1
I want make from:
test <- data.frame(subject=c(rep(1,10),rep(2,10)),x=1:10,y=0:1)
Something like that:
As I wrote in the title, when the first 1 appears all subsequent values of "y" for a given "subject" must change to 1, then the same for the next "subject"
I tried something like that:
test <- test%>%
group_nest(subject) %>%
mutate(XD = map(data,function(x){
ifelse(x$y[which(grepl(1, x$y))[1]:nrow(x)]==TRUE , 1,0)})) %>% unnest(cols = c(data,XD))
It didn't work :(
Try this:
library(dplyr)
#Code
new <- test %>%
group_by(subject) %>%
mutate(y=ifelse(row_number()<min(which(y==1)),y,1))
Output:
# A tibble: 20 x 3
# Groups: subject [2]
subject x y
<dbl> <int> <dbl>
1 1 1 0
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 1
10 1 10 1
11 2 1 0
12 2 2 1
13 2 3 1
14 2 4 1
15 2 5 1
16 2 6 1
17 2 7 1
18 2 8 1
19 2 9 1
20 2 10 1
Since you appear to just have 0's and 1's, a straightforward approach would be to take a cumulative maximum via the cummax function:
library(dplyr)
test %>%
group_by(subject) %>%
mutate(y = cummax(y))
#Duck's answer is considerably more robust if you have a range of values that may appear before or after the first 1.
I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))
For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.
i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.