how to cast to multicolumn in R like Pandas-Style? - r

i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))

ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.

Related

From daset year select corresponding column of that year in r

I have the following dataframe:
Area
Year
Sucess_Year1
Sucess_Year2
Sucess_Year3
6
1
1
2
4
7
2
2
3
1
33
3
3
2
1
44
1
2
1
4
23
2
2
3
1
53
3
1
2
4
Now i want to get a seperate column with sucess only in the year of the year column.
Like this:
Area
Year
Sucess
6
1
1
7
2
3
33
3
1
44
1
2
23
2
3
53
3
4
How do I do this? Something like join? Select if?
I tought of something like data_match <- data[ , grep("col", colnames(data))] but then you would not iterate over the rows.
With sapply to do a vectorized grep, and cbind to do indexing by rows:
data$success <- data[cbind(1:nrow(data), sapply(data$Year, grep, colnames(data)))]
cbind(data[1:2],
success = data[cbind(1:nrow(data), sapply(data$Year, grep, colnames(data)))])
# Area Year success
#1 6 1 1
#2 7 2 3
#3 33 3 1
#4 44 1 2
#5 23 2 3
#6 53 3 4
You can use a 2-column matrix for indexing:
df[grep("Sucess", names(df))][cbind(seq_len(nrow(df)), df$Year)]
# [1] 1 3 1 2 3 4
Using pivoting (here using tidyr/dplyr):
library(dplyr)
library(tidyr)
data |>
pivot_longer(starts_with("Sucess_Year"),
names_prefix = "Sucess_Year",
values_to = "Sucess",
names_to = "YearCol") |>
filter(Year == YearCol) |>
select(-YearCol)
Output:
# A tibble: 6 × 3
Area Year Sucess
<dbl> <dbl> <dbl>
1 6 1 1
2 7 2 3
3 33 3 1
4 44 1 2
5 23 2 3
6 53 3 4

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

R: Frequency across multiple columns

I have a large data set of hospital discharge records. There are procedure codes for each discharge, with numerous columns containing the codes (principle code, other 1, other 2...other 24). I would like to get a frequency list for 20 specific codes, so I need to get the frequency across multiple columns. Any help would be appreciated!
Example:
#Sample Data
ID <- c(112,113,114,115)
Sex <- c(1,0,1,0)
Princ_Code <- c(1,2,5,3)
Oth_Code_1 <- c(5,7,8,1)
Oth_Code_2 <- c(2,10,12,9)
discharges <- data.frame(ID,Sex,Princ_Code,Oth_Code_1, Oth_Code_2)
I'd like to get a frequency count of specific codes across the columns.
Something like:
x freq
1 2
2 2
3 1
12 1
One way to think about this problem is to transform the data from a wide format (multiple columns with identically-typed data) to a tall format (where each column is a fairly different type from the others). I'll demonstrate using tidyr, though there are base and data.table methods as well.
out <- tidyr::gather(discharges, codetype, code, -ID, -Sex)
out
# ID Sex codetype code
# 1 112 1 Princ_Code 1
# 2 113 0 Princ_Code 2
# 3 114 1 Princ_Code 5
# 4 115 0 Princ_Code 3
# 5 112 1 Oth_Code_1 5
# 6 113 0 Oth_Code_1 7
# 7 114 1 Oth_Code_1 8
# 8 115 0 Oth_Code_1 1
# 9 112 1 Oth_Code_2 2
# 10 113 0 Oth_Code_2 10
# 11 114 1 Oth_Code_2 12
# 12 115 0 Oth_Code_2 9
Do you see how transforming from "wide" to "tall" makes the problem seem a lot simpler? From here, you could use table or xtabs
table(out$code)
# 1 2 3 5 7 8 9 10 12
# 2 2 1 2 1 1 1 1 1
xtabs(~code, data=out)
# code
# 1 2 3 5 7 8 9 10 12
# 2 2 1 2 1 1 1 1 1
or you can continue with dplyr pipes and tidyr:
library(dplyr)
library(tidyr)
discharges %>%
gather(codetype, code, -ID, -Sex) %>%
group_by(code) %>%
tally()
# # A tibble: 9 × 2
# code n
# <dbl> <int>
# 1 1 2
# 2 2 2
# 3 3 1
# 4 5 2
# 5 7 1
# 6 8 1
# 7 9 1
# 8 10 1
# 9 12 1

Flag first by-group in R data frame

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)
Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Resources