Add column with numbers based on a second column - r

Here my data.frame:
df = read.table(text = 'Day ID Event
100 1 1
100 1 1
99 1 1
97 1 1
87 2 1
86 2 1
85 2 1
965 1 2
964 1 2
960 1 2
959 1 2
709 2 2
708 2 2
12 3 2
9 3 2', header = TRUE)
What I would like to do is to create a new column which, considering the ID and Event ones, assign for each observation a number in decreasing order based on the relative Day ones.
My desired output would be:
Day ID Event Count
100 1 1 4
100 1 1 4
99 1 1 3
97 1 1 1
87 2 1 3
86 2 1 2
85 2 1 1
965 1 2 7
964 1 2 6
960 1 2 2
959 1 2 1
709 2 2 2
708 2 2 1
12 3 2 4
9 3 2 1
E.g. If you look at the first 'block' above: Day 97 = 1, Day 98 = 2, Day 99 = 3 and Day 100 = 4. We are missing Day 98 but we still need to include it in the count.
I tried the following but the output is not the one I need:
df$Count <- ave(df$Day, df$Event, df$ID, FUN = seq_along)
Thanks for your help

We can try
library(dplyr)
df %>%
group_by(ID, Event) %>%
mutate(Count = 1+(Day-Day[n()]))

Related

Keep previous value if it is under a certain threshold

I would like to create a variable called treatment_cont that is grouped by group as follows:
ID day day_diff treatment treatment_cont
1 0 NA 1 1
1 14 14 1 1
1 20 6 2 2
1 73 53 1 1
2 0 NA 1 1
2 33 33 1 1
2 90 57 2 2
2 112 22 3 2
2 152 40 1 1
2 178 26 4 1
Treatment_cont is the same as treatment but we want to keep the same treatment regime only when the day_diff, the difference in days between treatments, is lower than 30.
I have tried many ways on dplyr, manipulating the table, but I cannot figure out how to do it efficiently.
Probably, a conditional mutate, using case_when and lag might work:
df %>% mutate(treatment_cont = case_when(day_diff < 30 ~ treatment,TRUE ~ lag(treatment)))
You are probably looking for lag (and perhaps it's brother, lead):
df %>%
replace_na(list(day_diff=0)) %>%
group_by(ID) %>%
arrange(day) %>%
mutate(
treatment_cont = ifelse(day_diff < 30, lag(treatment_cont, default = treatment_cont[1]),treatment_cont)
# A tibble: 10 x 5
ID day day_diff treatment treatment_cont
<int> <int> <dbl> <int> <int>
1 1 0 0 1 1
2 1 14 14 1 1
3 1 20 6 2 1
4 1 73 53 1 1
5 2 0 0 1 1
6 2 33 33 1 1
7 2 90 57 2 2
8 2 112 22 3 2
9 2 152 40 1 1
10 2 178 26 4 1
) %>%
ungroup %>%
arrange(ID, day)

Sum 1:n by group

Have: Dataset I need to sum i:n for each row within each group
demo<-data.frame(th=c(c(0,24,26),(c(0,1,2,4))),hs=c(rep(220,3),c(rep(240,4))),
seq=(c(1:3,1:4)),group=c(rep(1,3),rep(2,4)))
Here's what that looks like:
> demo
th hs seq group
1 0 220 1 1
2 24 220 2 1
3 26 220 3 1
4 0 240 1 2
5 1 240 2 2
6 2 240 3 2
7 4 240 4 2
Need a vector that is a based on the hs, seq, and th columns but that is a summation of the hs column raised to the seq column and times the th columns up to that row within the group.
demo[1,"an"]<- demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"]
demo[2,"an"]<-sum(demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"],
demo[2,"hs"]^demo[2,"seq"] * demo[2,"th"] )
demo[3,"an"]<-sum(demo[1,"hs"]^demo[1,"seq"] * demo[1,"th"],
demo[2,"hs"]^demo[2,"seq"] * demo[2,"th"],
demo[3,"hs"]^demo[3,"seq"] * demo[3,"th"])
demo[6,"an"]<-sum(demo[4,"hs"]^demo[4,"seq"] * demo[4,"th"],
demo[5,"hs"]^demo[5,"seq"] * demo[5,"th"],
demo[6,"hs"]^demo[6,"seq"] * demo[6,"th"])
Here's what that new column (an) should look like
> demo
th hs seq group an
1 0 220 1 1 0
2 24 220 2 1 1161600
3 26 220 3 1 278009600
4 0 240 1 2 NA
5 1 240 2 2 NA
6 2 240 3 2 27705600
7 4 240 4 2 NA
Ignore the NA's in this MRE, those need to be filled in too.
Libraries
library(tidyverse)
Sample data
df <-
read.csv(
text =
"th hs seq group
0 220 1 1
24 220 2 1
26 220 3 1
0 240 1 2
1 240 2 2
2 240 3 2
4 240 4 2",
sep = " ",header = T
)
Code
df %>%
#Grouping by group
group_by(group) %>%
#Applying a cumulative sum of the formula, by group
mutate(an = cumsum(hs^seq*th))
Output
th hs seq group an
<int> <int> <int> <int> <dbl>
1 0 220 1 1 0
2 24 220 2 1 1161600
3 26 220 3 1 278009600
4 0 240 1 2 0
5 1 240 2 2 57600
6 2 240 3 2 27705600
7 4 240 4 2 13298745600
We can use data.table
library(data.table)
setDT(df)[, an := cumsum(hs^seq^th), group]

How can I create a lag difference variable within group relative to baseline?

I would like a variable that is a lagged difference to the within group baseline. I have panel data that I have balanced.
my_data <- data.frame(id = c(1,1,1,2,2,2,3,3,3), group = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
id group score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I would like it to look like this:
id group score lag_diff_baseline
1 1 1 0 NA
2 1 2 150 150
3 1 3 170 170
4 2 1 80 NA
5 2 2 100 20
6 2 3 110 30
7 3 1 75 NA
8 3 2 100 25
9 3 3 0 -75
The data.table version of #Liam's answer
library(data.table)
setDT(my_data)
my_data[,.(id,group,score,lag_diff_baseline = score-first(score)),by = id]
I missed the easy answer:
library(dplyr)
my_data %>%
group_by(id) %>%
mutate(lag_diff_baseline = score - first(score))

how to cast to multicolumn in R like Pandas-Style?

i searched a lot but didn't find anything relevant.
What I Want:
I'm trying to do a simple groupby and summarising in R.
My preffered output would be with multiindexed columns and multiindexed rows. Multiindexed rows are easy with dplyr, the difficulty are the cols.
what I already tried:
library(dplyr)
cp <- read.table(text="SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
1 1 1 1 1 70 1
2 1 1 1 2 154 8
3 1 1 2 1 210 10
4 1 1 2 2 21 1
5 1 2 1 1 77 8
6 1 2 1 2 90 6
7 1 2 2 1 105 5
8 1 2 2 2 140 11
")
attach(cp)
cp_gb <- cp %>%
group_by(SEX, REGION, CAR_TYPE, JOB) %>%
summarise(counts=round(sum(NUMBER/EXPOSURE*1000)))
dcast(cp_gb, formula = SEX + REGION ~ CAR_TYPE + JOB, value.var="counts")
Now there is the problem that the column index is "melted" into one instead of a multiindexed column, like I know it from Python/Pandas.
Wrong output:
SEX REGION 1_1 1_2 2_1 2_2
1 1 14 52 48 48
1 2 104 67 48 79
Example how it would work in Pandas:
# clipboard, copy this withoud the comments:
# SEX REGION CAR_TYPE JOB EXPOSURE NUMBER
# 1 1 1 1 1 70 1
# 2 1 1 1 2 154 8
# 3 1 1 2 1 210 10
# 4 1 1 2 2 21 1
# 5 1 2 1 1 77 8
# 6 1 2 1 2 90 6
# 7 1 2 2 1 105 5
# 8 1 2 2 2 140 11
df = pd.read_clipboard(delim_whitespace=True)
gb = df.groupby(["SEX","REGION", "CAR_TYPE", "JOB"]).sum()
gb['promille_value'] = (gb['NUMBER'] / gb['EXPOSURE'] * 1000).astype(int)
gb = gb[['promille_value']].unstack(level=[2,3])
correct Output:
CAR_TYPE 1 1 2 2
JOB 1 2 1 2
SEX REGION
1 1 14 51 47 47
1 2 103 66 47 78
(Update) What works (nearly):
I tried to to with ftable, but it only prints ones in the matrix instead of the values of "counts".
ftable(cp_gb, col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
ftable accepts lists of factors (data frame) or a table object. Instead of passing the grouped data frame as it is, converting it to a table object first before passing to ftable should get your the counts:
# because xtabs expects factors
cp_gb <- cp_gb %>% ungroup %>% mutate_at(1:4, as.factor)
xtabs(counts ~ ., cp_gb) %>%
ftable(col.vars=c("CAR_TYPE","JOB"), row.vars = c("SEX","REGION"))
# CAR_TYPE 1 2
# JOB 1 2 1 2
# SEX REGION
# 1 1 14 52 48 48
# 2 104 67 48 79
There is a difference of 1 in some of counts between R and pandas outputs because you use round in R and truncation (.astype(int)) in python.

Create a new column with a sum based on the value of three other columns

I have a data frame and I want to create another column based on the information of three different columns. I am using R.
I want to start counting on 0 and to add 2 in each new cell, based on a column Time and on Item and Participants information. I want to have 0 for the beginning of the Time counting (which is in ms) for each item of each participant.
df <- data.frame(Item=c(1,1,1,1,1,1,2,2,2,2,2,2),
Part=c(1,1,1,2,2,2,1,1,1,2,2,2),
Time=c(1234,1235,1236,345,346,347,1546,1547,1548,234,235,236))
Item Part Time
1 1 1 1234
2 1 1 1235
3 1 1 1236
4 1 2 345
5 1 2 346
6 1 2 347
7 2 1 1546
8 2 1 1547
9 2 1 1548
10 2 2 234
11 2 2 235
12 2 2 236
With the new column the table would be something like:
Item Part Time NewColumn
1 1 1 1234 0
2 1 1 1235 2
3 1 1 1236 4
4 1 2 345 0
5 1 2 346 2
6 1 2 347 4
7 2 1 1546 0
8 2 1 1547 2
9 2 1 1548 4
10 2 2 234 0
11 2 2 235 2
12 2 2 236 4
Many thanks in advance.
In case the structure stays as it is
library(dplyr)
result <- df %>% group_by(Part, Item) %>% mutate(NewColumn = seq (0,4,2))
I group by Item and Part and create a new column that counts 0, 2, 4
Item Part Time NewColumn
1 1 1 1234 0
2 1 1 1235 2
3 1 1 1236 4
4 1 2 345 0
5 1 2 346 2
6 1 2 347 4
7 2 1 1546 0
8 2 1 1547 2
9 2 1 1548 4
10 2 2 234 0
11 2 2 235 2
12 2 2 236 4
In order to be more flexible (if you have more than 3 rows per group), you can use
result <- df %>% group_by(Part, Item) %>% mutate(NewColumn = 2* (row_number()-1))
which will will generate numbers in the sequence 0, 2, 4, 6, 8,...
library(data.table)
df <- data.table(df)
df[, NewCol := seq(0,nrow(df),2), by=list(Item,Part)]
Er... df = cbind(df,NewColumn=c(0,2,4))?
+1 for library(plyr)
library(plyr)
ddply(df, c("Item","Part"), mutate,NewColumn = seq(0,4,2))
Item Part Time NewColumn
1 1 1234 0
1 1 1235 2
1 1 1236 4
1 2 345 0
1 2 346 2
1 2 347 4
2 1 1546 0
2 1 1547 2
2 1 1548 4
2 2 234 0
2 2 235 2
2 2 236 4

Resources