Adding lines in data frame for each observation - r

I have a data structure in long format, meaning that each individual has more than one observation (and each observation has one row). Now each individual has a different number of observations. I would like to structure my data in the way, that each individual will have the same number of observations. Therefore, it would be great to find the individual with the most observations and add lines with LOCF (depending on the number of missing lines).
For example:
# simulate data structure
d <- data.frame(
id = c(1,1,1,2,2,3,3,3,3,3),
value = c(10,11,12,5,9,55,14,12,20,7) )
Now individual 3 has the most observations (count = 5). I would like to add two lines for individual 1 (with 12 for value) and three lines for individual 2 (with 9 for value)
Any ideas?
Best wishes and thank you.

In case you wish to carry forward the last value for each individual you could do
d$seq=ave(d$id,d$id,FUN=seq_along)
d=merge(
d,
merge(
aggregate(value~id,data=d,FUN=tail,1),
data.frame("seq"=1:max(table(d$id))),
how="cross"
),
by=c("id","seq"),
all.y=T
)
d$value=ifelse(is.na(d$value.x),d$value.y,d$value.x)
d=d[,!grepl("value.",colnames(d))]
id seq value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 12
5 1 5 12
6 2 1 5
7 2 2 9
8 2 3 9
9 2 4 9
10 2 5 9
11 3 1 55
12 3 2 14
13 3 3 12
14 3 4 20
15 3 5 7

Here's a tidyverse solution. If we create a variable to hold the within ID count using seq_along then we can use complete and fill to expand the table and fill in the missing values.
d |> group_by(id) |>
mutate(n = seq_along(value)) |>
ungroup() |>
complete(id, n) |>
fill(value) |>
select(-n)
# A tibble: 15 × 2
id value
<dbl> <dbl>
1 1 10
2 1 11
3 1 12
4 1 12
5 1 12
6 2 5
7 2 9
8 2 9
9 2 9
10 2 9
11 3 55
12 3 14
13 3 12
14 3 20
15 3 7

Related

sum column by ID [duplicate]

This question already has an answer here:
Sum rows based on ID
(1 answer)
Closed 11 months ago.
I have a data frame like this one with several rows for the same subject and with different durations :
ID Duration
1 10
2 15
2 10
3 2
3 5
3 6
I would like to sum up all the durations per subject and put it in a new column to get a data frame like this one:
ID Duration Sum_duration
1 10 10
2 15 25
2 10 25
3 2 13
3 5 13
3 6 13
I thought of using these functions:
df%>%group_by(id) and colSums
but I don't know how to use them in my case.
Thanks in advance for your help
Using dplyr
library(dplyr)
df %>% group_by(ID) %>% mutate(Sum_duration = sum(Duration))
ID Duration sum
<dbl> <dbl> <dbl>
1 1 10 10
2 2 15 25
3 2 10 25
4 3 2 13
5 3 5 13
6 3 6 13
Data
df = data.frame('ID' = c(1,2,2,3,3,3), "Duration" = c(10,15,10,2,5,6))
Using ave
df$Sum_duration=ave(df$Duration,df$ID,FUN=sum)
ID Duration Sum_duration
1 1 10 10
2 2 15 25
3 2 10 25
4 3 2 13
5 3 5 13
6 3 6 13

replace a given value within a column with the next different number in a row in R

I have a data set that will ultimately be about ~30,000 observations. I have formatted a variable in such a way that the numerical values 1:4 are of interest, while the value 5 is a place holder and was not able to be collected by our testing instrument for one reason or another (not worried about why or missingness etc).
I am looking to turn any observation of 5, or series of observations of 5, into the next number in the observations. As can be seen in the example data set below, the first four observations have the number 5 while the next four observations are the number 4. In this situation I would like the first 4 observations to be changed from 5 to 4.
Note that after the 8th observation another series of 5's occur, follow by a series of 3s. In this case the 5s should be changed to 3s.
In the code block below I have provided an example of what the current data look like, delineated by the column "Current." I have also provided a column of the desired output, delineated by the column name "Desired." The obs variable was helpful to create just to show the row number of the changes in values for the case of this post.
df <- data.frame(Current = c(5,5,5,5,4,4,4,4,5,5,3,3,3,5,3,3,5,5,2,5,5,5,1),
Desired = c(4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1))
df$obs = seq(1,nrow(df), by = 1)
You could use
library(tidyr)
library(dplyr)
df %>%
mutate(new_column = na_if(Current, 5)) %>%
fill(new_column, .direction = "up")
This returns
Current Desired new_column
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1
We use dplyr's na_if function to convert the 5 into missing values.
Next we use tidyr's fill function to replace the NA's by the following values.
You can use the following solution. I made use of zoo::na.locf function which takes the most non-NA value and replace all NAs on the way down. However, to fit this to your data set I first replaced all values equal to 5 with NA and then reverse the vector and after I replaced all the values with the desired values, I again reversed it back to its original order:
library(dplyr)
library(zoo)
library(zoo)
df %>%
mutate(Desired2 = ifelse(Current == 5, NA, Current),
Desired2 = rev(na.locf(rev(Desired2))))
Current Desired Desired2
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1

How to break ties in a ranking with gaps in ranking [duplicate]

This question already has answers here:
Increment by one to each duplicate value
(4 answers)
Closed 1 year ago.
Say that I have these data:
data <- data.frame(orig=c(1,5,5,5,14,18,18,25))
orig
1 1
2 5
3 5
4 5
5 14
6 18
7 18
8 25
I would like to create the want column:
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
This column takes orig and copies its value, but breaks ties if they exist. What I am trying to do is to re-create the rankings so that there are no ties and the ties are broken based on the order of the rows in the dataset. If not for the spaces in the rankings (jump from 1 to 5, etc.), I could use
library(tidyverse)
data %>% mutate(test = rank(orig, ties.method="min"))
But this of course doesn't get me what I want:
orig test
1 1 1
2 5 2
3 5 2
4 5 2
5 14 5
6 18 6
7 18 6
8 25 8
What can I do?
We may add row_number() after grouping
library(dplyr)
data %>%
group_by(orig) %>%
mutate(want = orig + row_number() - 1) %>%
ungroup
-ouptut
# A tibble: 8 x 2
orig want
<dbl> <dbl>
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
Or may simplify with rowid from data.table
library(data.table)
data %>%
mutate(want = orig + rowid(orig)-1)
A base R option using ave + seq_along
transform(
data,
want = orig + ave(orig, orig, FUN = seq_along) - 1
)
gives
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

Resources