add duplicate rows to R dataframe based on sequence - r

My dataframe needs to be expanded
df1<-structure(list(TotalTime = c(0, 15, 16, 23, 24, 29), PhaseName = structure(c(1L,1L, 2L, 2L, 2L, 3L), .Label = c("A", "B","C"), class = "factor")), .Names = c("TotalTime", "Phase"), row.names = c(NA, 6L), class = "data.frame")
df1:
TotalTime Phase
1 0 A
2 15 A
3 16 B
4 23 B
5 24 B
6 29 C
So that it becomes the following dataframe with rows that are duplicated based on TotalTime, however TotalTime should be filled in for every number (second). (I put ... in the example to reduce space, but should be filled with 6,7,8,9-15 etc.) :
TotalTime Phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
..
16 15 A
17 16 B
18 17 B
.. B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C

using both packages zoo and dplyr:
library(dplyr)
library(zoo)
data.frame(TotalTime=0:max(df1$TotalTime)) %>% left_join(df1) %>% na.locf
It first creates a data.frame that has the hole sequence from 0 to 29 (here) and merges it with your data. Then I simply do a "last observation carried forward" imputation on the missing values created by the merge.
It can also be done with the library data.table like this: (see also this answer that I adapted:
library(data.table)
df1 = data.table(df1, key="TotalTime")
df2=data.table(TotalTime=0:max(df1$TotalTime))
df1[df2, roll=T]

You can get it done with dplyr with tidyr:
library(tidyverse)
df1 %>% do(data.frame(TotalTime = first(.$TotalTime):last(.$TotalTime))) %>%
left_join(df1, by="TotalTime") %>%
fill(Phase)
Output:
TotalTime Phase
0 A
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 A
12 A
13 A
14 A
15 A
16 B
17 B
18 B
19 B
20 B
21 B
22 B
23 B
24 B
25 B
26 B
27 B
28 B
29 C
I hope this helps.

In case you want to see a base R solution.
phases <- with(aggregate(TotalTime~Phase, df1, FUN=min),
rep(Phase, c(diff(TotalTime),
max(df1$TotalTime[df1$Phase == tail(Phase, 1)]) -
min(df1$TotalTime[df1$Phase == tail(Phase, 1)])+1)))
The main "trick" here is in that the second argument of rep can be a vector, which then repeats each element of the first argument that many times. The second argument is constructed using the difference of the minimum values of each phase diff(TotalTime) and concatenating the difference of the min and max value (+1) of the final phase level (here, "C"). The minimum values are found with aggregate, and I use with to simplify notation.
The result can then be fed to the data.frame.
data.frame(period=seq_len(length(phases))-1, phase=phases)
period phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
7 6 A
8 7 A
9 8 A
10 9 A
11 10 A
12 11 A
13 12 A
14 13 A
15 14 A
16 15 A
17 16 B
18 17 B
19 18 B
20 19 B
21 20 B
22 21 B
23 22 B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C

Related

Taking rolling differences of columns in R tibble for arbitrary number of columns

I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA

Loop over certain columns to replace NAs with 0 in a dataframe

I have spent a lot of time trying to write a loop to replace NAs with zeros for certain columns in a data frame and have not yet succeeded. I have searched and can't find similar question.
df <- data.frame(A = c(2, 4, 6, NA, 8, 10),
B = c(NA, 10, 12, 14, NA, 16),
C = c(20, NA, 22, 24, 26, NA),
D = c(30, NA, NA, 32, 34, 36))
df
Gives me:
A B C D
1 2 NA 20 30
2 4 10 NA NA
3 6 12 22 NA
4 NA 14 24 32
5 8 NA 26 34
6 10 16 NA 36
I want to set NAs to 0 for only columns B and D. Using separate code lines, I could:
df$B[is.na(df$B)] <- 0
df$D[is.na(df$D)] <- 0
However, I want to use a loop because I have many variables in my real data set.
I cannot find a way to loop over only columns B and D so I get:
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Essentially, I want to apply a loop using a variable list to a data frame:
varlist <- c("B", "D")
How can I loop over only certain columns in the data frame using a variable list to replace NAs with zeros?
here is a tidyverse aproach:
library(tidyverse)
df %>%
mutate_at(.vars = vars(B, D), .funs = funs(ifelse(is.na(.), 0, .)))
#output:
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
basically you say vars B and D should change by a defined function. Where . corresponds to the appropriate column.
Here's a base R one-liner
df[, varlist][is.na(df[, varlist])] <- 0
using the zoo package we can fill the selected columns.
library(zoo)
df[varlist]=na.fill(df[varlist],0)
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
In base R we can have
df[varlist]=lapply(df[varlist],function(x){x[is.na(x)]=0;x})
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36

selecting middle n rows in R

I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.
Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60
You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.

How to join tables and generate sums of columns?

I have several tables( two in particular example) with the same structure. I would like to join on ID_Position & ID_Name and generate the sum of January and February in the output table (There might be some NAs in both columns)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,23,2,22,24,26,28)
feb<-c(10,30,20,12,NA,3,NA,22,24,26)
df1 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,NA,NA,22,24,26,28)
feb<-c(10,30,20,12,23,3,3,22,24,26)
df2 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
I tried the inner and the full join. But that seems to work as I desire:
library(plyr)
test<-join(df1, df2, by =c("ID_Position","ID_Name") , type = "inner", match = "all")
Desired output:
ID_Position Position ID_Name Name jan feb
1 A 11 Michael 20 20
2 B 12 Tobi 40 60
3 C 13 Chris 60 40
4 D 14 Hans 44 24
5 E 15 Likas 23 23
6 H 16 Martin 2 6
7 I 17 Seba 44 22
8 J 18 Li 48 44
9 X 19 Sha 52 48
10 W 20 Susi 56 52
Your desired output doesn't seem entirely correct, but here's an example of how you can do this efficiently using data.table binary join which allows you to efficiently run functions while joining using the by = .EACHI option
library(data.table)
setkey(setDT(df1), ID_Position, ID_Name, Name)
setkey(setDT(df2), ID_Position, ID_Name, Name)
df2[df1, .(jan = sum(jan, i.jan, na.rm = TRUE),
feb = sum(feb, i.feb, na.rm = TRUE)),
by = .EACHI]
# ID_Position ID_Name Name jan feb
# 1: 1 11 Michael 20 20
# 2: 2 12 Tobi 40 60
# 3: 3 13 Chris 60 40
# 4: 4 14 Hans 44 24
# 5: 5 15 Likas 46 0
# 6: 6 16 Martin 0 6
# 7: 7 17 Seba 44 0
# 8: 8 18 Li 48 44
# 9: 9 19 Sha 52 48
# 10: 10 20 Susi 56 52

Forcing unique values before casting (pivoting) in R

I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.

Resources