Shifting rows up in columns and flush remaining ones - r

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!

If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5

You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.

One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Related

Expanding Data Frame with cumsum in R

I've got a data frame with historc F1 data that looks like this:
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
...
...
...
...
Moss
47
1
4
Fangio
47
2
23
Kling
47
3
2
now I want to extend it in a way that for every race there is not only the top 3 of that specific Race but also everyone that has had a top 3 before so I can create a racing bar chart. The final data frame should look like this
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
Farina
2
NA
1
Fagioli
2
NA
1
Parnell
2
NA
1
Parsons
3
1
1
Holland
3
2
1
Rose
3
3
1
Farina
3
NA
1
Fagioli
3
NA
1
Parnell
3
NA
1
Fangio
3
NA
1
Ascari
3
NA
1
Chiron
3
NA
1
Is there any easy way to do this? I couldnt find someone with a similar problem on google.
If I correctly understand your problem, you have only observations for the top3 drivers of every race. But you want to have observations for every driver that has ever achieved a top3 position in your dataset across all races.
For example in the following dataset, driver D only has an observation for the second race where they achieved the first place, but not the other races:
dat <- data.frame(driver = c("A", "B", "C", "D", "A", "B", "B", "A", "C"),
race_number = rep(1:3, each = 3),
position = rep(1:3, 3))
print(dat)
driver race_number position
1 A 1 1
2 B 1 2
3 C 1 3
4 D 2 1
5 A 2 2
6 B 2 3
7 B 3 1
8 A 3 2
9 C 3 3
To add entries for driver D for the races number 1 and 2 you could use tidyr's expand() function or if you want to use base R you could achieve the same using expand.grid() and unique(). This would leave you with a dataframe object containing all possible combinations between the drivers and the race numbers. Afterwards you simply have to left or right join the result with the initial dataframe.
A solution using standard tidyverse packages tidyr and dplyr could look like this:
library(dplyr)
library(tidyr)
dat %>%
expand(driver, race_number) %>%
left_join(dat)
# A tibble: 12 x 4
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 2 3
4 B 1 2 1
5 B 2 3 2
6 B 3 1 3
7 C 1 3 1
8 C 2 NA NA
9 C 3 3 2
10 D 1 NA NA
11 D 2 1 1
12 D 3 NA NA
Note that the "new" observations will naturally have NAs for the position and the number of previous podium positions. The latter could be added easily via the following approach, which counts the previous
dat %>%
expand(driver, race_number) %>%
left_join(dat) %>%
arrange(race_number) %>%
mutate(previous_podium_positions = ifelse(is.na(previous_podium_positions),0,1)) %>%
group_by(driver) %>%
mutate(previous_podium_positions = cumsum(previous_podium_positions))
Joining, by = c("driver", "race_number")
# A tibble: 12 x 4
# Groups: driver [4]
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 B 1 2 1
3 C 1 3 1
4 D 1 NA 0
5 A 2 2 2
6 B 2 3 2
7 C 2 NA 1
8 D 2 1 1
9 A 3 2 3
10 B 3 1 3
11 C 3 3 2
12 D 3 NA 1
I hope this helped. Just a brief disclaimer, these may very well be not the most resource or time-efficient solutions but rather the fastest/easiest way to solve the issue.

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources