Shifting a variable up in R data frame with respect to another variable - r

I have a data.frame:
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
I would like to push the Pur up for each ID to get the up.Pur as follows:
ID Pur up.Pur
2 0 1
2 1 2
2 2 3
2 3 NA
3 1 2
3 2 NA
5 4 5
5 5 NA
Would appreciate your help with this.

Here is a dplyr approach
library(dplyr)
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
df %>%
group_by(ID) %>%
mutate(up.Pur = lead(Pur))
# Source: local data frame [8 x 3]
# Groups: ID [3]
#
# ID Pur up.Pur
# <dbl> <dbl> <dbl>
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA
For completeness, I've added a base R approach, just in case you don't feel like installing any packages.
dfList = split(df, ID)
dfList = lapply(dfList, function(x){
x$up.Pur = c(x$Pur[-1], NA)
return(x)
})
unsplit(dfList, ID)
# ID Pur up.Pur
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA

We can use shift from data.table
library(data.table)
setDT(df)[, up.Pur := shift(Pur, type = "lead"), by = ID]
df
# ID Pur up.Pur
#1: 2 0 1
#2: 2 1 2
#3: 2 2 3
#4: 2 3 NA
#5: 3 1 2
#6: 3 2 NA
#7: 5 4 5
#8: 5 5 NA

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

how to move up the values within each group in R

I need to shift valid values to the top the of dataframe withing each id. Here is an example dataset:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
itemid = c(1,2,3,1,2,3,1,2,3,4),
values = c(1,NA,0,NA,NA,0,1,NA,0,NA))
df
id itemid values
1 1 1 1
2 1 2 NA
3 1 3 0
4 2 1 NA
5 2 2 NA
6 2 3 0
7 3 1 1
8 3 2 NA
9 3 3 0
10 3 4 NA
excluding the id column, when there is a missing value in values column, I want to shift all values aligned to the top for each id.
How can I get this desired dataset below?
df1
id itemid values
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
Using tidyverse you can arrange by whether values is missing or not (which will put those at the bottom).
library(tidyverse)
df %>%
arrange(id, is.na(values))
Output
id itemid values
<dbl> <dbl> <dbl>
1 1 1 1
2 1 3 0
3 1 2 NA
4 2 3 0
5 2 1 NA
6 2 2 NA
7 3 1 1
8 3 3 0
9 3 2 NA
10 3 4 NA
Or, if you wish to retain the same order for itemid and other columns, you can use mutate to specifically order columns of interest (like values). Other answers provide good solutions, such as #Santiago and #ThomasIsCoding. If you have multiple columns of interest to move NA to the bottom per group, you can also try:
df %>%
group_by(id) %>%
mutate(across(.cols = values, ~values[order(is.na(.))]))
where the .cols argument would contain the columns to transform and reorder independently.
Output
id itemid values
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
We can try ave + order
> transform(df, values = ave(values, id, FUN = function(x) x[order(is.na(x))]))
id itemid values
1 1 1 1
2 1 2 0
3 1 3 NA
4 2 1 0
5 2 2 NA
6 2 3 NA
7 3 1 1
8 3 2 0
9 3 3 NA
10 3 4 NA
With data.table:
library(data.table)
setDT(df)[, values := values[order(is.na(values))], id][]
#> id itemid values
#> 1: 1 1 1
#> 2: 1 2 0
#> 3: 1 3 NA
#> 4: 2 1 0
#> 5: 2 2 NA
#> 6: 2 3 NA
#> 7: 3 1 1
#> 8: 3 2 0
#> 9: 3 3 NA
#> 10: 3 4 NA
I'd define a function that does what you want and then group by id:
completed_first <- function(x) {
completed <- x[!is.na(x)]
length(completed) <- length(x)
completed
}
library(dplyr)
df %>%
group_by(id) %>%
mutate(
values = completed_first(values)
) %>%
ungroup()
# # A tibble: 10 × 3
# id itemid values
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 2 0
# 3 1 3 NA
# 4 2 1 0
# 5 2 2 NA
# 6 2 3 NA
# 7 3 1 1
# 8 3 2 0
# 9 3 3 NA
# 10 3 4 NA
(This method preserves the order of itemid.)
Or building upon ThomasIsCoding's answer:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
values = values[order(is.na(values))]
) %>%
ungroup()
# # A tibble: 10 × 3
# id itemid values
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 2 0
# 3 1 3 NA
# 4 2 1 0
# 5 2 2 NA
# 6 2 3 NA
# 7 3 1 1
# 8 3 2 0
# 9 3 3 NA
# 10 3 4 NA

Removing groups with all NA in Data.Table or DPLYR in R

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))
I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...
Is this what you want?
library(dplyr)
dataHAVE %>%
group_by(student) %>%
filter(!all(is.na(score)))
student time score
<dbl> <dbl> <dbl>
1 1 1 7
2 1 2 9
3 1 3 5
4 3 1 NA
5 3 2 3
6 3 3 9
7 5 NA 7
8 5 2 NA
9 5 3 5
Each student is only kept if not (!) all score values are NA
Since nobody suggested one, here is a solution using data.table:
library(data.table)
dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
Edit:
Previous but wrong code:
dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]
New and correct code:
dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]
Returns:
student time score
1: 1 1 7
2: 1 2 9
3: 1 3 5
4: 3 1 NA
5: 3 2 3
6: 3 3 9
7: 5 NA 7
8: 5 2 NA
9: 5 3 5
Edit:
Updatet data.table solution with #Cole s suggestion...
Here is a base R solution using subset + ave
dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))
or
dataWANT <- subset(dataHAVE,
!Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))
Another option:
library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]
Create a dummy variable, and filter based on that
library("dplyr")
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataHAVE %>%
mutate(check=is.na(time)&is.na(score)) %>%
filter(check == FALSE) %>%
select(-check)
#> student time score
#> 1 1 1 7
#> 2 1 2 9
#> 3 1 3 5
#> 4 2 1 NA
#> 5 2 2 NA
#> 6 2 3 NA
#> 7 3 1 NA
#> 8 3 2 3
#> 9 3 3 9
#> 10 5 NA 7
#> 11 5 2 NA
#> 12 5 3 5
Created on 2020-02-21 by the reprex package (v0.3.0)
data.table solution generalising to any number of columns:
dataHAVE[,
.SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)],
by = student]
# student time score
# 1: 1 1 7
# 2: 1 2 9
# 3: 1 3 5
# 4: 3 1 NA
# 5: 3 2 3
# 6: 3 3 9
# 7: 5 NA 7
# 8: 5 2 NA
# 9: 5 3 5

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Append frequency value from data

I have a data set that look like this
XDATA
SAMPN HHSIZE TOTVEH
1 2 3
2 6 4
2 6 4
5 1 3
5 1 3
5 1 3
How can i add an extra column for lets say SAMPN frequency so i can look like this:
XDATA
SAMPN HHSIZE TOTVEH FREQ
1 2 3 1
2 6 4 2
2 6 4 2
5 1 3 3
5 1 3 3
5 1 3 3
Thanks in advance
library(data.table)
XDATA <- data.table(XDATA)
XDATA[, FREQ := .N, by=SAMPN]
XDATA
SAMPN HHSIZE TOTVEH FREQ
1: 1 2 3 1
2: 2 6 4 2
3: 2 6 4 2
4: 5 1 3 3
5: 5 1 3 3
6: 5 1 3 3
>
For BASE R - see #Ananda Mahto's solution
An alternative to #Ricardo's answer for base R (which uses tapply and merge) is to use ave:
within(XDATA, {
FREQ <- ave(SAMPN, SAMPN, FUN = length)
})
# SAMPN HHSIZE TOTVEH FREQ
# 1 1 2 3 1
# 2 2 6 4 2
# 3 2 6 4 2
# 4 5 1 5 3
# 5 5 1 5 3
# 6 5 1 5 3
XDATA <- data.table(
SAMPN = c(1,2,2,5,5,5),
HHSIZE = c(2,6,6,1,1,1),
TOTVEH = c(3,4,4,5,5,5)
)
XDATA[, COUNT := 1]
XDATA[, FREQ := sum(COUNT), by = c('SAMPN','HHSIZE','TOTVEH')]
#SAMPN HHSIZE TOTVEH COUNT FREQ
#1: 1 2 3 1 1
#2: 2 6 4 1 2
#3: 2 6 4 1 2
#4: 5 1 5 1 3
#5: 5 1 5 1 3
#6: 5 1 5 1 3

Resources