Order columns by year independently in a dataframe in R - r

Data:
set.seed(0)
Temp <- data.frame(year=rep(1:3,each=4),V1=floor(rnorm(12)*2),V2=floor(rnorm(12)*2))
year V1 V2
1 1 2 -3
2 1 -1 -1
3 1 2 -1
4 1 2 -1
5 2 0 0
6 2 -4 -2
7 2 -2 0
8 2 -1 -3
9 3 -1 -1
10 3 4 0
11 3 1 0
12 3 -2 1
I want to reorder V1 and V2 independently within each year. I can do it with 10 lines, but I believe there must be a more beautiful way to do it.
Desired output:
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1

Using dplyr you can do
library(dplyr)
Temp %>%
group_by(year) %>%
mutate(V1=sort(V1), V2=sort(V2))
which returns
# A tibble: 12 x 3
# Groups: year [3]
year V1 V2
<int> <dbl> <dbl>
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1
And if you needed to do that with all columns, you could do
Temp %>%
group_by(year) %>%
mutate_all(sort)

Using data.table:
library(data.table)
setDT(Temp)[,c("V1","V2"):=list(sort(V1),sort(V2)),year]

If you use plyr and you know the column names, you can easily do this using ddply:
library(plyr)
ddply(Temp, "year", summarize, V1=sort(V1), V2=sort(V2))
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1
If you don't know the column names, you'd have to make a function to do it:
> ddply(Temp, "year", function(x) { as.data.frame(lapply(x, sort)) })
year V1 V2
1 1 -1 -3
2 1 2 -1
3 1 2 -1
4 1 2 -1
5 2 -4 -3
6 2 -2 -2
7 2 -1 0
8 2 0 0
9 3 -2 -1
10 3 -1 0
11 3 1 0
12 3 4 1

Related

how to create a column that determines if a value is missing in a variable in R

I am trying to identify if a column has a missing number category based on a max.score. Here is a sample dataset.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
score = c(0,0,2,0,2, 0,1,1,0,1, 0,1,0,1,0),
max.score = c(2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2))
> df
id score max.score
1 1 0 2
2 1 0 2
3 1 2 2
4 1 0 2
5 1 2 2
6 2 0 1
7 2 1 1
8 2 1 1
9 2 0 1
10 2 1 1
11 3 0 2
12 3 1 2
13 3 0 2
14 3 1 2
15 3 0 2
for the id = 1, based on the max.score, it is missing the category 1. I would like to add missing column saying something like 1. When id=3 is missing score = 2, the missing column should indicate a value of 2. If there are more than one category is missing, then it would indicate those missing categories as ,for example, 1,3. The desired output should be:
> df
id score max.score missing
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2
Any thoughts?
Thanks!
df %>%
group_by(id) %>%
mutate(missing = toString(setdiff(0:max.score[1], unique(score))),
missing = ifelse(nzchar(missing), missing, NA))
# A tibble: 15 x 4
# Groups: id [3]
id score max.score missing
<dbl> <dbl> <dbl> <chr>
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2

How can I create a new variable which identifies rows where another variable changes sign?

I have a question regarding data preparation. I have the following data set (in long format; one row per measurement point, therefore several rows per person):
dd <- read.table(text=
"ID time
1 -4
1 -3
1 -2
1 -1
1 0
1 1
2 -3
2 -1
2 2
2 3
2 4
3 -3
3 -2
3 -1
4 -1
4 1
4 2
4 3
5 0
5 1
5 2
5 3
5 4", header=TRUE)
Now I would like to create a new variable that has a 1 in the row, in which a sign change on the time variable happens for the first time for this person, and a 0 in all other rows. If a person has only negative values on time, the should not be any 1 on the new variable. For a person that has only positive values on time, the first row should have a 1 on the new variable and all other rows should be coded with 0. For my example above the new data frame should look like this:
dd <- read.table(text=
"ID time new.var
1 -4 0
1 -3 0
1 -2 0
1 -1 0
1 0 1
1 1 0
2 -3 0
2 -1 0
2 2 1
2 3 0
2 4 0
3 -3 0
3 -2 0
3 -1 0
4 -1 0
4 1 1
4 2 0
4 3 0
5 0 1
5 1 0
5 2 0
5 3 0
5 4 0", header=TRUE)
Does anyone know how to do this? I thought about using dplyr and group_by, however I am pretty new to R and did not make it. Any help is much appreciated!
There are 2 different operations you want done to create new.var, so you need to do them in 2 steps. I'll break this into 2 separate mutate calls for simplicity, but you can put both of them into the same mutate
First, we group by ID and then find the rows where the sign changes. We need to use time >= 0 instead of sign as recommended in this answer: R identifying a row prior to a change in sign because you want a sign change to be counted only when going from -1 <-> 0, not from 0 <-> 1:
library(tidyverse)
dd2 <- dd %>%
group_by(ID) %>%
mutate(new.var = as.numeric((time >= 0) != (lag(time) >= 0)))
dd2
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 NA
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 NA
8 2 -1 0
9 2 2 1
10 2 3 0
# … with 13 more rows
Then we use case_when to modify the first row based on your desired rules. Due to the way lag works, the first row will always have NA (since there is no previous row to look at) which makes it a good way to pick out that first row to change it based on the time values in that group:
dd3 <- dd2 %>%
mutate(new.var = case_when(
!is.na(new.var) ~ new.var,
all(time >= 0) ~ 1,
TRUE ~ 0)
)
print(dd3, n = 100) #n=100 because tibbles are truncated to 10 rows by print
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
11 2 4 0
12 3 -3 0
13 3 -2 0
14 3 -1 0
15 4 -1 0
16 4 1 1
17 4 2 0
18 4 3 0
19 5 0 1
20 5 1 0
21 5 2 0
22 5 3 0
23 5 4 0
You can try this:
library(dplyr)
dd %>% left_join(dd %>% group_by(ID) %>% summarise(index=min(which(time>=0)))) %>%
group_by(ID) %>% mutate(new.var=ifelse(row_number(ID)==index,1,0)) %>% select(-index)-> DF
# A tibble: 23 x 3
# Groups: ID [5]
ID time new.var
<int> <int> <dbl>
1 1 -4 0
2 1 -3 0
3 1 -2 0
4 1 -1 0
5 1 0 1
6 1 1 0
7 2 -3 0
8 2 -1 0
9 2 2 1
10 2 3 0
The following ave instruction does what the question asks for.
dd$new.var <- with(dd, ave(time, ID, FUN = function(x){
y <- integer(length(x))
if(any(x >= 0)) y[which.max(x[1]*x <= 0)] <- 1L
y
}))
dd
# ID time new.var
#1 1 -4 0
#2 1 -3 0
#3 1 -2 0
#4 1 -1 0
#5 1 0 1
#6 1 1 0
#7 2 -3 0
#8 2 -1 0
#9 2 2 1
#10 2 3 0
#11 2 4 0
#12 3 -3 0
#13 3 -2 0
#14 3 -1 0
#15 4 -1 0
#16 4 1 1
#17 4 2 0
#18 4 3 0
#19 5 0 1
#20 5 1 0
#21 5 2 0
#22 5 3 0
#23 5 4 0
If the expected output is renamed dd2 then
identical(dd, dd2)
#[1] TRUE

add "counting" column to a data frame with certain conditions

I have a data frame with different account and win or lose record. I want to count how many times a person has lost in a row.
df <- data.frame(account_number =c(1,1,1,1,1,1,1,2,2,2,2,2,3,3),
win_lose = c(-1,-1,-1,1,-1,-1,-1,-1,-1,1,1,1,1,-1))
> df
account_number win_lose
1 1 -1
2 1 -1
3 1 -1
4 1 1
5 1 -1
6 1 -1
7 1 -1
8 2 -1
9 2 -1
10 2 1
11 2 1
12 2 1
13 3 1
14 3 -1
Each account represents a person. The end results should look like this
account_number win_lose losing_streak
1 1 -1 1
2 1 -1 2
3 1 -1 3
4 1 1 0
5 1 -1 1
6 1 -1 2
7 1 -1 3
8 2 -1 1
9 2 -1 2
10 2 1 0
11 2 1 0
12 2 1 0
13 3 1 0
14 3 -1 1
One option is rleid from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'account_numberandrleidof 'win_lose', we get the sequence of rows (seq_len(.N)) multiplied by 'win_lose < 0' so that all the FALSE values gets coerced to 0 and would be 0 by multiplying and the TRUE will be coerced to 1 and we get the sequence value by multiplying with 1.
library(data.table)
setDT(df)[, losing_streak := seq_len(.N) * (win_lose <0) ,
by = .(account_number, rleid(win_lose))]
df
# account_number win_lose losing_streak
# 1: 1 -1 1
# 2: 1 -1 2
# 3: 1 -1 3
# 4: 1 1 0
# 5: 1 -1 1
# 6: 1 -1 2
# 7: 1 -1 3
# 8: 2 -1 1
# 9: 2 -1 2
#10: 2 1 0
#11: 2 1 0
#12: 2 1 0
#13: 3 1 0
#14: 3 -1 1
A base R option would be using ave (for group by) and with rle
with(df, ave(win_lose, account_number, FUN =
function(x) with(rle(x== -1), sequence(lengths) * rep(values, lengths))))
#[1] 1 2 3 0 1 2 3 1 2 0 0 0 0 1

Row positions relative to a specific condition in R

I have a dataset with "Athletes" playing "Matches" ("Match"==1) on random "Dates". For example:
df <- data.frame(matrix(nrow = 80, ncol = 5))
colnames(df) <- c("Athlete", "Date", "Match", "DaysAfter", "DaysBefore")
df[,"Athlete"] <- c(rep(1, 20), rep(2,20), rep(3, 20), rep(4, 20))
df[,"Date"] <- rep(1:20, 4)
df[,"Match"] <- c(0,0,0,0,1,0,0,1,0,0)
I want to make two variables:
df$DaysAfter <- # number of days after last "Match" (for each "Athlete").
df$DaysBefore <- # number of days before next "Match" (for each "Athlete").
PS! When "Match" == 1, then "DaysAfter" and "DaysBefore" should be 0.
When there are no matches before in "DaysAfter" and after in "DaysBefore", show NA (see example).
I want the dataset to look like this:
Ath Dat Mat DA DB
1 1 0 NA -4
1 2 0 NA -3
1 3 0 NA -2
1 4 0 NA -1
1 5 1 0 0
1 6 0 1 -2
1 7 0 2 -1
1 8 1 0 0
1 9 0 1 -4
1 10 0 2 -3
1 11 0 3 -2
1 12 0 4 -1
1 13 1 0 0
1 14 0 1 -2
1 15 0 2 -1
1 16 1 0 0
1 17 0 1 NA
1 18 0 2 NA
1 19 0 3 NA
1 20 0 4 NA
2 1 0 NA -4
2 2 0 NA -3
etc.
How can I achieve this?
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Athlete' and another grouping variable created based on the position of 1 in 'Match' (cumsum(Match == 1)), we create two columns -
1) DA - As we need NA for all the elements until the first 1 in 'Match', create a logical condition with if/else so that all the elements that are 0 in 'Match' will be multiplied by 'NA' (NA* any number returns NA). As we did the grouping by cumsum, only the first group have all elements as 0, so that part got solved. The else condition gets the sequence of rows and subtract 1 from it (`.seq_len(.N)-1).
2) DB - We multiply the 'Match' with the number of rows (.N) and subtract from the reverse sequence (.N:1). Once we get this done, the last part involves creating NA for the elements in the column after the last 1 in 'Match'. Grouped by 'Athlete', we get the row index (.I) of the sequence from the last 1 in 'Match' (next element) to the number of rows (.N), and assign (:=) the 'DB' to NA based on that index.
library(data.table)
df1 <- setDT(df)[, c("DA", "DB") := list(if(all(!Match)) NA*Match else
seq_len(.N)-1,Match*(.N) -(.N:1)) , by = .(cumsum(Match==1), Athlete)]
df1[df1[, .I[(max(which(Match==1))+1):.N] , by = Athlete]$V1, DB:= NA][]
# Athlete Date Match DA DB
# 1: 1 1 0 NA -4
# 2: 1 2 0 NA -3
# 3: 1 3 0 NA -2
# 4: 1 4 0 NA -1
# 5: 1 5 1 0 0
# 6: 1 6 0 1 -2
# 7: 1 7 0 2 -1
# 8: 1 8 1 0 0
# 9: 1 9 0 1 -6
#10: 1 10 0 2 -5
#11: 1 11 0 3 -4
#12: 1 12 0 4 -3
#13: 1 13 0 5 -2
#14: 1 14 0 6 -1
#15: 1 15 1 0 0
#16: 1 16 0 1 -2
#17: 1 17 0 2 -1
#18: 1 18 1 0 0
#19: 1 19 0 1 NA
#20: 1 20 0 2 NA
#21: 2 1 0 NA -4
#22: 2 2 0 NA -3
#23: 2 3 0 NA -2
#24: 2 4 0 NA -1
#25: 2 5 1 0 0
#26: 2 6 0 1 -2
#27: 2 7 0 2 -1
#28: 2 8 1 0 0
#29: 2 9 0 1 -6
#30: 2 10 0 2 -5
#31: 2 11 0 3 -4
#32: 2 12 0 4 -3
#33: 2 13 0 5 -2
#34: 2 14 0 6 -1
#35: 2 15 1 0 0
#36: 2 16 0 1 -2
#37: 2 17 0 2 -1
#38: 2 18 1 0 0
#39: 2 19 0 1 NA
#40: 2 20 0 2 NA
#41: 3 1 0 NA -4
#42: 3 2 0 NA -3
#43: 3 3 0 NA -2
#44: 3 4 0 NA -1
#45: 3 5 1 0 0
#46: 3 6 0 1 -2
#47: 3 7 0 2 -1
#48: 3 8 1 0 0
#49: 3 9 0 1 -6
#50: 3 10 0 2 -5
#51: 3 11 0 3 -4
#52: 3 12 0 4 -3
#53: 3 13 0 5 -2
#54: 3 14 0 6 -1
#55: 3 15 1 0 0
#56: 3 16 0 1 -2
#57: 3 17 0 2 -1
#58: 3 18 1 0 0
#59: 3 19 0 1 NA
#60: 3 20 0 2 NA
#61: 4 1 0 NA -4
#62: 4 2 0 NA -3
#63: 4 3 0 NA -2
#64: 4 4 0 NA -1
#65: 4 5 1 0 0
#66: 4 6 0 1 -2
#67: 4 7 0 2 -1
#68: 4 8 1 0 0
#69: 4 9 0 1 -6
#70: 4 10 0 2 -5
#71: 4 11 0 3 -4
#72: 4 12 0 4 -3
#73: 4 13 0 5 -2
#74: 4 14 0 6 -1
#75: 4 15 1 0 0
#76: 4 16 0 1 -2
#77: 4 17 0 2 -1
#78: 4 18 1 0 0
#79: 4 19 0 1 NA
#80: 4 20 0 2 NA
This code should work:
unique_list<-(unique(df$Athlete))
for(k in (1:length(unique_list))){
index<-c(1:dim(df)[1])[df$Athlete==unique_list[k]]
count=NA
for(j in index){
if(df$Mat[j]==1){
count=0
}else{
count=count+1
}
df$DaysAfter[j]=count
}
count=NA
for(j in index[c(length(index):1)]){
if(df$Mat[j]==1){
count=0
}else{
count=count-1
}
df$DaysBefore[j]=count
}
}
I once wrote the following function:
cumsum.r <- function (vals, restart)
{
if (!is.vector(vals) || !is.vector(restart))
stop("expect vectors")
if (length(vals) != length(restart))
stop("different length")
len = length(vals)
restart[1] = T
ind = which(restart)
ind = rep(ind, c(ind[-1], len + 1) - ind)
vals.c = cumsum(vals)
vals.c - vals.c[ind] + vals[ind]
}
It performs cumsum, but starts from zero whenever restart=TRUE.
For "days after", you need
new.ath <- c(TRUE, df$Ath[-1]==df$Ath[-length(df$Ath)])
restart <- df$Math==1 | new.ath
days.after <- cumsum.r(1-restart, restart)
for days.before you need
rr <- rev(restart)
days.before <- -rev(cumsum.r(1-rr, rr))
(This does not put NAs, but you can use this cumsum.r for NAs too.)

Replace a column data with another column of data in a data frame while replacing prior instances <0 by 0

I have a data frame
x<-c(1,3,0,2,4,5,0,-2,-5,1,0)
y<-c(-1,-2,0,3,4,5,1,8,1,0,2)
data.frame(x,y)
x y
1 1 -1
2 3 -2
3 0 0
4 2 3
5 4 4
6 5 5
7 0 1
8 -2 8
9 -5 1
10 1 0
11 0 2
I would like to replace the data in column y with data from column x and also replacing in y the instances that where <0 in y and replacing them by 0. This will result in the following data frame
data.frame(x,y)
x y
1 1 0
2 3 0
3 0 0
4 2 2
5 4 4
6 5 5
7 0 0
8 -2 -2
9 -5 -5
10 1 0
11 0 0
Thanks
x<-c(1,3,0,2,4,5,0,-2,-5,1,0)
y<-c(-1,-2,0,3,4,5,1,8,1,0,2)
df <- data.frame(x, y)
df$y <- ifelse(y<0,0,x)
df
# x y
# 1 1 0
# 2 3 0
# 3 0 0
# 4 2 2
# 5 4 4
# 6 5 5
# 7 0 0
# 8 -2 -2
# 9 -5 -5
# 10 1 1
# 11 0 0
In one line:
> df <- transform(data.frame(x,y), y = ifelse(y<0,0,x))
> df
x y
1 1 0
2 3 0
3 0 0
4 2 2
5 4 4
6 5 5
7 0 0
8 -2 -2
9 -5 -5
10 1 1
11 0 0
Note that the resulting data differs from the reference result you provide on record 10. I suspect that this might be because you applied the condition <= 0 rather than < 0? Otherwise the 1 would be carried across from the x field for this record.
Given your x and y vectors, create the data.frame in one swift move:
> data.frame(x, y=ifelse(y < 0, 0, x))
x y
1 1 0
2 3 0
3 0 0
4 2 2
5 4 4
6 5 5
7 0 0
8 -2 -2
9 -5 -5
10 1 1
11 0 0

Resources