R: assignment using values from several rows - r

Say I have measured some value (valueencoded as H,L or I) in five individuals (id) at two time points (time). Sometimes NAs may occur in value:
require(stringr)
require(dplyr)
set.seed(8)
df1 <- data.frame(
time=rep(c(1,2), 5),
id=rep(c("a", "b", "c", "d", "e"),2),
value=sample(c("H","L","I", NA), replace=T, 10))
How can I make a factor variable (preferable using dplyr::mutate()) that indicates for each idthe transition of value from time 1 to time 2 (e.g: like "HL" if H at time 1 and L at time 2).
df1 %>%
group_by(id) %>%
arrange(time)
Gives:
time id value
1 1 a L
2 2 a I
3 1 b L
4 2 b H
5 1 c NA
6 2 c NA
7 1 d NA
8 2 d I
9 1 e L
10 2 e I
And I would need a fourth column indicating time transition, like (made-up):
time id value transition
1 1 a L L-I
2 2 a I L-I
3 1 b L L-H
4 2 b H L-H
5 1 c NA NA-NA
6 2 c NA NA-NA
7 1 d NA NA-I
8 2 d I NA-I
9 1 e L L-I
10 2 e I L-I
Something like (if only the str_c() command could do it):
df1 <-
df1 %>%
group_by(id) %>%
arrange(time) %>%
mutate(transition=str_c(value, sep="-"))

df1 %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(transition = paste0(value[1],"-",value[2]))

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

R slide window through tibble

I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance
We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())

tidyr spread subset of key-value pairs

Given the example data, I'd like to spread a subset of the key-value pairs. In this case it is just one pair. However there are other cases where the subset to be spread is more than one pair.
library(tidyr)
# dummy data
> df1 <- data.frame(e = c(1, 1, 1, 1),
n = c("a", "b", "c", "d") ,
s = c(1, 2, 5, 7))
> df1
e n s
1 1 a 1
2 1 b 2
3 1 c 5
4 1 d 7
Classical spread of all key-value pairs:
> df1 %>% spread(n,s)
e a b c d
1 1 1 2 5 7
Desired output, spread only n=c
e c n s
1 1 5 a 1
2 1 5 b 2
3 1 5 d 7
We can do a gather after the spread
df1 %>%
spread(n, s) %>%
gather(n, s, -c, -e)
# e c n s
#1 1 5 a 1
#2 1 5 b 2
#3 1 5 d 7
Or instead of spread/gather, we filter without the 'c' row and then mutate to create the 'c' column while subsetting the 's' that corresponds to 'c'
df1 %>%
filter(n != "c") %>%
mutate(c = df1$s[df1$n=="c"])

Rearranging data frame columns in R (mutate, dplyr)

I have a data frame like so
Type Number Species
A 1 G
A 2 R
A 7 Q
A 4 L
B 4 S
B 5 T
B 3 H
B 9 P
C 12 K
C 11 T
C 6 U
C 5 Q
Where I have used group_by(Type)
My goal is to collapse this data by having NUMBER be the top 2 values in the number column, and then making a new column(Number_2) that is the second 2 values.
Also I would want the Species values for the bottom two numbers to be deleted, so that the species corresponds to the higher number in the row
I would like to use dplyr and the final would look like this
Type Number Number_2 Species
A 7 1 Q
A 4 2 L
B 5 3 T
B 9 4 P
C 12 6 K
C 11 5 T
as of now the order that number_2 is in doesn't matter, as long as it is in the same type....
I don't know if this is possible but if it is does anyone know how...
thanks!
You can try
library(data.table)
setDT(df1)[order(-Number), list(Number1=Number[1:2],
Number2=Number[3:4],
Species=Species[1:2]), keyby = Type]
# Type Number1 Number2 Species
#1: A 7 2 Q
#2: A 4 1 L
#3: B 9 4 P
#4: B 5 3 T
#5: C 12 6 K
#6: C 11 5 T
Or using dplyr with do
library(dplyr)
df1 %>%
group_by(Type) %>%
arrange(desc(Number)) %>%
do(data.frame(Type=.$Type[1L],
Number1=.$Number[1:2],
Number2 = .$Number[3:4],
Species=.$Species[1:2], stringsAsFactors=FALSE))
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T
Here's a different dplyr approach.
library(dplyr)
# Start creating the data set with top 2 values and store as df1:
df1 <- df %>%
group_by(Type) %>%
top_n(2, Number) %>%
ungroup() %>%
arrange(Type, Number)
# Then, get the anti-joined data (the not top 2 values), arrange, rename and select
# the number colummn and cbind to df1:
out <- df %>%
anti_join(df1, c("Type","Number")) %>%
arrange(Type, Number) %>%
select(Number2 = Number) %>%
cbind(df1, .)
This results in:
> out
# Type Number Species Number2
#1 A 4 L 1
#2 A 7 Q 2
#3 B 5 T 3
#4 B 9 P 4
#5 C 11 T 5
#6 C 12 K 6
This could be another option using ddply
library(plyr)
ddply(dat[order(Number)], .(Type), summarize,
Number1 = Number[4:3], Number2 = Number[2:1], Species = Species[4:3])
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T

Using 'window' functions in dplyr

I need to process rows of a data-frame in order, but need to look-back for certain rows. Here is an approximate example:
library(dplyr)
d <- data_frame(trial = rep(c("A","a","b","B","x","y"),2))
d <- d %>%
mutate(cond = rep('', n()), num = as.integer(rep(0,n())))
for (i in 1:nrow(d)){
if(d$trial[i] == "A"){
d$num[i] <- 0
d$cond[i] <- "A"
}
else if(d$trial[i] == "B"){
d$num[i] <- 0
d$cond[i] <- "B"
}
else{
d$num[i] <- d$num[i-1] +1
d$cond[i] <- d$cond[i-1]
}
}
The resulting data-frame looks like
> d
Source: local data frame [12 x 3]
trial cond num
1 A A 0
2 a A 1
3 b A 2
4 B B 0
5 x B 1
6 y B 2
7 A A 0
8 a A 1
9 b A 2
10 B B 0
11 x B 1
12 y B 2
What is the proper way of doing this using dplyr?
dlpyr-only solution:
d %>%
group_by(i=cumsum(trial %in% c('A','B'))) %>%
mutate(cond=trial[1],num=seq(n())-1) %>%
ungroup() %>%
select(-i)
# trial cond num
# 1 A A 0
# 2 a A 1
# 3 b A 2
# 4 B B 0
# 5 x B 1
# 6 y B 2
# 7 A A 0
# 8 a A 1
# 9 b A 2
# 10 B B 0
# 11 x B 1
# 12 y B 2
Try
d %>%
mutate(cond = zoo::na.locf(ifelse(trial=="A"|trial=="B", trial, NA))) %>%
group_by(id=rep(1:length(rle(cond)$values), rle(cond)$lengths)) %>%
mutate(num = 0:(n()-1)) %>% ungroup %>%
select(-id)
Here is one way. The first thing was to add A or B in cond using ifelse. Then, I employed na.locf() from the zoo package in order to fill NA with A or B. I wanted to assign a temporary group ID before I took care of num. I borrowed rleid() in the data.table package. Grouping the data with the temporary group ID (i.e., foo), I used row_number() which is one of the window functions in the dplyr package. Note that I tried to remove foo doing select(-foo). But, the column wanted to stay. I think this is probably something to do with compatibility of the function.
library(zoo)
library(dplyr)
library(data.table)
d <- data_frame(trial = rep(c("A","a","b","B","x","y"),2))
mutate(d, cond = ifelse(trial == "A" | trial == "B", trial, NA),
cond = na.locf(cond),
foo = rleid(cond)) %>%
group_by(foo) %>%
mutate(num = row_number() - 1)
# trial cond foo num
#1 A A 1 0
#2 a A 1 1
#3 b A 1 2
#4 B B 2 0
#5 x B 2 1
#6 y B 2 2
#7 A A 3 0
#8 a A 3 1
#9 b A 3 2
#10 B B 4 0
#11 x B 4 1
#12 y B 4 2

Resources