R - sort semi-numeric column - r

I created a dataset to illustrate the problem that I have.
My data looks like this
id time act
1 1 time1 a
2 1 time2 a
3 1 time3 a
4 1 time101 a
5 1 time103 a
6 1 time1001 b
7 1 time1003 b
9 1 time10000 b
10 1 time100010 c
What I want is to spread the data with time in the correct order, like this :
id 1 2 3 101 103 1001 1003 1004 10000 100010
1 a a a a a b b b b c
Here is what I do not fully understand. When I spread my data I get something like
library(dplyr)
library(tidyr)
dt %>% spread(time, act)
id time1 time10000 time100010 time1001 time1003 time1004 time101 time103 time2 time3
1 1 a b c b b b a a a a
So R seems to recognise so some of numerical order but considers that time10000 is prior to 2 or 3.
Why is it so ? and I could I solve this problem.
What I would like is this :
id time1 time2 time3 time101 time103 time1001 time1003 time1004 time10000 time100010
1 1 a a a a a b b b b c
The data
dt = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
time = structure(c(1L, 9L, 10L, 7L, 8L, 4L, 5L, 6L, 2L, 3L
), .Label = c("time1", "time10000", "time100010", "time1001",
"time1003", "time1004", "time101", "time103", "time2", "time3"
), class = "factor"), act = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("id",
"time", "act"), class = "data.frame", row.names = c(NA, -10L))

Reorder your factor levels:
> dt$time<-factor(dt$time, as.character(dt$time))
> dt %>% spread(time, act)
id time1 time2 time3 time101 time103 time1001 time1003 time1004 time10000
1 1 a a a a a b b b b
time100010
1 c

Related

Matching the previous row in a specific column and performing a calculation in R

I currently have a data file that resembles this:
R ID A B
1 A1 0 0
2 A1 2 4
3 A1 4 8
4 A2 0 0
5 A2 3 3
6 A2 6 6
I would like to write a script that will only calculate "(8-4)/(4-2)" from the previous row only if the "ID" matches. For example, in the output for a column "C" in row 3, if A1 == A1 in the "ID" column, then (8-4)/(4-2) = 2. If A1 != A1, then output is 0.
I would like the output to be like this:
R ID A B C
1 A1 0 0 0
2 A1 2 4 2
3 A1 4 8 2
4 A2 0 0 0
5 A2 3 3 1
6 A2 6 6 1
Hopefully I explained this correctly in a non-confusing manner.
We could group_by ID, use diff to calculate difference between rows and divide.
library(dplyr)
df %>% group_by(ID) %>% mutate(C = c(0, diff(B)/diff(A)))
# R ID A B C
# <int> <fct> <int> <int> <dbl>
#1 1 A1 0 0 0
#2 2 A1 2 4 2
#3 3 A1 4 8 2
#4 4 A2 0 0 0
#5 5 A2 3 3 1
#6 6 A2 6 6 1
and similarly using data.table
library(data.table)
setDT(df)[, C := c(0, diff(B)/diff(A)), ID]
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))
We can also use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(C = (B - lag(B, default = first(B)))/(A - lag(A, default = first(A))))
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

Subsetting a data frame according to recursive rows and creating a column for ordering

Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1

R - how to avoid repeating filter & row bind

Because I am working on a very large dataset, I need to slice my dataset by groups in order to pursue my computations.
I have a person-period (melt) dataset that looks like this
group id var time
1 A 1 a 1
2 A 1 b 2
3 A 1 a 3
4 A 2 b 1
5 A 2 b 2
6 A 2 b 3
7 B 1 a 1
8 B 1 a 2
9 B 1 a 3
10 B 2 c 1
11 B 2 c 2
12 B 2 c 3
I need to do this simple transformation
library(reshape2)
library(dplyr)
dt %>% dcast(group + id ~ time, value.var = 'var')
In order to get
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
So far, so good.
However, because my database is too big, I need to do this separately for each different groups, such as
a = dt %>% filter(group == 'A') %>% dcast(group + id ~ time, value.var ='var')
b = dt %>% filter(group == 'B') %>% dcast(group + id ~ time, value.var = 'var')
bind_rows(a,b)
My problem is that I would like to avoid doing it by hand. I mean, having to store separately each groups, a = ..., b = ..., c = ..., and so on
Any idea how I could have a single pipe stream that would separate each group, compute the transformation and put it back together in a dataframe ?
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), time = structure(c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor")), .Names = c("group", "id",
"var", "time"), row.names = c(NA, -12L), class = "data.frame")
Package purrr can be useful for working with lists. First split the dataset by group and then use map_df to dcast each list but return everything in a single data.frame.
library(purrr)
dt %>%
split(.$group) %>%
map_df(~dcast(.x, group + id ~ time, value.var = "var"))
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
lapply is your friend here:
do.call(rbind, lapply(unique(dt$Group), function(grp, dt){
dt %>% filter(Group == grp) %>% dcast(group + id ~ time, value.var = "var")
}, dt = dt))

'Complex' aggregation function in dcast from reshape2

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Resources