i have a data frame like this
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
what i want to do is to create a function that shift this data frame by a value. for example:
if the value of shifting is 1 the data frame will become:
A B value
3 2 0.412
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
etc...
the function should be like this.
shift<-function(dataframe,shiftvalue)
is there any simple way to do this in R without entering in a lot of loops??
You can do it many ways, but one way is to use head and tail:
df <- data.frame(a=1:10, b = 11:20)
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
> shift(df,3)
a b
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
1 1 11
2 2 12
3 3 13
I prefer plain old modulo ;-)
shift<-function(df,offset) df[((1:nrow(df))-1-offset)%%nrow(df)+1,]
It is pretty straightforward, the only quirk is R's from-one indexing. Also it works for offsets like 0, -7 or 7*nrow(df)...
here is my implementation:
> shift <- function(df, sv = 1) df[c((sv+1):nrow(df), 1:sv),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
Updated:
> shift <- function(df, sv = 1) df[c((nrow(df)-sv+1):nrow(df), 1:(nrow(df)-sv)),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
There's a shift function in taRifx that works on vectors. Applying it results in coersion of all columns to character if any are character, so we'll use a trick from plyr. I'll likely write a data.frame method for it soon:
dd <- data.frame(b = seq(4),
x = c("A", "D", "A", "C"), y = c('a','b','c','d'),
z = c(1, 1, 1, 2),stringsAsFactors=FALSE)
> dd
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
4 4 C d 2
library(taRifx)
library(plyr)
shift.data.frame <- colwise(shift)
> shift.data.frame(dd)
b x y z
1 2 D b 1
2 3 A c 1
3 4 C d 2
4 1 A a 1
> shift(dd,n=-1)
b x y z
1 4 C d 2
2 1 A a 1
3 2 D b 1
4 3 A c 1
> shift(dd,n=-1,wrap=FALSE)
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
> shift(dd,n=-1,wrap=FALSE,pad=TRUE)
b x y z
1 NA <NA> <NA> NA
2 1 A a 1
3 2 D b 1
4 3 A c 1
The advantage of shift is that it takes a bunch of options:
n can be positive or negative to wrap from left/right
wrap can be turned on or off
If wrap is turned off, pad can be turned on to pad with NAs so vector remains the same length
https://dplyr.tidyverse.org/reference/lead-lag.html
lag(1:5, n = 1)
#> [1] NA 1 2 3 4
lag(1:5, n = 2)
#> [1] NA NA 1 2 3
Related
I wonder how to combine mutate and if_else to transform a data frame into TRUE and FALSE?
For example, mutate a table into TRUE (value >= 2) and FALSE(value <2):
> iris %>% as_tibble() %>% select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# A tibble: 150 × 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# … with 140 more rows
into
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 T T F F
2 T T F F
3 T T F F
4 T T F F
5 T T F F
6 T T F F
7 T T F F
Thanks a lot!
iris %>%
mutate(across(where(is.numeric), ~ . >= 2))
You don't need if_else when the result you want is TRUE or FALSE. Generally, ifelse(test, TRUE, FALSE) is a long way of writing test.
Or in base R
iris[1:4] >= 2
In R I frequently use dplyr's select in combination with everything()
df %>% select(var4, var17, everything())
The above for example would reorder the columns of the dataframe, such that var4 is the first, var17 is the second and subsequently all remaining columns are listed. What is the most pandathonic way of doing this? Working with many columns makes explicitly spelling them out a pain as well as keeping track of their position.
The ideal solution is short, readable and can be used in pandas chaining.
Use Index.difference for all values without specified in list and join together:
df = pd.DataFrame({
'G':list('abcdef'),
'var17':[4,5,4,5,5,4],
'A':[7,8,9,4,2,3],
'var4':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['var4','var17']
another = df.columns.difference(cols, sort=False).tolist()
df = df[cols + another]
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
EDIT: For chaining is possible use DataFrame.pipe with passed DataFrame:
def everything_after(df, cols):
another = df.columns.difference(cols, sort=False).tolist()
return df[cols + another]
df = df.pipe(everything_after, ['var4','var17']))
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
Now how smoothly you can do it with datar!
>>> from datar import f
>>> from datar.datasets import iris
>>> from datar.dplyr import select, everything, slice_head
>>> iris >> slice_head(5)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>> iris >> select(f.Species, everything()) >> slice_head(5)
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2
I am the author of the package. Feel free to submit issues if you have any questions.
Currently, I have a time series data table and I have to get the first and the last entry for each group .So I am using the below code to do that .
data[,c(.SD[1,] , .SD[2,]),by=c("id","status","group")]
Now instead of the first and last, I want to take the first and the last three entries by the grouping.
Any help is appreciated.
quick and dirty
#sample data
DT <- data.table( id = 1:2, value = 1:100 )
#code
DT[, .SD[ c( 1:3, .N-2, .N-1, .N ) ], by = .(id) ]
#output
# id value
# 1: 1 1
# 2: 1 3
# 3: 1 5
# 4: 1 95
# 5: 1 97
# 6: 1 99
# 7: 2 2
# 8: 2 4
# 9: 2 6
# 10: 2 96
# 11: 2 98
# 12: 2 100
I think head and tail will be safer and using unique once on the indices as mentioned by sindri_baldur:
k <- 3L
DT[unique(DT[, c(head(.I, k), tail(.I, k)), id]$V1)]
output:
id VAL
1: 1 1
2: 1 2
3: 1 3
4: 1 5
5: 1 6
6: 1 7
7: 2 8
8: 2 9
9: 2 10
10: 2 11
11: 2 12
12: 2 13
13: 3 14
14: 3 15
15: 3 16
16: 3 17
17: 3 18
18: 4 19
19: 4 20
20: 4 21
21: 4 22
22: 5 23
23: 5 24
24: 5 25
25: 6 26
26: 6 27
27: 7 28
id VAL
data:
library(data.table)
x <- sort(sequence(7:1))
DT <- data.table(id=x, VAL=1:length(x))
Using #chinsoon12's data, here is another way:
k = 3L
max_grp = 2L * k
init_seq = seq_len(k)
k_minus_one = k - 1L
DT[DT[, if (.N <=max_grp) .I else .I[c(init_seq, (.N-k_minus_one):.N)], by = x]$V1]
This answer scales well but is definitely an eyesore.
One solution using dplyr could be
data %>%
group_by(id, status, group) %>%
slice(c(1:3, (n()-2):n()))
Example with the iris dataset
iris %>%
group_by(Species) %>%
slice(c(1:3, (n()-2):n()))
Output
# A tibble: 18 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.2 1.4 0.2 setosa
# 5 5.3 3.7 1.5 0.2 setosa
# 6 5 3.3 1.4 0.2 setosa
# 7 7 3.2 4.7 1.4 versicolor
# 8 6.4 3.2 4.5 1.5 versicolor
# 9 6.9 3.1 4.9 1.5 versicolor
# 10 6.2 2.9 4.3 1.3 versicolor
# 11 5.1 2.5 3 1.1 versicolor
# 12 5.7 2.8 4.1 1.3 versicolor
# 13 6.3 3.3 6 2.5 virginica
# 14 5.8 2.7 5.1 1.9 virginica
# 15 7.1 3 5.9 2.1 virginica
# 16 6.5 3 5.2 2 virginica
# 17 6.2 3.4 5.4 2.3 virginica
# 18 5.9 3 5.1 1.8 virginica
Here's the dplyr::filter solution to taking 1st 3 and last 3 entries within each group:
data %>%
group_by(id, status, group) %>%
filter(row_number() %in% c(1:3, (n()-2):n()))
It is quite easy to dummy code a collapsed column using the tidyverse. Here is a quick example of how I've done it in the past. First, I'll load the iris data and create a custom collapsed column of randomly sampled letters:
library(tidyverse)
# load practice data
data(iris)
iris <- as_tibble(iris)
# create column of collapsed values
lst <- list()
for(i in 1:150) {
value <- as.list(paste0(sample(letters[1:2], 1), ", ", sample(letters[3:4], 1)))
lst[i] <- value
}
# append custom columns to the iris dataset
iris$Samples <- unlist(lst)
iris$Subject <- c(1:150)
iris <- iris %>% select(Subject, everything())
# preview custom dataset
iris
# A tibble: 150 x 7
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species Samples
<int> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 1 5.1 3.5 1.4 0.2 setosa a, d
2 2 4.9 3 1.4 0.2 setosa a, c
3 3 4.7 3.2 1.3 0.2 setosa a, c
4 4 4.6 3.1 1.5 0.2 setosa b, c
5 5 5 3.6 1.4 0.2 setosa a, c
6 6 5.4 3.9 1.7 0.4 setosa a, d
7 7 4.6 3.4 1.4 0.3 setosa b, c
8 8 5 3.4 1.5 0.2 setosa b, c
9 9 4.4 2.9 1.4 0.2 setosa b, d
10 10 4.9 3.1 1.5 0.1 setosa a, c
# ... with 140 more rows
So, let's say that each letter represented a unique value of interest and I wanted to wrangle this data into a series of dummy coded variables for each letter. Here is how I would do this using tidyverse functions:
iris %>%
separate_rows(Samples, sep = ', ') %>%
mutate(Values = 1) %>%
pivot_wider(names_from = "Samples", values_from = "Values") %>%
mutate_if(is.double, ~replace_na(., 0))
# A tibble: 150 x 10
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species a d c b
<int> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 5.1 3.5 1.4 0.2 setosa 1 1 0 0
2 2 4.9 3 1.4 0.2 setosa 1 0 1 0
3 3 4.7 3.2 1.3 0.2 setosa 1 0 1 0
4 4 4.6 3.1 1.5 0.2 setosa 0 0 1 1
5 5 5 3.6 1.4 0.2 setosa 1 0 1 0
6 6 5.4 3.9 1.7 0.4 setosa 1 1 0 0
7 7 4.6 3.4 1.4 0.3 setosa 0 0 1 1
8 8 5 3.4 1.5 0.2 setosa 0 0 1 1
9 9 4.4 2.9 1.4 0.2 setosa 0 1 0 1
10 10 4.9 3.1 1.5 0.1 setosa 1 0 1 0
# ... with 140 more rows
This is fast and efficient for small datasets. But, I am quickly moving into datasets that have millions of rows. Enter data.table.
How would I accomplish the same process using data.table? Here is my attempt:
library(data.table)
# convert my tibble into a data.table
iris.dt <- as.data.table(iris)
# perform the separate_rows functionality on my data
result <- iris.dt[, list(Samples = unlist(strsplit(Samples, ", "))), by = Subject
][, Values := 1]
print(result)
Subject Samples Values
1: 1 a 1
2: 1 d 1
3: 2 a 1
4: 2 c 1
5: 3 a 1
---
296: 148 d 1
297: 149 a 1
298: 149 d 1
299: 150 b 1
300: 150 c 1
The problem is that I don't know how to (1) keep all other columns and (2) spread out this info in a way similar to dplyr::pivot_wider.
Any help would be much appreciated!
One way is to tstrsplit and then melt+dcast. Seems kind of inefficient but not sure of another way
Example Data:
library(magrittr)
library(data.table)
set.seed(2020)
iris.dt <- as.data.table(iris)
iris.dt[, samples := paste0(sample(letters[1:2], .N, T), ', ', sample(letters[3:4], .N, T))]
Create dummy cols
new_cols <-
iris.dt[, tstrsplit(samples, ', ')][, I := .I] %>%
melt('I') %>%
dcast(I ~ value, fun.agg = length) %>%
.[, I := NULL]
iris.dt[, names(new_cols) := new_cols][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species samples a b c d
# 1: 5.1 3.5 1.4 0.2 setosa b, c 0 1 1 0
# 2: 4.9 3.0 1.4 0.2 setosa a, d 1 0 0 1
# 3: 4.7 3.2 1.3 0.2 setosa b, c 0 1 1 0
# 4: 4.6 3.1 1.5 0.2 setosa a, d 1 0 0 1
# 5: 5.0 3.6 1.4 0.2 setosa a, c 1 0 1 0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica b, d 0 1 0 1
# 147: 6.3 2.5 5.0 1.9 virginica a, d 1 0 0 1
# 148: 6.5 3.0 5.2 2.0 virginica b, c 0 1 1 0
# 149: 6.2 3.4 5.4 2.3 virginica a, c 1 0 1 0
# 150: 5.9 3.0 5.1 1.8 virginica a, d 1 0 0 1
Here is another option using matrix numeric index:
l <- strsplit(DT[["Samples"]], ",")
nl <- lengths(l)
ul <- unlist(l)
cols <- sort(unique(ul))
DT[, (cols) := {
m <- matrix(0L, nrow=.N, ncol=length(cols))
m[cbind(rep(1L:.N, nl), match(ul, cols))] <- 1L
as.data.table(m)
}]
output:
Subject Samples a b c d
1: 1 a,d 1 0 0 1
2: 2 a,c 1 0 1 0
3: 3 a,c 1 0 1 0
4: 4 b,c 0 1 1 0
5: 5 a,c 1 0 1 0
6: 6 a,d 1 0 0 1
7: 7 b,c 0 1 1 0
8: 8 b,c 0 1 1 0
9: 9 b,d 0 1 0 1
10: 10 a,c 1 0 1 0
data:
DT <- fread("Subject Samples
1 a,d
2 a,c
3 a,c
4 b,c
5 a,c
6 a,d
7 b,c
8 b,c
9 b,d
10 a,c", sep=" ")
Let's say I want to replace several variables by 1 in a dataset:
data(iris)
put_1 <- function(x){ x = 1}
iris %>%
mutate_at(vars(Petal.Length, Petal.Width), funs(put_1)) %>%
head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1 1 setosa
# 2 4.9 3.0 1 1 setosa
# 3 4.7 3.2 1 1 setosa
# 4 4.6 3.1 1 1 setosa
# 5 5.0 3.6 1 1 setosa
# 6 5.4 3.9 1 1 setosa
Question : Is there a way to do the same without declaring a function before ?
I tried things like :
mutate_at(vars(...), funs(function(x){ x <- 1 }))
mutate_at(vars(...), funs(~ 1 }))
mutate_at(vars(...), funs(~ . = 1 }))
without success.
Thank you in advance.
This is one of the times when = and <- don't work the same
> iris%>%mutate_at(vars(Petal.Length,Petal.Width),funs(.<-1))%>%head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1 1 setosa
2 4.9 3.0 1 1 setosa
3 4.7 3.2 1 1 setosa
4 4.6 3.1 1 1 setosa
5 5.0 3.6 1 1 setosa
6 5.4 3.9 1 1 setosa
And
> iris%>%mutate_at(vars(Petal.Length,Petal.Width),funs(.=1))%>%head()
Error: Can't create call to non-callable object
Call `rlang::last_error()` to see a backtrace
The best answer is from #josemz
iris %>%
mutate_at(vars(Petal.Length, Petal.Width), ~ 1)