Suppose I have a simple dataset
df <- data.frame(id=c("A","B","C","D","E","F"),
value=c(1,NA,NA,NA,NA,NA))
I want to recode value (or create a new variable) so that each subsequent value is equal to the previous value * 2 + the previous value.
| id | value |
|----|-------|
| A | 1 |
| B | 3 |
| C | 9 |
| D | 27 |
| E | 81 |
| F | 243 |
I thought I could do this using lag:
df <- df %>%
mutate(value=(lag(value)*2)+lag(value))
But that didn't work. So instead I used a for loop
for (i in 2:nrow(df)){
df[I,"value"] <-(df[i-1,"value"]*2)+df[i-1,"value"]
}
That works but seems inelegant. Is there a better way to do this using tidyverse conventions/tools?
We can use accumulate from purrr
library(dplyr)
library(purrr)
df %>%
mutate(value = accumulate(value, ~ .x * 2 + .x))
# id value
#1 A 1
#2 B 3
#3 C 9
#4 D 27
#5 E 81
#6 F 243
Or more compact
df %>%
mutate(value = accumulate(value, ~ .x* 3))
Or in base R with Reduce
Reduce(function(x, y) x * 2 + x, df$value, accumulate = TRUE)
#[1] 1 3 9 27 81 243
We can use accumulate from purrr :
library(dplyr)
df %>%
mutate(value = purrr::accumulate(value[-n()], ~.x * 2 + .x,
.init = first(value)))
# id value
#1 A 1
#2 B 3
#3 C 9
#4 D 27
#5 E 81
#6 F 243
Which can be done similarly in base R using Reduce
Reduce(function(x, y) x * 2 + x, df$value[-nrow(df)], init = df$value[1],
accumulate = TRUE)
#[1] 1 3 9 27 81 243
Related
An example:
a = c(10,20,30)
b = c(1,2,3)
c = c(4,5,6)
d = c(7,8,9)
df=data.frame(a,b,c,d)
library(dplyr)
df_1 = df %>% mutate(a1=sum(a+1))
How do I add "a1" after "a" (or any other defined position) and NOT at the end?
Thank you.
An update that might be useful for others who find this question - this can now be achieved directly within mutate (I'm using dplyr v1.0.2).
Just specify which existing column the new column should be positioned after or before, e.g.:
df_after <- df %>%
mutate(a1=sum(a+1), .after = a)
df_before <- df %>%
mutate(a1=sum(a+1), .before = b)
Another option is add_column from tibble
library(tibble)
add_column(df, a1 = sum(a + 1), .after = "a")
# a a1 b c d
#1 10 63 1 4 7
#2 20 63 2 5 8
#3 30 63 3 6 9
Extending on www's answer, we can use dplyr's select_helper functions to reorder newly created columns as we see fit:
library(dplyr)
## add a1 after a
df %>%
mutate(a1 = sum(a + 1)) %>%
select(a, a1, everything())
#> a a1 b c d
#> 1 10 63 1 4 7
#> 2 20 63 2 5 8
#> 3 30 63 3 6 9
## add a1 after c
df %>%
mutate(a1 = sum(a + 1)) %>%
select(1:c, a1, everything())
#> a b c a1 d
#> 1 10 1 4 63 7
#> 2 20 2 5 63 8
#> 3 30 3 6 63 9
dplyr >= 1.0.0
relocate was added as a new verb to change the order of one or more columns. If you pipe the output of your mutate the syntax for relocate also uses .before and .after arguments:
df_1 %>%
relocate(a1, .after = a)
a a1 b c d
1 10 63 1 4 7
2 20 63 2 5 8
3 30 63 3 6 9
An additional benefit is you can also move multiple columns using any tidyselect syntax:
df_1 %>%
relocate(c:a1, .before = b)
a c d a1 b
1 10 4 7 63 1
2 20 5 8 63 2
3 30 6 9 63 3
The mutate function will always add the newly created column at the end. However, we can sort the column alphabetically after the mutate function using select.
library(dplyr)
df_1 <- df %>%
mutate(a1 = sum(a + 1)) %>%
select(sort(names(.)))
df_1
# a a1 b c d
# 1 10 63 1 4 7
# 2 20 63 2 5 8
# 3 30 63 3 6 9
I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.
Example
| Index | B | C | Class|
| 1 | 3 | 4 | Dog |
| 2 | 1 | 9 | Cat |
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Cat |
From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..
| Index | B | C | Class|
| 1 | 3 | 4 | Cat | Re-sampled
| 2 | 1 | 9 | Dog | Re-sampled
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Dog | Re-sampled
This code randomly extracts rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.
sample(Class[sample(nrow(Class),N),])
Suppose df is your data frame:
df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))
Would this do what you want?
dfSamp <- sample(1:nrow(df), N)
df$Class[dfSamp] <- sample(df$Class[dfSamp])
I simulated the data frame and did an example:
df <- data.frame(
ID=1:4,
Class=c('Dog', 'Cat', 'Dog', 'Cat')
)
N <- 2
sample_ids <- sample(nrow(df), N)
df$Class[sample_ids] <- sample(df$Class, length(sample_ids))
Assuming Class is how you named your datafame, you could do this:
library(dplyr)
bind_rows(
Class %>%
mutate(origin = 'not_sampled'),
Class %>%
sample(100, replace = TRUE) %>%
mutate(origin = 'sampled'))
Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.
What you're wanting to do is replace in-line some classes, but not others.
So, if we start with a data frame, df
set.seed(100)
df = data.frame(index = 1:100,
B = sample(1:10,100,replace = T),
C = sample(1:10,100,replace = T),
Class = sample(c('Cat','Dog','Bunny'),100,replace = T))
And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing unique(df$class) you don't weight the classes by their current occurrence. You could adjust this with the weight argument or remove unique to use occurrence as weight.
n_rows = 5
rows_to_update = sample(1:100,n_rows,replace = F)
new_classes = sample(unique(df$Class),n_rows,replace = T)
rows_to_update
#> [1] 85 65 94 60 48
new_classes
#> [1] "Bunny" "Dog" "Dog" "Dog" "Bunny"
We can inspect what the original data looked like
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Dog
#> 65 65 5 1 Bunny
#> 94 94 5 10 Dog
#> 60 60 3 7 Bunny
#> 48 48 9 1 Cat
We can update this in place with a reference to the column and the rows to update.
df$Class[rows_to_update] = new_classes
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Bunny
#> 65 65 5 1 Dog
#> 94 94 5 10 Dog
#> 60 60 3 7 Dog
#> 48 48 9 1 Bunny
I am trying to create an unsummarized data frame from a data frame of count data.
I have had some experience creating sample datasets but I am having some trouble trying to get a specific number of rows and proportion for each state/person without coding each of them separately and then combining them. I was able to do it using the following code but I feel like there is a better way.
set.seed(2312)
dragon <- sample(c(1),3,replace=TRUE)
Maine <- sample(c("Maine"),3,replace=TRUE)
Maine1 <- data.frame(dragon, Maine)
dragon <- sample(c(0),20,replace=TRUE)
Maine <- sample(c("Maine"),20,replace=TRUE)
Maine2 <- data.frame(dragon, Maine)
Maine2
library(dplyr)
maine3 <- bind_rows(Maine1, Maine2)
Is there a better way to generate this dataset then the code above?
I am trying to create a data frame from the following count data:
+-------------+--------------+--------------+
| | # of dragons | # no dragons |
+-------------+--------------+--------------+
| Maine | 3 | 20|
| California | 1 | 10|
| Jocko | 28 | 110515 |
| Jessica Day | 17 | 26122 |
| | 14 | 19655 |
+-------------+--------------+--------------+
And I would like it to look like this:
+-----------------------+---------------+
| | Dragons (1/0) |
+-----------------------+---------------+
| Maine | 1 |
| Maine | 1 |
| Maine | 1 |
| Maine | 0 |
| Maine….(2:20) | 0…. |
| California | 1 |
| California….(2:10) | 0… |
| Ect.. | |
+-----------------------+---------------+
I do not want the code written for me but would love with ideas on function or examples that you think might be helpful.
I am not completely sure what does sampling have to do with this problem?
It looks to me like you are looking for untable.
Here is an example
data:
set.seed(1)
no_drag = sample(1:5, 5)
drag = sample(15:25, 5)
df <- data.frame(names = LETTERS[1:5],
drag,
no_drag)
names drag no_drag
1 A 24 2
2 B 25 5
3 C 20 4
4 D 23 3
5 E 15 1
library(reshape)
library(tidyverse)
df %>%
gather(key, value, 2:3) %>% #convert to long format
{untable(.,num = .$value)} %>% #untable by value column
mutate(value = ifelse(key == "drag", 0, 1)) %>% #convert values to 0/1
select(-key) %>% #remove unwanted column
arrange(names) #optional
#part of output
names value
1 A 0
2 A 0
3 A 0
4 A 0
5 A 0
6 A 0
7 A 0
8 A 0
9 A 0
10 A 0
11 A 0
12 A 0
13 A 0
14 A 0
15 A 0
16 A 0
17 A 0
18 A 0
19 A 0
20 A 0
21 A 0
22 A 0
23 A 0
24 A 0
25 A 1
26 A 1
27 B 0
28 B 0
29 B 0
30 B 0
there are other ways to tackle the problem here is one:
One is like #Frank mentioned in the comment:
df %>%
gather(key, val, 2:3) %>%
mutate(v = Map(rep, key == "drag", val)) %>%
unnest %>%
select(-key, -val)
Another:
df <- gather(df, key, value, 2:3)
df <- df[rep(seq_len(nrow(df)), df$value), 1:2]
df$key[df$key == "drag"] <- FALSE
df$key[df$key != "drag"] <- TRUE
One can use tidyr::expand to expand rows in desired format.
The solution using df used by #missuse can be shown as:
library(tidyverse)
df %>% gather(key,value,-names) %>%
mutate(key = ifelse(key=="drag", 1, 0)) %>%
group_by(names,key) %>%
expand(value = 1:value) %>%
select(names, value = key) %>%
as.data.frame()
# names value
# 1 A 0
# 2 A 0
# 3 A 1
# 4 A 1
# 5 A 1
# 6 A 1
# 7 A 1
# 8 A 1
# 9 A 1
# 10 A 1
# ...so on
# 117 E 1
# 118 E 1
# 119 E 1
# 120 E 1
# 121 E 1
# 122 E 1
Is it possible to shift data of one cell in a column from one timestamp to other in a time series data without losing any other data? I have tried shift and slide functions but it replaces the data with NA values.
I have tried using mutate function as well but it changes the complete column.Is There any function or method to perform manipulation?
E.g, convert :
Date_Time | x | y
01-01-2016 | 1 | 2
02-01-2016 | 3 | 4
03-01-2016 | 5 | 6
04-01-2016 | 2 | 5
to:
Date_Time | x | y
01-01-2016 | 5 | 2
02-01-2016 | 3 | 4
03-01-2016 | 1 | 6
04-01-2016 | 2 | 5
or slide the data vertically
Date_Time | x | y
01-01-2016 | 2 | 2
02-01-2016 | 1 | 4
03-01-2016 | 3 | 6
04-01-2016 | 5 | 5
Two swap two values you need to hold one in a temporary variable. We can write a simple function:
swap = function(x, i, j) {
stopifnot(length(i) == length(j))
temp = x[i]
x[i] = x[j]
x[j] = temp
return(x)
}
On your data, it should work like this to give the desired result:
your_data$x = swap(your_data$x, which.min(your_data$x), which.max(your_data$x))
Two other options with dplyr:
library(dplyr)
df %>%
mutate(x = case_when(
x == max(x) ~ min(x),
x == min(x) ~ max(x),
TRUE ~ x
))
df %>%
mutate(x = replace(x, c(which.max(x), which.min(x)), c(min(x), max(x))))
Result:
Date_Time x y
1 01-01-2016 5 2
2 02-01-2016 3 4
3 03-01-2016 1 6
4 04-01-2016 2 5
To shift x vertically:
df %>%
mutate(x = c(x[-1], x[1]))
or
df %>%
mutate(x = c(x[length(x)], x[-length(x)]))
Result:
> df %>%
+ mutate(x = c(x[-1], x[1]))
Date_Time x y
1 01-01-2016 3 2
2 02-01-2016 5 4
3 03-01-2016 2 6
4 04-01-2016 1 5
> df %>%
+ mutate(x = c(x[length(x)], x[-length(x)]))
Date_Time x y
1 01-01-2016 2 2
2 02-01-2016 1 4
3 03-01-2016 3 6
4 04-01-2016 5 5
Data:
df = read.table(text = "Date_Time | x | y
01-01-2016 | 1 | 2
02-01-2016 | 3 | 4
03-01-2016 | 5 | 6
04-01-2016 | 2 | 5", header = TRUE, sep = "|")
I have this data.frame:
df <- data.frame(id=c('A','A','B','B','B','C'), amount=c(45,66,99,34,71,22))
id | amount
-----------
A | 45
A | 66
B | 99
B | 34
B | 71
C | 22
which I need to expand so that each by group in the data.frame is of equal length (filling it out with zeroes), like so:
id | amount
-----------
A | 45
A | 66
A | 0 <- added
B | 99
B | 34
B | 71
C | 22
C | 0 <- added
C | 0 <- added
What is the most efficient way of doing this?
NOTE
Benchmarking the some of the solutions provided with my actual 1 million row data.frame I got:
plyr | data.table | unstack
-----------------------------------
Elapsed: 139.87s | 0.09s | 2.00s
One way using data.table
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 2L, 3L),
.Label = c("A ", "B ", "C "), class = "factor"),
V2 = c(45, 66, 99, 34, 71, 22)),
.Names = c("V1", "V2"),
class = "data.frame", row.names = c(NA, -6L))
require(data.table)
dt <- data.table(df, key="V1")
# get maximum index
idx <- max(dt[, .N, by=V1]$N)
# get final result
dt[, list(V2 = c(V2, rep(0, idx-length(V2)))), by=V1]
# V1 V2
# 1: A 45
# 2: A 66
# 3: A 0
# 4: B 99
# 5: B 34
# 6: B 71
# 7: C 22
# 8: C 0
# 9: C 0
I'm sure there is a base R solution, but here is one that uses ddply in the plyr package
library(plyr)
##N: How many values should be in each group
N = 3
ddply(df, "id", summarize,
amount = c(amount, rep(0, N-length(amount))))
gives:
id amount
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0
Here's another way in base R using unstack and stack.
# ensure character id col
df <- transform(df, id=as.character(id))
# break into a list by id
u <- unstack(df, amount ~ id)
# get max length
max.len <- max(sapply(u, length))
# pad the short ones with 0s
filled <- lapply(u, function(x) c(x, numeric(max.len - length(x))))
# recombine into data.frame
stack(filled)
# values ind
# 1 45 A
# 2 66 A
# 3 0 A
# 4 99 B
# 5 34 B
# 6 71 B
# 7 22 C
# 8 0 C
# 9 0 C
How about this?
out <- by(df, INDICES = df$id, FUN = function(x, N) {
x <- droplevels(x)
lng <- nrow(x)
dif <- N - lng
if (dif == 0) return(x)
make.list <- lapply(1:dif, FUN = function(y) data.frame(id = levels(x$id), amount = 0))
rbind(x, do.call("rbind", make.list))
}, N = max(table(df$id))) # N could also be an integer
do.call("rbind", out)
id amount
A.1 A 45
A.2 A 66
A.3 A 0
B.3 B 99
B.4 B 34
B.5 B 71
C.6 C 22
C.2 C 0
C.3 C 0
Here is a dplyr option:
library(dplyr)
# Get maximum number of rows for all groups
N = max(count(df,id)$n)
df %>%
group_by(id) %>%
summarise(amount = c(amount, rep(0, N-length(amount))), .groups = "drop")
Output
id amount
<chr> <dbl>
1 A 45
2 A 66
3 A 0
4 B 99
5 B 34
6 B 71
7 C 22
8 C 0
9 C 0