R, How to generate additional observations denoted by numbered sequence - r

I'm currently a bit stuck, since I'm a bit unsure of how to even formulate my problem.
What I have is a dataframe of observations with a few variables.
Lets say:
test <- data.frame(var1=c("a","b"),var2=c(15,12))
Is my initial dataset.
What I want to end up with is something like:
test2 <- data.frame(var1_p=c("a","a","a","a","a","b","b","b","b","b"),
var2=c(15,15,15,15,15,12,12,12,12,12),
var3=c(1,2,3,4,5,1,2,3,4,5)
However, the initial observation count and the fact, that I need the numbering to run from 0-9 makes it rather tedious to do by hand.
Does anybody have a nice alternative solution?
Thank you.
What I tried so far was:
a)
testdata$C <- 0
testdata <- for (i in testdata$Combined_Number) {add_row(testdata,C=seq(0,9))}
which results in the dataset to be empty.
b)
testdata$C <- with(testdata, ave(Combined_Number,flur, FUN = seq(0,9)))
which gives the following error code:
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found

Perhaps crossing helps
library(tidyr)
crossing(df, var3 = 0:9)
-output
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9

With dplyr this is one approach
library(dplyr)
df %>%
group_by(var1) %>%
summarize(var2, var3 = 0:9, .groups="drop")
# A tibble: 20 × 3
var1 var2 var3
<chr> <dbl> <int>
1 a 15 0
2 a 15 1
3 a 15 2
4 a 15 3
5 a 15 4
6 a 15 5
7 a 15 6
8 a 15 7
9 a 15 8
10 a 15 9
11 b 12 0
12 b 12 1
13 b 12 2
14 b 12 3
15 b 12 4
16 b 12 5
17 b 12 6
18 b 12 7
19 b 12 8
20 b 12 9
Data
df <- structure(list(var1 = c("a", "b"), var2 = c(15, 12)), class = "data.frame", row.names = c(NA,
-2L))

Related

Assign value to data based on more than two conditions and on other data

I have a data frame that looks like this
> df
name time count
1 A 10 9
2 A 12 17
3 A 24 19
4 A 3 15
5 A 29 11
6 B 31 14
7 B 7 7
8 B 30 18
9 C 29 13
10 C 12 12
11 C 3 16
12 C 4 6
and for each name group (A, B, C) I would need to assign a category following the rules below:
if time<= 10 then category = 1
if 10 <time<= 20 then category = 2
if 20 <time<= 30 then category = 3
if time> 30 then category = 4
to have a data frame that looks like this:
> df_final
name time count category
1 A 10 9 1
2 A 12 17 2
3 A 24 19 3
4 A 3 15 1
5 A 29 11 3
6 B 31 14 4
7 B 7 7 1
8 B 30 18 3
9 C 29 13 3
10 C 12 12 2
11 C 3 16 1
12 C 4 6 1
after that I would need to sum the value in count based on their category. The ultimate data frame should loo like this:
> df_ultimate
name count category
1 A 24 1
2 A 17 2
3 A 30 3
4 A NA 4
5 B 7 1
6 B NA 2
7 B 18 3
8 B 14 4
9 C 22 1
10 C 12 2
11 C 13 3
12 C NA 4
I have tried to play around with summarise and group_by but without much success.
Thanks for your help
With cut + complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(name, category = cut(time, breaks = c(-Inf, 10, 20, 30, Inf), labels = 1:4)) %>%
summarise(count = sum(count)) %>%
complete(category)
# # Groups: name [3]
# name category count
# 1 A 1 24
# 2 A 2 17
# 3 A 3 30
# 4 A 4 NA
# 5 B 1 7
# 6 B 2 NA
# 7 B 3 18
# 8 B 4 14
# 9 C 1 22
# 10 C 2 12
# 11 C 3 13
# 12 C 4 NA

creating new tibble columns based on mapping plus user data

I am trying generate new columns in a tibble from the output of a function that takes as input several existing columns of that tibble plus user data. As a simplified example, I would want to use this function
addup <- function(x, y, z){x + y + z}
and use it to add the numbers in the existing columns in this tibble...
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
# A tibble: 12 x 2
num1 num2
<int> <int>
1 8 5
2 6 3
3 7 7
4 3 11
5 1 2
6 2 1
7 11 6
8 10 9
9 4 8
10 9 12
11 5 10
12 12 4
...together with user input. For instance, if a user defines the vector
vec <- c(3,6,4)
I would like to generate one new column per item in vec, adding the mapped values with the user input values.
The desired result in this case would look something like:
# A tibble: 12 x 5
num1 num2 `3` `6` `4`
<int> <int> <dbl> <dbl> <dbl>
1 5 7 15 18 16
2 8 2 13 16 14
3 7 9 19 22 20
4 1 11 15 18 16
5 3 3 9 12 10
6 9 12 24 27 25
7 6 6 15 18 16
8 10 10 23 26 24
9 11 4 18 21 19
10 12 5 20 23 21
11 4 1 8 11 9
12 2 8 13 16 14
If I know vec beforehand, I could achieve this by
tib %>%
mutate("3" = map2_dbl(num1, num2, ~addup(.x, .y, 3)),
"6" = map2_dbl(num1, num2, ~addup(.x, .y, 6)),
"4" = map2_dbl(num1, num2, ~addup(.x, .y, 4)))
but as the length of vec can vary, I do not know how to generalize this. I've found this answer repeated mutate in tidyverse, but there the functions are repeated over the existing columns instead of using the multiple existing columns for mapping.
Any ideas?
Since we don't have to have the function or the colnames as arguments, this is relatively simple. You just need to iterate over vec with a function that returns the summed column, and then combine with the original table. If you have an addup function that accepts vector inputs then you can skip the whole map2 part; in fact this one does but I don't know if your real function does.
library(tidyverse)
vec <- c(3,6,4)
set.seed(1)
tib <- tibble(num1 = sample(12), num2 = sample(12))
addup <- function(c1, c2, z) {c1 + c2 + z}
addup_vec <- function(df, vec) {
new_cols <- map_dfc(
.x = vec,
.f = function(v) {
map2_dbl(
.x = df[["num1"]],
.y = df[["num2"]],
.f = ~ addup(.x, .y, v)
)
}
)
colnames(new_cols) <- vec
bind_cols(df, new_cols)
}
tib %>%
addup_vec(vec)
#> # A tibble: 12 x 5
#> num1 num2 `3` `6` `4`
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 4 9 16 19 17
#> 2 5 5 13 16 14
#> 3 6 8 17 20 18
#> 4 9 11 23 26 24
#> 5 2 6 11 14 12
#> 6 7 7 17 20 18
#> 7 10 3 16 19 17
#> 8 12 4 19 22 20
#> 9 3 12 18 21 19
#> 10 1 1 5 8 6
#> 11 11 2 16 19 17
#> 12 8 10 21 24 22
Created on 2019-01-16 by the reprex package (v0.2.0).
This uses lapply to apply the function to each element of your vector then binds the result to the original data frame and adds column names.
# Given example
set.seed(1)
(tib <- tibble(num1 = sample(12), num2 = sample(12)))
addup <- function(x, y, z){x + y + z}
vec <- c(3,6,4)
# Add columns and bind to original data frame
foo <- cbind(tib, lapply(vec, function(x)addup(tib$num1, tib$num2, x)))
# Correct column names
colnames(foo)[(ncol(tib)+1):ncol(foo)] <- vec
# Print result
print(foo)
# num1 num2 3 6 4
# 1 4 9 16 19 17
# 2 5 5 13 16 14
# 3 6 8 17 20 18
# 4 9 11 23 26 24
# 5 2 6 11 14 12
# 6 7 7 17 20 18
# 7 10 3 16 19 17
# 8 12 4 19 22 20
# 9 3 12 18 21 19
# 10 1 1 5 8 6
# 11 11 2 16 19 17
# 12 8 10 21 24 22

Get the nth lagged value in grouped data in R

I have a data frame similar to this
mydf = data_frame(letter = rep(c("a", "b", "c"), each =5),
var1 = sample(1:25, 15, replace = TRUE))
# A tibble: 15 x 2
letter var1
<chr> <int>
1 a 16
2 a 9
3 a 5
4 a 14
5 a 6
6 b 13
7 b 9
8 b 20
9 b 18
10 b 4
11 c 18
12 c 11
13 c 9
14 c 1
15 c 12
I know I can get the immediate value from the previous row with dplyr::lag(). However, I am trying to obtain a similar solution to obtain the third value before each observation. The expected result should look like this:
# A tibble: 15 x 3
# Groups: letter [3]
letter var1 var2
<chr> <int> <dbl>
1 a 16 NA
2 a 9 NA
3 a 5 16
4 a 14 9
5 a 6 5
6 b 13 NA
7 b 9 NA
8 b 20 13
Thanks in advance

Fill column with prior nonmissing value, no ID

I'm trying to fill a missing ID column of a data frame as shown below. It's not blank in the first row it applies to and then blank until the next ID. I wrote ugly code to do this in a for loop, but wonder if there's a tidy-ier way to do this. Any suggestions?
Here's what I've got:
code data
1 A 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 B 11
12 12
13 13
14 14
15 15
16 C 16
17 17
18 18
19 19
20 20
I want:
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20
Code I've got now:
# Create mock data frame
df <- data.frame(code = c("A", rep("", 9),
"B", rep("", 4),
"C", rep("", 4)),
data = 1:20)
# For loop over rows (BAD!)
for (i in seq(2, nrow(df))) {
df[i,]$code <- ifelse(df[i,]$code == "", df[i-1,]$code, df[i, ]$code)
}
There is a tidyr way to do it, there is the fill function. You also need to replace the zero length string with NA for this to work, which you can easily do using the mutate and na_if functions from dplyr.
df %>%
mutate(code = na_if(code,"")) %>%
fill(code)
code data
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 11
12 B 12
13 B 13
14 B 14
15 B 15
16 C 16
17 C 17
18 C 18
19 C 19
20 C 20

Partition groups of data by group

I have the following dataset:
df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year
df<- df %>%
mutate(Predictor = seq_along(Location)) %>%
ungroup(df)
print(df)
Location Year Predictor
a 1 1
a 2 2
a 3 3
a 1 4
a 2 5
a 3 6
a 1 7
a 2 8
a 3 9
b 1 10
b 2 11
b 3 12
b 4 13
b 5 14
... 40 more rows
I want to split the above dataframe into training and test sets. For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. My test set should look something like this:
Location Year Predictor
a 1 1
a 1 4
a 1 7
b 3 12
b 3 18
b 3 24
b 5 14
b 5 20
b 5 26
c 3 30
c 3 39
c 3 48
c 6 33
c 6 42
c 6 51
c 7 34
c 7 43
c 7 52
I found a similar question here, but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). I want a different random sample of years from each location and a proportional number of samples.
Would like to do this in dplyr if possible
You can first create a distinct set of year/location combinations, then sample some of them for each location and use that in a semi_join on the original data. This could be done as:
df %>%
distinct(Location, Year) %>%
group_by(Location) %>%
sample_frac(.3) %>%
semi_join(df, .)
# Location Year Predictor
# 1 a 3 3
# 2 a 3 6
# 3 a 3 9
# 4 b 4 13
# 5 b 4 19
# 6 b 4 25
# 7 b 5 14
# 8 b 5 20
# 9 b 5 26
# 10 c 8 35
# 11 c 8 44
# 12 c 8 53
# 13 c 1 28
# 14 c 1 37
# 15 c 1 46
# 16 c 2 29
# 17 c 2 38
# 18 c 2 47

Resources