I have a dataframe like this:
library(tidyverse)
a <- tibble(x=c("mother","father","brother","brother"),y=c("a","b","c","d"))
b <- tibble(x=c("mother","father","brother","brother"),z=c("e","f","g","h"))
I want to join these dataframes so that each "brother" occurs only once
I have tried fulljoin
ab <- full_join(a,b,by="x")
and obtained this:
# A tibble: 6 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother c g
4 brother c h
5 brother d g
6 brother d h
What I need is this:
ab <- tibble(x=c("mother","father","brother1","brother2"),y=c("a","b","c","d"),z=c("e","f","g","h"))
# A tibble: 4 x 3
x y z
<chr> <chr> <chr>
1 mother a e
2 father b f
3 brother1 c g
4 brother2 d h
Using dplyr you could do something like the following, which adds an extra variable person to identify each person within each group in x, and then joins by x and person:
library(dplyr)
a %>%
group_by(x) %>%
mutate(person = 1:n()) %>%
full_join(b %>%
group_by(x) %>%
mutate(person = 1:n()),
by = c("x", "person")
) %>%
select(x, person, y, z)
Which returns:
# A tibble: 4 x 4
# Groups: x [3]
x person y z
<chr> <int> <chr> <chr>
1 mother 1 a e
2 father 1 b f
3 brother 1 c g
4 brother 2 d h
Unfortunatelly, the first and second brotherare indistinguisheable form each other! How would R know that you want to join them that way, and not the reverse?
I would try to "remove duplicates" in the original data.frames by adding the "1" and "2" identifiers there.
I don't know tidyverse syntax, but if you never get more than two repetitions, you may want to try
a <- c("A", "B", "C", "C")
a[duplicated(a)] <- paste0(a[duplicated(a)], 2)
Related
I have a dataset :
library(tidyverse)
fac = factor(c("a","b","c"))
x = c(1,2,3)
d = tibble(fac,x);d
that looks like this :
# A tibble: 3 × 2
fac x
<fct> <dbl>
1 a 1
2 b 2
3 c 3
I want to change the value 2 of column x that corresponds to factor b with 3.14.
How can I do it in the dplyr pipeline framework ?
One alternative with ifelse statement:
library(dplyr)
d %>%
mutate(x = ifelse(fac == "b", 3.14, x))
fac x
<fct> <dbl>
1 a 1
2 b 3.14
3 c 3
We may use replace
library(dplyr)
library(magrittr)
d %<>%
mutate(x = replace(x, fac == "b", 3.14))
-output
d
# A tibble: 3 × 2
fac x
<fct> <dbl>
1 a 1
2 b 3.14
3 c 3
I'm trying to iterate through a df with a number of addresses in different neighborhoods, and for each neighborhood I would like to randomly divide each address into one of two equal groups. My df looks roughly like this:
neighborhood <- c("armatage", "armatage", "armatage", "windom", "windom", "windom", "windom")
address <- c("a", "b", "c", "d", "e", "f", "g")
df <- data.frame(address, neighborhood)
but with many more neighborhoods with varying numbers of addresses. Using the randomizr package, so far I have been able to come up with this script, which iterates through each neighborhood name and comes up with a randomized list of 0s and 1s with the length of the number of rows within each neighborhood. The problem seems to be the second for loop, and actually assigning the randomized value to each row
for (i in df$neighborhood)
n <- nrow(df[df$neighborhood == i, ])
z <- complete_ra((n))
for (row in 1:nrow(df[df$neighborhood == i, ]))
df$group[row] <- z[row]
where df$group is where I would like to store the randomly assigned value. I would greatly appreciate any advice anyone might have. Thanks!
Here is another way and avoids a double loop:
library(data.table)
dt = as.data.table(df)
dt[, .(grp = sample(.N) %% 2,
address)
, by = neighborhood]
#> neighborhood grp address
#> 1: armatage 1 a
#> 2: armatage 0 b
#> 3: armatage 1 c
#> 4: windom 0 d
#> 5: windom 1 e
#> 6: windom 1 f
#> 7: windom 0 g
Basically, if we take the modulo while also doing the sequence of the total number of addresses in each neighborhood, we can assign randomness.
Background
Let's take a look at what the modulo operator %% does to the number sequence 1 to 4:
seq(from = 1, to = 4) ## or 1:4 or seq(4)
## [1] 1 2 3 4
seq(from = 1, to = 4) %% 2
## [1] 1 0 1 0
Mathematically, it tells us the remainder. That is, 1 / 2 has a remainder of 1; 2 / 2 has a remainder of 0; and so on. We can use this to make groupings. The problem is that this isn't random. That's where sample() comes in play
sample(4) ## or sample(1:4) or sample(seq(1, 4))
## [1] 2 1 4 3
So if we combine modulo with sample(), we can effectively randomize these by groups if we know how many are in each group. That's where grouping such as data.table dt[i, j, by] syntax could help or dplyr tibble %>% group_by() %>% mutate() syntax are of use. Yes, we could subset the unique neighborhoods in a loop, but it is more efficient to do groupings.
Since dplyr is what helped me initially, let's take a look at that version:
library(dplyr)
df %>%
group_by(neighborhood) %>%
mutate(group = sample(n()) %% 2)
## # A tibble: 7 x 3
## # Groups: neighborhood [2]
## address neighborhood group
## <chr> <chr> <dbl>
## 1 a armatage 1
## 2 b armatage 0
## 3 c armatage 1
## 4 d windom 1
## 5 e windom 0
## 6 f windom 1
## 7 g windom 0
An approach using dplyr, purrr
neighborhood <- c("armatage", "armatage", "armatage", "windom", "windom", "windom", "windom")
address <- c("a", "b", "c", "d", "e", "f", "g")
df <- data.frame(address, neighborhood)
library(dplyr)
library(purrr)
df %>%
# split original data into group of neighborhod by group_split from dplyr
group_split(neighborhood) %>%
# then for group of neighborhood apply function to split them into 2 group
# based on their row number and number of group is 2
map(.f = function(x) {
x %>% group_by((row_number() - 1) %/% (n() / 2)) %>%
nest %>% pull(.)
})
Result of above code
[[1]]
[[1]][[1]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 a armatage
2 b armatage
[[1]][[2]]
# A tibble: 1 x 2
address neighborhood
<chr> <chr>
1 c armatage
[[2]]
[[2]][[1]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 d windom
2 e windom
[[2]][[2]]
# A tibble: 2 x 2
address neighborhood
<chr> <chr>
1 f windom
2 g windom
In case you just want to add an index column to categorize each row into separate group.
df %>%
group_by(neighborhood) %>%
# cur_group_id gave group index + some math to calculate proper index
# for each group base on their row number.
mutate(group = (cur_group_id() - 1) * 2 + (row_number() - 1) %/% (n() / 2) + 1)
Output
# A tibble: 7 x 3
# Groups: neighborhood [2]
address neighborhood group
<chr> <chr> <dbl>
1 a armatage 1
2 b armatage 1
3 c armatage 2
4 d windom 3
5 e windom 3
6 f windom 4
7 g windom 4
I would like to create a function that will produce a table that has counts based on one or more grouping variables. I found this post Using dplyr group_by in a function which works if I pass the function a single variable name
library(dplyr)
l <- c("a", "b", "c", "e", "f", "g")
animal <- c("dog", "cat", "dog", "dog", "cat", "fish")
sex <- c("m", "f", "f", "m", "f", "unknown")
n <- rep(1, length(animal))
theTibble <- tibble(l, animal, sex, n)
countString <- function(things) {
theTibble %>% group_by(!! enquo(things)) %>% count()
}
countString(animal)
countString(sex)
That works nicely but I don't know how to pass the function two variables.
This sort of works:
countString(paste(animal, sex))
It gives me the correct counts but the returned table collapses the animal and sex variables into one variable.
# A tibble: 4 x 2
# Groups: paste(animal, sex) [4]
`paste(animal, sex)` nn
<chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
What is the syntax for passing a function two words separated by commas? I want to get this result:
# A tibble: 4 x 3
# Groups: animal, sex [4]
animal sex nn
<chr> <chr> <int>
1 cat f 2
2 dog f 1
3 dog m 2
4 fish unknown 1
You can use group_by_at and column index such as:
countString <- function(things) {
index <- which(colnames(theTibble) %in% things)
theTibble %>%
group_by_at(index) %>%
count()
}
countString(c("animal", "sex"))
## A tibble: 4 x 3
## Groups: animal, sex [4]
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
We replaced 'things' with ... for multiple arguments, similarly enquos with !!! for multiple arguments. Removed the group_by with count
countString <- function(...) {
grps <- enquos(...)
theTibble %>%
count(!!! grps)
}
countString(sex)
# A tibble: 3 x 2
# sex nn
# <chr> <int>
#1 f 3
#2 m 2
#3 unknown 1
countString(animal)
# A tibble: 3 x 2
# animal nn
# <chr> <int>
#1 cat 2
#2 dog 3
#3 fish 1
countString(animal, sex)
# A tibble: 4 x 3
# animal sex nn
# <chr> <chr> <int>
#1 cat f 2
#2 dog f 1
#3 dog m 2
#4 fish unknown 1
Let's say I have a dataframe
x y val
A B 5
A C 3
B A 7
B C 9
C A 1
As you can see there are two pairs matching by x and y:
Pair 1: A B 5 and B A 7
Pair 2: A C 3 and C A 1
I would like to merge them to A B 12 and A C 4 and leave the B C 9 as it doesn't have a pair (C B).
The final dataframe should look like this:
x y val
A B 12
A C 4
B C 9
How can I achieve this in R?
Here's one solution with dplyr:
library(dplyr)
df %>%
mutate(var = paste(pmin(x, y), pmax(x, y))) %>%
group_by(var) %>%
summarise(val = sum(val))
# A tibble: 3 x 2
var val
<chr> <int>
1 A B 12
2 A C 4
3 B C 9
Add separate(var, c("x", "y")) to the end of the chain if you want the x and y columns as Melissa Key mentions.
First ensure that x and y are character giving DF_c and then sort them giving DF_s. Finally perform the aggregation. No packages are used. The first line would not be needed if x and y were already character.
DF_c <- transform(DF, x = as.character(x), y = as.character(y))
DF_s <- transform(DF_c, x = pmin(x, y), y = pmax(x, y))
aggregate(val ~ x + y, DF_s, sum)
giving:
x y val
1 A B 12
2 A C 4
3 B C 9
One can group by row_number() to sort and combine columns in sorted order to create a order independent pair.
Note: Below solution can be evolve to work for more than 2 columns pairing as well. e.g.treating A B C, A C B or B C A as same group.
library(dplyr)
library(tidyr)
df %>%
group_by(row_number()) %>%
mutate(xy = paste0(sort(c(x,y)),collapse=",")) %>%
group_by(xy) %>%
summarise(val = sum(val)) %>%
separate(xy, c("x","y"))
## A tibble: 3 x 3
# x y val
#* <chr> <chr> <int>
#1 A B 12
#2 A C 4
#3 B C 9
Data:
df <- read.table(text =
"x y val
A B 5
A C 3
B A 7
B C 9
C A 1",
header = TRUE, stringsAsFactors = FALSE)
I have data in a data frame where one column is a list. This is an example:
rand_lets <- function(){
sample(letters[1:26], runif(sample(1:10, 1), min=5, max=12))
}
example_data <- data.frame(ID = seq(1:5),
location = LETTERS[1:5],
observations = I(list(rand_lets(),
rand_lets(),
rand_lets(),
rand_lets(),
rand_lets())))
I am looking for an elegant tidyverse approach to unlist the list column so that each element in the list is separated into a new column. For example the first row would look like this:
ID location observations observations.1 observations.3 observations.3 observations.4 observations.5 observations.6 observations.7 observations.8 observations.9
1 A "y" "b" "m" "u" "x" "j" "t" "i" "v" "w"
Of course the lists entries may be different lengths so empty cells should be NA.
How could this be done?
If you want to keep your data in "long" format, you can do:
example_data %>% unnest(observations)
ID location observations
1 1 A e
2 1 A x
3 1 A w
...
44 5 E u
45 5 E o
46 5 E z
To spread the data to "wide" format, as in your example, you can do:
library(stringr)
example_data %>% unnest(observations) %>%
group_by(location) %>%
mutate(counter=paste0("Obs_", str_pad(1:n(),2,"left","0"))) %>%
spread(counter, observations)
ID location Obs_01 Obs_02 Obs_03 Obs_04 Obs_05 Obs_06 Obs_07 Obs_08 Obs_09 Obs_10 Obs_11
* <int> <fctr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A e x w c s j k t z <NA> <NA>
2 2 B k u d h z x <NA> <NA> <NA> <NA> <NA>
3 3 C v z m o s f n c r u b
4 4 D z i m s a v n r e t x
5 5 E f b g h a d u o z <NA> <NA>