Expanding data frame with "mirror" observations - r

I have a data frame arranged as follows:
df <- structure(list(name1= c("A","A","B"),
name2 = c("B", "C","C"),
size = c(10,20,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3"), class =("data.frame"))
I would like to add "mirror" observations as follows:
df <- structure(list(name1 = c("A","B","A", "C", "B", "C"),
name2 = c("B", "A","C", "A", "C", "B"),
size = c(10,10,20,20,30,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3", "4", "5", "6"), class =("data.frame"))
Inputs would be much appreciated.

We can do this in two steps,
df1 <- df[rep(rownames(df), each = 2),]
df1[c(FALSE, TRUE), 1:2] <- df1[c(FALSE, TRUE), 2:1]
df1
# name1 name2 size
#1 A B 10
#1.1 B A 10
#2 A C 20
#2.1 C A 20
#3 B C 30
#3.1 C B 30

We can do
library(data.table)
rbindlist(list(df, df[c(2:1, 3)]))

Related

How to match columns both forward and reverse direction in a data_frame using r

I have two dataframe.
df1:
P_1 P_2
1 Anb Bmn
2 Cvd Dbn
3 Elf Fish
4 Goat Hen
5 Ink Jelly
6 Kin Lion
7 ACAN HSPG
8 HSPG2 COL6A2
df2:
P_1 P_2 Value
1 Anb Bmn 12
2 Dbn Cvd 31
3 Elf Fish 15
4 Goat Hen 98
5 Jelly Ink 78
6 Kin Lion 56
7 HSPG ACAN 89
I tried to merge these two dataframe based on P_1 and P_2 using following command
e<-merge(df1,df2, by=c("P_1","P_2"),all.x=TRUE)
But for the row 2 , 5 and 7, I got 'NA'. This is because, the order is changed. But in the output I need the value even the order is changed. How do I achieve this?
Data
df1 <- structure(list(P_1 = c("Anb", "Cvd", "Elf", "Goat", "Ink", "Kin","ACAN"," HSPG2"), P_2 = c("Bmn", "Dbn", "Fish", "Hen", "Jelly", "Lion","HSPG","COL6A2")), class = "data.frame",row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
df2 <- structure(list(P_1 = c("Anb", "Dbn", "Elf", "Goat", "Jelly", "Kin","HSPG"), P_2 = c("Bmn","Cvd", "Fish", "Hen", "Ink", "Lion","ACAN"), Value = c(12L, 31L, 15L, 98L, 78L,56L,89L)), class = "data.frame", row.names = c("1", "2", "3","4","5", "6","7"))
Any help would be appreciated..
If we need the order to be same, we need to sort by row for each of the datasets
df1new <- df1
df1new[] <- t(apply(df1, 1, sort))
df2new <- df2
df2new[1:2] <- t(apply(df2new[1:2], 1, sort))
and now do the merge
merge(df1new, df2new, all.x = TRUE)
data
df1 <- structure(list(P_1 = c("A", "C", "E", "G", "I", "K", "z", "w"), P_2 = c("B", "D", "F", "H", "J", "L", "b", "c")), class = "data.frame",row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
df2 <- structure(list(P_1 = c("A", "D", "E", "H", "J", "K"), P_2 = c("B", "C", "F", "G", "I", "L"), Value = c(12L, 31L, 15L, 98L, 78L, 56L)), class = "data.frame", row.names = c("1", "2", "3", "4","5", "6"))

count_if (EXPSS) with multiple conditions in R

I am using expss::count_if.
While something like this works fine (i.e., counting values only where value is equal to "1"):
(number_unemployed = count_if("1",unemployed_field,na.rm = TRUE)),
This does not (i.e., counting values only where value is equal to "1" or "2" or "3"):
(number_unemployed = count_if("1", "2", "3", unemployed_field,na.rm = TRUE)),
What is the correct syntax for using multiple conditions for count_if? I cannot find anything in the expss package documentation.
You need to put them into a vector. This works:
(number_unemployed = count_if(c("1", "2", "3"), unemployed_field), na.rm=T),
Example: Sample data is provided below;
library(expss)
count_if(c("1","2","3"),dt$Encounter)
#> 9
Data:
dt <- structure(list(Location = c("A", "B", "A", "A", "C", "B", "A", "B", "A", "A", "A"),
Encounter = c("1", "2", "3", "1", "2", "3", "4", "1", "2", "3", "4")),
row.names = c(NA, -11L), class = "data.frame")
# Location Encounter
# 1 A 1
# 2 B 2
# 3 A 3
# 4 A 1
# 5 C 2
# 6 B 3
# 7 A 4
# 8 B 1
# 9 A 2
# 10 A 3
# 11 A 4

How to apply separate_rows() to all columns, passing a `sep` parameter?

Pretty straight straightforward: I have a data frame where the values in many columns need to be split into their own rows, based on ;s as the delimiter.
After reading a bit,
df %>%
Reduce(separate_rows_, x = colnames)
works, except that I can't pass the sep parameter (so it also separates by white spaces, commas, and other non-alphanumeric chars).
One answer proposed writing a modified version of the function that includes the parameter, but I couldn't get that working:
Reduce(f = function(y) separate_rows_(sep = ";"), x = colnames)
What am I doing wrong?
Having said that, my ideal solution would be a tidyverse solution, if it's cleaner (maybe map_dfr?); but obviously any solution is better than none :).
Here's sample data:
structure(list(q1 = c("1,2,3,4", "2,4"), q2 = c("a,b", "e,f"),
q3 = c("c,d", "g,h,z")), row.names = 1:2, class = "data.frame")
Expected output:
structure(list(q1 = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "2", "2", "2", "2", "2",
"2", "4", "4", "4", "4", "4", "4"), q2 = c("a", "a", "b", "b",
"a", "a", "b", "b", "a", "a", "b", "b", "a", "a", "b", "b", "e",
"e", "e", "f", "f", "f", "e", "e", "e", "f", "f", "f"), q3 = c("c",
"d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d",
"c", "d", "g", "h", "z", "g", "h", "z", "g", "h", "z", "g", "h",
"z")), row.names = c(NA, -28L), class = "data.frame")
The process I want to streamline is not having to pass every column name like so:
output <- test %>%
separate_rows(q1, sep = ",") %>%
separate_rows(q2, sep = ",") %>%
separate_rows(q3, sep = ",")
You can use purrr::reduce, which applies the given function .f to .init and the first element of .x, then applies the function to the output of that and the second element of .x, etc. until all elements of .x have been used.
Within the .f argument formula, .x is the previous output (or .init for the first run) and .y is the given element of the .x argument to reduce.
library(tidyverse)
reduce(.init = df, .x = names(df), .f = ~separate_rows(.x, .y, sep = ','))
# equiv to: reduce(.init = df, .x = names(df), .f = separate_rows, sep = ',')
As akrun notes in the comments, this can also be done in base R with the code below (same output)
Reduce(function(x, y) separate_rows(x, y, sep=","), names(df), init = df)
# q1 q2 q3
# 1 1 a c
# 2 1 a d
# 3 1 b c
# 4 1 b d
# 5 2 a c
# 6 2 a d
# 7 2 b c
# 8 2 b d
# 9 3 a c
# 10 3 a d
# 11 3 b c
# 12 3 b d
# 13 4 a c
# 14 4 a d
# 15 4 b c
# 16 4 b d
# 17 2 e g
# 18 2 e h
# 19 2 e z
# 20 2 f g
# 21 2 f h
# 22 2 f z
# 23 4 e g
# 24 4 e h
# 25 4 e z
# 26 4 f g
# 27 4 f h
# 28 4 f z

Count occurrences per entry in dataframe

I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)

Combine character columns into new column

I'd be very grateful if you could help me with the following as after a few tests I haven't still been able to get the right outcome.
I've got this data:
dd_1 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"))
And I'd like to produce a new column 'CLASS':
dd_2 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"),
CLASS = c("a", "b", "a-b", "b", "b"))
Thanks a lot!
Here it is:
tmp <- paste(dd_1$Class_a, dd_1$Class_b, sep='-')
tmp <- gsub('NA-|-NA', '', tmp)
(dd_2 <- cbind(dd_1, tmp))
First we concatenate (join as strings) the 2 columns. paste treats NAs as ordinary strings, i.e. "NA", so we either get NA-a, NA-b, or a-b. Then we substitute NA- or -NA with an empty string.
Which results in:
## ID Class_a Class_b tmp
## 1 1 a <NA> a
## 2 2 <NA> b b
## 3 3 a b a-b
## 4 4 <NA> b b
## 5 5 <NA> b b
Another option:
dd_1$CLASS <- with(dd_1, ifelse(is.na(Class_a), as.character(Class_b),
ifelse(is.na(Class_b), as.character(Class_a),
paste(Class_a, Class_b, sep="-"))))
This way you would check if any of the classes is NA and return the other, or, if none is NA, return both separated by "-".
Here's a short solution with apply:
dd_2 <- cbind(dd_1, CLASS = apply(dd_1[2:3], 1,
function(x) paste(na.omit(x), collapse = "-")))
The result
ID Class_a Class_b CLASS
1 1 a <NA> a
2 2 <NA> b b
3 3 a b a-b
4 4 <NA> b b
5 5 <NA> b b

Resources