R - combining columns by specific conditions - r

I currently has a data frame as follow:
groups <- data.frame(name=paste("person",c(1:27),sep=""),
assignment1 = c("F","A","B","H", "A", "E", "D", "G", "I", "I", "E", "A", "D", "C", "F", "C", "D", "H", "F", "H", "G", "I", "G", "C", "B", "E", "B"),
assignment2 = c("H", "F", "F", "D", "E", "G", "A", "E", "I", "C", "A", "H", "G", "B", "I", "C", "E", "I", "C", "A", "B", "B", "G", "D", "H", "F", "D"),stringsAsFactors = FALSE)
It would looks like this:
I would like to create a list for each person that only contains the people he had already worked with. For example, person1 is on group F and H for 1st and 2nd assignment respectively and
The member of groups F on 1st assignment are {"person1","person15", "person19"}.
The member of groups D on 2nd assignment are {"person1","person12", "person25"}.
I would like to create a vector for person1 like
{"person15", "person19", "person12", "person25"}.
Any one knows a convenient way to do this in R?
Any help will be appreciated. Thanks in advance.

You could do this:
teammates <- lapply(1:nrow(groups), function(i) {
assig1 <- subset(groups, assignment1 == groups$assignment1[i])$name
assig2 <- subset(groups, assignment2 == groups$assignment2[i])$name
unq_set <- unique(c(assig1, assig2))
return(setdiff(unq_set, groups$name[i]))
})
This takes a vector of row indices, and for each one applies a function that a) gets the names of those where assignments 1 & 2 match the given row, b) gets the unique superset of these, c) returns that, less the name of the person around whom the group is built
The output is a list like this:
[[1]]
[1] "person15" "person19" "person12" "person25"
[[2]]
[1] "person5" "person12" "person3" "person26"
[[3]]
[1] "person25" "person27" "person2" "person26"
...and so on
For more brevity, the following is equivalent (though order inside list items may be different). Same logic as #user5219763's answer for subsetting, but the setdiff part is important
teammates <- lapply(1:nrow(groups), function(i) {
setdiff(
with(groups, name[assignment1 == assignment1[i] |
assignment2 == assignment2[i] ]),
groups$name[i])
})

Here's a solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
groups %>%
gather(var, val, -name) %>%
unite(comb, var, val) %>%
left_join(.,., by = 'comb') %>%
group_by(name.x) %>%
summarise(out = list(name.y))
The heavy lifting is done using the left_join before that we are combining columns, so that we can merge on eg assignment1_f. The output contains itself, and is not corrected for dupes - that is up to you.
However, as #akrun says, if you are doing a lot of this stuff, use igraph

You could use is.element()
workedWith <- function(index,data=groups){
data[is.element(data[,2],data[index,2]) | is.element(data[,3],data[index,3]),1]
}
lapply(X = seq(1:nrow(groups)),FUN = workedWith)

Related

Fastest validation of sorting vector of pairs of elements until they are unorderly paired

I have an unsorted vector of length N. Each element of the vector appears is present precisely twice (the vector length is an even number). I have a custom sorting algorithm, and the goal is to iterate until the vector achieves a state in which each element is adjacent to its copy.
Unsorted vector = {A,F,J,E,F,A,J,E}
A valid sorted state = {A,A,J,J,E,E,F,F}
Another valid sorted state = {J,J,A,A,F,F,E,E}
So my question lies in what is the fastest way to check if a sorted state is valid so that I can speed up my iterations? For long vectors, this will dictate most of my scaling ability.
Something quick and dirty but I'm not sure it will always work:
all(duplicated(x) == c(FALSE,TRUE))
This is relying on the fact that the two same values will always be next to each other, one not-duplicated, the next duplicated. Seems to work with the test sets:
x <- c("A", "F", "J", "E", "F", "A", "J", "E")
s1 <- c("A", "A", "J", "J", "E", "E", "F", "F")
s2 <- c("J", "J", "A", "A", "F", "F", "E", "E")
all(duplicated(x) == c(FALSE,TRUE))
#[1] FALSE
all(duplicated(s1) == c(FALSE,TRUE))
#[1] TRUE
all(duplicated(s2) == c(FALSE,TRUE))
#[1] TRUE
And is pretty quick, looking through a million length vector in 5 hundredths of a second on my machine:
x <- rep(1:1e6, each=2)
system.time(all(duplicated(x) == c(FALSE,TRUE)))
# user system elapsed
# 0.04 0.00 0.05
An option involves converting the vector (as the length is even and an element is present exactly twice) to a two-row matrix, get the uniqueand test whether the number of rows is 1. If values duplicated are adjacent, while adding the dim attributes with matrix, the second row will be exactly the same as the first
f1 <- function(x)
{
nrow(unique(matrix(x, nrow = 2))) == 1
}
-testing
> v1 <- c("A", "F", "J", "E", "F", "A", "J", "E")
> v2 <- c("A", "A", "J", "J", "E", "E", "F", "F")
> v3 <- c("J", "J", "A", "A", "F", "F", "E", "E")
> f1(v1)
[1] FALSE
> f1(v2)
[1] TRUE
> f1(v3)
[1] TRUE
Or slightly more faster
f2 <- function(x)
{
sum(duplicated(matrix(x, nrow = 2))) == 1
}
-testing
> f2(v1)
[1] FALSE
> f2(v2)
[1] TRUE
> f2(v3)
[1] TRUE
-benchmarks
#thelatemail
> f3 <- function(x) all(duplicated(x) == c(FALSE,TRUE))
#TarJae
> f4 <- function(x) {rle_obj <- rle(x); all(rle_obj$lengths > 1)}
> x1 <- rep(1:1e8, each = 2)
> system.time(f1(x1))
user system elapsed
2.649 0.456 3.111
> system.time(f2(x1))
user system elapsed
2.258 0.433 2.694
> system.time(f3(x1))
user system elapsed
9.972 1.272 11.233
> system.time(f4(x1))
user system elapsed
7.051 3.281 10.333
Another option is to use rle function:
v1 <- c("A", "F", "J", "E", "F", "A", "J", "E")
v2 <- c("A", "A", "J", "J", "E", "E", "F", "F")
v3 <- c("J", "J", "A", "A", "F", "F", "E", "E")
rle_obj <- rle(v3)
all(rle_obj$lengths > 1)
test:
> rle_obj <- rle(v1)
> all(rle_obj$lengths > 1)
[1] FALSE
> rle_obj <- rle(v2)
> all(rle_obj$lengths > 1)
[1] TRUE
> rle_obj <- rle(v3)
> all(rle_obj$lengths > 1)
[1] TRUE
>

Removing list items based on presence of a sub-list

I have a list and I would like to remove any list object with a sublist. In the example below, I would like to remove ob2 and ob5 and keep all other objects.
dat <- list(ob1 = c("a", "b", "c"),
ob2 = list(a = c("d")),
ob3 = c("e", "f", "g"),
ob4 = c("h", "i", "j"),
ob5 = list(1:3))
Can anyone offer a solution of how to do this?
We can create a condition with sapply (from base R)
dat[!sapply(dat, is.list)]
Or with Filter from base R
Filter(Negate(is.list), dat)
Or with discard
library(purrr)
discard(dat, is.list)

R , Replicating the rownames in data.frame

I have a data.frame with dimension [6587 37] and the rownames must repeat after every 18 rows. How i can do this in Rstudio.
If your 18 column names are:
mynames <- c("a", "b", "c", "d", "e", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s")
You can get what you want with:
paste0(rep(mynames,length.out=6587),rep(1:366,each=18,length.out=6587))
Or you can modify the names pasting different things.
Row names in data.frames have to be unique.
> df <- data.frame(x = 1:2)
> rownames(df) <- c("a", "a")
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘a’
You could use make.names to make the names unique, but still carry some repeating information.
> make.names(c("a","a"), unique = TRUE)
[1] "a" "a.1"
These could be identified with help from grep
Or you could make a column in df or a second data.frame that holds the information

Counting on dataframe in R

I have a data frame like
A B
A E
B E
B C
..
I want to convert it to two dataframes
One is counting how many times A, B, C.. appear in the first column and other one is counting how many times A, B, B .. appear in the second column.
A 5
B 4
...
Could you give me some suggestions?
Thanks
Try plyr library:
library(plyr)
myDataFrame <- as.data.frame(cbind( c("A", "A", "B", "B", "B", "C"), c("B", "E", "E", "C", "C", "E") ))
count(myDataFrame[,1]) ##prints counts of first column
count(myDataFrame[,2]) ##prints counts of second column
We can use lapply to loop over the columns, get the frequency with table, convert to data.frame and if needed as separate datasets, use list2env (not recommended)
list2env(setNames(lapply(df1, function(x)
as.data.frame(table(x))), paste0("df", 1:2)), envir=.GlobalEnv)
Alternatively, You could also use the dplyr library-
library("dplyr")
df<- as.data.frame(cbind( c("A", "A", "B", "B", "B", "C"), c("B", "E", "E", "C", "C", "E") ))
names(df)<-c("V1","V2")
df <- tbl_df(df)
df %>% group_by(V1) %>% summarise(c1 = n()) ## for column 1
df %>% group_by(V2) %>% summarise(c1 = n()) ## for column 2

NAs in the dummies package

I am using R dummy.data.frame function in the dummies package to create dummy variables for the k levels of my factor. Unfortunately, my factor has NAs. When I use dummy.data.frame it creates k dummies with no NAs and a new dummy which flags with 1 the missing values.
However, I would like to still have the NAs in the k dummies and not a dummy for the missing values.
Is this possible with that function? Do you know any other functions that can help me?
I usually do this kind of things using the model.matrix(). Using that with the option na.action set to pass will retain the NAs in their correct places. This option does not seem to change the behavior of the function dummy(), so using model.matrix() might be your easiest bet. For example, for a single factor letters the following should do the trick:
options(na.action="na.pass")
letters <- c( "a", "a", "b", "c", "d", "e", "f", "g", "h", "b", "b", NA )
model.matrix(~letters-1)
Or for several variables or columns of a data frame as well:
letters <- c( "a", "a", "b", "c", "d", "e", "f", "g", "h", "b", "b", NA )
betters <- c( "a", "a", "c", "c", "c", "d", "d", "d", NA, "e", "e", "e" )
model.matrix(~letters+betters-1)
The important trick here really is to set the option na.action. After this dummy recoding, it is a good idea to return the option to its default value:
options(na.action="na.omit")

Resources