How to keep values occur only in one group? - r

I am working on a dataframe likes:
groups . values
a . 1
a . 1
a 2
b . 2
b . 3
b . 3
c . 4
c . 5
c . 6
d . 6
d . 7
d . 2
The problem is to turn it into something like:
groups . values
a . 1
a . 1
b . 3
b . 3
c . 4
c . 5
d . 7
I want to keep rows whose values only occur in ONE group. For example, value 2 is deleted because it occurs in three different groups, but value 1 is kept although it occur twice in ONLY ONE group.
Is there any functions from dplyr package can handle this problem? or I have to write my own function?

As you asked for a dplyr solution:
df %>% group_by(values) %>% filter(n_distinct(groups) == 1)
# # A tibble: 7 x 2
# # Groups: values [5]
# groups values
# <chr> <int>
#1 a 1
#2 a 1
#3 b 3
#4 b 3
#5 c 4
#6 c 5
#7 d 7
with
df <- structure(list(groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
values = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
row.names = c(NA, -12L), class = "data.frame")

Group by values and see if column groups has only one element. This can be done with ave.
i <- as.logical(with(df1, ave(as.numeric(groups), values, FUN = function(x) length(unique(x)) == 1)))
df1[i, ]
# groups values
#1 a 1
#2 a 1
#5 b 3
#6 b 3
#7 c 4
#8 c 5
#11 d 7
Data in dput format.
df1 <-
structure(list(groups = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("a", "b",
"c", "d"), class = "factor"), values = c(1L, 1L, 2L,
2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
class = "data.frame", row.names = c(NA, -12L))

x[x$values %in% names(which(colSums(table(x)>0)==1)),]
where
x = structure(list(groups = c("a", "a", "a", "b", "b", "b", "c",
"c", "c", "d", "d", "d"), values = c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 5L, 6L, 6L, 7L, 2L)), row.names = c(NA, -12L), class = "data.frame")
or, a data.table solution:
setDT(x)[, .SD[uniqueN(groups)==1], values]

Using sqldf package for your original data frame df:
library(sqldf)
result <- sqldf("SELECT * FROM df
WHERE `values` IN (
SELECT `values` from (
SELECT `values`, groups, count(*) as num from df
GROUP BY `values`, groups) t
GROUP BY `values`
HAVING COUNT(1) = 1
)")

Related

How do you (simply) apply a function to mutliple sub-sets of differing lengths in R? [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
I need to apply a function to several subsets of data of differing lengths within a column and generate a new data frame which includes the outputs and their associated metadata.
How can I do this without recourse to for loops? tapply() seems like a good place to start, but I struggle with the syntax.
For example -- I have something like this:
block plot id species type response
1 1 1 w a 1.5
1 1 2 w a 1
1 1 3 w a 2
1 1 4 w a 1.5
1 2 5 x a 5
1 2 6 x a 6
1 2 7 x a 7
1 3 8 y b 10
1 3 9 y b 11
1 3 10 y b 9
1 4 11 z b 1
1 4 12 z b 3
1 4 13 z b 2
2 5 14 w a 0.5
2 5 15 w a 1
2 5 16 w a 1.5
2 6 17 x a 3
2 6 18 x a 2
2 6 19 x a 4
2 7 20 y b 13
2 7 21 y b 12
2 7 22 y b 14
2 8 23 z b 2
2 8 24 z b 3
2 8 25 z b 4
2 8 26 z b 2
2 8 27 z b 4
And I want to produce something like this:
block plot species type mean.response
1 1 w a 1.5
1 2 x a 6
1 3 y b 10
1 4 z b 2
2 5 w a 1
2 6 x a 3
2 7 y b 13
2 8 z b 3
Try this. You can use group_by() to set the grouping variables and then summarise() to compute the expected variable. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))
Output:
# A tibble: 8 x 5
# Groups: block, plot, species [8]
block plot species type Mean
<int> <int> <chr> <chr> <dbl>
1 1 1 w a 1.5
2 1 2 x a 6
3 1 3 y b 10
4 1 4 z b 2
5 2 5 w a 1
6 2 6 x a 3
7 2 7 y b 13
8 2 8 z b 3
Or using base R (-3 is used to omit id variable in the aggregation):
#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)
Output:
block plot species type response
1 1 1 w a 1.5
2 2 5 w a 1.0
3 1 2 x a 6.0
4 2 6 x a 3.0
5 1 3 y b 10.0
6 2 7 y b 13.0
7 1 4 z b 2.0
8 2 8 z b 3.0
Some data used:
#Data
df <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Use any of these where the input dd is given reproducibly in the Note at the end:
# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type, dd, mean)
# 2. aggregate.default - base R
v <- c("block", "plot", "species", "type")
aggregate(list(mean.response = dd$response), dd[v], mean)
# 3. sqldf
library(sqldf)
sqldf("select block, plot, species, type, avg(response) as [mean.response]
from dd group by 1, 2, 3, 4")
# 4. data.table
library(data.table)
v <- c("block", "plot", "species", "type")
as.data.table(dd)[, .(mean.response = mean(response)), by = v]
# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type, dd)
Note
The input in reproducible form:
dd <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))

Preparing data for Gephi

Greeting,
I would need to prepare data for network analysis in Gephi. I have data in the following format:
MY Data
And I need data in format (Where the values represent persons that are connected through the organization):
Required format
Thank you very much!
I think this code should do the job. It is not the best most elegant way of doing it, but it works :)
# Data
x <-
structure(
list(
Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")
),
.Names = c("Persons", "Organizations"),
class = "data.frame",
row.names = c(NA, -11L)
)
# This will merge n:n
edgelist <- merge(x, x, by = "Organizations")[,2:3]
# We don't want autolinks
edgelist <- subset(edgelist, Persons.x != Persons.y)
# Removing those that are repeated
edgelist <- unique(edgelist)
edgelist
#> Persons.x Persons.y
#> 2 1 3
#> 3 1 2
#> 4 3 1
#> 6 3 2
#> 7 2 1
#> 8 2 3
HIH
Created on 2018-01-03 by the reprex
package (v0.1.1.9000).
Starting with x:
structure(list(Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")), .Names = c("Persons", "Organizations"), class = "data.frame", row.names = c(NA,-11L))
Create a new data.frame with different names. Just convert Organizations to a factor and then use the numeric values:
> y=data.frame(Source=x$Persons, Target=as.numeric(as.factor(x$Organizations)))
> y
Source Target
1 1 1
2 1 2
3 1 5
4 2 6
5 2 1
6 2 5
7 2 3
8 3 4
9 3 3
10 3 1
11 3 5
For what it's worth, I'm pretty sure gephi can handle strings.

Remove the rows that have the same column A value but different column B value from df (but not vice-versa) in R

I’m trying to remove all the rows that have the same value in the "lan" column of my dataframe but different value for my "id" column (but not vice-versa).
Using an example dataset:
require(dplyr)
t <- structure(list(id = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), lan = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 1L,
7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"),
value = c(0.22988498, 0.848989831, 0.538065821, 0.916571913,
0.304183372, 0.983348167, 0.356128559, 0.054102854, 0.400934593,
0.001026817, 0.488452667)), .Names = c("id", "lan", "value"
), class = "data.frame", row.names = c(NA, -11L))
t
I need to get rid of rows 1 and 10 because they have the same lan (a) but different id.
I've tried the following, without success:
a<-t[(!duplicated(t$id)),]
c<-a[duplicated(a$lan)|duplicated(a$lan, fromLast=TRUE),]
d<-t[!(t$lan %in% c$lan),]
Thanks for your help!
And an alternative using dplyr:
t2 <- t %>%
group_by(lan,id) %>%
summarise(value=sum(value)) %>%
group_by(lan) %>%
summarise(number=n()) %>%
filter(number>1) %>%
select(lan)
> t[!t$lan %in% t2$lan ,]
id lan value
2 2 b 0.84898983
3 2 c 0.53806582
4 3 d 0.91657191
5 3 d 0.30418337
6 4 e 0.98334817
7 4 e 0.35612856
8 4 e 0.05410285
9 4 f 0.40093459
11 4 g 0.48845267
You could use duplicated on "lan", to get the logical index of all elements that are duplicates, repeat the same with both columns together ('id', 'lan'), to get the elements not duplicated, check which of these elements are TRUE in both cases, negate, and subset.
indx1 <- with(t, duplicated(lan)|duplicated(lan,fromLast=TRUE))
indx2 <- !(duplicated(t[1:2])|duplicated(t[1:2],fromLast=TRUE))
t[!(indx1 & indx2),]
# id lan value
#2 2 b 0.84898983
#3 2 c 0.53806582
#4 3 d 0.91657191
#5 3 d 0.30418337
#6 4 e 0.98334817
#7 4 e 0.35612856
#8 4 e 0.05410285
#9 4 f 0.40093459
#11 4 g 0.48845267

find highest value within factor levels

if I have the following dataframe:
value factorA factorB
1 a e
2 a f
3 a g
1 b k
2 b l
3 b m
1 c e
2 c g
how can I get for each factorA the highest value and the entry from factorB associated with it i.e.
value factorA factorB
3 a g
3 b m
2 c g
Is this possible without first using
blocks<-split(factorA, list(), drop=TRUE)
and then sorting each block$a as this will be performed many times and number of blocks will always change.
Here is one option, using base R functions:
maxRows <- by(df, df$factorA, function(X) X[which.max(X$value),])
do.call("rbind", maxRows)
# value factorA factorB
# a 3 a g
# b 3 b m
# c 2 c g
With your data
df<- structure(list(value = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), factorA = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
factorB = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 3L), .Label = c("e",
"f", "g", "k", "l", "m"), class = "factor")), .Names = c("value",
"factorA", "factorB"), class = "data.frame", row.names = c(NA,
-8L))
Using ddply function in plyr package
> df2<-ddply(df,c('factorA'),function(x) x[which(x$value==max(x$value)),])
value factorA factorB
1 3 a g
2 3 b m
3 2 c g
Or,
> rownames(df2) <- df2$factorA
> df2
value factorA factorB
a 3 a g
b 3 b m
c 2 c g

reshape: cast oddity

Either it's late, or I've found a bug, or cast doesn't like colnames with "." in them. This all happens inside a function, but it "doesn't work" outside of a function as much as it doesn't work inside of it.
x <- structure(list(df.q6 = structure(c(1L, 1L, 1L, 11L, 11L, 9L,
4L, 11L, 1L, 1L, 2L, 2L, 11L, 5L, 4L, 9L, 4L, 4L, 1L, 9L, 4L,
10L, 1L, 11L, 9L), .Label = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k"), class = "factor"), df.s5 = structure(c(4L,
4L, 1L, 2L, 4L, 4L, 4L, 3L, 4L, 1L, 2L, 1L, 2L, 4L, 1L, 3L, 4L,
2L, 2L, 4L, 4L, 4L, 2L, 2L, 1L), .Label = c("a", "b", "c", "d",
"e"), class = "factor")), .Names = c("df.q6", "df.s5"), row.names = c(NA,
25L), class = "data.frame")
cast(x, df.q6 + df.s5 ~., length)
No worky.
However, if:
colnames(x) <- c("variable", "value")
cast(x, variable + value ~., length)
Works like a charm.
For me I use a similar solution to what Spacedman points out.
#take your data.frame x with it's two columns
#add a column
x$value <- 1
#apply your cast verbatim
cast(x, df.q6 + df.s5 ~., length)
df.q6 df.s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
6 d a 1
7 d b 1
8 d d 3
9 e d 1
10 i a 1
11 i c 1
12 i d 2
13 j d 1
14 k b 3
15 k c 1
16 k d 1
Hopefully that helps!
Jay
Nothing to do with the dots in the colnames (easily shown!).
If your dataframe doesnt have a column called 'value' then cast() guesses what column is the value - in this case it guesses 'df.s5' as it is the last column. This is what you get when you melt() data. It then renames that column to 'value' before calling reshape1. Now the column 'df.s5' is no more, yet it's there on the left of your formula. Uh oh.
You are using the value in the formula, which is an odd thing to do. None of the cast examples do that. What are you trying to do here?
You could add an ad-hoc column as a dummy value:
> cast(cbind(x,1), df.q6+s5~., length)
Using 1 as value column. Use the value argument to cast to override this choice
df.q6 s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
[etc]
But I suspect there's a better way to get the number of repeated observations (rows) in a data frame - which is your real question!
if you are looking for an easy solution, dcast in reshape2 package can help you:
library(reshape2)
dcast(x, df.q6 + df.s5 ~., length)

Resources