Preparing data for Gephi - r

Greeting,
I would need to prepare data for network analysis in Gephi. I have data in the following format:
MY Data
And I need data in format (Where the values represent persons that are connected through the organization):
Required format
Thank you very much!

I think this code should do the job. It is not the best most elegant way of doing it, but it works :)
# Data
x <-
structure(
list(
Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")
),
.Names = c("Persons", "Organizations"),
class = "data.frame",
row.names = c(NA, -11L)
)
# This will merge n:n
edgelist <- merge(x, x, by = "Organizations")[,2:3]
# We don't want autolinks
edgelist <- subset(edgelist, Persons.x != Persons.y)
# Removing those that are repeated
edgelist <- unique(edgelist)
edgelist
#> Persons.x Persons.y
#> 2 1 3
#> 3 1 2
#> 4 3 1
#> 6 3 2
#> 7 2 1
#> 8 2 3
HIH
Created on 2018-01-03 by the reprex
package (v0.1.1.9000).

Starting with x:
structure(list(Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")), .Names = c("Persons", "Organizations"), class = "data.frame", row.names = c(NA,-11L))
Create a new data.frame with different names. Just convert Organizations to a factor and then use the numeric values:
> y=data.frame(Source=x$Persons, Target=as.numeric(as.factor(x$Organizations)))
> y
Source Target
1 1 1
2 1 2
3 1 5
4 2 6
5 2 1
6 2 5
7 2 3
8 3 4
9 3 3
10 3 1
11 3 5
For what it's worth, I'm pretty sure gephi can handle strings.

Related

How To Remove NA Values Dependent On A Condition in R?

Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?
You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]
With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))

How do you (simply) apply a function to mutliple sub-sets of differing lengths in R? [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
I need to apply a function to several subsets of data of differing lengths within a column and generate a new data frame which includes the outputs and their associated metadata.
How can I do this without recourse to for loops? tapply() seems like a good place to start, but I struggle with the syntax.
For example -- I have something like this:
block plot id species type response
1 1 1 w a 1.5
1 1 2 w a 1
1 1 3 w a 2
1 1 4 w a 1.5
1 2 5 x a 5
1 2 6 x a 6
1 2 7 x a 7
1 3 8 y b 10
1 3 9 y b 11
1 3 10 y b 9
1 4 11 z b 1
1 4 12 z b 3
1 4 13 z b 2
2 5 14 w a 0.5
2 5 15 w a 1
2 5 16 w a 1.5
2 6 17 x a 3
2 6 18 x a 2
2 6 19 x a 4
2 7 20 y b 13
2 7 21 y b 12
2 7 22 y b 14
2 8 23 z b 2
2 8 24 z b 3
2 8 25 z b 4
2 8 26 z b 2
2 8 27 z b 4
And I want to produce something like this:
block plot species type mean.response
1 1 w a 1.5
1 2 x a 6
1 3 y b 10
1 4 z b 2
2 5 w a 1
2 6 x a 3
2 7 y b 13
2 8 z b 3
Try this. You can use group_by() to set the grouping variables and then summarise() to compute the expected variable. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))
Output:
# A tibble: 8 x 5
# Groups: block, plot, species [8]
block plot species type Mean
<int> <int> <chr> <chr> <dbl>
1 1 1 w a 1.5
2 1 2 x a 6
3 1 3 y b 10
4 1 4 z b 2
5 2 5 w a 1
6 2 6 x a 3
7 2 7 y b 13
8 2 8 z b 3
Or using base R (-3 is used to omit id variable in the aggregation):
#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)
Output:
block plot species type response
1 1 1 w a 1.5
2 2 5 w a 1.0
3 1 2 x a 6.0
4 2 6 x a 3.0
5 1 3 y b 10.0
6 2 7 y b 13.0
7 1 4 z b 2.0
8 2 8 z b 3.0
Some data used:
#Data
df <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Use any of these where the input dd is given reproducibly in the Note at the end:
# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type, dd, mean)
# 2. aggregate.default - base R
v <- c("block", "plot", "species", "type")
aggregate(list(mean.response = dd$response), dd[v], mean)
# 3. sqldf
library(sqldf)
sqldf("select block, plot, species, type, avg(response) as [mean.response]
from dd group by 1, 2, 3, 4")
# 4. data.table
library(data.table)
v <- c("block", "plot", "species", "type")
as.data.table(dd)[, .(mean.response = mean(response)), by = v]
# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type, dd)
Note
The input in reproducible form:
dd <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))

How to select lines which have equal values in columns and mantain this characteristics

I have a complete data frame of all cities from Brazil. I want just some predefined cities. I have a column with these predefined cities. Then I'd like to use all the columns from my data frame, but select only the lines which coincides the cities of column with all cities and the column with predefined cities.
data = read.csv(file="C:/Users/guilherme/Desktop/data.csv", header=TRUE, sep=";")
data
> AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 A 2 4 C 12 5
2 B 2 2 A 11 10
3 C 3 4 F 09 2
4 D 4 2
5 E 5 6
6 F 6 2
I want the following
> data
AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 C 3 4 C 12 5
2 A 2 4 A 11 10
3 F 6 2 F 09 2
You need merge -
merge(
data[, c("AllCities", "Year1990", "Year200")],
data[, c("PredefinedCities", "CharacCities1", "CharacCities2")],
by.x = "AllCities", by.y = "PredefinedCities"
)
AllCities Year1990 Year200 CharacCities1 CharacCities2
1 A 2 4 11 10
2 C 3 4 12 5
3 F 6 2 9 2
Note - Your data format is unusual. If you can, you should fix data source so that it gives you AllCities and PreferredCities tables separately or maybe even join them correctly before creating the csv file.
Data -
structure(list(AllCities = c("A", "B", "C", "D", "E", "F"), Year1990 = c(2L,
2L, 3L, 4L, 5L, 6L), Year200 = c(4L, 2L, 4L, 2L, 6L, 2L), PredefinedCities = c("C",
"A", "F", "", "", ""), CharacCities1 = c(12L, 11L, 9L, NA, NA,
NA), CharacCities2 = c(5L, 10L, 2L, NA, NA, NA)), .Names = c("AllCities",
"Year1990", "Year200", "PredefinedCities", "CharacCities1", "CharacCities2"
), class = "data.frame", row.names = c(NA, -6L))
data <- data[data$AllCities %in% data$PredefinedCities,]

How to keep values occur only in one group?

I am working on a dataframe likes:
groups . values
a . 1
a . 1
a 2
b . 2
b . 3
b . 3
c . 4
c . 5
c . 6
d . 6
d . 7
d . 2
The problem is to turn it into something like:
groups . values
a . 1
a . 1
b . 3
b . 3
c . 4
c . 5
d . 7
I want to keep rows whose values only occur in ONE group. For example, value 2 is deleted because it occurs in three different groups, but value 1 is kept although it occur twice in ONLY ONE group.
Is there any functions from dplyr package can handle this problem? or I have to write my own function?
As you asked for a dplyr solution:
df %>% group_by(values) %>% filter(n_distinct(groups) == 1)
# # A tibble: 7 x 2
# # Groups: values [5]
# groups values
# <chr> <int>
#1 a 1
#2 a 1
#3 b 3
#4 b 3
#5 c 4
#6 c 5
#7 d 7
with
df <- structure(list(groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
values = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
row.names = c(NA, -12L), class = "data.frame")
Group by values and see if column groups has only one element. This can be done with ave.
i <- as.logical(with(df1, ave(as.numeric(groups), values, FUN = function(x) length(unique(x)) == 1)))
df1[i, ]
# groups values
#1 a 1
#2 a 1
#5 b 3
#6 b 3
#7 c 4
#8 c 5
#11 d 7
Data in dput format.
df1 <-
structure(list(groups = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("a", "b",
"c", "d"), class = "factor"), values = c(1L, 1L, 2L,
2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
class = "data.frame", row.names = c(NA, -12L))
x[x$values %in% names(which(colSums(table(x)>0)==1)),]
where
x = structure(list(groups = c("a", "a", "a", "b", "b", "b", "c",
"c", "c", "d", "d", "d"), values = c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 5L, 6L, 6L, 7L, 2L)), row.names = c(NA, -12L), class = "data.frame")
or, a data.table solution:
setDT(x)[, .SD[uniqueN(groups)==1], values]
Using sqldf package for your original data frame df:
library(sqldf)
result <- sqldf("SELECT * FROM df
WHERE `values` IN (
SELECT `values` from (
SELECT `values`, groups, count(*) as num from df
GROUP BY `values`, groups) t
GROUP BY `values`
HAVING COUNT(1) = 1
)")

counting number of observations into a dataframe

I want to create a new column in a dataframe that states the number of observations for a particular group.
I have a surgical procedure (HRG.Code) and multiple consultants who perform this procedure (Consultant.Code) and the length of stay their patients are in for in days.
Using
sourceData2$meanvalue<-with(sourceData2,ave(LengthOfStayDays., HRG.Code, Consultant.Code FUN=mean))
I can get a new column (meanvalue) that shows the mean length of stay per consultant per procedure.
This is just what I need. However, I'd also like to know how many occurances of each procedures each consultant performed as a new column in this same data frame.
How do I generate this number of observations. There doesn't appear to be a FUN = Observations or FUN = freq capability.
You may try:
tbl <- table(sourceData2[,3:2]) #gives the frequency of each `procedure` i.e. `HRG.Code` done by every `Consultant.Code`
tbl
# HRG.Code
#Consultant.Code A B C
# A 1 1 0
# B 4 2 1
# C 0 0 1
# D 1 1 1
# E 2 0 0
as.data.frame.matrix(tbl) #converts the `table` to `data.frame`
If you want the total unique procedures done by each Consultant.Code in the long form.
with(sourceData2, as.numeric(ave(HRG.Code, Consultant.Code,
FUN=function(x) length(unique(x)))))
# [1] 3 3 3 2 1 3 3 3 3 1 1 3 3 3 2
data
sourceData2 <- structure(list(LengthofStayDays = c(2L, 2L, 4L, 3L, 4L, 5L, 2L,
4L, 5L, 2L, 4L, 2L, 4L, 4L, 2L), HRG.Code = c("C", "A", "A",
"B", "A", "A", "B", "C", "A", "A", "C", "A", "B", "B", "A"),
Consultant.Code = c("B", "B", "B", "A", "E", "B", "D", "D",
"D", "E", "C", "B", "B", "B", "A")), .Names = c("LengthofStayDays",
"HRG.Code", "Consultant.Code"), row.names = c(NA, -15L), class = "data.frame")

Resources