How to collect unique values, and sum across other columns with conditions - r

I have a lot of financial trading data with around a million rows and I want to be able to condense this into a new data frame with a list of Unique UserIDs. I then want to be able to add up the "trades" for their account, with some conditions, ie if TransactionTypeId == 2 & AC_Type== 19. I would use a sumifs in excel for this but the size of the file means its pretty much impossible to run on my computer.
df<- structure(list(UserId = c(1, 1, 1, 1, 2,
2, 2, 3, 3, 3, 4, 5, 6,
6, 6, 7, 7, 7, 8, 8, 8,
8, 8, 9, 9, 9, 10, 11, 12,
12, 13, 13, 13, 14, 14, 15, 15,
16, 16, 16), TransactionTypeId = c(14, 1, 1, 70,
15, 1, 1, 14, 1, 1, 70, 14, 14, 1, 1, 14, 1, 1, 14, 1, 1, 1,
1, 14, 1, 1, 14, 14, 1, 1, 14, 1, 1, 1, 1, 70, 70, 14, 1, 1),
AC_Type = c(21, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19,
19, 19, 19, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19, 20,
19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Trades = c(30,
30, 0.00067116, 0.00067115, 249, 249, 0.00533033, 48.75,
48.75, 0.00101298, 0.00533, 24.37, 146.25, 146.25, 0.00309109,
100.01, 100.01, 0.00233551, 97.5, 90, 0.00189134, 5, 0.00245851,
234, 234, 0.00500802, 100.01, 48.75, 48.5, 0.0275474, 24,
24, 0.00051975, 100, 0.00223998, 0.00051975, 0.00205, 9.75,
8.75, 0.00017811)), row.names = c(NA, -40L), class = c("tbl_df",
"tbl", "data.frame"))

You can take sum of the logical condition that you want to count.
library(dplyr)
df %>%
group_by(UserId) %>%
summarise(count = sum(Trades[TransactionTypeId == 2 & AC_Type== 19]))

Not quite sure what you want ...
libary(dplyr)
df %>%
group_by(UserId) %>%
filter(TransactionTypeId == 1 & AC_Type == 19) %>%
summarise(sum = sum(Trades))
# A tibble: 6 x 2
UserId sum
<dbl> <dbl>
1 2 249.
2 3 48.8
3 6 146.
4 8 95.0
5 9 234.
6 12 48.5
Here you first group_by UserId, then filterthose rows that meet your conditions (NB: I've changed 2to 1 as there aren't any 2s in the sample data), and finally summarise by summing up the values in Trades.

Using data.table
library(data.table)
setDT(df)[, .(count = sum(Trades[TransactionTypeId == 2 &
AC_Type== 19], na.rm = TRUE)), UserId]

Related

how to classify data inside a list file

I have a data looks like this
df<- structure(list(14, FALSE, c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12,
13, 6), c(0, 0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 0), c(0, 1, 2,
3, 4, 12, 5, 6, 7, 8, 9, 10, 11), c(0, 1, 2, 3, 4, 12, 5, 6,
7, 8, 9, 10, 11), c(0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13), c(0, 6, 6, 6, 6, 6, 6, 13, 13, 13, 13, 13, 13, 13, 13
), list(c(1, 0, 1), structure(list(), names = character(0)),
list(name = c("Bestman", "Tera1", "Tera2", "Tera3", "Tera4",
"Tera5", "Tetra", "Brownie1", "Brownie2", "Brownie3", "Brownie4",
"Brownie5", "Brownie6", "Brownie7")), list()), <environment>), class = "igraph")
I am trying to make a list and assign the two core as root
I can easily do this
as_tbl_graph(df) %>%
activate(nodes) %>%
mutate(type = ifelse(name %in% c("Bestman", "Tetra"), "root", "branch")) %>%
mutate(group = ifelse(name == "Bestman" | grepl("Tera", name),
"Bestman", "Tera"))
when the number of core grows, this method does not work, for example if I have more and I do the following
for example when my data becomes like this
df2<-structure(list(28, FALSE, c(1, 2, 3, 4, 5, 6, 1, 2, 8, 7, 9,
10, 11, 7, 7, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 26,
27, 7, 12, 18, 25, 12, 18, 25, 18, 25, 25), c(0, 0, 0, 0, 0,
0, 0, 0, 7, 6, 7, 7, 7, 2, 1, 12, 12, 12, 12, 12, 18, 18, 18,
18, 18, 18, 25, 25, 0, 0, 0, 0, 7, 7, 7, 12, 12, 18), c(6, 0,
7, 1, 2, 3, 4, 5, 28, 14, 13, 9, 8, 10, 11, 12, 29, 32, 15, 16,
17, 18, 19, 30, 33, 35, 20, 21, 22, 23, 24, 25, 31, 34, 36, 37,
26, 27), c(6, 0, 7, 1, 2, 3, 4, 5, 28, 29, 30, 31, 14, 13, 9,
8, 10, 11, 12, 32, 33, 34, 15, 16, 17, 18, 19, 35, 36, 20, 21,
22, 23, 24, 25, 37, 26, 27), c(0, 0, 2, 4, 5, 6, 7, 8, 12, 13,
14, 15, 16, 18, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 31, 32,
36, 37, 38), c(0, 12, 13, 14, 14, 14, 14, 15, 22, 22, 22, 22,
22, 29, 29, 29, 29, 29, 29, 36, 36, 36, 36, 36, 36, 36, 38, 38,
38), list(c(1, 0, 1), structure(list(), names = character(0)),
list(name = c("Bestman", "Tera1", "Tera2", "Tera3", "Tera4",
"Tera5", "Brownie2", "Tetra", "Brownie1", "Brownie3", "Brownie4",
"Brownie5", "trueG", "ckage1", "ckage2", "ckage3", "ckage4",
"ckage5", "Carowner", "Hoghet1", "Hoghet2", "Hoghet3", "Hoghet4",
"Hoghet5", "Hoghet6", "Bestwomen", "Esme2", "Esme3")), list()),
<environment>), class = "igraph")
as_tbl_graph(df2) %>%
activate(nodes) %>%
mutate(type = ifelse(name %in% c("Bestman", "Tetra", "trueG", "Carowner","Bestwomen"), "root", "branch")) %>%
mutate(group = ifelse(name == "Bestman" | grepl("Tetra", name) | grepl("trueG",name) | grepl("Carowner", name) | grepl("Bestwomen", name) , "Bestman", "Tetra","trueG","Carowner","Bestwomen" ))
I get error, I want to know what I am doing wrong here ?
Your second graph is more complex than your first. Some of the 'peripheral' nodes join more than one central node, so it is not clear how they should be labelled / colored. However, tidygraph has various grouping functions which can be used to assign the nodes to groups based on their connectivity, and the centrality of a node can be calculated automatically to help with labelling and sizing.
library(tidygraph)
library(ggraph)
df2 %>%
as_tbl_graph() %>%
activate(nodes) %>%
mutate(is_central = centrality_hub() > 0.6) %>%
mutate(group = factor(group_label_prop())) %>%
ggraph(layout = "igraph", algorithm = "nicely") +
geom_edge_link(width = 2, alpha = 0.1) +
geom_node_circle(aes(r = ifelse(is_central, nchar(name)/12, 0.1), fill = group),
color = NA) +
geom_node_text(aes(label = ifelse(is_central, name, '')), size = 5,
color = "gray40", family = "Roboto Condensed", fontface = 2) +
theme_graph() +
coord_equal() +
scale_fill_brewer(palette = "Pastel2", guide = "none")
ifelse only allows for two options, try using dplyr::case_when instead.
https://dplyr.tidyverse.org/reference/case_when.html
Update to add requested code:
mutate(group = dplyr::case_when(name == "Bestman" ~ "Bestman",
grepl("Tetra", name) ~ "Tetra",
grepl("trueG",name) ~ "trueG",
grepl("Carowner", name) ~ "Carowner",
grepl("Bestwomen", name) ~ "Bestwomen"))

R - broom::tidy for binomial GLM- Error in approx(sp$y, sp$x, xout = cutoff) : need at least two non-NA values to interpolate

I'm trying to run a logistic regression model in R to identify auxiliary variables which predict missingness in other variables to run a multiple imputation by chained equations model.
Below, var_1 is the missingness variable, computed from another existing variable (0 = not missing, 1 = missing).
For the model, I'm using a logistic regression to predict missingness (0,1) from other variables, and those which are predictive at the p<0.05 level will be used as auxiliary variables which predict missingness.
I'm aware this is not the method normally used to identify auxiliary variables for MICE but this is what I have been advised to do and cannot deviate from this method.
var_1 <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
var_2 <- c(14, 13, 16, 12, 11, 16, 8, 13, 14, 16, 11, 15, 13, 15, 15, 7, 15, 14, 14, 7, 16, 14, 12, 16, 12, 16, 12, 12, 11, 13, 16, 12, 13, 13, 12, 14, 12, 16, 10, 14, 16, 14, 16, 16, 12, 8, 15, 14, 14, 14, 12, 15, 10, 12, 10, 13, 14, 16, 11, 7, 14, 9, 15, 14, 13, 9, 16)
var_3 <- c(13, 10, 16, 8, 10, 13, 8, 16, 13, 12, 4, 3, 8, 11, 13, 8, 8, 13, 10, 9, 16, 13, 1, 14, 12, 14, 12, 10, 12, 11, 16, 9, 9, 5, 7, 14, 15, 16, 10, 8, 16, 12, 12, 7, 13, 4, 16, 13, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
var_4 <- c(16, 16, 12, 16, 13, 17, 15, 19, 12, 18, 15, 19, 15, NA, 17, 11, 10, 13, NA, 11, 18, 18, 11, 12, 11, 19, 10, 15, 17, 17, NA, 17, 15, 15, 17, 18, 15, 14, 11, 13, 14, 15, 20, 16, 12, 11, 17, 16, 11, 15, 15, 15, 11, 11, 9, 13, NA, 12, 13, 14, 13, 15, 19, 15, 15, 15, 16)
df <- cbind(var_1, var_2, var_3, var_4)
lm_1 <- glm(var_1 ~ var_2, data = df, family = binomial())
broom::tidy(lm_1, conf.int = TRUE)
lm_2 <- glm(var_1 ~ var_3, data = df, family = binomial())
broom::tidy(lm_2, conf.int = TRUE)
lm_3 <- glm(var_1 ~ var_4, data = df, family = binomial())
broom::tidy(lm_3, conf.int = TRUE)
All glm functions will compute, but for broom::tidy, lm_1 works, lm_2 doesn't, and lm_3 works (lm_3 was given as an example to show the model can handle some NA values).
I've figured out that this is most likely because var_1 has 0 values up until the 49th variable, and 1 from 49 onwards except for the 56th which is 0, whereas var_3 has na from the 50th onwards except for the 57th, which is 0. This means that the model cannot compute based on only 1 non-zero, non-NA value.
Is there an alternative model, method, package or function which I can utilise which will help me achieve my desired outcome? I can see that swapping the variables to 1 = not missing, 0 = missing will probably work - will this provide me with the correct result? I realise I'm probably overthinking this, I have been trying to deal with it a long time.
Thanks for taking the time to read this, I'd love any advice you might have!

I want to calculate a formula in R

I have a dataset that starts like this:
In dput it is
structure(list(20, TRUE, c(0, 0, 1, 1, 1, 1, 2, 3, 4, 4, 4, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7), c(8, 1, 0, 8, 9, 5,
8, 10, 10, 5, 7, 4, 11, 12, 6, 13, 14, 15, 16, 17, 18, 4, 5,
19, 4, 17), c(1, 0, 2, 5, 3, 4, 6, 7, 9, 10, 8, 11, 14, 12, 13,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), c(2, 1, 11, 21,
24, 5, 9, 22, 14, 10, 0, 3, 6, 4, 7, 8, 12, 13, 15, 16, 17, 18,
19, 25, 20, 23), c(0, 2, 6, 7, 8, 11, 21, 24, 26, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26), c(0, 1, 2, 2, 2, 5, 8, 9,
10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26), list(c(1,
0, 1), structure(list(), names = character(0)), list(name = c("1",
"3", "5", "6", "8", "9", "12", "19", "2", "4", "7", "10", "11",
"14", "15", "16", "17", "18", "20", "13")), list(`Number of messages` = c(157,
1058, 2481, 833, 178, 119, 66, 222, 20, 343, 3, 4991, 47, 11,
83, 26, 10, 19, 33, 84, 51, 589, 79, 37, 110, 55))), <environment>), class = "igraph")
so far I have the following codelines:
Datensatz <- read_xlsx("...")
Netzwerkgraph <- graph.data.frame(Datensatz[,1:3], directed = TRUE)
actors<-Datensatz$From
relations<-Datensatz$To
weight<-Datensatz$`Number of messages`
How can I calculate the following formula in R with my data set?
I´ve tried the following code
Function <- function(i,j,x,y,z){
i <- actors
j <- relations
w <- weight
for(i in 1:20)
print (-1/(cumsum 1:length(actors, i)(w,i+1))logb(x,base=2)*1/(cumsum 1:length(actors, i)*w,i+1))
}
It isn't entirely clear how you wish to apply the given formula to your example data set, that is, exactly what inputs you are using and what outputs you wish to achieve. Hence, it also isn't clear if the following approach will be sufficient for your purposes. Here is my interpretation thus far.
If one interprets each unique value in the "from" column as being a node i, then it appears that you wish to calculate the sum of messages to each j in the "to" column for each sender i in the "from" column. One approach might then be to calculate all such sums by sender first and then run them all through a simple function that accepts the sum along with some lambda constant.
I used a lambda value of "2" below arbitrarily for illustrative purposes. Additionally, while the formula references a time t, there does not appear to be a time component in your example data set; time isn't represented in this approach. The output would presumably represent the expression for each node at a single point in time.
#written in R version 4.2.1
require(data.table)
##Example data frame
df = data.frame(from = c(1,1,3,3,3), to = c(2,3,1,2,4),nm = c(157,1058,2481,833,178))
df = data.table(df)
df
from to nm
1: 1 2 157
2: 1 3 1058
3: 3 1 2481
4: 3 2 833
5: 3 4 178
##Calculate the sum of messages by sender in "from" column
nf = df[,sum(nm), by = from]
colnames(nf) = c("from","message_total")
nf
from message_total
1: 1 1215
2: 3 3492
## Function
## inputs to function are the total number of messages of a sender in
## "from" column (called cit) and some lambda constant
icit = function(cit,lambda = 2){
-(1/(cit + lambda))*log(1/((cit + lambda)), base = 2)
}
##Find vector of values for each sender in the data set
ans = NULL
for(i in 1:dim(nf)[1]){
ans[i] = icit(nf$message_total[i])
}
ans
[1] 0.008421622 0.003368822

How to get the true node value in igraph

So I have read in a network data in iGraph(R) and would like to store the nodes into a list. Here's what I have done:
G = read_graph("somegraph.graphml",format="graphml")
x = list(V(G))
> x
+ 15/15 vertices, from ecb3920:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
My question is, how do I get the true value, i.e. the actually node id in my data, from V(G). Thanks.
> dput(G)
structure(list(15, FALSE, c(13, 7, 9, 14, 10, 5, 4, 11, 6, 7,
14, 4, 13, 9, 10, 5, 5, 13, 9, 6, 7, 14, 12, 10, 14, 10, 11,
13, 9, 10, 12, 14, 8, 7, 11, 12, 8, 13, 14, 9, 11, 13, 13, 12,
14, 10, 13, 12, 14, 12, 13, 13, 14, 14), c(0, 0, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6,
6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 9, 9, 9, 9, 10,
10, 10, 11, 11, 12, 12, 13), c(6, 11, 5, 15, 16, 8, 19, 1, 9,
20, 33, 32, 36, 2, 13, 18, 28, 39, 4, 14, 23, 25, 29, 45, 7,
26, 34, 40, 22, 30, 35, 43, 47, 49, 0, 12, 17, 27, 37, 41, 42,
46, 50, 51, 3, 10, 21, 24, 31, 38, 44, 48, 52, 53), c(1, 0, 6,
5, 2, 4, 3, 11, 15, 8, 9, 13, 14, 7, 12, 10, 16, 19, 20, 18,
23, 22, 17, 21, 25, 24, 33, 32, 28, 29, 26, 30, 27, 31, 36, 39,
34, 35, 37, 38, 40, 41, 45, 43, 42, 44, 47, 46, 48, 49, 50, 51,
52, 53), c(0, 0, 0, 0, 0, 2, 5, 7, 11, 13, 18, 24, 28, 34, 44,
54), c(0, 2, 2, 7, 16, 24, 26, 34, 40, 42, 46, 49, 51, 53, 54,
54), list(c(1, 0, 1), structure(list(), .Names = character(0)),
structure(list(id = c("1351920706", "500102244", "1454425532",
"1625050630", "510838353", "1262640078", "681721364", "1351920717",
"1260750116", "1524975171", "1070293410", "727198538", "715215233",
"1351920666", "500920034")), .Names = "id"), list()), <environment>), class = "igraph")
Just for closure (and to summarise from our chat): Based on the sample data you give, you can extract additional data for every vertex by indexing the corresponding element.
So
V(g)$id
returns
#[1] "1351920706" "500102244" "1454425532" "1625050630" "510838353"
#[6] "1262640078" "681721364" "1351920717" "1260750116" "1524975171"
#[11] "1070293410" "727198538" "715215233" "1351920666" "500920034"

sorting columns from lowest to highest values (i.e. 1, 2, 3 etc, not 1, 10, 11...2, 20, 21... etc)

I have a dataset with 50 thousand rows that I want to sort according the the values in one of the columns. The numbers in the column go from 1-30, and when I do the following
data=data[order(data$columnname),]
it gets sorted so that the order of the columns is like this
1, 10, 11 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3, 30, 4, 5, 6, 7, 8, 9
how could I sort it so that it is like this
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
For me it seems, that your format is not numeric. Try this:
data$columnname<-as.numeric(data$columnname)
data=data[order(data$columnname),]

Resources