I have a dataframe "forum" that basically looks like this:
post-id: 1, 2, 3, 4, 5, ...
user-id: 1, 1, 2, 3, 4, ...
subforum-id: 1, 1, 1, 2, 3, ...
Now I'm trying to create a new dataframe that looks like this:
subforum-id: 1, 2, 3, ...
number-of-users-that-posted-only-once-to-this-subforum: ...
number-of-users-that-posted-more-than-n-times-to-this-subforum: ...
Is there any way to do that without pre-fabricating all the counts?
Using plyr and summarise:
# N = 1 here
ddply(DF, .(subforum.id), summarise, once = sum(table(user.id) == 1),
n.times = sum(table(user.id) > N))
# subforum.id once n.times
# 1 1 1 1
# 2 2 1 0
# 3 3 1 0
This is the data.frame DF:
DF <- structure(list(post.id = 1:5, user.id = c(1, 1, 2, 3, 4),
subforum.id = c(1, 1, 1, 2, 3)),
.Names = c("post.id", "user.id", "subforum.id"),
row.names = c(NA, -5L), class = "data.frame")
Here's a basic idea to get you started: Use table to get a count of user ids by subforum ids and work from there:
> mydf <- structure(list(post.id = c(1, 2, 3, 4, 5), user.id = c(1, 1,
2, 3, 4), subforum.id = c(1, 1, 1, 2, 3)), .Names = c("post.id",
"user.id", "subforum.id"), row.names = c(NA, -5L), class = "data.frame")
> mytable <- with(mydf, table(subforum.id, user.id))
> mytable
user.id
subforum.id 1 2 3 4
1 2 1 0 0
2 0 0 1 0
3 0 0 0 1
Hint: from there, look at the rowSums function, and think about what happens if you sum over a logical vector.
Related
I have my main data in data1:
structure(list(participant = c("DB", "DB", "DB", "TW", "TW",
"CF", "CF", "JH", "JH", "JH"), timepoint = c(1, 2, 3, 1, 2, 1,
2, 1, 2, 3), score = c(7, 8, 8, NA, 9, 9, 8, 10, 10, 10)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
and I have a list of ids in data2:
structure(list(participant = c("DB", "CF")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -2L))
I would like to add a new column to data1 and create a binary variable (new_dummy) that will equal to 1 if the participant is in data2 and 0 if the participant is not in data2. which should look like this:
structure(list(participant = c("DB", "DB", "DB", "TW", "TW",
"CF", "CF", "JH", "JH", "JH"), timepoint = c(1, 2, 3, 1, 2, 1,
2, 1, 2, 3), score = c(7, 8, 8, NA, 9, 9, 8, 10, 10, 10), new_dummy = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))
Here a base R solution:
data1$new_dummy <- as.numeric(data1$participant %in% data2$participant)
This checks for each participant in data1 if it also exists in the participant column of data2. The output is a vector of TRUE and FALSE statements. By converting the booleans into numerics, you will get 0 and 1 instead. This vector is then assigned to a new column.
A dplyr solution
library(tidyverse)
data1 %>%
mutate(newdummy = case_when(participant %in% data2$participant ~ 1,
TRUE ~ 0))
# A tibble: 10 x 4
participant timepoint score newdummy
<chr> <dbl> <dbl> <dbl>
1 DB 1 7 1
2 DB 2 8 1
3 DB 3 8 1
4 TW 1 NA 0
5 TW 2 9 0
6 CF 1 9 1
7 CF 2 8 1
8 JH 1 10 0
9 JH 2 10 0
10 JH 3 10 0
I have a dataset like this:
data <- data.frame(ID = c(1,1,1,1,1,2,2,2,2),
year = c(1,2,3,4,5,1,2,3,4),
score = c(0.89943475,-3.51761975,1.54511640,-1.38284380,2.45591240,-1.89925250,0.83935451,-0.61843636,-0.70421765)
ID, year, score
1, 1, 0.89943475
1, 2, -3.51761975
1, 3, 1.54511640
1, 4, -1.38284380
1, 5, 2.45591240
2, 1, -1.89925250
2, 2, 0.83935451
2, 3, -0.61843636
2, 4, -0.70421765
I want to create a data table which aggregates the above data and counts the number of observations for an ID when score is positive and negative, like this:
ID, pos, neg, total
1, 3, 2, 5
2, 1, 3, 4
Is this possible to do using data.table in R?
An alternative to akrun's answer:
data[, .(pos = sum(score >= 0), neg = sum(score < 0), total = .N), by = ID]
# ID pos neg total
# <num> <int> <int> <int>
# 1: 1 3 2 5
# 2: 2 1 3 4
Data
data <- setDT(structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2), year = c(1, 2, 3, 4, 5, 1, 2, 3, 4), score = c(0.89943475, -3.51761975, 1.5451164, -1.3828438, 2.4559124, -1.8992525, 0.83935451, -0.61843636, -0.70421765)), class = c("data.table", "data.frame"), row.names = c(NA, -9L)))
We could use dcast with sign
library(data.table)
dcast(setDT(data), ID ~ sign(score), fun.aggregate = length)[,
total := rowSums(.SD), .SDcols = -1][]
-output
ID -1 1 total
1: 1 2 3 5
2: 2 3 1 4
I'm trying to compute ICC values for each subject for the table below, but group_by() is not working as I think it should.
SubID Rate1 Rate2
1 1 2 5
2 1 2 4
3 1 2 5
4 2 3 4
5 2 4 1
6 2 5 1
7 2 2 2
8 3 2 5
9 3 3 5
The code I am running is as follows:
df %>%
group_by(SubID) %>%
summarise(icc = DescTools::ICC(.)$results[3, 2])
and the output:
# A tibble: 3 x 2
SubID icc
<dbl> <dbl>
1 1 -0.247
2 2 -0.247
3 3 -0.247
It seems that summarise is not being applied according to groups, but to the entire dataset. I'm not sure what is going on.
dput()
structure(list(SubID = c(1, 1, 1, 2, 2, 2, 2, 3, 3), Rate1 = c(2,
2, 2, 3, 4, 5, 2, 2, 3), Rate2 = c(5, 4, 5, 4, 1, 1, 2, 5, 5)), class = "data.frame", row.names = c(NA,
-9L))
Not terribly familiar with library(DescTools) but here is a potential solution that utilizes a nest() / map() combo:
library(DescTools)
library(tidyverse)
df <- structure(
list(SubID = c(1, 1, 1, 2, 2, 2, 2, 3, 3),
Rate1 = c(2, 2, 2, 3, 4, 5, 2, 2, 3),
Rate2 = c(5, 4, 5, 4, 1, 1, 2, 5, 5)),
class = "data.frame", row.names = c(NA, -9L)
)
df %>%
nest(ICC3 = -SubID) %>%
mutate(ICC3 = map_dbl(ICC3, ~ ICC(.x)[["results"]] %>%
filter(type == "ICC3") %>%
pull(est)))
#> # A tibble: 3 x 2
#> SubID ICC3
#> <dbl> <dbl>
#> 1 1 2.83e-15
#> 2 2 -5.45e- 1
#> 3 3 -6.66e-16
Created on 2021-03-08 by the reprex package (v0.3.0)
Let's say I have a list
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
and I need to count all of these vectors so the desired output should looks like:
Category Count
1, 2, 3 3
2, 4, 6 1
1, 5, 10 2
Is there any simple way in R how to achieve this?
You can just paste and use table, i.e.
as.data.frame(table(sapply(test, paste, collapse = ' ')))
which gives,
Var1 Freq
1 1 2 3 3
2 1 5 10 2
3 2 4 6 1
The function unique() can work on a list. For counting one can use identical():
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
Lcount <- function(xx, L) sum(sapply(L, identical, y=xx))
sapply(unique(test), FUN=Lcount, L=test)
unique(test)
The result as data.frame:
result <- data.frame(
Set=sapply(unique(test), FUN=paste0, collapse=','),
count= sapply(unique(test), FUN=Lcount, L=test)
)
result
# > result
# Set count
# 1 1,2,3 3
# 2 2,4,6 1
# 3 1,5,10 2
The task is to efficiently extract events from this data:
data <- structure(
list(i = c(1, 1, 1, 2, 2, 2), t = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("i", "t", "x"), row.names = c(NA, -6L), class = "data.frame"
)
> data
i t x
1 1 1 1
2 1 2 1
3 1 3 2
4 2 1 1
5 2 3 2
6 2 4 3
Let's call i facts, t is time, and x is the number of selections of i at t.
An event is an uninterrupted sequence of selections of one fact. Fact 1 is selected all throughout t=1 to t=3 with a sum of 4 selections. But fact 2 is split into two events, the first from t=1 to t=1 (sum=1) and the second from t=3 to t=4 (sum=5). Therefore, the event data frame is supposed to look like this:
> event
i from to sum
1 1 1 3 4
2 2 1 1 1
3 2 3 4 5
This code does what is needed:
event <- structure(
list(i = logical(0), from = logical(0), to = logical(0), sum = logical(0)),
.Names = c("i", "from", "to", "sum"), row.names = integer(0),
class = "data.frame"
)
l <- nrow(data) # get rows of data frame
c <- 1 # set counter
d <- 1 # set initial row of data to start with
e <- 1 # set initial row of event to fill
repeat{
event[e,1] <- data[d,1] # store "i" in event data frame
event[e,2] <- data[d,2] # store "from" in event data frame
while((data[d+1,1] == data[d,1]) & (data[d+1,2] == data[d,2]+1)){
c <- c+1
d <- d+1
if(d >= l) break
}
event[e,3] <- data[d,2] # store "to" in event data frame
event[e,4] <- sum(data[(d-c+1):d,3]) # store "sum" in event data frame
c <- 1
d <- d+1
e <- e+1
}
The problem is that this code takes 3 days to extract the events from a data frame with 1 million rows and my data frame has 5 million rows.
How can I make this more efficient?
P.S.: There's also a minor bug in my code related to termination.
P.P.S.: The data is sorted first by i, then by t.
can you try if this dplyr implementation is faster?
library(dplyr)
data <- structure(
list(fact = c(1, 1, 1, 2, 2, 2), timing = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("fact", "timing", "x"), row.names = c(NA, -6L), class = "data.frame"
)
group_by(data, fact) %>%
mutate(fromto=cumsum(c(0, diff(timing) > 1))) %>%
group_by(fact, fromto) %>%
summarize(from=min(timing), to=max(timing), sumx=sum(x)) %>%
select(-fromto) %>%
ungroup()
how about this data.table implementation?
library(data.table)
data <- structure(
list(fact = c(1, 1, 1, 2, 2, 2), timing = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("fact", "timing", "x"), row.names = c(NA, -6L), class = "data.frame"
)
setDT(data)[, fromto:=cumsum(c(0, diff(timing) > 1)), by=fact]
event <- data[, .(from=min(timing), to=max(timing), sumx=sum(x)), by=c("fact", "fromto")][,fromto:=NULL]
##results when i enter event in the R console and my data.table package version is data.table_1.9.6
> event
fact from to sumx
1: 1 1 3 4
2: 2 1 1 1
3: 2 3 4 5
> str(event)
Classes ‘data.table’ and 'data.frame': 3 obs. of 4 variables:
$ fact: num 1 2 2
$ from: num 1 1 3
$ to : num 3 1 4
$ sumx: num 4 1 5
- attr(*, ".internal.selfref")=<externalptr>
> dput(event)
structure(list(fact = c(1, 2, 2), from = c(1, 1, 3), to = c(3,
1, 4), sumx = c(4, 1, 5)), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .Names = c("fact", "from", "to", "sumx"), .internal.selfref = <pointer: 0x0000000000120788>)
Reference
detect intervals of the consequent integer sequences
Assuming the data frame is sorted according to data$t, you can try something like this
event <- NULL
for (i in unique(data$i)) {
x <- data[data$i == i, ]
ev <- cumsum(c(1, diff(x$t)) > 1)
smry <- lapply(split(x, ev), function(z) c(i, range(z$t), sum(z$x)))
event <- c(event, smry)
}
event <- do.call(rbind, event)
rownames(event) <- NULL
colnames(event) <- c('i', 'from', 'to', 'sum')
The result is a matrix, not a data frame.