Manipulating large dataset with dcast

Manipulating large dataset with dcast - r

Apologies if this is a repeat question but I could not find the specific answer I am looking for. I have a dataframe with counts of different species caught on a given trip. A simplified example with 5 trips and 4 species is below:
trip = c(1,1,1,2,2,3,3,3,3,4,5,5)
species = c("a","b","c","b","d","a","b","c","d","c","c","d")
count = c(5,7,3,1,8,10,1,4,3,1,2,10)
dat = cbind.data.frame(trip, species, count)
dat
> dat
trip species count
1 1 a 5
2 1 b 7
3 1 c 3
4 2 b 1
5 2 d 8
6 3 a 10
7 3 b 1
8 3 c 4
9 3 d 3
10 4 c 1
11 5 c 2
12 5 d 10
I am only interested in the counts of species b for each trip. So I want to manipulate this data frame so I end up with one that looks like this:
trip2 = c(1,2,3,4,5)
species2 = c("b","b","b","b","b")
count2 = c(7,1,1,0,0)
dat2 = cbind.data.frame(trip2, species2, count2)
dat2
> dat2
trip2 species2 count2
1 1 b 7
2 2 b 1
3 3 b 1
4 4 b 0
5 5 b 0
I want to keep all trips, including trips where species b was not observed. So I can't just subset the data by species b. I know I can cast the data so species are the columns and then just remove the columns for the other species like so:
library(dplyr)
library(reshape2)
test = dcast(dat, trip ~ species, value.var = "count", fun.aggregate = sum)
test
> test
trip a b c d
1 1 5 7 3 0
2 2 0 1 0 8
3 3 10 1 4 3
4 4 0 0 1 0
5 5 0 0 2 10
However, my real dataset has several hundred species caught on thousands of trips, and if I try to cast that many species to columns R chokes. There are way too many columns. Is there a way to specify in dcast that I only want to cast species b? Or is there another way to do this that doesn't require casting the data? Thank you.

Here is a data.table approach which I suspect will be very fast for you:
library(data.table)
setDT(dat)
result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip]
result
trip species count
1: 1 b 7
2: 2 b 1
3: 3 b 1
4: 4 b 0
5: 5 b 0

We can use tidyverse
library(dplyr)
library(tidyr)
dat %>%
filter(species == 'b') %>%
group_by(trip, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
complete(trip = unique(dat$trip), fill = list(species = 'b', count = 0))
# A tibble: 5 x 3
# trip species count
# <dbl> <chr> <dbl>
#1 1 b 7
#2 2 b 1
#3 3 b 1
#4 4 b 0
#5 5 b 0

Related

How to create new column of repeating sequence based on other column

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated

Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!

You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

Extracting group dependent results from a dataframe

I have a dataframe made from different groups, and for each group real and predicted values. I want to extract values of tests on these values :
library(dplyr)
d = data.frame(group = c(rep(5,x="a"),rep(5,x="b")), real = c(rep(2, x=1:5)), pred = c(2,1,3,4,5,1,2,4,3,5))
group real pred
1 a 1 2
2 a 2 1
3 a 3 3
4 a 4 4
5 a 5 5
6 b 1 1
7 b 2 2
8 b 3 4
9 b 4 3
10 b 5 5
d <- d %>% group_by(group) %>% mutate( sg = ifelse(real == 1 & real == pred, 1, 0))
d <- d %>% group_by(group) %>% mutate( sp = ifelse(real <= 3 & pred <= 3, 1, 0))
d %>% distinct(sg, sp)
sg sp group
1 0 1 a
2 0 0 a
3 1 1 b
4 0 1 b
5 0 0 b
But I want something like this (only 1 result per group)
sg sp group
1 0 1 a
3 1 1 b
I am pretty sure dplyr, data.table or tidyr can do something but I cannot find how.

If it is always the first row of each group that you want to extract, you could use the do function:
d %>% do(.[1,])
Another option is to use the filter function like this:
d %>% filter(seq_along(sp) == 1)

How to Index subjects using R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 7 years ago.
I am working in R and I have a Data set that has multiple entries for each subject. I want to create an index variable that indexes by subject. For example:
Subject Index
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
7 D 1
8 D 2
9 E 1
The first A entry is indexed as 1, while the second A entry is indexed as 2. The first B entry is indexed as 1, etc.
Any help would be excellent!

Here.s a quick data.table aproach
library(data.table)
setDT(df)[, Index := seq_len(.N), by = Subject][]
# Subject Index
# 1: A 1
# 2: A 2
# 3: B 1
# 4: C 1
# 5: C 2
# 6: C 3
# 7: D 1
# 8: D 2
# 9: E 1
Or with base R
with(df, ave(as.numeric(Subject), Subject, FUN = seq_along))
## [1] 1 2 1 1 2 3 1 2 1
Or with dplyr (don't run this on a data.table class)
library(dplyr)
df %>%
group_by(Subject) %>%
mutate(Index = row_number())

Using dplyr
library(dplyr)
df %>% group_by(Subject) %>% mutate(Index = 1:n())
You get:
#Source: local data frame [9 x 2]
#Groups: Subject
#
# Subject Index
#1 A 1
#2 A 2
#3 B 1
#4 C 1
#5 C 2
#6 C 3
#7 D 1
#8 D 2
#9 E 1

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2

Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))

Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2

Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Manipulating large dataset with dcast - r

Here is a data.table approach which I suspect will be very fast for you: library(data.table) setDT(dat) result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip] result trip species count 1: 1 b 7 2: 2 b 1 3: 3 b 1 4: 4 b 0 5: 5 b 0

Related

How to create new column of repeating sequence based on other column

In R, take sum of multiple variables if combination of values in two other columns are unique

Extracting group dependent results from a dataframe

How to Index subjects using R [duplicate]

R, dplyr: cumulative version of n_distinct

Categories

Resources