Counting occurrences and occurrences which do not appear - r

I have a dataframe which looks like this:
head(df)
id id_child
1 1
1 2
1 3
2 1
4 1
4 2
I would like to create a variable which counts the number of children per parent. So I would like something like this:
head(nb_chilren)
id id_child
1 3
2 1
3 0
4 2
If possible, I would like that the person 3 is indicated as having 0 child even though she does not exist in the first frame.
Note: ids are sequential, in real data they are 1 to 10628.
Any suggestions? I suppose I must use the split() function, but I really do not know how to use it.

Convert id to factor with levels from minimum id value to maximum.
df$id <- factor(df$id, levels = min(df$id):max(df$id))
You can then use table in base R :
stack(table(df$id))[2:1]
Or count in dplyr :
library(dplyr)
df %>% count(id, .drop = FALSE)
# id n
#1 1 3
#2 2 1
#3 3 0
#4 4 2

One dplyr option could be:
df %>%
group_by(id = factor(id, levels = min(id):max(id)), .drop = FALSE) %>%
summarise(id_child = n_distinct(id_child))
id id_child
<fct> <int>
1 1 3
2 2 1
3 3 0
4 4 2

Here is a solution with table
table(factor(df[[1]], levels = Reduce(':', range(df[[1]]))))
#1 2 3 4
#3 1 0 2
In data.frame format:
tbl <- table(id = factor(df[[1]], levels = Reduce(':', range(df[[1]]))))
as.data.frame(tbl)
# id Freq
#1 1 3
#2 2 1
#3 3 0
#4 4 2

Related

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

In R, take sum of multiple variables if combination of values in two other columns are unique

I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2

How to subset by time range?

I would like to create a subset from the following example of data frame. The condition is to select those rows where time column values belong to time range from the minimum time for the certain id till the next lets say one hour.
id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693
And the output should be like this:
id time
1 1468696537
1 1468696637
2 1471902849
2 1471902850
3 1471446673
3 1471446693
Please, help me how to do that?
We can do the following:
library(magrittr);
library(dplyr);
df %>%
group_by(id) %>%
filter(time <= min(time) + 3600)
# id time
# <int> <int>
#1 1 1468696537
#2 1 1468696637
#3 2 1471902849
#4 2 1471902850
#5 3 1471446673
#6 3 1471446693
Explanation: Group entries by id, then filter entries that are within min(time) + 1 hour.
Sample data
df <- read.table(text =
" id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693 ", header = T)

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

Count cells greater than 0 by group

I am trying to count the number of cells greater than 0 by group, and I need help as to how to approach this.
My data looks like this:
Group Number of Shoes
1 0
1 1
1 2
2 0
2 NA
2 1
3 1
3 2
3 2
And I want this:
Group Shoe owners
1 2
2 1
3 3
You can do this simply with:
aggregate(Number.of.Shoes ~ Group, df, function(x) sum(x > 0, na.rm = TRUE))
The result:
Group Number.of.Shoes
1 1 2
2 2 1
3 3 3
This can also be done with dplyr package:
library(dplyr)
df %>%
group_by(Group) %>%
summarise(counts = sum(Number.of.Shoes > 0, na.rm = TRUE))
we can also use data.table
library(data.table)
setDT(df1)[Number.of.Shoes >0, .(Shoe.Owners = .N), Group]
# Group Shoe.Owners
#1: 1 2
#2: 2 1
#3: 3 3

Resources