R - sum multiple separate columns for each unique ID? Aggregate? - r

My dataset features several blocks, each containing several plots. In each plot, three different lifeforms were marked as present/absent (i.e. 1/0):
Block
Plot
tree
bush
grass
1
1
0
1
0
1
2
1
1
1
1
3
1
1
1
2
1
0
0
1
2
2
0
0
1
2
3
1
0
1
I'm looking for a code that will sum the total number of counts for each distict lifeform at the block level.
I would like an output that resembles this:
Block
tree
bush
grass
1
2
3
2
2
1
0
3
I have tried this many ways but the only thing that comes close is:
aggregate(df[,3:5], by = list(df$block), FUN = sum)
However, what this actually returns is:
Block
tree
bush
grass
1
7
7
7
2
4
4
4
It appears to be summing all columns together instead of keeping the lifeforms separate.
I feel as though this should be so simple, as there are many queries online about similar processes, but nothing I try has worked.

library(tidyverse)
df %>%
select(-Plot) %>%
pivot_longer(-Block) %>%
group_by(Block, name) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = name, values_from = sum)
# A tibble: 2 × 4
# Groups: Block [2]
Block bush grass tree
<dbl> <dbl> <dbl> <dbl>
1 1 3 2 2
2 2 0 3 1

You were close. Maybe just a typo?
The data frame style
aggregate(df[,3:5], by = list(Block = df$Block), sum)
Block tree bush grass
1 1 2 3 2
2 2 1 0 3
Or a formula style aggregate
aggregate(. ~ Block, df[,-2], sum)
Block tree bush grass
1 1 2 3 2
2 2 1 0 3
With dplyr
library(dplyr)
df %>%
group_by(Block) %>%
summarize(across(tree:grass, sum))
# A tibble: 2 × 4
Block tree bush grass
<int> <int> <int> <int>
1 1 2 3 2
2 2 1 0 3
Data
df <- structure(list(Block = c(1L, 1L, 1L, 2L, 2L, 2L), Plot = c(1L,
2L, 3L, 1L, 2L, 3L), tree = c(0L, 1L, 1L, 0L, 0L, 1L), bush = c(1L,
1L, 1L, 0L, 0L, 0L), grass = c(0L, 1L, 1L, 1L, 1L, 1L)), class =
"data.frame", row.names = c(NA,
-6L))

Related

Sum column values over a window and report the values of the previous window

I´m having a data.frame of the following form:
ID Var1
1 1
1 1
1 3
1 4
1 1
1 0
2 2
2 2
2 6
2 7
2 8
2 0
3 0
3 2
3 1
3 3
3 2
3 4
and I would like to get there:
ID Var1 X
1 1 0
1 1 0
1 3 0
1 4 5
1 1 5
1 0 5
2 2 0
2 2 0
2 6 0
2 7 10
2 8 10
2 0 10
3 0 0
3 2 0
3 1 0
3 3 3
3 2 3
3 4 3
so in words: I´d like to calculate the sum of the variable in a window = 3, and then report the results obtained in the previous window. This should happen with respect to the IDs and thus the first three observations on every ID should be returned with 0, as there is no previous time period that could be reported.
For understanding: In the actual dataset each row corresponds to one week and the window = 7. So X is supposed to give information on the sum of Var1 in the previous week.
I have tried using some rollapply stuff, but always ended in an error and also the window would be a rolling window if I got that right, which is specifically not what I need.
Thanks for your answers!
In rollapply, the width argument can be a list which provides the offsets to use. In this case we want to use the points 3, 2 and 1 back for the first point, 4, 3 and 2 back for the second, 5, 4 and 3 back for the third and then recycle. That is, for a window width of k = 3 we would want the following list of offset vectors:
w <- list(-(3:1), -(4:2), -(5:3))
In general we can write w below in terms of the window width k. ave then invokes rollapply with that width list for each ID.
library(zoo)
k <- 3
w <- lapply(1:k, function(x) seq(to = -x, length = k))
transform(DF, X = ave(Var1, ID, FUN = function(x) rollapply(x, w, sum, fill = 0)))
giving:
ID Var1 X
1 1 1 0
2 1 1 0
3 1 3 0
4 1 4 5
5 1 1 5
6 1 0 5
7 2 2 0
8 2 2 0
9 2 6 0
10 2 7 10
11 2 8 10
12 2 0 10
13 3 0 0
14 3 2 0
15 3 1 0
16 3 3 3
17 3 2 3
18 3 4 3
Note
The input DF in reproducible form is:
DF <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)),
class = "data.frame", row.names = c(NA, -18L))
We could group by 'ID', create a new grouping column with window size of 3 using gl, then get the summarized output by taking the sum of 'Var1' and placing the 'Var1' in a list, get the lag of 'X' and unnest
library(dplyr) #1.0.0
library(tidyr)
df1 %>%
# // grouping by ID
group_by(ID) %>%
# // create another group added with gl
group_by(grp = as.integer(gl(n(), 3, n())), .add = TRUE) %>%
# // get the sum of Var1, while changing the Var1 in a list
summarise(X = sum(Var1), Var1 = list(Var1)) %>%
# // get the lag of X
mutate(X = lag(X, default = 0)) %>%
# // unnest the list column
unnest(c(Var1)) %>%
select(names(df1), X)
# A tibble: 18 x 3
# Groups: ID [3]
# ID Var1 X
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 1 0
# 3 1 3 0
# 4 1 4 5
# 5 1 1 5
# 6 1 0 5
# 7 2 2 0
# 8 2 2 0
# 9 2 6 0
#10 2 7 10
#11 2 8 10
#12 2 0 10
#13 3 0 0
#14 3 2 0
#15 3 1 0
#16 3 3 3
#17 3 2 3
#18 3 4 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)), class = "data.frame",
row.names = c(NA,
-18L))

R - delete rows according to the value of another row

I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data

How to remove some row in each group and keep first of them?

suppose I have
family person loop. mode
1 1 1 car
1 1 1 walk
1 1 1 car
1 1 2 walk
1 1 2 bus
1 2 1 bus
1 2 1 walk
1 2 2 bus
2 1 1 car
2 1 1 car
2 1 2 car
2 2 1 bus
I want this:
each person has some loop in each family. I want to keep the first row of each loop for each person and each family if it is car and remove all other rows in that loop (if it is car or bus or walk) . if the first row of loop is not car I don't remove anything
Output:
family person loop. mode
1 1 1 car
1 1 2 walk
1 1 2 bus
1 2 1 bus
1 2 1 walk
1 2 2 bus
2 1 1 car
2 1 2 car
2 2 1 bus
in the first family first person has car mode in his first row of the first loop so I removed all trips in his first loop and just kept the first one. his second loop doesn't have car mode so I kept all. second person also doesn't have car mode so I kept all. second household first person has mode car in his first loop so I kept the first row and removed rest in the loop. his second loop has one row so I kept it and second person doesn't have car mode so I kept it
An option is to group by 'family', 'person', 'loop.', and slice only the first row if the first element of 'mode' is 'car' or else return the full number of rows
library(dplyr)
df1 %>%
group_by(family, person, loop.) %>%
slice(if(first(mode) == 'car') 1 else row_number())
# A tibble: 9 x 4
# Groups: family, person, loop. [7]
# family person loop. mode
# <int> <int> <int> <chr>
#1 1 1 1 car
#2 1 1 2 walk
#3 1 1 2 bus
#4 1 2 1 bus
#5 1 2 1 walk
#6 1 2 2 bus
#7 2 1 1 car
#8 2 1 2 car
#9 2 2 1 bus
data
df1 <- structure(list(family = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), person = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 2L), loop. = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L), mode = c("car", "walk", "car", "walk", "bus", "bus", "walk",
"bus", "car", "car", "car", "bus")), class = "data.frame", row.names = c(NA,
-12L))

how to change a column with some information in the data set?

I have columns household , persons in each household, tour (each tour contains different trips for each person) , trip (number of trips in each tour) , and mode ( mode of travel of each person in each trip)
I want change mode column with respect of tour column as the following
mood== car if there exist at least one trip in the tour with mode car
mood==non-car if non of trips in a tour has mode=car
example:
household. person. trip. tour. mode
1 1 1 1 car
1 1 2 1 walk
1 1 4 1 bus
1 1 1 2 bus
1 1 2 2 walk
1 2 1 1 walk
1 2 2 1 bus
1 2 3 1 walk
2 1 1 1 walk
2 1 1 1 car
output
household. person. trip. tour. mode
1 1 1 1 car
1 1 2 1 car
1 1 4 1 car
1 1 1 2 non-car
1 1 2 2 non-car
1 2 1 1 non-car
1 2 2 1 non-car
1 2 3 1 non-car
2 1 1 1 car
2 1 1 1 car
We can group by 'household.', 'person.', 'tour.' and change the 'mode' to two values by checking if there are any 'car' in the column. In that case, convert it to a numeric index by adding 1 (TRUE -> 2, FALSE ->1) and based on this index, we pass a vector of strings to replace the index
library(dplyr)
df1 %>%
group_by(household., person., tour.) %>%
mutate(mode = c('non-car', 'car')[1+any(mode == "car")])
# A tibble: 10 x 5
# Groups: household., person., tour. [4]
# household. person. trip. tour. mode
# <int> <int> <int> <int> <chr>
# 1 1 1 1 1 car
# 2 1 1 2 1 car
# 3 1 1 4 1 car
# 4 1 1 1 2 non-car
# 5 1 1 2 2 non-car
# 6 1 2 1 1 non-car
# 7 1 2 2 1 non-car
# 8 1 2 3 1 non-car
# 9 2 1 1 1 car
#10 2 1 1 1 car
data
df1 <- structure(list(household. = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L), person. = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L),
trip. = c(1L, 2L, 4L, 1L, 2L, 1L, 2L, 3L, 1L, 1L), tour. = c(1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), mode = c("car", "walk",
"bus", "bus", "walk", "walk", "bus", "walk", "walk", "car"
)), class = "data.frame", row.names = c(NA, -10L))

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

Resources