I have a dataframe with two rows of data. I would like to add a third row with values representing the difference between the previous two.
The data is in the following format:
Month A B C D E F
Jan 1 2 4 8 4 1
Feb 1 1 4 5 2 0
I would like to add an extra row with a calculation to give the change across the two months:
Month A B C D E F
Jan 1 2 4 8 4 1
Feb 1 1 4 5 2 0
Change 0 1 0 3 2 1
I've been looking at various functions to add additional rows including rbind and mutate but I'm struggling to perform the calculation in the newly created row.
As it's just two rows you can subtract the individual rows and rbind the difference
rbind(df, data.frame(Month = "Change", df[1, -1] - df[2, -1]))
# Month A B C D E F
#1 Jan 1 2 4 8 4 1
#2 Feb 1 1 4 5 2 0
#3 Change 0 1 0 3 2 1
d1<-data.frame(Month = "Change" , df[1,-1] , df[2,-1])
newdf <- rbind(df,d1)
This will create a new data-frame with what you require
An option with tidyverse
library(tidyverse)
df1 %>%
summarise_if(is.numeric, diff) %>%
abs %>%
bind_rows(df1, .) %>%
mutate(Month = replace_na(Month, "Change"))
# Month A B C D E F
#1 Jan 1 2 4 8 4 1
#2 Feb 1 1 4 5 2 0
#3 Change 0 1 0 3 2 1
data
df1 <- structure(list(Month = c("Jan", "Feb"), A = c(1L, 1L), B = 2:1,
C = c(4L, 4L), D = c(8L, 5L), E = c(4L, 2L), F = 1:0),
class = "data.frame", row.names = c(NA,
-2L))
Related
R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
I have an issue in R I cannot fix, so I'm asking for help here. I want to merge three columns into one, but haven't found a way to do so. Let's say it looks like this table:
Time H C W K
0 1 2 0 5
1 5 2 1 1
2 0 1 2 2
How do I turn it into this table:
Time G K
0 3 5
1 8 1
2 3 2
Maybe you can try the code below
subset(within(df, G <- rowSums(cbind(H, C, W))), select = -c(H, C, W))
giving
Time K G
1 0 5 3
2 1 1 8
3 2 2 3
or a data.table option
> setDT(df)[, .(Time, G = rowSums(cbind(H, C, W)), K)][]
Time G K
1: 0 3 5
2: 1 8 1
3: 2 3 2
We can use transmute
library(dplyr)
df %>%
transmute(Time, G = rowSums(select(., H:W)), K)
# Time G K
#1 0 3 5
#2 1 8 1
#3 2 3 2
Maybe try this:
#Code
newdf <- data.frame(df[,1,drop=F],G=rowSums(df[,-c(1,5)]),df[,5,drop=F])
Output:
Time G K
1 0 3 5
2 1 8 1
3 2 3 2
Some data used:
#Data
df <- structure(list(Time = 0:2, H = c(1L, 5L, 0L), C = c(2L, 2L, 1L
), W = 0:2, K = c(5L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
Also a shortcut instead of placing each variable and improving the answer of #KarthikS can be using c_across():
library(dplyr)
#Code2
newdf <- df %>% rowwise() %>% mutate(G = sum(c_across(H:W))) %>% select(Time, G, K)
Output:
# A tibble: 3 x 3
# Rowwise:
Time G K
<int> <int> <int>
1 0 3 5
2 1 8 1
3 2 3 2
Apologies if this is a repeat question but I could not find the specific answer I am looking for. I have a dataframe with counts of different species caught on a given trip. A simplified example with 5 trips and 4 species is below:
trip = c(1,1,1,2,2,3,3,3,3,4,5,5)
species = c("a","b","c","b","d","a","b","c","d","c","c","d")
count = c(5,7,3,1,8,10,1,4,3,1,2,10)
dat = cbind.data.frame(trip, species, count)
dat
> dat
trip species count
1 1 a 5
2 1 b 7
3 1 c 3
4 2 b 1
5 2 d 8
6 3 a 10
7 3 b 1
8 3 c 4
9 3 d 3
10 4 c 1
11 5 c 2
12 5 d 10
I am only interested in the counts of species b for each trip. So I want to manipulate this data frame so I end up with one that looks like this:
trip2 = c(1,2,3,4,5)
species2 = c("b","b","b","b","b")
count2 = c(7,1,1,0,0)
dat2 = cbind.data.frame(trip2, species2, count2)
dat2
> dat2
trip2 species2 count2
1 1 b 7
2 2 b 1
3 3 b 1
4 4 b 0
5 5 b 0
I want to keep all trips, including trips where species b was not observed. So I can't just subset the data by species b. I know I can cast the data so species are the columns and then just remove the columns for the other species like so:
library(dplyr)
library(reshape2)
test = dcast(dat, trip ~ species, value.var = "count", fun.aggregate = sum)
test
> test
trip a b c d
1 1 5 7 3 0
2 2 0 1 0 8
3 3 10 1 4 3
4 4 0 0 1 0
5 5 0 0 2 10
However, my real dataset has several hundred species caught on thousands of trips, and if I try to cast that many species to columns R chokes. There are way too many columns. Is there a way to specify in dcast that I only want to cast species b? Or is there another way to do this that doesn't require casting the data? Thank you.
Here is a data.table approach which I suspect will be very fast for you:
library(data.table)
setDT(dat)
result <- dat[,.(species = "b", count = sum(.SD[species == "b",count])),by = trip]
result
trip species count
1: 1 b 7
2: 2 b 1
3: 3 b 1
4: 4 b 0
5: 5 b 0
We can use tidyverse
library(dplyr)
library(tidyr)
dat %>%
filter(species == 'b') %>%
group_by(trip, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
complete(trip = unique(dat$trip), fill = list(species = 'b', count = 0))
# A tibble: 5 x 3
# trip species count
# <dbl> <chr> <dbl>
#1 1 b 7
#2 2 b 1
#3 3 b 1
#4 4 b 0
#5 5 b 0
currently i am trying to count frequency of set of sequence of data frame.
A B
1 a
1 b
1 c
2 a
2 b
2 c
i have this data frame and i would like to count frequency of "B" of another data frame looking like this
C D
1 a
1 a
1 b
1 b
2 b
2 c
2 c
As you can see the number of rows is different so datatable(counts) does not work. i would like to it to look like this after frequency count is done
a b freq
1 a 2
1 b 2
1 c 0
2 a 0
2 b 1
2 c 2
As you can see it makes counts of all the frequency even the 0 as the on some groups there is no data on it.
thanks for anyone that helps!
By using merge and aggregate
df2$freq = 1
df = merge(df1,aggregate(freq~.,df2,length),by.x = c('A','B'),by.y = c('C','D'),all.x = T)
df[is.na(df)] = 0
df
A B freq
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
More Info
aggregate(freq~.,df2,length)
C D freq
1 1 a 2
2 1 b 2
3 2 b 1
4 2 c 2
Data Input
df1
A B
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
df2
C D
1 1 a
2 1 a
3 1 b
4 1 b
5 2 b
6 2 c
7 2 c
This looks to be a question of how to tabulate frequencies across two factors without dropping missing levels.
Here's the dplyr solution. This assumes that dfAB, as in your example data, contains no duplicates (dfAB is interchangeable with the output of expand.grid if you don't already have the level combinations in a data frame)
library(dplyr)
dfAB %>%
# need at least one non-joining variable to tell matches from non-matches
left_join(mutate(dfCD, dummy = 1), by = c("A" = "C", "B" = "D")) %>%
group_by(A, B) %>%
summarize(freq = sum(dummy, na.rm = TRUE))
Output:
# A tibble: 6 x 3
# Groups: A [?]
A B freq
<dbl> <chr> <dbl>
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
(if there are duplicates in dfAB, add a distinct call to the chain before the join)
df1_rows = Reduce(paste, df1)
df2_rows = Reduce(paste, df2)
data.frame(df1, freq = sapply(df1_rows, function(x) sum(df2_rows %in% x)),
row.names = NULL)
# A B freq
#1 1 a 2
#2 1 b 2
#3 1 c 0
#4 2 a 0
#5 2 b 1
#6 2 c 2
DATA
df1 = data.frame(A = c(1L, 1L, 1L, 2L, 2L, 2L),
B = c("a", "b", "c", "a", "b", "c"))
df2 = data.frame(C = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
D = c("a", "a", "b", "b", "b", "c", "c"))