Reshaping complicating data-set in R - r

I have a strange dataset format where a simple reshape function won't work. Assume I have three time periods (1-3); 2 id Names (A-B); and three variables (X,Y and Z) in the following format. Where the id names and variables name are seperated by -:
Time A-X A-Y A-Z B-X B-Y B-Z
1 2 4 5 6 1 2
2 2 3 2 3 2 3
3 4 4 4 4 4 4
Ideally, I would like to produce the dataset in the following format:
ID Time X Y Z
A 1 2 4 5
A 2 2 3 2
A 3 4 4 4
B 1 6 1 2
B 2 3 2 3
B 3 4 4 4
Which functions to use?

library(dplyr)
library(tidyr)
library(splitstackshape)
df %>%
gather(key, value, -Time) %>%
cSplit("key", sep="_") %>%
spread(key_2, value) %>%
rename(ID = key_1) %>%
arrange(ID, Time)
Output is:
Time ID X Y Z
1 1 A 2 4 5
2 2 A 2 3 2
3 3 A 4 4 4
4 1 B 6 1 2
5 2 B 3 2 3
6 3 B 4 4 4
Sample data:
df <- structure(list(Time = 1:3, A_X = c(2L, 2L, 4L), A_Y = c(4L, 3L,
4L), A_Z = c(5L, 2L, 4L), B_X = c(6L, 3L, 4L), B_Y = c(1L, 2L,
4L), B_Z = 2:4), .Names = c("Time", "A_X", "A_Y", "A_Z", "B_X",
"B_Y", "B_Z"), class = "data.frame", row.names = c(NA, -3L))

Here is another dplyr and tidyr solution.
df %>%
gather(ID, value, -Time) %>%
separate(ID, into = c("ID", "var")) %>%
spread(var, value) %>%
arrange(ID) %>%
select(ID, Time, X, Y, Z)
# ID Time X Y Z
# 1 A 1 2 4 5
# 2 A 2 2 3 2
# 3 A 3 4 4 4
# 4 B 1 6 1 2
# 5 B 2 3 2 3
# 6 B 3 4 4 4

Related

Consolidating and summing columns in R with similar names

I need some help consolidating columns in R
I have ~130 columns, some of which have a similar name. For example, I have ~25 columns called "pathogen".
However, after importing my datasheet into R, these colums are now listed as follows : pathogen..1, pathogen...2, etc. Because of how R renamed these columns, I'm not sure how to proceed.
I need to consolidate all my columns with the same/similar name, so that I have only 1 column called "pathogen". I also need this consolidated column to include the sums of all the consolidated columns called "pathogen".
here an example of my input
sample Unidentified…1 Unidentified…2 Pathogen..1 Pathogen…2
1 5 3 6 8
2 7 2 1 0
3 8 4 2 9
4 9 6 4 0
5 0 7 5 1
Here is my desired output
Sample Unidentified Pathogen
1 8 14
2 9 1
3 12 11
4 15 4
5 7 6
Any help would be really appreciated.
Here is an option where you pivot to create the two groups and then you summarize.
library(tidyverse)
df |>
pivot_longer(cols = -sample,
names_to = ".value",
names_pattern = "(\\w+)") |>
group_by(sample) |>
summarise(across(everything(), sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or with Base R
data.frame(
sample = 1:5,
Unidentified = rowSums(df[,grepl("Unidentified", colnames(df))]),
Pathogen = rowSums(df[,grepl("Pathogen", colnames(df))])
)
#> sample Unidentified Pathogen
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or another pivot option where we go long and then immediately go long and summarize the nested cells.
library(tidyverse)
df |>
pivot_longer(-sample, names_pattern = "(\\w+)") |>
pivot_wider(names_from = name,
values_from = value,
values_fn = list(value = sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
Here I reshape long to make the column names more easily manipulable. I separate them into "stub" and "number" values and the default separator settings work fine. Then I sum the total values for each id-stub combo, and spread wide again.
library(tidyverse)
data.frame(
check.names = FALSE,
sample = c(1L, 2L, 3L, 4L, 5L),
`Unidentified…1` = c(5L, 7L, 8L, 9L, 0L),
`Unidentified…2` = c(3L, 2L, 4L, 6L, 7L),
Pathogen..1 = c(6L, 1L, 2L, 4L, 5L),
`Pathogen…2` = c(8L, 0L, 9L, 0L, 1L)
) %>%
pivot_longer(-sample) %>%
separate(name, c("stub","num")) %>%
count(sample, stub, wt = value) %>%
pivot_wider(names_from = "stub", values_from = "n")
Result
# A tibble: 5 × 3
sample Pathogen Unidentified
<int> <int> <int>
1 1 14 8
2 2 1 9
3 3 11 12
4 4 4 15
5 5 6 7

Creating a new variable based on conditions of 3 other variables in R

I have a data set (n=500) in R that looks like this
ID A C S
1 4 4 4
2 3 2 3
3 5 4 2
Id like to create a new variable(I am calling this variable "same") that tells me whether any of my columns have the same value (excluding my ID column). So,
ID A C S Same
1 4 4 4 all
2 3 2 3 as
3 5 4 2 none
4 7 7 2 ac
Any help would be much appreciated! I am pretty lost! Thank you!
We may loop over the rows with apply (MARGIN = 1) with selected columns ([-1] without the 'ID' column), then check the length of unique elements, if it is 1, return 'all' or else paste the names of the duplicated elements. If there are no duplicates, then it returns blank "", change the blank to 'none'
df1$Same <- apply(df1[-1], 1, \(x) {
x1 <- if(length(unique(x)) == 1) 'all' else
paste(tolower(names(x))[duplicated(x)|duplicated(x,
fromLast = TRUE)], collapse = "")
x1[x1 == ""] <- "none"
x1})
-output
> df1
ID A C S Same
1 1 4 4 4 all
2 2 3 2 3 as
3 3 5 4 2 none
4 4 7 7 2 ac
data
df1 <- structure(list(ID = 1:4, A = c(4L, 3L, 5L, 7L), C = c(4L, 2L,
4L, 7L), S = c(4L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
Try this using dplyr rowwise with rle
df |> rowwise() |> mutate(Same = case_when(length(rle(sort(c_across(A:S)))$values) == 1 ~ "all" ,
length(rle(sort(c_across(A:S)))$values) == 3 ~ "none" ,
c_across(A) == c_across(C) ~ "ac" ,
c_across(C) == c_across(S) ~ "cs" , TRUE ~ "as"))
output
# A tibble: 4 × 5
# Rowwise:
ID A C S Same
<int> <int> <int> <int> <chr>
1 1 4 4 4 all
2 2 3 2 3 as
3 3 5 4 2 none
4 4 7 7 2 ac

R - Count unique/distinct values in two columns together per group

R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2

How do I merge and add up columns in R?

I have an issue in R I cannot fix, so I'm asking for help here. I want to merge three columns into one, but haven't found a way to do so. Let's say it looks like this table:
Time H C W K
0 1 2 0 5
1 5 2 1 1
2 0 1 2 2
How do I turn it into this table:
Time G K
0 3 5
1 8 1
2 3 2
Maybe you can try the code below
subset(within(df, G <- rowSums(cbind(H, C, W))), select = -c(H, C, W))
giving
Time K G
1 0 5 3
2 1 1 8
3 2 2 3
or a data.table option
> setDT(df)[, .(Time, G = rowSums(cbind(H, C, W)), K)][]
Time G K
1: 0 3 5
2: 1 8 1
3: 2 3 2
We can use transmute
library(dplyr)
df %>%
transmute(Time, G = rowSums(select(., H:W)), K)
# Time G K
#1 0 3 5
#2 1 8 1
#3 2 3 2
Maybe try this:
#Code
newdf <- data.frame(df[,1,drop=F],G=rowSums(df[,-c(1,5)]),df[,5,drop=F])
Output:
Time G K
1 0 3 5
2 1 8 1
3 2 3 2
Some data used:
#Data
df <- structure(list(Time = 0:2, H = c(1L, 5L, 0L), C = c(2L, 2L, 1L
), W = 0:2, K = c(5L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
Also a shortcut instead of placing each variable and improving the answer of #KarthikS can be using c_across():
library(dplyr)
#Code2
newdf <- df %>% rowwise() %>% mutate(G = sum(c_across(H:W))) %>% select(Time, G, K)
Output:
# A tibble: 3 x 3
# Rowwise:
Time G K
<int> <int> <int>
1 0 3 5
2 1 8 1
3 2 3 2

Applying mutate_at conditionally to specific rows in a dataframe in R

I have a dataframe in R that looks like the following:
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
5 6 8 ncap
5 7 6 ncap
8 7 6 ncap
I am trying to recode the values in columns a, b, and c for condition ncap (and also 2 other conditions not pictured here) while leaving the values for acap alone.
The following code works when applied to the first 3 columns. I am trying to figure out how I can apply this only to rows that I specify by condition while keeping everything in the same dataframe.
df = df %>%
mutate_at(vars(a:c), function(x)
case_when x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)
This is the expected output.
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
1 2 4 ncap
1 3 2 ncap
4 3 2 ncap
I've looked around for an answer to this question and am unable to find it. If someone knows of an answer that already exists, I would appreciate being directed to it.
We can use the case_when on a condition created with row_number i.e. if the row number is 4 to 6, subtract 4 from the value or else return the value
df %>%
mutate_at(vars(a:c), funs(case_when(row_number() %in% 4:6 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
If this is based on the value instead of the rows, create the condition on the value
df %>%
mutate_at(vars(a:c), funs(case_when(. %in% 5:8 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
Or if it is based on the value in the 'condition'
df %>%
mutate_at(vars(a:c), funs(case_when(condition == 'ncap' ~ . - 4L,
TRUE ~ .)))
Or without using any case_when
df %>%
mutate_at(vars(a:c), funs( . - c(0, 4)[(condition == 'ncap')+1]))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
In base R, we can do this by creating the index
i1 <- df$condition =='ncap'
df[i1, 1:3] <- df[i1, 1:3] - 4
data
df <- structure(list(a = c(1L, 2L, 2L, 5L, 5L, 8L), b = c(4L, 3L, 4L,
6L, 7L, 7L), c = c(2L, 1L, 3L, 8L, 6L, 6L), condition = c("acap",
"acap", "acap", "ncap", "ncap", "ncap")), class = "data.frame",
row.names = c(NA, -6L))
You can use filter to apply recoding values to only specific rows (not equal to "acap" here)
library(dplyr)
df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4))
# a b c condition
#1 1 2 4 ncap
#2 1 3 2 ncap
#3 4 3 2 ncap
If you need the entire dataframe back again we can do
df %>%
filter(condition == "acap") %>%
bind_rows(df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap

Resources