Consolidating and summing columns in R with similar names - r

I need some help consolidating columns in R
I have ~130 columns, some of which have a similar name. For example, I have ~25 columns called "pathogen".
However, after importing my datasheet into R, these colums are now listed as follows : pathogen..1, pathogen...2, etc. Because of how R renamed these columns, I'm not sure how to proceed.
I need to consolidate all my columns with the same/similar name, so that I have only 1 column called "pathogen". I also need this consolidated column to include the sums of all the consolidated columns called "pathogen".
here an example of my input
sample Unidentified…1 Unidentified…2 Pathogen..1 Pathogen…2
1 5 3 6 8
2 7 2 1 0
3 8 4 2 9
4 9 6 4 0
5 0 7 5 1
Here is my desired output
Sample Unidentified Pathogen
1 8 14
2 9 1
3 12 11
4 15 4
5 7 6
Any help would be really appreciated.

Here is an option where you pivot to create the two groups and then you summarize.
library(tidyverse)
df |>
pivot_longer(cols = -sample,
names_to = ".value",
names_pattern = "(\\w+)") |>
group_by(sample) |>
summarise(across(everything(), sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or with Base R
data.frame(
sample = 1:5,
Unidentified = rowSums(df[,grepl("Unidentified", colnames(df))]),
Pathogen = rowSums(df[,grepl("Pathogen", colnames(df))])
)
#> sample Unidentified Pathogen
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or another pivot option where we go long and then immediately go long and summarize the nested cells.
library(tidyverse)
df |>
pivot_longer(-sample, names_pattern = "(\\w+)") |>
pivot_wider(names_from = name,
values_from = value,
values_fn = list(value = sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6

Here I reshape long to make the column names more easily manipulable. I separate them into "stub" and "number" values and the default separator settings work fine. Then I sum the total values for each id-stub combo, and spread wide again.
library(tidyverse)
data.frame(
check.names = FALSE,
sample = c(1L, 2L, 3L, 4L, 5L),
`Unidentified…1` = c(5L, 7L, 8L, 9L, 0L),
`Unidentified…2` = c(3L, 2L, 4L, 6L, 7L),
Pathogen..1 = c(6L, 1L, 2L, 4L, 5L),
`Pathogen…2` = c(8L, 0L, 9L, 0L, 1L)
) %>%
pivot_longer(-sample) %>%
separate(name, c("stub","num")) %>%
count(sample, stub, wt = value) %>%
pivot_wider(names_from = "stub", values_from = "n")
Result
# A tibble: 5 × 3
sample Pathogen Unidentified
<int> <int> <int>
1 1 14 8
2 2 1 9
3 3 11 12
4 4 4 15
5 5 6 7

Related

how to delete one of two duplicates in each column and merge them in r

I have a data which consists of two columns and some duplicates on these columns. I want to remove duplicates for each column and then gather all unique values keeping column names.
data<-structure(c(10L, 10L, 11L, 11L, 5L, 5L, 3L, 5L), .Dim = c(2L,
4L), .Dimnames = list(c("d1", "m1"), c("year2036", "year2037",
"year2038", "year2039")))
year2036 year2037 year2038 year2039
d1 10 11 5 3
m1 10 11 5 5
And the output will be like:
year2036 year2037 year2038 year2039 year2039
10 11 5 3 5
out<-structure(c(10, 11, 5, 3, 5), .Names = c("year2036", "year2037",
"year2038", "year2039", "year2039"))
I tried unique(r[c(1:8)]) but it is just giving unique numbers removing column names.
You can use unique in apply and stack the result.
stack(apply(data, 2, unique))
# values ind
#1 10 year2036
#2 11 year2037
#3 5 year2038
#4 3 year2039
#5 5 year2039
Or in the format you wanted:
x <- stack(apply(data, 2, unique))
setNames(x$values, x$ind)
#year2036 year2037 year2038 year2039 year2039
# 10 11 5 3 5
data %>%
as_tibble() %>%
pivot_longer(everything()) %>%
group_by(name) %>%
distinct(value)
# A tibble: 5 x 2
# Groups: name [4]
name value
<chr> <int>
1 year2036 10
2 year2037 11
3 year2038 5
4 year2039 3
5 year2039 5
It is not a good practice to have data with same column names. Here is a solution which gives same structure as your expected output but with modified
column names.
library(dplyr)
library(tidyr)
data %>%
as.data.frame() %>%
pivot_longer(cols = everything()) %>%
distinct() %>%
mutate(row = data.table::rowid(name)) %>%
pivot_wider(names_from = c(name, row), values_from = value)
# year2036_1 year2037_1 year2038_1 year2039_1 year2039_2
# <int> <int> <int> <int> <int>
#1 10 11 5 3 5
Using dapply from collapse
library(collapse)
stack(dapply(data, MARGIN = 2, FUN = funique))
values ind
1 10 year2036
2 11 year2037
3 5 year2038
4 3 year2039
5 5 <NA>

Sum up two variables in a long-format dataframe with tidyverse

I have a simple data frame in a tidy format:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000
....
for each group I'd like to have the variable "total money in group" which is just fishers_here * money_per_fisher; basically I'd like it to look like this:
group variable value
<fct> <chr> <dbl>
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
1 TOTAL_MONEY 200000
....
Is there a simple way to get this done with tidyverse?
By simple I mean without having to filter, summarise, add the variable column back in and then join the two now separate dataframes.
You can spread, do the multiplication and then gather back up. Note I'm assuming that there is a typo in the group number in row 6 as I commented, where it should be group 2 instead of group 1. If that's not the case, then some additional cleaning steps are needed. You can also sort your resulting rows however you want (e.g. to put the rows for each group back together)
library(tidyverse)
tbl <- read_table2(
"group variable value
1 fishers_here 100
1 money_per_fisher 2000
1 unnecessary_variable 10
2 fishers_here 140
2 money_per_fisher 8000
2 unnecessary_variable 304
3 fishers_here 10
3 money_per_fisher 9000"
)
tbl %>%
spread(variable, value) %>%
mutate(total_money_in_group = money_per_fisher * fishers_here) %>%
gather(variable, value, -group)
#> # A tibble: 12 x 3
#> group variable value
#> <dbl> <chr> <dbl>
#> 1 1 fishers_here 100
#> 2 2 fishers_here 140
#> 3 3 fishers_here 10
#> 4 1 money_per_fisher 2000
#> 5 2 money_per_fisher 8000
#> 6 3 money_per_fisher 9000
#> 7 1 unnecessary_variable 10
#> 8 2 unnecessary_variable 304
#> 9 3 unnecessary_variable NA
#> 10 1 total_money_in_group 200000
#> 11 2 total_money_in_group 1120000
#> 12 3 total_money_in_group 90000
Created on 2019-02-04 by the reprex package (v0.2.1)
An option would be to filter the 'money_per_fisher', 'fishers_here', grouped by 'group', summarise to get the prod of 'value', bind the rows with the original data and arrange by 'group'
library(tidyverse)
df1 %>%
filter(variable %in% c('fishers_here', 'money_per_fisher')) %>%
group_by(group) %>%
summarise(variable = "total_money_in_group", value = prod(value)) %>%
bind_rows(tbl, .) %>%
arrange(group)
# A tibble: 11 x 3
# group variable value
# <int> <chr> <dbl>
# 1 1 fishers_here 100
# 2 1 money_per_fisher 2000
# 3 1 unnecessary_variable 10
# 4 1 total_money_in_group 200000
# 5 2 fishers_here 140
# 6 2 money_per_fisher 8000
# 7 2 unnecessary_variable 304
# 8 2 total_money_in_group 1120000
# 9 3 fishers_here 10
#10 3 money_per_fisher 9000
#11 3 total_money_in_group 90000
data
df1 <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
variable = c("fishers_here",
"money_per_fisher", "unnecessary_variable", "fishers_here", "money_per_fisher",
"unnecessary_variable", "fishers_here", "money_per_fisher"),
value = c(100L, 2000L, 10L, 140L, 8000L, 304L, 10L, 9000L
)), class = "data.frame", row.names = c(NA, -8L))
Based on your output I think this is a possible solution:
df %>%
group_by(group) %>%
summarise(value = prod(value))
Edit: If you want a column on the original dataset you can use mutate instead of summarise

How to calculate a row-wise count of duplicates based on (element-wise) selected adjacent columns

I have a data frame test:
group userID A_conf A_chall B_conf B_chall
1 220 1 1 1 2
1 222 4 6 4 4
2 223 6 5 3 2
1 224 1 5 4 4
2 228 4 4 4 4
The data contains responses per user (shown by userID) where each user may enter any value between 1 to 6 for both the measures:
conf
chall
They can also choose not to respond, resulting in an NA entry.
The test dataframe contains several columns like A, B, C, D and so on. Conf and Chall measures can be reported for each of these columns separately.
I am interested in making following comparisons:
A_conf & A_chall
B_conf & B_chall
IF any of these measures are equal, the Final counter should be incremented (as shown below).
group userID A_conf A_chall B_conf B_chall Final
1 220 1 1 1 2 1
1 222 4 6 4 4 1
2 223 6 5 3 2 0
1 224 1 5 4 4 1
2 228 4 4 4 4 2
I am struggling with the Final counter. What script would help me achieve this functionality?
For reference, the dput of the test dataframe set is shared below:
dput(test):
structure(list(group = c(1L, 1L, 2L, 1L, 2L),
userID = c(220L, 222L, 223L, 224L, 228L),
A_conf = c(1L, 4L, 6L, 1L, 4L),
A_chall = c(1L, 6L, 5L, 5L, 4L),
B_conf = c(1L, 4L, 3L, 4L, 4L),
B_chall = c(2L, 4L, 2L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -5L))
I tried a code like this:
test$Final = as.integer(0) # add a column to keep counts
count_inc = as.integer(0) # counter variable to increment in steps of 1
for (i in 1:nrow(test)) {
count_inc = 0
if(!is.na(test$A_conf[i] == test$A_chall[i]))
{
count_inc = 1
test$Final[i] = count_inc
}#if
else if(!is.na(test$A_conf[i] != test$A_chall[i]))
{
count_inc = 0
test$Final[i] = count_inc
}#else if
}#for
The above code has been written to work ONLY on the columns A_conf and A_chall. The problem is, it fills the Final column with all 1's whether the entered values (by users) are equal or not.
A base R solution assuming you have equal number of "conf" and "chall" columns
#Find indexes of "conf" column
conf_col <- grep("conf", names(test))
#Find indexes of "chall" column
chall_col <- grep("chall", names(test))
#compare element wise and take row wise sum
test$Final <- rowSums(test[conf_col] == test[chall_col])
test
# group userID A_conf A_chall B_conf B_chall Final
#1 1 220 1 1 1 2 1
#2 1 222 4 6 4 4 1
#3 2 223 6 5 3 2 0
#4 1 224 1 5 4 4 1
#5 2 228 4 4 4 4 2
Can also be done in one-liner
rowSums(test[grep("conf", names(test))] == test[grep("chall", names(test))])
With tidyverse you can do:
df %>%
select(-Final) %>%
rowid_to_column() %>% #Creating an unique row ID
gather(var, val, -c(group, userID, rowid)) %>% #Reshaping the data
arrange(rowid, var) %>% #Arranging by row ID and by variables
group_by(rowid) %>% #Grouping by row ID
mutate(temp = gl(n()/2, 2)) %>% #Creating a grouping variable for different "_chall" and "_conf" variables
group_by(rowid, temp) %>% #Grouping by row ID and the new grouping variables
mutate(res = ifelse(val == lag(val), 1, 0)) %>% #Comparing whether the different "_chall" and "_conf" have the same value
group_by(rowid) %>% #Grouping by row ID
mutate(res = sum(res, na.rm = TRUE)) %>% #Summing the occurrences of "_chall" and "_conf" being the same
select(-temp) %>%
spread(var, val) %>% #Returning the data to its original form
ungroup() %>%
select(-rowid)
group userID res A_chall A_conf B_chall B_conf
<int> <int> <dbl> <int> <int> <int> <int>
1 1 220 1. 1 1 2 1
2 1 222 1. 6 4 4 4
3 2 223 0. 5 6 2 3
4 1 224 1. 5 1 4 4
5 2 228 2. 4 4 4 4
You can try this tidyverse as well. Some less lines compared to the other answer ;)
library(tidyverse)
d %>%
as.tibble() %>%
gather(k, v, -group,-userID) %>%
separate(k, into = c("letters", "test")) %>%
spread(test, v) %>%
group_by(userID) %>%
mutate(final = sum(chall == conf)) %>%
distinct(userID, final) %>%
ungroup() %>%
right_join(d)
# A tibble: 5 x 7
userID final group A_conf A_chall B_conf B_chall
<int> <int> <int> <int> <int> <int> <int>
1 220 1 1 1 1 1 2
2 222 1 1 4 6 4 4
3 223 0 2 6 5 3 2
4 224 1 1 1 5 4 4
5 228 2 2 4 4 4 4

Rolling average using groupby and varying window length

I'm trying to create a rolling average of a column based on an ID column and a measurement time label in R, but I am having a lot of trouble with it.
Here is what my dataframe looks like:
ID Measurement Value
A 1 10
A 2 12
A 3 14
B 1 10
B 2 12
B 3 14
B 4 10
The problem is that I have measurement counts varying from 9 to 76 for each ID so I haven't found a solution that will create a column of a rolling average for each ID while handling the varying window length.
My goal is a dataframe like this:
ID Measurement Value Average
A 1 10 NA
A 2 12 11
A 3 14 12
B 1 10 NA
B 2 12 11
B 3 14 12
B 4 10 11.5
With your data:
library(dplyr)
dat %>%
group_by(Id) %>%
mutate(Avrg = cumsum(Value)/(1:n()))
# A tibble: 7 x 4
# Groups: Id [2]
Id Measurement Value Avrg
<chr> <int> <int> <dbl>
1 A 1 10 10
2 A 2 12 11
3 A 3 14 12
4 B 1 10 10
5 B 2 12 11
6 B 3 14 12
7 B 4 10 11.5
Data:
structure(list(Id = c("A", "A", "A", "B", "B", "B", "B"),
Measurement = c(1L, 2L, 3L, 1L, 2L, 3L, 4L),
Value = c(10L, 12L, 14L, 10L, 12L, 14L, 10L)
),
class = "data.frame", row.names = c(NA, -7L))
P.S. I am pretty sure that the average of 10 is 10, not NA
library(dplyr)
data %>%
group_by(ID) %>%
mutate(rolling_mean = cummean(Value))
First row will be mean of first value for each group (ID), not NA.
This uses no packages. It calculates the cumulative average by ID except that for Measurement equal to 1 it forces the average to be NA.
transform(DF, Avg = ave(Value, ID, FUN = cumsum) /
ifelse(Measurement == 1, NA, Measurement))
giving:
ID Measurement Value Avg
1 A 1 10 NA
2 A 2 12 11.0
3 A 3 14 12.0
4 B 1 10 NA
5 B 2 12 11.0
6 B 3 14 12.0
7 B 4 10 11.5
Note
The input DF in reproducible form is:
Lines <- "ID Measurement Value
A 1 10
A 2 12
A 3 14
B 1 10
B 2 12
B 3 14
B 4 10"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)

Reshaping complicating data-set in R

I have a strange dataset format where a simple reshape function won't work. Assume I have three time periods (1-3); 2 id Names (A-B); and three variables (X,Y and Z) in the following format. Where the id names and variables name are seperated by -:
Time A-X A-Y A-Z B-X B-Y B-Z
1 2 4 5 6 1 2
2 2 3 2 3 2 3
3 4 4 4 4 4 4
Ideally, I would like to produce the dataset in the following format:
ID Time X Y Z
A 1 2 4 5
A 2 2 3 2
A 3 4 4 4
B 1 6 1 2
B 2 3 2 3
B 3 4 4 4
Which functions to use?
library(dplyr)
library(tidyr)
library(splitstackshape)
df %>%
gather(key, value, -Time) %>%
cSplit("key", sep="_") %>%
spread(key_2, value) %>%
rename(ID = key_1) %>%
arrange(ID, Time)
Output is:
Time ID X Y Z
1 1 A 2 4 5
2 2 A 2 3 2
3 3 A 4 4 4
4 1 B 6 1 2
5 2 B 3 2 3
6 3 B 4 4 4
Sample data:
df <- structure(list(Time = 1:3, A_X = c(2L, 2L, 4L), A_Y = c(4L, 3L,
4L), A_Z = c(5L, 2L, 4L), B_X = c(6L, 3L, 4L), B_Y = c(1L, 2L,
4L), B_Z = 2:4), .Names = c("Time", "A_X", "A_Y", "A_Z", "B_X",
"B_Y", "B_Z"), class = "data.frame", row.names = c(NA, -3L))
Here is another dplyr and tidyr solution.
df %>%
gather(ID, value, -Time) %>%
separate(ID, into = c("ID", "var")) %>%
spread(var, value) %>%
arrange(ID) %>%
select(ID, Time, X, Y, Z)
# ID Time X Y Z
# 1 A 1 2 4 5
# 2 A 2 2 3 2
# 3 A 3 4 4 4
# 4 B 1 6 1 2
# 5 B 2 3 2 3
# 6 B 3 4 4 4

Resources