Related
Lets say I have:
# Create a, b, c, d variables
x1 <- c("g", "a","c","d","e","f","h", "b")
x2 <- c(1,1,1,1,1,1,1,1)
x7 <- c(10,10,10,10,10,10, 10, 10)
# Join the variables to create a data frame
dataframeA <- data.frame(x1, x2, x7)
# Create a, b, c, d variables
x3 <- c("z", "k" ,"a", "b","c","d")
x4 <- c(5, 19, 6,7,8,9)
# Join the variables to create a data frame
dataframeB <- data.frame(x3, x4)
And I want to substitute values on column x2 of dataframe A with values of column x4 on dataframe b based on matching of a vector, such as dataframe A be:
matchingids = c("a", "b")
dataframeA$x2[which(dataframeA$x1 %in% matchingids)] <- dataframeB$x4[which(dataframeB$x3 %in% matchingids)]
dataframeA turns to:
structure(list(x1 = c("g", "a", "c", "d", "e", "f", "h", "b"),
x2 = c(1, 6, 1, 1, 1, 1, 1, 7), x7 = c(10, 10, 10, 10, 10,
10, 10, 10)), row.names = c(NA, -8L), class = "data.frame")
Which works, but then,
# Create a, b, c, d variables
x1 <- c("g", "a","c","d","e","f","h", "b")
x2 <- c(1,1,1,1,1,1,1,1)
x7 <- c(10,10,10,10,10,10, 10, 10)
# Join the variables to create a data frame
dataframeA <- data.frame(x1, x2, x7)
(here i changed "b" and "a" order
# Create a, b, c, d variables
x3 <- c("z", "k" ,"b", "a","c","d")
x4 <- c(5, 19, 6,7,8,9)
# Join the variables to create a data frame
dataframeB <- data.frame(x3, x4)
matchingids = c("a", "b")
dataframeA$x2[which(dataframeA$x1 %in% matchingids)] <- dataframeB$x4[which(dataframeB$x3 %in% matchingids)]
which gives:
structure(list(x1 = c("g", "a", "c", "d", "e", "f", "h", "b"),
x2 = c(1, 6, 1, 1, 1, 1, 1, 7), x7 = c(10, 10, 10, 10, 10,
10, 10, 10)), row.names = c(NA, -8L), class = "data.frame")
Which does not work, because it is substituing a on first dataframe to b in the second dataframe (order of the objects is wrong)
In the second case, a is being change to b in the first dataframe (value should be a = 7, b = 6)
As you can observe, i get the same result even if I change the position of "a" in dataframeB
This seems like a merge/join operation.
### base R
merge(dataframeA, subset(dataframeB, x3 %in% matchingids),
by.x="x1", by.y="x3", all.x=TRUE) |>
transform(x2 = ifelse(is.na(x4), x2, x4)) |>
subset(select = -x4)
# x1 x2 x7
# 1 a 6 10
# 2 b 7 10
# 3 g 1 10
# 4 c 1 10
# 5 d 1 10
# 6 e 1 10
# 7 f 1 10
# 8 h 1 10
### dplyr
library(dplyr)
filter(dataframeB, x3 %in% matchingids) %>%
right_join(dataframeA, by = c("x3"="x1")) %>%
mutate(x2 = coalesce(x4, x2)) %>%
select(-x4)
# x3 x2 x7
# 1 a 6 10
# 2 b 7 10
# 3 g 1 10
# 4 c 1 10
# 5 d 1 10
# 6 e 1 10
# 7 f 1 10
# 8 h 1 10
(FYI, base::merge doesn't do a good job preserving the original order. If it is very important, I suggest you preface that code with adding a row-number field, then sorting post-merge on that field. Adding sort=FALSE to base::merge does not solve it for me.)
Similarly with the second sets of frames:
merge(dataframeA, subset(dataframeB, x3 %in% matchingids),
by.x="x1", by.y="x3", all.x=TRUE) |>
transform(x2 = ifelse(is.na(x4), x2, x4)) |>
subset(select = -x4)
# x1 x2 x7
# 1 a 7 10
# 2 b 6 10
# 3 g 1 10
# 4 c 1 10
# 5 d 1 10
# 6 e 1 10
# 7 f 1 10
# 8 h 1 10
filter(dataframeB, x3 %in% matchingids) %>%
right_join(dataframeA, by = c("x3"="x1")) %>%
mutate(x2 = coalesce(x4, x2)) %>%
select(-x4)
# x3 x2 x7
# 1 b 6 10
# 2 a 7 10
# 3 g 1 10
# 4 c 1 10
# 5 d 1 10
# 6 e 1 10
# 7 f 1 10
# 8 h 1 10
Note: the |> is in R-4 and later. If you're on an earlier version, you'll need to shift to use intermediate objects.
For more discussions about the concepts of merge/join, see: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?, (pandas) Pandas Merging 101. It's a very powerful process and can pay huge dividends once you become more comfortable with using it.
It works with:
a$x2[order(a$x1)][which(a$x1[order(a$x1)] %in% matchingids)] <- b$x4[order(b$x3)][which(b$x3[order(b$x3)] %in% matchingids)]
But there might be plobems with it, mainly when
matchingids
have IDs that don't match dataframeA or dataframeB, or neither. If the number of IDs are different from dataframe to dataframe, it will also not work
Might only work when dataframeA and dataframeB contains all
matchingids
I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.
The dataset looks something like this
data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
x1 = c(1, 11, 111),
x2 = c(2, 22, 222),
x1_1 = c(3, 333, 333),
x2_1 = c(4, 44, 44),
x1_2 = c(5, 55, 555),
x2_2 = c(6, 66, 666),)
I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.
Any help would be much appreciated.
Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.
library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
result[[name_stems[i]]] =
rowSums(data[
str_detect(
names(data),
pattern = paste0(name_stems[i], "_")
)
])
}
result
# # A tibble: 3 × 3
# date x1 x2
# <chr> <dbl> <dbl>
# 1 1/1/2018 9 12
# 2 2/1/2018 399 132
# 3 3/1/2018 999 932
Using data.table:
baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
## date x1 x2
## 1: 1/1/2018 9 12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932
Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.
library(tidyverse)
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = substr(name, 1,2)) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
# date x1 x2
# <chr> <dbl> <dbl>
#1 1/1/2018 9 12
#2 2/1/2018 399 132
#3 3/1/2018 999 932
Updates
In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:
s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10" "x1" "x1_1" "x1_2" "x2" "x2_1" "x2_2" "x20"
#[10] "x20_1"
s |> str_extract("x1$|(x1_).*")
#[1] NA NA "x1" "x1_1" "x1_2" NA NA NA NA NA
s |> str_extract("x2$|(x2_).*")
#[1] NA NA NA NA NA "x2" "x2_1" "x2_2" NA NA
This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.
Here is the full code:
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
str_detect(name, "x2$|(x2_).*")~"x2")) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f
Suppose I have a large data.table that looks like dt below.
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 10)
)
# dt
# player_1 player_1_age player_2 player_2_age
# 1: a 10 b 20
# 2: b 20 a 10
# 3: b 20 c 30
# 4: c 30 a 10
From the dt above, I would like to create a data.table with unique players and their age like the following, player_dt:
# player_dt
# player age
# a 10
# b 20
# c 30
To do so, I've tried the code below, but it takes too long on my larger dataset, probably because I am creating a data.table for each iteration of sapply.
How would you get the player_dt above, while checking for each player that there is only one unique age value?
# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))
# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
unique_values <- unique(c(
dt[player_1 == x][["player_1_age"]],
dt[player_2 == x][["player_2_age"]]))
if(length(unique_values) > 1) stop() else return(unique_values)
})
# combine to create the player_dt
player_dt <- data.table(player, age)
I use the data from #DavidT as input.
dt
# player_1 player_1_age player_2 player_2_age
#1: a 10 b 20
#2: b 20 a 10
#3: b 20 c 30
#4: c 30 a 11 # <--
TL;DR
You can do
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
out <-
unique(melt(
dt,
measure.vars = list(colsAge, colsOther),
value.name = c("age", "player")
)[, .(age, player)])[, if (.N == 1) # credit: https://stackoverflow.com/a/34427944/8583393
.SD, by = player]
out
# player age
#1: b 20
#2: c 30
Step-by-step
What you can to do is to melt multiple columns simultaneously - those that end with "age" and those that don't.
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))
The result is
dt1
# variable age player
#1: 1 10 a
#2: 1 20 b
#3: 1 20 b
#4: 1 30 c
#5: 2 20 b
#6: 2 10 a
#7: 2 30 c
#8: 2 11 a
Now we call unique ...
out <- unique(dt1[, .(age, player)])
out
# age player
#1: 10 a
#2: 20 b
#3: 30 c
#4: 11 a
... and filter for groups of "player" with length equal to 1
out <- out[, if(.N == 1) .SD, by=player]
out
# player age
#1: b 20
#2: c 30
Given OP's input data, that last step is not needed.
data
library(data.table)
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
Reference: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html
I've altered your data so that there's at least one error to catch:
library(tidyverse)
dt <- tibble(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
# Get the Names columns and the Age columns
colName <- names(dt)
ageCol <- colName[str_detect(colName, "age$")]
playrCol <- colName[! str_detect(colName, "age$")]
# Gather the Ages
ages <- dt %>%
select(ageCol) %>%
gather(player_age, age)
# Gather the names
names <- dt %>%
select(playrCol ) %>%
gather(player_name, name)
# Bind the two together, and throw out the duplicates
# If there are no contradictions, this is what you want.
allNameAge <- cbind( names, ages) %>%
select(name, age) %>%
distinct() %>%
arrange(name)
# But check for inconsistencies. This should leave you with
# an empty tibble, but instead it shows the error.
inconsistencies <- allNameAge %>%
group_by(name) %>%
mutate(AGE.COUNT = n_distinct(age)) %>%
filter(AGE.COUNT > 1) %>%
ungroup()
This should extends to more name/age column pairs.
I would like build a new variable by summing values considering multiple conditions (and an extra one, see below). Here you can see R Code until something like my current issue.
# The raw dataframe
area <- c("A", "A", "B", "A", "C", "B", "A", "B", "A", "C")
varclass <- c("Z1", "Z1", "Z1", "Z2", "Z1", "Z1", "Z2", "Z1", "Z2", "Z2")
count <- c(45, 56, 2, 8, 345, 3, 98, 2, 6, 9)
df1 <- data.frame(area,
varclass,
count,
stringsAsFactors = FALSE)
df1
# See how df1 looks like...
# area varclass count
#1 A Z1 45
#2 A Z1 56
#3 B Z1 2
#4 A Z2 8
#5 C Z1 345
#6 B Z1 3
#7 A Z2 98
#8 B Z1 2
#9 A Z2 6
#10 C Z2 9
# Building the final dataframe
df2 <- data.frame(unique(df1$area),
stringsAsFactors = FALSE)
names(df2)[1] <- "area"
# See how df2 looks like...
# area
#1 A
#2 B
#3 C
# The new variable to build
df2$Z1_sum <- sum(df1[df1$varclass == "Z1" & df1$area == df2$area,]$count)
# doesn't work
# See what I hope
# area Z1_sum
#1 A 101
#2 B 7
#3 C 345
As you can see in the last line, I would like building a new variable, Z1_sum, in the df2 database. Z1_sum is the sum of count from the df1 database where varclass = "Z1" and df1$area meet the value of df2$area current row (in MS Excel, that means using a LC1 or $A2 cell id).
Please, consider the fact that I'm not looking for solutions involving to directly build df2 from df1 by using a group by condition or the dcast function... I only want a formula wich allow me to return correct values in my new column. It's my extra condition. Why? It's because I have next to build other variables with most sophisticated formulas than just a sum. By understanding how make such conditional operations, I hope move on...
Thanks for your help.
Jeff
Do you mean:
df2 <- setNames(
aggregate(
count ~ area,
df1[df1$varclass == "Z1", ],
sum
),
c("area", "Z1_sum")
)
df2
area Z1_sum
1 A 101
2 B 7
3 C 345
or
df2$Z1_sum <- aggregate(count ~ area, df1[df1$varclass == "Z1", ], sum)$count
Edit to address your comment.
Try with:
df2 <- aggregate(
count ~ area + varclass,
df1,
sum
)
that will give you your data in the "long" format:
df2
area varclass count
1 A Z1 101
2 B Z1 7
3 C Z1 345
4 A Z2 112
5 C Z2 9
Now you need to reshape it to the "wide" format using something like:
df2 <- xtabs(count ~ area + varclass, df2)
varclass
area Z1 Z2
A 101 112
B 7 0
C 345 9
or:
df2 <- reshape(df2, idvar = "area", timevar = "varclass", direction = "wide")
area count.Z1 count.Z2
1 A 101 112
2 B 7 NA
3 C 345 9
Just subset based on whichever Z* you want counted in your final sum.
df1Z1 <- df1[df1$varclass %in% c("Z1"), ]
aggregate(count ~ area, data = df1Z1, FUN = sum)
area count
1 A 101
2 B 7
3 C 345
You can get your desired result using dplyr:
library(dplyr)
df2 <- group_by(df1, area) %>%
filter(varclass == "Z1") %>%
summarize(Z1_sum = sum(count)) %>%
df2
#> # A tibble: 3 x 2
#> area Z1_sum
#> <chr> <dbl>
#> 1 A 101
#> 2 B 7
#> 3 C 345
The dplyr verbs should be pretty explanatory, and the %>% is the pipe operating, taking the output from one function and making it the first input to the next. group_by here groups by the column area so when we calculate the sum (in summarize) it's the sum for each area group. The filter subsets the data.