I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.
The dataset looks something like this
data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
x1 = c(1, 11, 111),
x2 = c(2, 22, 222),
x1_1 = c(3, 333, 333),
x2_1 = c(4, 44, 44),
x1_2 = c(5, 55, 555),
x2_2 = c(6, 66, 666),)
I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.
Any help would be much appreciated.
Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.
library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
result[[name_stems[i]]] =
rowSums(data[
str_detect(
names(data),
pattern = paste0(name_stems[i], "_")
)
])
}
result
# # A tibble: 3 × 3
# date x1 x2
# <chr> <dbl> <dbl>
# 1 1/1/2018 9 12
# 2 2/1/2018 399 132
# 3 3/1/2018 999 932
Using data.table:
baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
## date x1 x2
## 1: 1/1/2018 9 12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932
Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.
library(tidyverse)
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = substr(name, 1,2)) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
# date x1 x2
# <chr> <dbl> <dbl>
#1 1/1/2018 9 12
#2 2/1/2018 399 132
#3 3/1/2018 999 932
Updates
In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:
s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10" "x1" "x1_1" "x1_2" "x2" "x2_1" "x2_2" "x20"
#[10] "x20_1"
s |> str_extract("x1$|(x1_).*")
#[1] NA NA "x1" "x1_1" "x1_2" NA NA NA NA NA
s |> str_extract("x2$|(x2_).*")
#[1] NA NA NA NA NA "x2" "x2_1" "x2_2" NA NA
This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.
Here is the full code:
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
str_detect(name, "x2$|(x2_).*")~"x2")) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
Related
I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f
Hope the title isn't too confusing. But what im trying to do is match values associated with a variable name (from a data frame) to variables in a data frame inside a list... and then append the list with the matched values.
For example, if I have a data frame that contains 2 columns, i.e., the variable name and values, like so:
# create a dataframe
df <- data.frame(
variable = paste0("x", 1:10),
value = sample(1:20, 10, replace = T)
)
df
> df
variable value
1 x1 2
2 x2 8
3 x3 5
4 x4 11
5 x5 1
6 x6 13
7 x7 16
8 x8 8
9 x9 20
10 x10 13
and then if I have another data frame inside a list like:
# create a list
myList <- list(
newDf = data.frame(
var = c("x1", NA, NA, "x5", "x4", NA, "x3")
)
)
What im trying to do is to match the value for a variable from df to the variables in newDf in my list and then add the value to the corresponding variable in newDf. For clarity, my expected result for the above example would look like:
$newDf
var value
1 x1 2
2 <NA> NA
3 <NA> NA
4 x5 1
5 x4 11
6 <NA> NA
7 x3 5
So, we can see from df the value for x1 is 2, and since x1 is in newDf we add that value to the list, the value for x5 is 1, so we add it to the list, etc
Any suggestions as to how I could do this?
You may use match -
lapply(myList, function(x) transform(x,value = df$value[match(var,df$variable)]))
#$newDf
# var value
#1 x1 2
#2 <NA> NA
#3 <NA> NA
#4 x5 1
#5 x4 11
#6 <NA> NA
#7 x3 5
Or merge -
lapply(myList, function(x) merge(x, df, by.x = 'var', by.y = 'variable', all.x = TRUE))
If the length(myList) is always 1 you may skip lapply and perform this operation directly on myList[[1]].
This may not be the exact answer to the question. First the data.frame is stringsAsFactors=FALSE. Then we may find by the index of df which is the same as the list.
df <- data.frame(
variable = paste0("x", 1:10),
value = c(2, 8, 5, 11, 1, 13, 16, 8, 20, 13), stringsAsFactors = FALSE
)
myList <- list(
newDf = data.frame(
var = c("x1", NA, NA, "x5", "x4", NA, "x3"), stringsAsFactors = FALSE
)
)
s=lapply( 1:7, function(x) df[which( myList[['newDf']][x,]==df[, 'variable']), ] )
do.call(rbind.data.frame, s)
I have a dataframe with two columns per sample (n > 1000 samples):
df <- data.frame(
"sample1.a" = 1:5, "sample1.b" = 2,
"sample2.a" = 2:6, "sample2.b" = c(1, 3, 3, 3, 3),
"sample3.a" = 3:7, "sample3.b" = 2)
If there is a zero in column .b, the correspsonding value in column .a should be set to NA.
I thought to write a function over colnames (without suffix) to filter each pair of columns and conditional exchaning values. Is there a simpler approach based on tidyverse?
We can split the data.frame into a list of data.frames and do the replacement in base R
df1 <- do.call(cbind, lapply(split.default(df,
sub("\\..*", "", names(df))), function(x) {
x[,1][x[2] == 0] <- NA
x}))
Or another option is Map
acols <- endsWith(names(df), "a")
bcols <- endsWith(names(df), "b")
df[acols] <- Map(function(x, y) replace(x, y == 0, NA), df[acols], df[bcols])
Or if the columns are alternate with 'a', 'b' columns, use a logical index for recycling, create the logical matrix with 'b' columns and assign the corresponding values in 'a' columns to NA
df[c(TRUE, FALSE)][df[c(FALSE, TRUE)] == 0] <- NA
or an option with tidyverse by reshaping into 'long' format (pivot_longer), changing the 'a' column to NA if there is a correspoinding 0 in 'a', and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_sep="\\.",
names_to = c('group', '.value')) %>%
mutate(a = na_if(b, a == 0)) %>%
pivot_wider(names_from = group, values_from = c(a, b)) %>%
select(-rn)
# A tibble: 5 x 6
# a_sample1 a_sample2 a_sample3 b_sample1 b_sample2 b_sample3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2 1 2 2 1 2
#2 2 3 2 2 3 2
#3 2 3 2 2 3 2
#4 2 3 2 2 3 2
#5 2 3 2 2 3 2
This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
An example is below. How can I transform a dataframe df with column names to the form of df.transformed below?
> df <- data.frame("names" = c("y1", "y2"), "x1" = 1:2, "x2" = 4:5)
> df
names x1 x2
1 y1 1 4
2 y2 2 5
> df.transformed <- data.frame("y1x1" = 1, "y1x2" =4, "y2x1" = 2, "y2x2" = 5)
> df.transformed
y1x1 y1x2 y2x1 y2x2
1 1 4 2 5
Code
require(data.table); setDT(df)
dt = melt(df, id.vars = 'names')[, col := paste0(variable, names)]
out = dt$value; names(out) = dt$col
Result
> data.frame(t(out))
x1y1 x1y2 x2y1 x2y2
1 2 4 5
You can achieve this in base R. This should work for any data frame size. The idea is combine Reduce with outer to build the data frame column names.
df <- data.frame("names" = c("y1", "y2"), "x1" = 1:2, "x2" = 4:5)
df_names <- outer(df[,1], names(df[,-1]), paste0)
df.transformed <- as.data.frame(matrix(,ncol = nrow(df)*ncol(df[,-1]), nrow = 0))
names(df.transformed) <- Reduce(`c`,t(df_names))
df.transformed[1,] <- Reduce(`c`,t(df[-1]))
Output
# y1x1 y1x2 y2x1 y2x2
# 1 4 2 5
You can do this in one line with the new tidyr::pivot_wider. Setting multiple columns for values means names will get pasted together for assignment.
library(tidyr)
pivot_wider(df, names_from = names, values_from = c(x1, x2), names_sep = "")
#> # A tibble: 1 x 4
#> x1y1 x1y2 x2y1 x2y2
#> <int> <int> <int> <int>
#> 1 1 2 4 5
However, the column names ("x1", "x2") come first. If you need to swap the "x" and "y" components of the names, you can do regex replacement with dplyr::rename_all.
df %>%
pivot_wider(names_from = names, values_from = c(x1, x2), names_sep = "") %>%
dplyr::rename_all(gsub, pattern = "(x\\d+)(y\\d+)", replacement = "\\2\\1")
#> # A tibble: 1 x 4
#> y1x1 y2x1 y1x2 y2x2
#> <int> <int> <int> <int>
#> 1 1 2 4 5
I'm working with two dataframes in R:
df1 = data.frame(c("A", "B"), c(1, 21), c(17, 29))
colnames(df1) = c("location", "start", "stop")
df1
location start stop
A 1 17
B 21 29
df2 = data.frame(c("A", "A", "A", "A", "B"), c(1, 10, 20, 40, 20), c(10, 20, 30, 50, 30), c("x1", "x2","x4", "x5", "x3"))
colnames(df2) = c("location", "start", "stop", "out")
df2
location start stop out
A 1 10 x1
A 10 20 x2
A 20 30 x4
A 40 50 x5
B 20 30 x3
Now I want to check for each row of df1:
is there a match between 'location' with a 'location' from df2
if the 'start' value is in the range of start and stop from df2 or if the 'end' value is in the range of start and stop from df2, then the corresponding 'out' value from df2 should be pasted in a new column in df1
This is how the output would look in the case of this example
df1_new
location start stop out
A 1 17 x1,x2
B 21 29 x3
I've started in R, but I'm stuck at the point where I need to look in the complete dataframe of df2
for (i in nrow(df1)) {
if(df1$location[i] == df2$location # it needs to look for a match in the complete dataframe of df2. I don't know how to do this
& if (df1$start[i] %in% # it needs to check if the start value lies in the range between df2$start & df2$end
}
Here's a data.table way, using foverlaps:
library(data.table)
setkey(setDT(df1))
setDT(df2, key = names(df1))
foverlaps(df1, df2)[, .(out = toString(out)), by=location]
# location out
# 1: A x1, x2
# 2: B x3
You can get other cols out of the foverlaps results if desired:
foverlaps(df1, df2)
# location start stop out i.start i.stop
# 1: A 1 10 x1 1 17
# 2: A 10 20 x2 1 17
# 3: B 20 30 x3 21 29
You need to aggregate first and then merge, i.e.
merge(df1, aggregate(out ~ location, df2, toString), by = 'location')
# location start stop out
#1 A 1 17 x1, x2
#2 B 21 29 x3