Dynamically subset and mutate data.table? - r

I have 2 separate DFs, I want to mutate 2 new columns in dat2 ('Avg_of_nonNA', and a 'Cols' to track which column its using) based on the non-NA columns in dat1. I need take a subset of dat2 because the matrix is dense whereas dat1 is sparse (So I can take advantage of the sparse-ness). The only way to match the columns is to match the common elements in the names: (0-1,1-2,2-3,3-4) in my case. The rest of the column names are gibberish. Its requiring string splitting and matching--causing many problems because I can't chain stuff together because each row has a different combination of columns to average (dummy example is simplified). I do have a working solution, but it is painfully slow across my 1M+ rows. Here is that solution:
I'm looking for a way to get rid of the for loop. Any suggestions?
for (z in 1:5) {
relevant_cols=dat1[z,] %>%
select_if(~!all(is.na(.))) %>%
names %>% strsplit(.,'_') %>% map(.,2) %>% unlist()
id=dat1[z,'ID']$`ID`
dat2[`ID`== id,`:=`(Avg_of_nonNA = (mean(as.numeric(.SD))),Cols=paste0(relevant_cols,collapse='/')), .SDcols=names(dat2) %like% paste0(relevant_cols,collapse='|')]
}
Data Below
> dat1
ID gjfkg_0-1_fkjdk_fjdkd jdfsje_1-2_fhks_ejfskj dfjs_2-3_vjskf_wqew gdlkrzc_3-4_rjrkj Avg_of_nonNA_otherDT
1: 1 2.23 1.37 NA NA 1.5
2: 2 1.98 NA NA 1.760 6.5
3: 3 NA 4.45 9.350 3.320 11.0
4: 4 NA NA 6.642 2.019 15.5
5: 5 NA 3.21 3.677 NA 18.5
> dat2
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd
1: 1 1 2 3 4
2: 2 5 6 7 8
3: 3 9 10 11 12
4: 4 13 14 15 16
5: 5 17 18 19 20
dput(dat1)
structure(list(ID = 1:5, `gjfkg_0-1_fkjdk_fjdkd` = c(2.23, 1.98,
NA, NA, NA), `jdfsje_1-2_fhks_ejfskj` = c(1.37, NA, 4.45, NA,
3.21), `dfjs_2-3_vjskf_wqew` = c(NA, NA, 9.35, 6.642, 3.677),
`gdlkrzc_3-4_rjrkj` = c(NA, 1.76, 3.32, 2.019, NA)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
dput(dat2)
structure(list(ID = 1:5, `ewrwer_0-1_iopi_opop` = c(1L, 5L, 9L,
13L, 17L), `erewtt_1-2_rueiwu_vcvbc` = c(2L, 6L, 10L, 14L, 18L
), `erewr_2-3_iirew_rewr` = c(3L, 7L, 11L, 15L, 19L), `mnmn_3-4_cxzxzc_gjd` = c(4L,
8L, 12L, 16L, 20L)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
Expected output:

Here is an option:
setDT(dat1)
setDT(dat2)
nm <- sapply(strsplit(names(dat1[, -"ID"]), "_"), `[[`, 2L)
dat2[, c("Avg_of_nonNA_otherDT", "Cols") := {
nas <- is.na(dat1[,-"ID"])
m <- col(nas)
m[] <- nm[m]
m[nas] <- ""
.(rowMeans(.SD * NA^nas, na.rm=TRUE),
gsub("\\s+", "/", trimws(do.call(paste, as.data.frame(m)))))
}, .SDcols=-"ID"]
output:
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd Avg_of_nonNA_otherDT Cols
1: 1 1 2 3 4 1.5 0-1/1-2
2: 2 5 6 7 8 6.5 0-1/3-4
3: 3 9 10 11 12 11.0 1-2/2-3/3-4
4: 4 13 14 15 16 15.5 2-3/3-4
5: 5 17 18 19 20 18.5 1-2/2-3

Related

Rowsums on two vectors of paired columns but conditional on specific values

I have a dataset that looks like the one below where there are three "pairs" of columns pertaining to the type (datA, datB, datC), and the total for each type (datA_total, datB_total, datC_total):
structure(list(datA = c(1L, NA, 5L, 3L, 8L, NA), datA_total = c(20L,
30L, 40L, 15L, 10L, NA), datB = c(5L, 5L, NA, 6L, 1L, NA), datB_total = c(80L,
10L, 10L, 5L, 4L, NA), datC = c(NA, 4L, 1L, NA, 3L, NA), datC_total = c(NA,
10L, 15L, NA, 20L, NA)), class = "data.frame", row.names = c(NA,
-6L))
# datA datA_total datB datB_total datC datC_total
#1 1 20 5 80 NA NA
#2 NA 30 5 10 4 10
#3 5 40 NA 10 1 15
#4 3 15 6 5 NA NA
#5 8 10 1 4 3 20
#6 NA NA NA NA NA NA
I'm trying to create a rowSums across each row to determine the total visits across each data type conditional on whether they meet a criteria of having ANY score ranging (1-5).
Here is my thought process:
Select only the variables that are the data types (i.e. datA, datB, datC)
Across each row based on EACH data type, determine if that data type meets a criteria (i.e. datA -> does it contain (1,2,3,4,5))
If that data type column does contain one of the 5 values above ^, then look to its paired total variable and ready that value to be rowSummed (i.e. datA -> does it contain (1,2,3,4,5)? -> if yes, then grab datA_total value = 20).
The goal is to end up with a total column like below:
# datA datA_total datB datB_total datC datC_total overall_total
#1 1 20 5 80 NA NA 100
#2 NA 30 5 10 4 10 20
#3 5 40 NA 10 1 15 55
#4 3 15 6 5 NA NA 15
#5 8 10 1 4 3 20 24
#6 NA NA NA NA NA NA 0
You'll notice that row #2 only contained a total of 20 even though there is 30 in datA_total. This is a result of the conditional selection in that datA for row#2 contains "NA" rather than one of the five scores (1,2,3,4,5). Hence, the datA_total of 30 was not included in the rowSums calculation.
My code below shows the vectors I created and my attempt at a conditional rowSums but I end up getting an error regarding mutate... I'm not sure how to integrate the "conditional pairing" portion of this problem:
type_vars <- c("datA", "datB", "datC")
type_scores <- c("1", "2", "3", "4", "5")
type_visits <- c("datA_total", "datB_total", "datC_total")
df <- df %>%
mutate(overall_total = rowSums(all_of(type_visits[type_vars %in% type_scores])))
Any help/tips would be appreciated
dplyr's across should do the job.
library(dplyr)
# copying your tibble
data <-
tibble(
datA = c(1, NA, 5, 3, 8, NA),
datA_total = c(20, 30, 40, 15, 10, NA),
datB = c(5, 5, NA, 6, 1, NA),
datB_total = c(80, 10, 10, 5, 4, NA),
datC = c(NA, 4, 1, NA, 3, NA),
datC_total = c(NA, 10, 15, NA, 20, NA)
)
data %>%
mutate(across(c('A', 'B', 'C') %>% paste0('dat', .), \(x) (x %in% 1:5) * get(cur_column() %>% paste0(., '_total')), .names = "{col}_aux")) %>%
rowwise() %>%
mutate(overall_total = sum(across(ends_with('aux')), na.rm = TRUE)) %>%
select(any_of(c(names(data), 'overall_total')))
# A tibble: 6 × 7
datA datA_total datB datB_total datC datC_total overall_total
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 5 80 NA NA 100
2 NA 30 5 10 4 10 20
3 5 40 NA 10 1 15 55
4 3 15 6 5 NA NA 15
5 8 10 1 4 3 20 24
6 NA NA NA NA NA NA 0
First, we create an 'aux' column for each dat. It is 0 if dat is not within 1:5, and dat_total otherwise. Then we sum ignoring NA.

Tidyverse method for combining sets of columns based on a condition in the column names

Imagine I have the following columns (among others) in my dataframe (credit to Allan for creating the sample data):
20L, 15L), b_years = c(4L, 5L, 3L), b_months = 0:2, b_days = c(10L,
8L, 6L), c_years = 8:6, c_months = c(11L, 9L, 8L), c_days = c(26L,
19L, 18L)), class = "data.frame", row.names = c(NA, -3L))
df
#> a_years a_months a_days b_years b_months b_days c_years c_months c_days
#> 1 5 6 23 4 0 10 8 11 26
#> 2 4 7 20 5 1 8 7 9 19
#> 3 3 8 15 3 2 6 6 8 18
And I want to combine columns that start with the same grouping key (in this case the letter at the beginning, but in my data it's a longer expression) such that I get columns a_days, b_days, c_days and so on with values in eahc column equal to x_years * 365 + x_months * 30 + x_days, for each group (a, b, c, d, e and so on) of columns.
Is there a way to accomplish this all at once? Some combination of map() and mutate() comes to mind, or maybe using case_when(), but I can't quite figure it out. Thanks for any guidance you can offer!
You can do this with across inside transmute:
library(dplyr)
df %>%
transmute(across(contains("days"), ~ .x) +
across(contains("months"), ~ .x * 30) +
across(contains("years"), ~ .x * 365))
#> a_days b_days c_days
#> 1 2028 1470 3276
#> 2 1690 1863 2844
#> 3 1350 1161 2448
Sample data
df <- structure(list(a_years = 5:3, a_months = 6:8, a_days = c(23L,
20L, 15L), b_years = c(4L, 5L, 3L), b_months = 0:2, b_days = c(10L,
8L, 6L), c_years = 8:6, c_months = c(11L, 9L, 8L), c_days = c(26L,
19L, 18L)), class = "data.frame", row.names = c(NA, -3L))
df
#> a_years a_months a_days b_years b_months b_days c_years c_months c_days
#> 1 5 6 23 4 0 10 8 11 26
#> 2 4 7 20 5 1 8 7 9 19
#> 3 3 8 15 3 2 6 6 8 18
Created on 2022-09-29 with reprex v2.0.2

Adding columns and insert info from a second dataframe R

hello everyone I have two dataframes and I'd like to join information from one df to another one in a specific way. I'm gonna explain better. Here is my first df where i'd like to add 6 columns (general col named col1, col2 and so on..):
res1 res4 aa1234
1 AAAAAA 1 4 IVGG
2 AAAAAA 8 11 RPRQ
3 AAAAAA 10 13 RQFP
4 AAAAAA 12 15 FPFL
5 AAAAAA 20 23 NQGR
6 AAAAAA 32 35 HARF
here is the 2nd df:
res1 dist
1 3.711846
1 3.698985
2 4.180874
2 3.112819
3 3.559737
3 3.722107
4 3.842375
4 3.914970
5 3.361647
5 2.982788
6 3.245118
6 3.224230
7 3.538315
7 3.602273
8 3.185184
8 2.771583
9 4.276871
9 3.157737
10 3.933783
10 2.956738
Considering "res1" I'd like to add to the 1st df in my new 6 columns the first 6th values contained in "dist" of second df corresponding to res1 = 1.
After, in the 1st df I have res1 = 8, so I'd like to add in the new 6 columns the 6 values from res1 = 8 contained in "dist" of 2nd df.
I'd like to obtain something like this
res1 res4 aa1234 col1 col2 col3 col4 col5 col6
1 4 IVGG 3.71 3.79 4.18 3.11 3.55 3.72
8 11 RPRQ 3.18 2.77 4.27 3.15 3.93 2.95
10 13 RQFP
12 15 FPFL
20 23 NQGR
32 35 HARF
Please consider that I have to do it on a large dataset and for 1000 and more files... thanks!
You could create a sequence from res1 to res4 and then join the data with pdb.
library(tidyverse)
turn %>%
mutate(res = map2(res1, res4, seq)) %>%
unnest(res) %>%
left_join(pdb, by = c('res' = 'res1')) %>%
group_by(res1 = as.character(res1)) %>%
mutate(col = paste0('col', row_number())) %>%
select(-res4, -res, -eleno) %>%
pivot_wider(names_from = col, values_from = dist)
We can use rowid from data.table
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df2 %>%
mutate(col = str_c("col", rowid(res1))) %>%
pivot_wider(names_from = col, values_from = dist) %>%
right_join(df1, by = 'res1')
-output
# A tibble: 6 x 4
# res1 col1 col2 res4
# <int> <dbl> <dbl> <int>
#1 1 3.71 3.70 4
#2 8 3.19 2.77 11
#3 10 3.93 2.96 13
#4 12 NA NA 15
#5 20 NA NA 23
#6 32 NA NA 35
data
df1 <- structure(list(res1 = c(1L, 8L, 10L, 12L, 20L, 32L), res4 = c(4L,
11L, 13L, 15L, 23L, 35L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(res1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L,
6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), dist = c(3.711846,
3.698985, 4.180874, 3.112819, 3.559737, 3.722107, 3.842375, 3.91497,
3.361647, 2.982788, 3.245118, 3.22423, 3.538315, 3.602273, 3.185184,
2.771583, 4.276871, 3.157737, 3.933783, 2.956738)), class = "data.frame",
row.names = c(NA,
-20L))

R replace the column name by the dataframe name with a loop

I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!
You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))

Calculate rowMeans on a range of column (Variable number)

I want to calculate rowMeans of a range of column but I cannot give the hard-coded value for colnames (e.g c(C1,C3)) or range (e.g. C1:C3) as both names and range are variable. My df looks like:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
This is just a sample, in reality I have columns ranging in MGW.1 ... MGW.196 and so. Here Instead of giving the exact colnames or an exact range I want to pass initial of colnames and want to get average of all columns having that initials. Something like: MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
So my final output should look like:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
I know these values are not correct but it is just to give you and idea. Secondly I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values.
Here is the dput for sample example:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))
Firstly
I think you are looking for this to get mean of rows:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
And to delete the columns afterwards:
df[, grep("^HEL.", names(df))] <- NULL
Secondly
To delete rows which have only NA after the first three elements.
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]
Here's an idea achieving your desired output without hardcoding variable names:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
I took the liberty to modify your initial data to show how the logic would fit special cases. For example, here we have a row (#4) where all values but the first 3 columns are NAs (according to your requirements, this row should be removed) and one where there is a mix of NAs and values (#5). In this case, I assumed we would like to have a result for MGW since there is a value at MGW.1:
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
Which gives:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
Data
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Resources