Calculate rowMeans on a range of column (Variable number) - r

I want to calculate rowMeans of a range of column but I cannot give the hard-coded value for colnames (e.g c(C1,C3)) or range (e.g. C1:C3) as both names and range are variable. My df looks like:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
This is just a sample, in reality I have columns ranging in MGW.1 ... MGW.196 and so. Here Instead of giving the exact colnames or an exact range I want to pass initial of colnames and want to get average of all columns having that initials. Something like: MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
So my final output should look like:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
I know these values are not correct but it is just to give you and idea. Secondly I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values.
Here is the dput for sample example:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))

Firstly
I think you are looking for this to get mean of rows:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
And to delete the columns afterwards:
df[, grep("^HEL.", names(df))] <- NULL
Secondly
To delete rows which have only NA after the first three elements.
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]

Here's an idea achieving your desired output without hardcoding variable names:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
I took the liberty to modify your initial data to show how the logic would fit special cases. For example, here we have a row (#4) where all values but the first 3 columns are NAs (according to your requirements, this row should be removed) and one where there is a mix of NAs and values (#5). In this case, I assumed we would like to have a result for MGW since there is a value at MGW.1:
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
Which gives:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
Data
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Related

Rowsums on two vectors of paired columns but conditional on specific values

I have a dataset that looks like the one below where there are three "pairs" of columns pertaining to the type (datA, datB, datC), and the total for each type (datA_total, datB_total, datC_total):
structure(list(datA = c(1L, NA, 5L, 3L, 8L, NA), datA_total = c(20L,
30L, 40L, 15L, 10L, NA), datB = c(5L, 5L, NA, 6L, 1L, NA), datB_total = c(80L,
10L, 10L, 5L, 4L, NA), datC = c(NA, 4L, 1L, NA, 3L, NA), datC_total = c(NA,
10L, 15L, NA, 20L, NA)), class = "data.frame", row.names = c(NA,
-6L))
# datA datA_total datB datB_total datC datC_total
#1 1 20 5 80 NA NA
#2 NA 30 5 10 4 10
#3 5 40 NA 10 1 15
#4 3 15 6 5 NA NA
#5 8 10 1 4 3 20
#6 NA NA NA NA NA NA
I'm trying to create a rowSums across each row to determine the total visits across each data type conditional on whether they meet a criteria of having ANY score ranging (1-5).
Here is my thought process:
Select only the variables that are the data types (i.e. datA, datB, datC)
Across each row based on EACH data type, determine if that data type meets a criteria (i.e. datA -> does it contain (1,2,3,4,5))
If that data type column does contain one of the 5 values above ^, then look to its paired total variable and ready that value to be rowSummed (i.e. datA -> does it contain (1,2,3,4,5)? -> if yes, then grab datA_total value = 20).
The goal is to end up with a total column like below:
# datA datA_total datB datB_total datC datC_total overall_total
#1 1 20 5 80 NA NA 100
#2 NA 30 5 10 4 10 20
#3 5 40 NA 10 1 15 55
#4 3 15 6 5 NA NA 15
#5 8 10 1 4 3 20 24
#6 NA NA NA NA NA NA 0
You'll notice that row #2 only contained a total of 20 even though there is 30 in datA_total. This is a result of the conditional selection in that datA for row#2 contains "NA" rather than one of the five scores (1,2,3,4,5). Hence, the datA_total of 30 was not included in the rowSums calculation.
My code below shows the vectors I created and my attempt at a conditional rowSums but I end up getting an error regarding mutate... I'm not sure how to integrate the "conditional pairing" portion of this problem:
type_vars <- c("datA", "datB", "datC")
type_scores <- c("1", "2", "3", "4", "5")
type_visits <- c("datA_total", "datB_total", "datC_total")
df <- df %>%
mutate(overall_total = rowSums(all_of(type_visits[type_vars %in% type_scores])))
Any help/tips would be appreciated
dplyr's across should do the job.
library(dplyr)
# copying your tibble
data <-
tibble(
datA = c(1, NA, 5, 3, 8, NA),
datA_total = c(20, 30, 40, 15, 10, NA),
datB = c(5, 5, NA, 6, 1, NA),
datB_total = c(80, 10, 10, 5, 4, NA),
datC = c(NA, 4, 1, NA, 3, NA),
datC_total = c(NA, 10, 15, NA, 20, NA)
)
data %>%
mutate(across(c('A', 'B', 'C') %>% paste0('dat', .), \(x) (x %in% 1:5) * get(cur_column() %>% paste0(., '_total')), .names = "{col}_aux")) %>%
rowwise() %>%
mutate(overall_total = sum(across(ends_with('aux')), na.rm = TRUE)) %>%
select(any_of(c(names(data), 'overall_total')))
# A tibble: 6 × 7
datA datA_total datB datB_total datC datC_total overall_total
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 5 80 NA NA 100
2 NA 30 5 10 4 10 20
3 5 40 NA 10 1 15 55
4 3 15 6 5 NA NA 15
5 8 10 1 4 3 20 24
6 NA NA NA NA NA NA 0
First, we create an 'aux' column for each dat. It is 0 if dat is not within 1:5, and dat_total otherwise. Then we sum ignoring NA.

one hot encoding only factor variables in R recipes

I have a dataframe df like so
height age dept
69 18 A
44 8 B
72 19 B
58 34 C
I want to one-hot encode only the factor variables (only dept is a factor). How can i do this?
Currently right now I'm selecting everything..
and getting this warning:
Warning message:
The following variables are not factor vectors and will be ignored: height, age
ohe <- df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect::everything()) %>%
recipes::prep() %>%
recipes::bake(df)
Use the where with is.factor instead of everything
library(dplyr)
df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect:::where(is.factor)) %>%
recipes::prep() %>%
recipes::bake(df)
-output
# A tibble: 4 × 4
height age dept_B dept_C
<int> <int> <dbl> <dbl>
1 69 18 0 0
2 44 8 1 0
3 72 19 1 0
4 58 34 0 1
data
df <- structure(list(height = c(69L, 44L, 72L, 58L), age = c(18L, 8L,
19L, 34L), dept = structure(c(1L, 2L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")

Transform columns and rows of a dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I have a dataframe:
ID Value Name Score Card_type Card_number
1 NA John 242 X 23
1 124 John NA X 23
1 124 John 242 Y 25
1 124 NA 242 Y NA
2 55 Mike NA X 11
2 55 NA 431 X 11
2 55 Mike 431 Y 14
2 NA Mike 431 Y 14
As you see, there are IDs and each of them has two groups (Card_type) for column Card_number. Also as you see, some rows with same ID and Card_type have same missing values in some columns. What I want to get is, to make each ID be one row with filled columns. And column Card_number must be split into two columns Card_number_type_X and Card_number_type_X and column Card_type must be removed.
So the desired result must look like this:
ID Value Name Score Card_number_type_X Card_number_type_Y
1 124 John 242 23 25
2 55 Mike 431 11 14
How could I do that?
One way would be to fill the missing values in each ID and then get data in wide format keeping only the unique values.
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
pivot_wider(names_from = Card_type, values_from = Card_number,
values_fn = unique, names_prefix = 'Card_number_type_')
# ID Value Name Score Card_number_type_X Card_number_type_Y
# <int> <int> <chr> <int> <int> <int>
#1 1 124 John 242 23 25
#2 2 55 Mike 431 11 14
It seems original data is not the same as shared data in which case we can try :
df %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
distinct() %>%
group_by(ID, Value, Name, Score) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Card_type, values_from = Card_number,
names_prefix = 'Card_number_type_')
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Value = c(NA,
124L, 124L, 124L, 55L, 55L, 55L, NA), Name = c("John", "John",
"John", NA, "Mike", NA, "Mike", "Mike"), Score = c(242L, NA,
242L, 242L, NA, 431L, 431L, 431L), Card_type = c("X", "X", "Y",
"Y", "X", "X", "Y", "Y"), Card_number = c(23L, 23L, 25L, NA,
11L, 11L, 14L, 14L)), class = "data.frame", row.names = c(NA,
-8L))

Dynamically subset and mutate data.table?

I have 2 separate DFs, I want to mutate 2 new columns in dat2 ('Avg_of_nonNA', and a 'Cols' to track which column its using) based on the non-NA columns in dat1. I need take a subset of dat2 because the matrix is dense whereas dat1 is sparse (So I can take advantage of the sparse-ness). The only way to match the columns is to match the common elements in the names: (0-1,1-2,2-3,3-4) in my case. The rest of the column names are gibberish. Its requiring string splitting and matching--causing many problems because I can't chain stuff together because each row has a different combination of columns to average (dummy example is simplified). I do have a working solution, but it is painfully slow across my 1M+ rows. Here is that solution:
I'm looking for a way to get rid of the for loop. Any suggestions?
for (z in 1:5) {
relevant_cols=dat1[z,] %>%
select_if(~!all(is.na(.))) %>%
names %>% strsplit(.,'_') %>% map(.,2) %>% unlist()
id=dat1[z,'ID']$`ID`
dat2[`ID`== id,`:=`(Avg_of_nonNA = (mean(as.numeric(.SD))),Cols=paste0(relevant_cols,collapse='/')), .SDcols=names(dat2) %like% paste0(relevant_cols,collapse='|')]
}
Data Below
> dat1
ID gjfkg_0-1_fkjdk_fjdkd jdfsje_1-2_fhks_ejfskj dfjs_2-3_vjskf_wqew gdlkrzc_3-4_rjrkj Avg_of_nonNA_otherDT
1: 1 2.23 1.37 NA NA 1.5
2: 2 1.98 NA NA 1.760 6.5
3: 3 NA 4.45 9.350 3.320 11.0
4: 4 NA NA 6.642 2.019 15.5
5: 5 NA 3.21 3.677 NA 18.5
> dat2
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd
1: 1 1 2 3 4
2: 2 5 6 7 8
3: 3 9 10 11 12
4: 4 13 14 15 16
5: 5 17 18 19 20
dput(dat1)
structure(list(ID = 1:5, `gjfkg_0-1_fkjdk_fjdkd` = c(2.23, 1.98,
NA, NA, NA), `jdfsje_1-2_fhks_ejfskj` = c(1.37, NA, 4.45, NA,
3.21), `dfjs_2-3_vjskf_wqew` = c(NA, NA, 9.35, 6.642, 3.677),
`gdlkrzc_3-4_rjrkj` = c(NA, 1.76, 3.32, 2.019, NA)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
dput(dat2)
structure(list(ID = 1:5, `ewrwer_0-1_iopi_opop` = c(1L, 5L, 9L,
13L, 17L), `erewtt_1-2_rueiwu_vcvbc` = c(2L, 6L, 10L, 14L, 18L
), `erewr_2-3_iirew_rewr` = c(3L, 7L, 11L, 15L, 19L), `mnmn_3-4_cxzxzc_gjd` = c(4L,
8L, 12L, 16L, 20L)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
Expected output:
Here is an option:
setDT(dat1)
setDT(dat2)
nm <- sapply(strsplit(names(dat1[, -"ID"]), "_"), `[[`, 2L)
dat2[, c("Avg_of_nonNA_otherDT", "Cols") := {
nas <- is.na(dat1[,-"ID"])
m <- col(nas)
m[] <- nm[m]
m[nas] <- ""
.(rowMeans(.SD * NA^nas, na.rm=TRUE),
gsub("\\s+", "/", trimws(do.call(paste, as.data.frame(m)))))
}, .SDcols=-"ID"]
output:
ID ewrwer_0-1_iopi_opop erewtt_1-2_rueiwu_vcvbc erewr_2-3_iirew_rewr mnmn_3-4_cxzxzc_gjd Avg_of_nonNA_otherDT Cols
1: 1 1 2 3 4 1.5 0-1/1-2
2: 2 5 6 7 8 6.5 0-1/3-4
3: 3 9 10 11 12 11.0 1-2/2-3/3-4
4: 4 13 14 15 16 15.5 2-3/3-4
5: 5 17 18 19 20 18.5 1-2/2-3

how to calculate a specific subset in dataframe in r and save the calculation in another list

I have two lists:
list 1:
id name age
1 jake 21
2 ashly 19
45 lana 18
51 james 23
5675 eric 25
list 2 (tv watch):
id hours
1 1.1
1 3
1 2.5
45 5.6
45 3
51 2
51 1
51 2
this is just an example, the real lists are very big :list 1 - 5000 id's, list 2/3/4 - has more then 1 million rows (not a unique id).
I need for every list 2 and up to calculate average/sum/count to every id and add the value to list 1.
notice that I need the calculation saved in another list with different row numbers.
example:
list 1:
id name age tv_average
1 jake 21 2.2
2 ashly 19 n/a
45 lana 18 4.3
51 james 23 1.6667
5675 eric 25 n/a
this are my tries:
for (i in 1:nrow(list2)) {
p <- subset(list2,list2$id==i)
list2$tv_average[i==list2$id] <- sum(p$hours)/(nrow(p))
}
error:
out of 22999 rows it only work on 21713 rows.
Try this
#Sample Data
data1 = structure(list(id = c(1L, 2L, 45L, 51L, 5675L), name = structure(c(3L,
1L, 5L, 4L, 2L), .Label = c("ashly", "eric", "jake", "james",
"lana"), class = "factor"), age = c(21L, 19L, 18L, 23L, 25L)
), .Names = c("id",
"name", "age"), row.names = c(NA, -5L), class = "data.frame")
data2 = structure(list(id = c(1L, 1L, 1L, 3L, 45L, 45L, 51L, 51L, 51L,
53L), hours = c(1.1, 3, 2.5, 10, 5.6, 3, 2, 1, 2, 6)), .Names = c("id",
"hours"), class = "data.frame", row.names = c(NA, -10L))
# Use aggregate to calculate Average, Sum, and Count and Merge
merge(x = data1,
y = aggregate(hours~id, data2, function(x)
c(mean = mean(x),
sum = sum(x),
count = length(x))),
by = "id",
all.x = TRUE)
# id name age hours.mean hours.sum hours.count
#1 1 jake 21 2.200000 6.600000 3.000000
#2 2 ashly 19 NA NA NA
#3 45 lana 18 4.300000 8.600000 2.000000
#4 51 james 23 1.666667 5.000000 3.000000
#5 5675 eric 25 NA NA NA

Resources