Merging content of repeated variables in a dataframe in R [duplicate] - r

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 6 years ago.
I merged various dataframes in R, which had variables with the same name. In the merged file I got variables names as varA, varA.x, varA.x1, varA.x.y, etc. I want to create a file merging the content of all these variables in a single column. As an example of my file:
ID weight age varA varA.x varA.x.y varA.x.y.1
1 50 30 2 NA NA NA
2 78 34 NA 3 NA NA
3 56 56 NA NA NA 6
4 56 67 NA NA 7 NA
I want a file that looks like:
ID weight age varA
1 50 30 2
2 78 34 3
3 56 56 6
4 56 67 7
It is not feasible to use ifelse: `data$varA = ifelse(is.na(varA.x),varA.y,varA.x), because the statement would be too long as I have so many repeated variables.
Can you help me, please? Thank you so much.

We can use coalesce from tidyr
library(tidyverse)
df1 %>%
mutate(varA = coalesce(varA, varA.x, varA.x.y, varA.x.y.1)) %>%
select_(.dots = names(.)[1:4])
# ID weight age varA
#1 1 50 30 2
#2 2 78 34 3
#3 3 56 56 6
#4 4 56 67 7
Or use pmax from base R
cbind(df1[1:3], varA=do.call(pmax, c(df1[grep("varA", names(df1))], na.rm = TRUE)))

Related

Use a dynamcially created variable to select column in mutate

I am trying to use the value of vector_of_names[position] in the code above to dynamically select a column from data which to use for the value "age" using mutate.
vector_of_names <- c("one","two","three")
id <- c(1,2,3,4,5,6)
position <- c(1,1,2,2,1,1)
one <- c(32,34,56,77,87,98)
two <- c(45,67,87,NA,33,56)
three <- c(NA,NA,NA,NA,NA,60)
data <- data.frame(id,position,one,two,three)
attempt <- data %>%
mutate(age=vector_of_names[position])
I see a similar question here but the various answer fail as I am using a variable within the data "posistion" on which to select the column from the vector of names which is never recognised as I suspect is is looking outside of the data.
I am taking this approach as the number of columns "one","two" and "three" is not known before hand but the vector of their names is, and so they need to be selected dynamically.
You could do:
data %>%
rowwise() %>%
mutate(age = c_across(all_of(vector_of_names))[position])
id position one two three age
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 32 45 NA 32
2 2 1 34 67 NA 34
3 3 2 56 87 NA 87
4 4 2 77 NA NA NA
5 5 1 87 33 NA 87
6 6 1 98 56 60 98
If you want to be more explicit about what values should be returned:
named_vector_of_names <- setNames(seq_along(vector_of_names), vector_of_names)
data %>%
rowwise() %>%
mutate(age = get(names(named_vector_of_names)[match(position, named_vector_of_names)]))
Base R vectorized option using matrix subsetting.
data$age <- data[vector_of_names][cbind(1:nrow(data), data$position)]
data
# id position one two three age
#1 1 1 32 45 NA 32
#2 2 1 34 67 NA 34
#3 3 2 56 87 NA 87
#4 4 2 77 NA NA NA
#5 5 1 87 33 NA 87
#6 6 1 98 56 60 98

Replace NA using a vector of column names

I have a data frame with columns containing NAs which I replace using replace_na. The problem is these column names can change in the future so I would like to put these column names in a vector and then use the vector in the replace_na function. I don't want to change the entire data frame in one go, just specified columns. When I try this as below, the code runs but it doesn't change the data frame. Can anyone suggest any edits to the code?
library(tidyverse)
col1<-c(9,NA,25,26,NA,51)
col2<-c(9,5,25,26,NA,51)
col3<-c(NA,3,25,26,NA,51)
col4<-c(9,1,NA,26,NA,51)
data<-data.frame(col1,col2,col3,col4, stringsAsFactors = FALSE)
columns<-c(col1,col2)
data<-data%>%
replace_na(list(columns=0))
A dplyr option:
columns <- c("col1" ,"col2")
dplyr::mutate(data, across(columns, replace_na, 0))
Returns:
col1 col2 col3 col4
1 9 9 NA 9
2 0 5 3 1
3 25 25 25 NA
4 26 26 26 26
5 0 0 NA NA
6 51 51 51 51
Another option would be using coalesce inside map_at:
at argument in map_at can be a character vector of column names that you would like to modify
We then use coalesce function to specify the replacement of NAs
library(dplyr)
library(purrr)
data %>%
map_at(c("col1","col2"), ~ coalesce(.x, 0)) %>%
bind_cols()
# A tibble: 6 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <dbl>
1 9 9 NA 9
2 0 5 3 1
3 25 25 25 NA
4 26 26 26 26
5 0 0 NA NA
6 51 51 51 51
columns value should be string, you can then use is.na as -
columns<-c("col1","col2")
data[columns][is.na(data[columns])] <- 0
data
# col1 col2 col3 col4
#1 9 9 NA 9
#2 0 5 3 1
#3 25 25 25 NA
#4 26 26 26 26
#5 0 0 NA NA
#6 51 51 51 51
Or using tidyverse -
library(dplyr)
library(tidyr)
data <- data %>% mutate(across(all_of(columns), replace_na, 0))

Ranking based on two variables

I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!
You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]

How do I split values in a column into multiple columns based on a factor in R? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
Long time reader and first time poster, let's see how this goes...
I am working in R to create a summary of average out of pocket costs for different drugs based on different health care providers. In the data I have many more companies (~5000) than I do products (4). I know to start off by aggregating the out of pocket cost by product and health care provider as shown below:
avgdf <- aggregate(price ~ company + product, data= df, mean)
colnames(avgdf) <- c("company", "prod", "avg_price")
The resulting data frame looks like this: (Note for confidentiality reasons I cannot post the actual data but have to show a generic example)
company prod avg_price
A 1 88
A 2 63
A 3 46
B 1 55
C 2 8
D 1 67
D 2 42
D 3 40
D 4 61
E 1 13
E 2 17
F 1 85
F 4 17
I want to transform the data frame so that the prod column is split into 4 columns, one for each of the respective products, and the values of these 4 columns are filled in according to its company-product pair. In other words, I want the table to look like this:
company prod1.avg_price prod2.avg_price prod3.avg_price prod4.avg_price
A 88 63 46 NA
B 55 NA NA NA
C NA 2 NA NA
D 67 42 40 61
E 13 17 NA NA
F 85 NA NA 17
I shouldn't have as many NA's in my dataset as there are in my example, but I want a solution that can handle it. My guess is to use reshape2 melt and dcast functions but I am not sure how to implement it. Thank you in advance for the help!
We can use dcast from data.table to reshape it to 'wide' format.
library(data.table)
dcast(setDT(avgdf), company~paste0("prod", prod, ".avg_price"), value.var = "avg_price")
# company prod1.avg_price prod2.avg_price prod3.avg_price prod4.avg_price
#1: A 88 63 46 NA
#2: B 55 NA NA NA
#3: C NA 8 NA NA
#4: D 67 42 40 61
#5: E 13 17 NA NA
#6: F 85 NA NA 17

Select/ Extract rows basis of rowname in R [duplicate]

This question already has answers here:
subsetting data frame based on search pattern in vector
(2 answers)
subset with pattern
(3 answers)
Closed 6 years ago.
I want to extract some rows from my data in R based on specific identifier in column ids. My data is like this:
ids A1 B1 C1 D1 E1 ...
asd.wd.01 12 23 27 32 76
qsd.yh.02 54 32 32 11 22
gsd.kj.01 22 21 67 88 22
hnd.gd.02 22 88 42 41 93
sjd.td.01 52 31 72 19 31
And I want following output: (row with 01 eg. xxx.xx.01)
ids A1 B1 C1 D1 E1 ...
asd.wd.01 12 23 27 32 76
gsd.kj.01 22 21 67 88 22
sjd.td.01 52 31 72 19 31
You can use string matching. For example
Index <- grep("\\.01$", df$ids) ## Gives index of rows which contains .01
df <- df[Index, ] ## subsets dataframe
You can extract rows by using grepl
df <- subset(df, grepl("\\.01$", df$ids)
Use %>% (Pipe Operator) and filter() from dplyr package and %like% from data.table package. Extracted rows where Name ends with .1. You can use your data and do the similar substitution.
> library(dplyr)
> library(data.table)
> df <- data.frame(Name=c("A.1","B.1","A.3","B.2","C.1"),A=1:5,B=5:9,C=10:14)
> df
Name A B C
1 A.1 1 5 10
2 B.1 2 6 11
3 A.3 3 7 12
4 B.2 4 8 13
5 C.1 5 9 14
> df %>% filter(Name %like% ".1")
Name A B C
1 A.1 1 5 10
2 B.1 2 6 11
3 C.1 5 9 14

Resources