Using a vector of column names in mutate_at - r

This is my data:
ID a b c d
1 x 1 2 3
2 y 1 2 3
3 z NA NA NA
4 z 1 2 3
5 y NA NA NA
Now, if I wanted to replace the NAs in a single column, say b, with the mean of b by the group a, I know how to do it by using this code:
data %>%
group_by(a) %>%
mutate(b = ifelse(is.na(b), as.integer(mean(b, na.rm=TRUE)), b)
I want to use basically the same code but to apply it over columns b,c,d. But the code I have isn't working and I don't know why, it says "error, incompatible size (3), expecting 10 (the group size) or 1"
cols <- c("b","c","d")
data %>%
group_by(a) %>%
mutate_at(.cols = cols, funs(ifelse(is.na(cols),
as.integer(mean(cols, na.rm=TRUE)), cols)
I'm assuming the problem has to do with the code not correctly applying the column names when looking at the data?

for referencing a character vector to mutate use mutate_if instead.
cols <- c("b","c","d")
data %>%
group_by(a) %>%
mutate_if(names(.) %in% cols,
funs(ifelse(is.na(.), as.integer(mean(., na.rm=TRUE)), .)))

Related

Replace column values in a df with matching index with new values in R

I have df, containing 2 variables, df and val. df contains numbers from 1-255 and val is random numbers generated. I also have new_vals that is a vector of 255 different values.
df = (seq(1,255,by=1))
df = as.data.frame(df)
df$val = seq(0,1,length.out=255)
new_vals = (df$val+1)
new_vals=as.data.frame(new_vals)
I want to replace the value in df, where each number 1-255 in df$df corresponds to the 255 numbers in new_vals. If the index matches replace df$val with the value at each index from new_vals.
dataframe df
df val
1 0.000000000
2 0.003937008
3 0.007874016
dataframe newvals (these are the values at index 1,2,3)
new_vals
<dbl>
1.000000
1.003937
1.007874
Expected Output of dataframe df after replacing values at matching index
df val
1 1.000000
2 1.003937
3 1.007874
What is the easiest way I could do this?
Edit: I realized in this example i can just replace column, but imagine df$df's order of 1-255 was randomized or have more rows
If I'm understanding correctly, here's a way to match indices with dplyr:
library(dplyr)
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T)
Which gives us:
new_vals index df val
1 1.000000 1 1 0.000000000
2 1.003937 2 2 0.003937008
3 1.007874 3 3 0.007874016
Proposed solution without the example would be:
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T) %>%
select(df, val = new_vals)
Which gives us:
df val
1 1 1.000000
2 2 1.003937
3 3 1.007874
4 4 1.011811
5 5 1.015748
6 6 1.019685
7 7 1.023622
8 8 1.027559
9 9 1.031496
10 10 1.035433
If you are sure, there are df:1-255 in df, then:
df$val[which(df$df %in% c(1:255))] <- new_vals$new_vals
In addition a for loop can bring you more control and check the index accurately:
for (row in df$df) {
df$val[df$df==row] <- new_vals$new_vals[row]
}

melting matrices with logical values [duplicate]

This question already has answers here:
R matrix to rownames colnames values
(2 answers)
Closed 3 years ago.
I have a matrix with pairwise comparisons, of which the upper triangle and diagonal was set to NA.
df <- data.frame(a=c(NA,1,2), b=c(NA,NA,3), c=c(NA,NA,NA))
row.names(df) <- names(df)
I want to transform the matrix to long format, for which the standard procedure is to use reshape2's melt, followed by na.omit, so my desired output would be:
Var1 Var2 Value
a b 1
a c 2
b c 3
However, df$c is all NA and thus logical, and will be used as a non-measured variable by melt.
The output of melt(df) is therefore not what i am looking for.
library(reshape2)
melt(df)
How can I prevent melt from using df$c as id variable?
The trick is to convert the rownames to column and then convert to long format. A way to do it in tidyverse would be,
library(tidyverse)
df %>%
rownames_to_column() %>%
gather(var, val, -1) %>%
filter(!is.na(val))
# rowname var val
#1 b a 1
#2 c a 2
#3 c b 3
As #Humpelstielzche mentions in comments, there is a na.rm argument in gather so we can omit the last filtering, i.e.
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE)
While you have other answers already, this can be achieved with reshape2 and melt, if the appropriate function is called. In this case you don't want reshape2:::melt.data.frame but rather reshape2:::melt.matrix to be applied. So, try:
melt(as.matrix(df), na.rm=TRUE)
# Var1 Var2 value
#2 b a 1
#3 c a 2
#6 c b 3
If you then take a look at ?reshape2:::melt.data.frame you will see the statement:
This code is conceptually similar to ‘as.data.frame.table’
which means you could also use the somewhat more convoluted:
na.omit(as.data.frame.table(as.matrix(df), responseName="value"))
# Var1 Var2 value
#2 b a 1
#3 c a 2
#6 c b 3
In base R, we can use row and col to get row names and column names respectively and then filter the NA values.
df1 <- data.frame(col = colnames(df)[col(df)], row = rownames(df)[row(df)],
value = unlist(df), row.names = NULL)
df1[!is.na(df1$value), ]
# col row value
#2 a b 1
#3 a c 2
#6 b c 3

How to match values of several variables to a variable in a look up table?

I have two datasets:
loc <- c("a","b","c","d","e")
id1 <- c(NA,9,3,4,5)
id2 <- c(2,3,7,5,6)
id3 <- c(2,NA,5,NA,7)
cost1 <- c(10,20,30,40,50)
cost2 <- c(50,20,30,30,50)
cost3 <- c(40,20,30,10,20)
dt <- data.frame(loc,id1,id2,id3,cost1,cost2,cost3)
id <- c(1,2,3,4,5,6,7)
rate <- c(0.9,0.8,0.7,0.6,0.5,0.4,0.3)
lookupd_tb <- data.frame(id,rate)
what I want to do, is to match the values in dt with lookup_tb for id1,id2 and id3 and if there is a match, multiply rate for that id to its related cost.
This is my approach:
dt <- dt %>%
left_join(lookupd_tb , by=c("id1"="id")) %>%
dplyr :: mutate(cost1 = ifelse(!is.na(rate), cost1*rate, cost1)) %>%
dplyr :: select (-rate)
what I am doing now, works fine but I have to repeat it 3 times for each variable and I was wondering if there is a more efficient way to do this(probably using apply family?)
I tried to join all three variables with id in my look up table but when rate is joined with my dt, all the costs (cost1, cost2 and cost3) will be multiply by the same rate which I don't want.
I appreciate your help!
A base R approach would be to loop through the columns of 'id' using sapply/lapply, get the matching index from the 'id' column of 'lookupd_tb', based on the index, get the corresponding 'rate', replace the NA elements with 1, multiply with 'cost' columns and update the 'cost' columns
nmid <- grep("id", names(dt))
nmcost <- grep("cost", names(dt))
dt[nmcost] <- dt[nmcost]*sapply(dt[nmid], function(x) {
x1 <- lookupd_tb$rate[match(x, lookupd_tb$id)]
replace(x1, is.na(x1), 1) })
Or using tidyverse, we can loop through the sets of columns i.e. 'id' and 'cost' with purrr::map2, then do the same approach as above. The only diference is that here we created new columns instead of updating the 'cost' columns
library(tidyverse)
dt %>%
select(nmid) %>%
map2_df(., dt %>%
select(nmcost), ~
.x %>%
match(., lookupd_tb$id) %>%
lookupd_tb$rate[.] %>%
replace(., is.na(.),1) * .y ) %>%
rename_all(~ paste0("costnew", seq_along(.))) %>%
bind_cols(dt, .)
In tidyverse you can also try an alternative approach by transforming the data from wide to long
library(tidyverse)
dt %>%
# data transformation to long
gather(k, v, -loc) %>%
mutate(ID=paste0("costnew", str_extract(k, "[:digit:]")),
k=str_remove(k, "[:digit:]")) %>%
spread(k, v) %>%
# left_join and calculations of new costs
left_join(lookupd_tb , by="id") %>%
mutate(cost_new=ifelse(is.na(rate), cost,rate*cost)) %>%
# clean up and expected output
select(loc, ID, cost_new) %>%
spread(ID, cost_new) %>%
left_join(dt,., by="loc") # or %>% bind_cols(dt, .)
loc id1 id2 id3 cost1 cost2 cost3 costnew1 costnew2 costnew3
1 a NA 2 2 10 50 40 10 40 32
2 b 9 3 NA 20 20 20 20 14 20
3 c 3 7 5 30 30 30 21 9 15
4 d 4 5 NA 40 30 10 24 15 10
5 e 5 6 7 50 50 20 25 20 6
The idea ist to bring the data in suitable long format for the lef_joining using a gather & spread combination with new index columns k and ID. After the calculation we will transform to the expected output using a second spread and binding to dt

mutate rowSums exclude one column

I have a data frame like this
> df
Source: local data frame [4 x 4]
a x y z
1 name1 1 1 1
2 name2 1 1 1
3 name3 1 1 1
4 name4 1 1 1
Want to mutate it by adding columns x, y, and z (there can be many more numeric columns). Trying to exclude column 'a' as follows is not working.
dft <- df %>% mutate(funs(total = rowSums(.)), -a)
Error: not compatible with STRSXP
This also produces an error:
dft <- df %>% mutate(total = rowSums(.), -a)
Error in rowSums(.) : 'x' must be numeric
What is the right way?
If you want to keep non-numeric columns in the result, you can do this:
dat %>% mutate(total=rowSums(.[, sapply(., is.numeric)]))
UPDATE: Now that dplyr has scoped versions of its standard verbs, here's another option:
dat %>% mutate(total=rowSums(select_if(., is.numeric)))
UPDATE 2: With dplyr 1.0, the approaches above will still work, but you can also do row sums by combining rowwise and c_across:
iris %>%
rowwise %>%
mutate(row.sum = sum(c_across(where(is.numeric))))
You can use rich selectors with select() inside the call to rowSums()
df %>% transmute(a, total = rowSums(select(., -a)))
This should work:
#dummy data
df <- read.table(text="a x y z
1 name1 1 1 1
2 name2 1 1 1
3 name3 1 1 1
4 name4 1 1 1",header=TRUE)
library(dplyr)
df %>% select(-a) %>% mutate(total=rowSums(.))
First exclude text column - a, then do the rowSums over remaining numeric columns.

Collapsing data frame by selecting one row per group

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)

Resources