Select from column in dataframe based on value in another column - r

I have a dataframe as follows:
dataDF <- data.frame(
id = 1:5,
to_choose = c('red', 'blue', 'red', 'green', 'yellow'),
red_value = c(1,2,3,4,5),
blue_value = c(6,7,8,9,10),
yellow_value = c(11,12,13,14,15)
)
id to_choose red_value blue_value yellow_value
1 red 1 6 11
2 blue 2 7 12
3 red 3 8 13
4 green 4 9 14
5 yellow 5 10 15
I want to create a new column value, which is the value from the appropriate column based on the to_choose column.
I could do this with an ifelse as follows
mutate(dataDF,
value = ifelse(to_choose == 'red', red_value,
ifelse(to_choose == 'blue', blue_value,
ifelse(to_choose == 'yellow', yellow_value, NA))))
To give
id to_choose red_value blue_value yellow_value value
1 red 1 6 11 1
2 blue 2 7 12 7
3 red 3 8 13 3
4 green 4 9 14 NA
5 yellow 5 10 15 15
But if there a simpler one line way of doing this along the lines of
mutate(dataDF, value = paste(to_choose, 'value', sep = '_'))

dataDF %>%
gather(var, value , 3:5) %>%
mutate(var = gsub('_value', '', var)) %>%
filter(to_choose == var)

A base R approach using mapply
dataDF$value <- mapply(function(x, y) if(length(y) > 0) dataDF[x, y] else NA,
1:nrow(dataDF), sapply(dataDF$to_choose, function(x) grep(x, names(dataDF))))
dataDF
# id to_choose red_value blue_value yellow_value value
#1 1 red 1 6 11 1
#2 2 blue 2 7 12 7
#3 3 red 3 8 13 3
#4 4 green 4 9 14 NA
#5 5 yellow 5 10 15 15
The idea is to get the appropriate row and column indices to subset upon. Row indices we are already know that we need to get value for each row of the dataframe. As far as getting the appropriate column is concerned we use grep over to_choose to find the column index from where the value needs to be extracted.

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

R Fill backwards with flexible window based on number of rows in a separate column

I am trying to carry a value in one column backwards by a number of rows given in a second column and fill everything in between.
So column y mainly has 1s in it but might have individual numbers up to about 20 (in my real data, up to 3 in my example below). If the number in y is 20, I need the 19 rows before that row and that row itself to equal the value of x for the row where y is 20. If the value in y is 1 the output will just equal x.
y also has many NAs, these NAs are either legitimate NAs where I want an NA output or are placeholders where the filling should occur if a y value afterwards is > 1.
I thought I could use dplyr::lead but I cannot have a variable n value to look forwards a different number of steps, and it wouldn't fill inbetween, and I wondered about making a new, always increasing column and using RcppRoll::roll_max but have similar problems with the flexible window size.
Typically y-values in the lead up to a y > 1 will be 0 or NA, but if there were conflicts I would want to adopt the later value still eg in row 8 of my data frame y is 1 followed by y = 2 in row 9 so I want the value associated with row 9 in both cases. If y in NA and there is not covered by filling backwards, I want it to remain NA (or 0 would be fine)
Thanks for any thoughts
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
#> x y out
#> 1 9 NA NA
#> 2 4 NA NA
#> 3 7 NA NA
#> 4 1 1 1
#> 5 2 NA 11
#> 6 13 NA 11
#> 7 11 3 11
#> 8 3 1 8
#> 9 8 2 8
#> 10 12 1 12
#> 11 5 1 5
#> 12 6 NA NA
#> 13 15 NA NA
#> 14 10 NA 14
#> 15 14 2 14
#try adopting #sirius answer before I specified about the extra NAs
test$y <- ifelse(is.na(test$y),0,test$y)
test$out <- with( test, rep( x, y ) )
#> Error in `$<-.data.frame`(`*tmp*`, out, value = c(1L, 11L, 11L, 11L, 3L, : replacement has 11 rows, data has 15
Created on 2021-04-08 by the reprex package (v0.3.0)
Things got a bit complex, but essentially calculate all the repeated x's for each y > 0, and then let subsequent x'es overwrite earlier ones
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
test %<>% mutate( id = seq(n()) ) %>%
filter( !is.na(y) & y != 0 ) %>%
group_by(id) %>%
slice( rep(1,y) ) %>%
mutate( id = rev( max(id)+1-1:n() ) ) %>%
group_by(id) %>%
summarize( out = as.numeric(last(x)) ) %>%
right_join( test %>% mutate( id=seq(n()) ) ) %>%
arrange( id ) %>% select( -id ) %>% relocate( x, y, out )
identical( as.data.frame(test), desired_out ) ## TRUE
test
Output:
> test
# A tibble: 15 x 3
x y out
<int> <dbl> <dbl>
1 9 NA NA
2 4 NA NA
3 7 NA NA
4 1 1 1
5 2 NA 11
6 13 NA 11
7 11 3 11
8 3 1 8
9 8 2 8
10 12 1 12
11 5 1 5
12 6 NA NA
13 15 NA NA
14 10 NA 14
15 14 2 14
What the algorithm does, which after a few piped lines is no longer very clear, is the following:
temporarily add id as original row number
take away 0 and NA rows for y
repeat each row y times
within each such repeated row, create a new id that counts backwards (these will be the new row numbers for the x-values to
go)
group by id again this time to let later values overwrite earlier values (so keep only the highest row number for any collision)
join these data back on the original data, using the newly calculated row numbers, repeated x's will now be inserted
sort and clean up
Sequencing and indexing to the rescue:
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
# x y out
#1 9 NA NA
#2 4 NA NA
#3 7 NA NA
#4 1 1 1
#5 2 NA 11
#6 13 NA 11
#7 11 3 11
#8 3 1 8
#9 8 2 8
#10 12 1 12
#11 5 1 5
#12 6 NA NA
#13 15 NA NA
#14 10 NA 14
#15 14 2 14
I'm generating a row number, getting the row numbers prior to the key rows, and then overwriting those rows with repeats of the selected rows. Sometimes they specify the same location, but the later value will be taken as you can see in the output.
Should be pretty efficient as everything is vectorised and there's only one major assignment operation back to the original dataset for updating all the rows at once. Here's 4.5M rows processed in a fraction of a second:
test <- test[rep(1:15, 3e5),]
system.time({
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
})
# user system elapsed
# 0.28 0.00 0.28

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Resources