Recode numeric values in R - r

I want to recode some numeric values into different numeric values and have had a go using the following code:
survey$KY27PHYc <- revalue(survey$KY27PHY1, c(5=3, 4=2,3=2,2=1,1=1))
I get the following error:
## Error: unexpected '=' in "survey$KY27PHYc <- revalue(survey$KY27PHY1, c(5="
Where am I going wrong?

We can recode numeric values by using recode or case_when on dplyr 0.7.0.
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.0’
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# Note: it is necessary to add "L" a numerical value.
dplyr::recode(x, `2` = 20, `4` = 40)
# [1] NA 20 NA 40 NA NA NA NA NA NA
# Warning message:
# Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default
# With recode function using characters as arguments
as.numeric(dplyr::recode(as.character(x), "2" = "20", "4" = "40"))
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10

This function does not work on numeric vector. If you want to use it, you can do as follows:
x <- 1:10 # your numeric vector
as.numeric(revalue(as.character(x), c("2" = "33", "4" = "88")))
# [1] 1 33 3 88 5 6 7 8 9 10

Try this:
#sample data
set.seed(123); x <- sample(1:5, size = 10, replace = TRUE)
x
# [1] 2 4 3 5 5 1 3 5 3 3
#recode
x <- c(1, 1, 2, 2, 3)[ x ]
x
# [1] 1 2 2 3 3 1 2 3 2 2

Related

Replace values in one dataframe with another thats not NA

I have two dataframes A and B, that share have the same column names and the same first column (Location)
A <- data.frame("Location" = 1:3, "X" = c(21,15, 7), "Y" = c(41,5, 5), "Z" = c(12,103, 88))
B <- data.frame("Location" = 1:3, "X" = c(NA,NA, 14), "Y" = c(50,8, NA), "Z" = c(NA,14, 12))
How do i replace the values in dataframe A with the values from B if the value in B is not NA?
Thanks.
We can use coalesce
library(dplyr)
A %>%
mutate(across(-Location, ~ coalesce(B[[cur_column()]], .)))
-output
# Location X Y Z
#1 1 21 50 12
#2 2 15 8 14
#3 3 14 5 12
Here's an answer in base R:
i <- which(!is.na(B),arr.ind = T)
A[i] <- B[i]
A
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12
One option with fcoalesce from data.table pakcage
list2DF(Map(data.table::fcoalesce,B,A))
gives
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12

case_when using variable name to change data value

I have the following dataframe:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
I want to change a specific value of each columns using the following logic:
Variable name contains "_lag1" then the first element of the column has to turn into NA
Variable name contains "_lag2" then the first and second element of the column has to turn into NA
Else the column remains as it is
The expected result should be look like:
df_new <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(NA,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(NA,NA,34,5,45,7)
)
As you have the original unlagged variables in your df you could simply recompute the lagged values using e.g. dplyr::lag which by default will give you NAs:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
library(dplyr)
df %>% mutate(var1_lag1 = dplyr::lag(var1_lag0, n = 1), var2_lag2 = dplyr::lag(var2_lag0, n = 2))
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7
A base R solution might look like this:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
df_new <- df
df_new[1 , grep(pattern="_lag1", colnames(df))] <- NA
df_new[c(1,2) , grep(pattern="_lag2", colnames(df))] <- NA
df_new
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7
Created on 2021-01-06 by the reprex package (v0.3.0)
Here is a for loop that checks the column names of the df for the key words "_lag1" and "_lag2" and turns the corresponding values to NA.
for (i in 1:length(df)){
if (grepl("_lag1",colnames(df)[i])){
df[1,i] = NA
}
else if (grepl("_lag2",colnames(df)[i])){
df[1:2,i] = NA
}
}
You can try to wrap a case_when inside a helper function and use mutate_at with contains to get the proper columns.
df %>%
mutate_at(vars(contains("lag1")),
function(x, lag) fix(x, "lag1")) %>%
mutate_at(vars(contains("lag2")),
function(x, lag) fix(x, "lag2"))
Which produces
var1_lag0 var1_lag1 var2_lag0 var2_lag2
1 1 NA 34 NA
2 2 1 5 NA
3 3 2 45 34
4 4 3 7 5
5 5 4 2 45
6 6 5 1 7
Here is the helper function called fix
fix <- function(x, lag){
real_lag <- case_when(stringr::str_detect("lag1", lag) ~ 1,
stringr::str_detect("lag2", lag) ~ 2)
x[1:real_lag] <- NA
return(x)
}

Replace values with subseting operator [

I have the following data:
set.seed(1)
df_1 <- data.frame(x = replicate(n = 2, expr = sample(x = 1:3, size = 20, replace = T)),
y = as.factor(sample(x = 1:5, size = 20, replace = TRUE)))
I want replace the numbers >=2 by 9 in x.1 and x.2 simultaneoulsy:
df_1[df_1$x.1, df_1$x.2 >= 2] <- 9
Error in [<-.data.frame(*tmp*, df_1$x.1, df_1$x.2 >= 2, value = 9) :
duplicate subscripts for columns
And replace the number 3 by 99 in y.
df_1$y[df_1$y %in% c('3')] <- 99
Warning message:
In [<-.factor(*tmp*, df_1$y %in% c("3"), value = c(2L, 5L, 2L, :
invalid factor level, NA generated
Tks.
We can use replace
df_1[1:2] <- replace(df_1[1:2], df_1[1:2] >=2, 9)
Or another option is create the logical matrix on the subset of 'x.' columns, extract the values and assign it to 9
df_1[1:2][df_1[1:2] >= 2] <- 9
For changing the factor, we either needs to call factor again or add levels beforehand
levels(df_1$y) <- c(levels(df_1$y), "99")
df_1$y
#[1] 4 4 4 2 4 1 1 4 1 2 3 2 2 5 2 1 3 3 4 3
#Levels: 1 2 3 4 5 99
df_1$y[df_1$y == '3'] <- '99'
df_1$y
#[1] 4 4 4 2 4 1 1 4 1 2 99 2 2 5 2 1 99 99 4 99
##Levels: 1 2 3 4 5 99
Or as #thelatemail mentioned, if we are dropping the levels while doing the replacement
levels(df_1$y)[levels(df_1$y) == '3'] <- "99"
Or can use fct_recode from forcats
library(forcats)
df_1$y <- fct_recode(df_1$y, "99" = "3")

Data on one row by ID

I have a data frame with one id column and several other column grouped by couple and i'm trying to put all the data for a same id on one row. ID's do not appear the same number of times each.
My data looks like this :
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12, hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df
## id vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3
## 1 1 1 0.04632267 1 -0.37404379 1 0.90711353
## 2 4 2 0.50383152 2 0.06075954 2 0.30690284
## 3 1 3 1.52450117 3 -1.21539925 3 -1.12411614
## 4 1 4 -0.50624871 4 -0.75988364 4 -0.47970608
## 5 3 5 1.64610863 5 0.03445275 5 -0.18895338
## 6 1 6 0.22019099 6 -0.32101883 6 1.29375822
## 7 2 7 -0.10041807 7 -0.17351799 7 -0.03767921
## 8 2 8 0.81683565 8 0.62449158 8 0.50474787
## 9 2 9 -0.46891269 9 1.07743469 9 -0.55539149
## 10 1 10 0.69736549 10 -0.08573679 10 0.28025325
## 11 3 11 0.73354215 11 0.80676315 11 -1.12561358
## 12 2 12 -0.40903143 12 1.94155313 12 0.64231119
For the moment i came up with this :
align2 <- function(df) {
result <- lapply(1:nrow(df), function(j) lapply(1:3, function(i) {x <- df[j, paste0(c("vpcc", "hpcc"), i)]
names(x) <- paste0(c("vpcc", "hpcc"), (i + (j-1)*4))
return(x)}))
result2 <- lapply(result, function(x) do.call(cbind, x))
result3 <- do.call(cbind, result2)
return(result3)
}
testX <- lapply(1:4, function(k) align2(as.data.frame(split(df, f=df$id)[[k]])))
library(plyr)
testX2 <- do.call(rbind.fill, testX)
testX2
## vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3 vpcc4 hpcc4 vpcc5 hpcc5 vpcc6 hpcc6 vpcc7 hpcc7 vpcc8 hpcc8 ...
## 1 1 0.04632267 1 -0.37404379 1 0.90711353 3 1.5245012 3 -1.2153992 3 -1.1241161 4 -0.5062487 4 -0.7598836 ...
## 2 7 -0.10041807 7 -0.17351799 7 -0.03767921 8 0.8168356 8 0.6244916 8 0.5047479 9 -0.4689127 9 1.0774347 ...
## 3 5 1.64610863 5 0.03445275 5 -0.18895338 11 0.7335422 11 0.8067632 11 -1.1256136 NA NA NA NA ...
## 4 2 0.50383152 2 0.06075954 2 0.30690284 NA NA NA NA NA NA NA NA NA NA ...
It's a partial solution since it don't keep the id.
But I can't imagine there's not a easier way...
Thank you for suggestions
PS : maybe there's already a solution on SO but I didn't find it...
In your example the variables vpcc1 vpcc2 etc. are redundant, since they have all the same value. So you can transform the dataset into a more economical structure:
df <- data.frame(id=sample(1:4, 12, T), vpcc=1:12, hpcc1=rnorm(12),
hpcc2=rnorm(12),hpcc3=rnorm(12))
Then use reshape() and you'll have all the values for each id in a single row, with the columns corresponding to the vpcc value, so that "hpcc3.5" means hpcc3 when vpcc is 5.
reshape(df, idvar = "id", direction = "wide", timevar = "vpcc")
EDIT:
if vpccX varies, then maybe this will give you what you need?
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12,
hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df$time = ave(df$id, df$id, FUN = function(x) 1:length(x))
reshape(df, idvar = "id", direction = "wide", timevar = "time")
of course, you can rename your variables, if it's needed.
When you say "same row", is it necessary that the output is like it is in your attempt or would you be happy with something like:
x <- aggregate(df[2:ncol(df)],list(df$id),list)
which allows you to view output on one row as:
x
# Group.1 vpcc1 hpcc1 vpcc2 hpcc2 vpcc3
#1 1 9, 10 1.4651392, 0.8581344 9, 10 -1.621135, 1.391945 9, 10
#2 2 1, 3, 7 2.784998, 1.667367, -1.329005 1, 3, 7 0.2115051, 0.7871399, -0.4835389 1, 3, 7
#3 3 5, 6 -0.5024987, 0.2822224 5, 6 0.155844, 1.336449 5, 6
#4 4 2, 4, 8, 11, 12 -0.48563550, -0.92684024, -0.04016263, -0.41861021, 0.02309864 2, 4, 8, 11, 12 -0.17304058, 0.25428404, -0.49897995, 0.03101927, -0.13529866 2, 4, 8, 11, 12
# hpcc3
#1 -0.05182822, 0.28365514
#2 -0.06189895, -0.83640652, 0.19425789
#3 -0.006440312, 1.378218706
#4 0.09412386, 0.16733125, -1.15198965, -1.00839015, -0.16114475
and reference different values of vpcc and hpcc using list notation:
x$vpcc1
#$`0`
#[1] 9 10
#$`1`
#[1] 1 3 7
#$`2`
#[1] 5 6
#$`3`
#[1] 2 4 8 11 12
x$vpcc1[[1]]
#[1] 9 10
?

How to ignore case when using subset in R

How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.

Resources