I am having a brain cramp. Below is a toy dataset:
df <- data.frame(
id = 1:6,
v1 = c("a", "a", "c", NA, "g", "h"),
v2 = c("z", "y", "a", NA, "a", "g"),
stringsAsFactors=F)
I have a specific value that I want to find across a set of defined columns and I want to identify the position it is located in. The fields I am searching are characters and the trick is that the value I am looking for might not exist. In addition, null strings are also present in the dataset.
Assuming I knew how to do this, the variable position indicates the values I would like returned.
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
The general rule is that I want to find the position of value "a", and if it is not located or if v1 is missing, then I want 99 returned.
In this instance, I am searching across v1 and v2, but in reality, I have 10 different variables. It is also worth noting that the value I am searching for can only exist once across the 10 variables.
What is the best way to generate this recode?
Many thanks in advance.
Use match:
> df$position <- apply(df,1,function(x) match('a',x[-1], nomatch=99 ))
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
Firstly, drop the first column:
df <- df[, -1]
Then, do something like this (disclaimer: I'm feeling terribly sleepy*):
( df$result <- unlist(lapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))) )
v1 v2 result
1 a z 1
2 a y 1
3 c a 2
4 <NA> <NA> 99
5 g a 2
6 h g 99
* sleepy = code is not vectorised
EDIT (slightly different solution, I still feel sleepy):
df$result <- rapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))
Related
I would like to change specific values to missing values using multiple conditions. Another way to describe what I am doing, I would like to change a specific value in multiple columns, but only for a specific group in my dataset. Suppose I have the following dataset:
df <- data.frame(id = c("A", "A", "A", "A", "B", "B","B", "B"),
x1 = c(1, 99, 2, 99, 3, 99, 5, 6),
x2 = c(99, 1, 99, 2, 3, 4, 99, 6))
df
id x1 x2
1 A 1 99
2 A 99 1
3 A 2 99
4 A 99 2
5 B 3 3
6 B 99 4
7 B 5 99
8 B 6 6
I would like to change the values 99 to NA, but only for a subset when id equals A of my dataset.
This is a simple example, my real dataset has multiple columns. But I am trying to do something like this:
col <- c("x1", "x2")
df[, col] <- ifelse(df$id == "A" & df[,col] == 99, NA, df[,col])
I tried other variations of the code, but I keep getting error messages, not sure what I am doing wrong. Does anyone has a suggestion, or does anyone knows what am I getting wrong?
ifelse often does not behave quite as expected; it's important to remember that, from the documentation ?ifelse, it:
"returns a value with the same shape as test"
I think replace can work here:
df[, col] <- replace(df[, col],
df[, col] == 99 & df$id == "A",
NA)
Result:
id x1 x2
1 A 1 NA
2 A NA 1
3 A 2 NA
4 A NA 2
5 B 3 3
6 B 99 4
7 B 5 99
8 B 6 6
I am organizing a large dataset adapted to my research. Suppose that I have 9 observations (records) and 4 columns as follows:
z <- data.frame("fa" = c(1, NA, NA, 2, 1, 1, 2, 1, 1),
"fb" = c(2, 2, NA, 1, NA, NA, NA, 1, 2),
"initial_1" = c("A", "B", "B", "B", "A", "C", "D", "B", "A"),
"initial_2" = c("D", "C", "C", "A", "B", "A", "A", "D", "D"))
I want to create two new columns, fa_new and fb_new according to the values of the first two columns, fa and fb, which are linked to the reference columns, initial_1 and initial_2, such that fa == # is matching to intial_#.
For example, as can be seen above, the first record of the column fa is 1 which is linked to "A" of intial_1. Thus, the first record of the new column fa_new will be "A". Likewise, the first record of fb is 2 which is linked to "D" of intial_2; thus, the first record of fb_new will be "D".
Accordingly, my expectation is:
fa_new fb_new
1 A D
2 NA C
3 NA NA
4 A B
5 A NA
6 C NA
7 A NA
8 B B
9 A D
Is this possible using r?
You can use lapply to do this for multiple columns :
cols <- 1:2
init_cols <- paste0('initial_', cols)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z[new_cols] <- lapply(z[cols], function(x) z[init_cols][cbind(inds, x)])
z
# fa fb initial_1 initial_2 fa_new fb_new
#1 1 2 A D A D
#2 NA 2 B C <NA> C
#3 NA NA B C <NA> <NA>
#4 2 1 B A A B
#5 1 NA A B A <NA>
#6 1 NA C A C <NA>
#7 2 NA D A A <NA>
#8 1 1 B D B B
#9 1 2 A D A D
The logic here is we create a matrix with cbind which has row/column number. The row number is inds (1:nrow(z)) whereas column number comes from fa/fb columns which is used to subset z dataframe.
The actual dataframe is labelled dataset, the following answer should work on the real data.
cols <- 1:2
init_cols <- paste0('fuinitials_', 1:94)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z1 <- data.frame(z)
z1[cols][z1[cols] < 1] <- NA
z1[new_cols] <- lapply(z1[cols], function(x) z1[init_cols][cbind(inds, x)])
I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a
I have one question which is probably easy for a lot of you. I would like to write a function which will do the calculations based on condition in selected column. It will be easier to show you an example:
con <- c("A", "B", "B", "C", "C", "A", "D", "A", "B", "D", "D", "D")
value <- c(1, 3, 2, 1, 1, 1, 2, 1, 2, 3, 3, 2)
dat <- data.frame(con, value)
head(dat)
So one possibility would be to do this in this simple way:
dat$new <- ifelse(dat$con == "A", dat$value*10,
ifelse(dat$con == "B", dat$value*100, dat$value*1000))
head(dat)
But, my question is how would the function look like? I tried something like this, but it is not working. Can someone help me with explanation what is missing and wrong?
calc <- function(dat) {
if(dat[, con] == "A") {
new <- dat$value*10
}
if(dat[, con] == "B") {
new <- dat$value*100
} else {
new <- dat$value*1000
}
}
calc(dat)
You can also create a function without if and ifelse:
calc <- function(data)
transform(data, new = value * 1000 / 100 ^ (con == "A") / 10 ^ (con == "B"))
The function is based on mathematical operations.
calc(dat)
# con value new
# 1 A 1 10
# 2 B 3 300
# 3 B 2 200
# 4 C 1 1000
# 5 C 1 1000
# 6 A 1 10
# 7 D 2 2000
# 8 A 1 10
# 9 B 2 200
# 10 D 3 3000
# 11 D 3 3000
# 12 D 2 2000
calc <- function(dat) {
dat$new <- ifelse(dat[,'con'] == 'A', dat[,'value']*10,
ifelse(dat[,'con'] == 'B', dat[,'value']*100,
dat[,'value']*1000)
)
dat
}
The subsetting operator $ is problematic in functions. Instead using the framework DF[,'<variable>'] is better. Also, note the quotation marks around the variable names (column names). Also your original function does not print a result to the screen. The last command will be returned when the function is called.
calc(dat)
con value new
1 A 1 10
2 B 3 300
3 B 2 200
4 C 1 1000
5 C 1 1000
6 A 1 10
7 D 2 2000
8 A 1 10
9 B 2 200
10 D 3 3000
11 D 3 3000
12 D 2 2000
I want to create a new column based on 4 values in another column.
if col1=1 then col2= G;
if col1=2 then col2=H;
if col1=3 then col2=J;
if col1=4 then col2=K.
HOW DO I DO THIS IN R?
Please I need someone to help address this. I have tried if/else and ifelse but none seems to be working. Thanks
You could use nested ifelse:
col2 <- ifelse(col1==1, "G",
ifelse(col1==2, "H",
ifelse(col1==3, "J",
ifelse(col1==4, "K",
NA )))) # all other values map to NA
In this simple case it's overkill, but for more complicated ones...
You have a special case of looking up values where the index are integer numbers 1:4. This means you can use vector indexing to solve your problem in one easy step.
First, create some sample data:
set.seed(1)
dat <- data.frame(col1 = sample(1:4, 10, replace = TRUE))
Next, define the lookup values, and use [ subsetting to find the desired results:
values <- c("G", "H", "J", "K")
dat$col2 <- values[dat$col1]
The results:
dat
col1 col2
1 2 H
2 2 H
3 3 J
4 4 K
5 1 G
6 4 K
7 4 K
8 3 J
9 3 J
10 1 G
More generally, you can use [ subsetting combined with match to solve this kind of problem:
index <- c(1, 2, 3, 4)
values <- c("G", "H", "J", "K")
dat$col2 <- values[match(dat$col1, index)]
dat
col1 col2
1 2 H
2 2 H
3 3 J
4 4 K
5 1 G
6 4 K
7 4 K
8 3 J
9 3 J
10 1 G
There are a number of ways of doing this, but here's one.
set.seed(357)
mydf <- data.frame(col1 = sample(1:4, 10, replace = TRUE))
mydf$col2 <- rep(NA, nrow(mydf))
mydf[mydf$col1 == 1, ][, "col2"] <- "A"
mydf[mydf$col1 == 2, ][, "col2"] <- "B"
mydf[mydf$col1 == 3, ][, "col2"] <- "C"
mydf[mydf$col1 == 4, ][, "col2"] <- "D"
col1 col2
1 1 A
2 1 A
3 2 B
4 1 A
5 3 C
6 2 B
7 4 D
8 3 C
9 4 D
10 4 D
Here's one using car's recode.
library(car)
mydf$col3 <- recode(mydf$col1, "1" = 'A', "2" = 'B', "3" = 'C', "4" = 'D')
One more from this question:
mydf$col4 <- c("A", "B", "C", "D")[mydf$col1]
You could have a look at ?symnum.
In your case, something like:
col2<-symnum(col1, seq(0.5, 4.5, by=1), symbols=c("G", "H", "J", "K"))
should get you close.