Conditional Replacement Column Content--many ids to be updated - r

Thinking I can take the easy way out, I was going to use elseif to replace id codes in an entire dataset. I have a specific dataset with a id column. I have to replace these old ids with updated ids, but there are 50k+ rows with 270 unique ids. So, I first tried:
df$id<- ifelse(df$id== 2, 1,
ifelse(df$id== 3, 5,
ifelse(df$id == 4, 5,
ifelse(df$id== 6, NA,
ifelse(df$id== 7, 7,
ifelse(df$id== 285, NA,
ifelse(df$id== 8, 10,.....
ifelse(df$id=200, 19, df$id)
While this would have worked, I am limited to 51 nests, and I cannot separate them because it would only a 1/4 of the set. And then updates for first half would interfere as codes do overlap.
I then tried
df$id[df$id== 2] <- 1
and I was going to do that for every code. However, if I update all twos to one, there is still a later code in which old and new "1" will become X number, and I would only want the old "1" to become X... I actually think this takes out the if else even if 51 was not the limit. A function similar to vlookup in Excel? Any ideas?
Thanks!
Old forum related to replacing cell contents, but does not work in my case.
Replace contents of factor column in R dataframe

partial example
df <- data.frame(id=seq(1, 10))
old.id <- c(2, 3, 4, 6)
new.id <- c(1, 5, 5, NA)
df$id[df$id %in% old.id] <- new.id[unlist(sapply(df$id, function(x) which(old.id==x)))]
output
> df
id
1 1
2 1
3 5
4 5
5 5
6 NA
7 7
8 8
9 9
10 10

Related

How to go along a numeric vector and mark the index of currently minimal value until finding a smaller one?

I want to obtain the indexes of minimal values such as:
v1 <- c(20, 30, 5, 18, 2, 10, 8, 4)
The result is:
1 3 5
Explanation:
Over v1, we start at value 20. Without moving on, we note the minimal value (20) and its index (1). We ignore the adjacent element because it is greater than 20. So 20 still holds the record for smallest. Then we move to 5, which is smaller than 20. Now that 5 is the smallest, we note its index (3). Since 18 isn't smaller than so-far-winner (5), we ignore it and keep going right. Since 2 is the smallest so far, it is the new winner and its position is noted (5). No value smaller than 2 moves right, so that's it. Finally, positions are:
1 # for `20`
3 # for `5`
5 # for `2`
Clearly, the output should always start with 1, because we never know what comes next.
Another example:
v2 <- c(7, 3, 4, 4, 4, 10, 12, 2, 7, 7, 8)
# output:
1 2 8
Which.min() seems to be pretty relevant. But I'm not sure how to use it to get the desired result.
You can use:
which(v1 == cummin(v1))
[1] 1 3 5
If you have duplicated cumulative minimums and don't want the duplicates indexed, you can use:
which(v1 == cummin(v1) & !duplicated(v1))
Or:
match(unique(cummin(v1)), v1)
This is the verbose way:
library(purrr)
v1 <- c(20, 30, 5, 18, 2, 10, 8, 4)
v1 %>%
length() %>%
seq() %>%
map_dbl(~ which.min(v1[1: .x])) %>%
unique()
#> [1] 1 3 5
Created on 2021-12-08 by the reprex package (v2.0.1)

Filtering Rows with duplicate column values

I was cleaning a dataset for class. I noticed there were some negative values. Some rows with this condition also have the same id name in two columns 2 and 3.
I'm stumped. I'm trying to draft out a code, but unsure where should I start. I would love to get advice. I couldn't find anything similar.
Below is a sample table similar to the table I have.
df <- data.frame(A=c(1,2,4,7,8), B=c(2,2,4,9,9), C=c(0,1,5,3,4))
Do I use the ifelse () nested within a filter()? I want to filter a data table without rows that have duplicate values in columns A and B. Using the table above as an example, what code would result in getting back rows 1, 4 and 5?
(sorry, above example keeps coming up as code and not a table.)
Up to now, the question has not received a proper answer.
If I understand correctly, the OP wants to know how to remove / filter out those rows where the columns A and B have identical values. Or, in other words how to keep those rows where A and B are different.
This is a basic question for which different approaches are available in R:
base R
df[df$A != df$B, ]
or
subset(df, A != B)
dplyr
as already mentioned in Martin Gal's comment
dplyr::filter(df, A != B)
data.table
as the question was tagged with data.table
data.table::setDT(df)[A != B]
All return rows 1, 4, and 5, e.g.,
A B C
1 1 2 0
4 7 9 3
5 8 9 4
There is no ifelse() required.
Data
df <- data.frame(
A = c(1, 2, 4, 7, 8),
B = c(2, 2, 4, 9, 9),
C = c(0, 1, 5, 3, 4)
)

Filter based on starting letter and presence of an asterisk in column

I have a large data frame, with 22 columns. I want to filter based on values in the second column, so if the value doesn't start with "X" I want to remove that row. Also I want to remove the row if this value in the second column contains an asterisk.
test <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
secondcolumn <- c("Xidfhsfd*isjdf", "Hsuhdfskdh", "Xwidfsoid", "X*sdkfjjhsd", "Xkdsfhsd", "Uskesfudhsk", "Sdfukhsdiu", "Osdfihsdoiuh", "Xsodifdsifj")
othercolumn <- c(3, 5, 7,2, 5, 8, 3, 0, 5)
df <- as.data.frame(test, secondcolumn, othercolumn)
How would this be done? In this example, I would want to remove the 1st, 2nd, 4th, 6th, 7th, 8th, and 9th rows.
Thanks!
Hope this works
# Condition 1: value start with "X"
cond1 <- grep("^X", d[, 2])
# Condition 2: doesn't contain "*"
cond2 <- grep("\\*", d[, 2], invert = TRUE)
# Rows where any of condition is true
wantedRows <- intersect(cond1, cond2)
# Table without those rows
d[wantedRows, ]
Another option would be to match 'X' at the start (^) of the string followed by one or more characters that are not a * ([^*]+) until the end ($) of the string to get the numeric index and subset rows based on that
df[grep("^X[^*]+$", df$secondcolumn),]
# test secondcolumn othercolumn
#3 3 Xwidfsoid 7
#5 5 Xkdsfhsd 5
#9 9 Xsodifdsifj 5

order() not behaving as expected

Ok. I am pretty convinced I am about to embarrass myself, but here we go.
I cannot get order() to work properly. I am trying to come up with a composite ranking by two different factors, which is distilled down to an example below:
test1 <- rnorm(5)
test2 <- abs(rnorm(5))
test1; test2
> 0.4839582 0.1665794 -0.7648058 -0.5492701 0.6616983
> 0.8491913 0.2840523 2.3413548 0.7299879 0.1584666
test1Ord <- order(test1, decreasing = TRUE)
test2Ord <- order(test2)
test3Ord <- test1Ord + test2Ord
test1Ord; test2Ord; test3Ord
> 5 1 2 4 3
> 5 2 4 1 3
> 10 3 6 5 6
order(as.numeric(test3Ord), decreasing = TRUE)
> 1 3 5 4 2
As you can see, the vector c(10, 3, 6, 5, 6) should be ordered 1, 5, 3, 4, 2 or 1, 5, 2, 4, 3 (since the tie at 6). This is not what the output is.
Am I missing something?!
It looks like I was looking for rank(). (I was previously unaware of this function.) I am pretty familiar with order(), but got mixed up in what I was trying to do.
The rank() of the vector provides what I was going for.
Thanks to all for setting me straight!

Permutations from columns of a data frame in R with specific conditions

This may be a rather complex question so if someone can at least point me in the right direction I can probably figure out the rest on my own.
Sample data:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
A B C D
1 1 6 5 5
2 4 5 3 NA
3 5 NA 1 3
4 3 5 5 10
5 NA 3 3 4
6 5 5 7 5
I would like to find all possible permutations of letter sequences of different lengths from the table shown above. For example, one valid letter sequence might be: A C A D D B. Another valid sequence could be B C C.
However, there are a few exceptions to this I'd like to follow:
1. Must be able to specify the minimum length of the returned sequence.
Note that in my example above, the min sequence length was 3 and the max sequence length was equal to the number of rows. I would like to be able to specify the min value (the max value will always be equal to the number of rows, 6 in the case of the sample data).
Note that if the sequence length is shorter than 6, it cannot be generated from skipping rows. In other words, any short sequences must come from consecutive rows. Clarification based on comments: Short sequences do not have to start on row 1. A short sequence could start on row 3 and continue onward through consecutive rows to row 6.
2. Letters with an NA value are not available for sampling.
Note that in row 2 there is an NA in the D column. This means that D would not be available for sampling in row 2. So A B D would be a valid combination but A D D would not be valid.
3. The sequences must be ranked based on the values in each cell.
Notice how each cell has a specific value in it. Each sequence chosen can be ranked by summing up the value shown in the table for the chosen letter. Using the example from above A C A D D B would have a rank of 1+3+5+10+4+5. So when generating all possible sequence they should be ordered from highest rank to lowest rank.
I would like to apply all three of these rules to the data table listed above to find all combinations of sequences possible of minimum length 3 and maximum length 6.
Please let me know if I need to clarify anything!
In principle, you want to do this using expand.grid I believe. Using your example data, I worked out the basics here:
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5),
B = c(6, 5, NA, 5, 3, 5),
C = c(5, 3, 1, 5, 3, 7),
D = c(5, NA, 3, 10, 4, 5))
dat[,1][!is.na(dat[,1])] <- paste("A",na.omit(dat[,1]),sep="-")
dat[,2][!is.na(dat[,2])] <- paste("B",na.omit(dat[,2]),sep="-")
dat[,3][!is.na(dat[,3])] <- paste("C",na.omit(dat[,3]),sep="-")
dat[,4][!is.na(dat[,4])] <- paste("D",na.omit(dat[,4]),sep="-")
transp_data <- as.data.frame(t(dat))
data_list <- list(V1 = as.vector(na.omit(transp_data$V1)),
V2 = as.vector(na.omit(transp_data$V2)),
V3 = as.vector(na.omit(transp_data$V3)),
V4 = as.vector(na.omit(transp_data$V4)),
V5 = as.vector(na.omit(transp_data$V5)),
V6 = as.vector(na.omit(transp_data$V6)))
This code lets you essentially transform your data frame into a list of vectors of different lengths (one element for each variable in your original data, but omitting NAs and such). The reason you would want to do this is because it makes finding the acceptable combinations trivially easy by using the expand.grid function.
To solve for the six, you would simply use:
grid_6 <- do.call(what = expand.grid,
args = data_list)
This would give you a list of all possible permutations that met your criteria for the six (i.e. there were no NA elements). You can extract the numeric data back using some regular expressions (not a very vectorized way of doing it, but this is a complex thing that I don't have time to fully put into a function).
grid_6_letters <- grid_6
for(x in 1:ncol(grid_6_letters)) {
for(y in 1:nrow(grid_6_letters)) {
grid_6_letters[y,x] <- gsub(pattern = "-[0-9]*",replacement = "",x = grid_6_letters[y,x])
}
}
grid_6_numbers <- grid_6
for(x in 1:ncol(grid_6_numbers)) {
for(y in 1:nrow(grid_6_numbers)) {
grid_6_numbers[y,x] <- gsub(pattern = "^[ABCD]-",replacement = "",x = grid_6_numbers[y,x])
}
grid_6_numbers[[x]] <- as.numeric(grid_6_numbers[[x]])
}
grid_6_letters$Total <- rowSums(grid_6_numbers)
grid_6_letters <- grid_6_letters[order(grid_6_letters$Total,decreasing = TRUE),]
Anyway, if you wanted to get the various lower-level combinations, you could do it by simply using expand.grid on subsets of the list and combining them using rbind (with some judicious use of setNames as needed. Example:
grid_3 <- rbind(setNames(do.call(what = expand.grid,args = list(data_list[1:3],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[2:4],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[3:5],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")),
setNames(do.call(what = expand.grid,args = list(data_list[4:6],stringsAsFactors = FALSE)),nm = c("V1","V2","V3")))
Anyway, with some time and programming, you can likely wrap this into a function that is much better than my example, but hopefully it will get you started.
Sorry I don't do any R anymore, so I'll try to help with a dirty code...
addPointsToSequence <- function(seq0, currRow){
i<-0;
for(i in 1:4){# 4 is the number of columns
seq2 = seq0
if (!is.na(dat[currRow,i])){
# add the point at the end of seq2
seq2 = cbind(seq2,dat[currRow,i])
# here I add the value, but you may prefer
# adding the colnames(dat)[i] and using the value to estimate the value of this sequence, in another variable
if(length(seq2) >= 3){
# save seq2 as an existing sequence where you need to
print (seq2)
}
if(currRow < 6){# 6 is the number of rows in dat (use nrow?)
addPointsToSequence(seq2, currRow+1)
}
}
}
}
dat <- data.frame(A = c(1, 4, 5, 3, NA, 5), B = c(6, 5, NA, 5, 3, 5), C = c(5, 3, 1, 5, 3, 7), D = c(5, NA, 3, 10, 4, 5))
for (startingRow in 1:4){
#4 is the last row you can start from to make a length3 sequence
emptySequence <- {};
addPointsToSequence(emptySequence , i);
}

Resources