Create Value in final column of dataframe based on multiple columns - r

I have a dataframe that looks like this (but with a lot more variables/columns)
set.seed(5)
id<-seq(5)*floor(runif(5,min=1000, max=10000))
vals1<-c("Y","N","N","N","N")
vals2<-c("N","N","N","N","N")
vals3<-c("N","N","N","Y","N")
df<-data.frame(id,vals1,vals2,vals3)
I'd like to create a final column in the frame such that it generates a final flag with the following logic: If there is any value of 'Y' for any id the final flag is 'Y', otherwise it would be a 'N'. So, for this dataframe the 1st and 4th ids (2801, 14236) has a 'Y' in the final column and the rest have an 'n' for the final column. I tried a few approaches like apply and if...else to no avail.

Initialize by assigning "N" to every row. In next step, for the rows with "Y" (check using apply), assign "Y"
df$final = "N"
df$final[apply(df, 1, function(a) "Y" %in% a)] = "Y"

A solution for your letter encoding below.
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# If you really want to use the letter encoding, my solution works as below
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x == 'Y')})
However, I think you should use a boolean (TRUE/FALSE) for this.
Works well in combination with apply and any
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# Convert your labels into booleans:
df[,2:4] <- df[,2:4] == 'Y'
# Then summarise across rows
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x)})

Somewhat similar to the #d.b answer:
df$final <- apply(df, 1, function(x) c("N","Y")[any(x == "Y")+1])

Related

Passing dataframe as argument to function

I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})

How to replace existing values by new values from look-up list without causing NA?

I have a data frame. One column contains the following values:
df$current_column=(A,B,C,D,E)
A vector contains a look up value:
v <- c(A=X, B=Y)
I want to replace this column to come up with a list of (X, Y, C,D,E)
I am thinking to create a new column like
df$new_column <- v[df$current_column]
It does the replacement of A and B but it also makes C,D,E as NA (X,Y, NA, NA, NA).
How to keep C,D and E or is there any other way?
looks like ifelse() could help:
d$current_column <- ifelse( d$current_column == A, X,
ifelse( d$current_column == B, Y, d$current_column ))
We can create a logical index with %in% and then do the conversion
i1 <- df$current_column %in% names(v)
df$new_column <- df$current_column
df$new_column[i1] <- v[df$new_column[i1]]
df$new_column
#[1] "X" "Y" "C" "D" "E"
Or use a single ifelse
with(df, ifelse(current_column %in% names(v),
v[current_column], current_column))
Update
If the 'current_column' is factor class, convert to character class and it should work.
df$new_column <- as.character(df$current_column)
df$new_column[i1] <- v[df$new_column[i1]]
data
df <- data.frame(current_column = LETTERS[1:5],
stringsAsFactors=FALSE)
v <- setNames(c('X', 'Y'), LETTERS[1:2])
user2029709,
-- was working off of your little example; for a more generic approach it would be nice to see a snippet of the real data or close simulation. In any case, here is something that may work for you better, without coding manually all ifelse() options, and is still a relatively straightforward solution:
vd <- data.frame(current_column = names(v), new_column = v, stringsAsFactors = FALSE)
df <- merge(df, vd, by = 'current_column', all.x = TRUE)
df$new_column <- ifelse(is.na(df$new_column), df$current_column, df$current_column)
You may have to modify data types when creating vd data.frame to assure proper merge.
Best,
oleg

Serial Subsetting in R

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use
You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Changing variable type of R dataframe inside a list

I have a list of dataframes and I need to transform a certain variable in each of the dataframes as factor.
E.g.
myList <- list(df1 = data.frame(A = sample(10), B = rep(1:2, 10)),
df2 = data.frame(A = sample(10), B = rep(1:2, 10))
)
Lets say that variable B needs to be factor in each dataframe. I've tried this:
TMP <- setNames(lapply(seq_along(myList), function(x) apply(myList[[x]][c("B")], 2, factor)), names(myList))
but it only returns the transformed variable, not the whole dataframe as I need. I know how to do this with for loop, but I don't want to resort to that.
Per comment from David Arenburg, this solution should work:
TMP <- lapply(myList, function(x) {x[, "B"] <- factor(x[, "B"]) ; x}) ; str(TMP)

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources