I have a data set with 100 values and want to pick only specific items from that data set. That's how I do it right now:
df.match <- subset(df.raw.csv, value == "UC9d" | value == "UCenoM“)
It's working but I want to solve it with a loop. I tried this but I only get one match. Although I know both values are in the data set.
for (ID in c("UC9d" , "UCenoM")){df.match <- subset(df.raw.csv, value == ID)}
Any suggestions?
My suggestion would be not to use loops in R:
library(dplyr)
mydata <- mutate(mydata, TOBEINCL = 0) #rename according to your data
Create a list of patterns for the match of mydata$ID (^ and $ are for exact matching):
toMatch <- c("^UC9d$", "^UCenoM$")
Use pattern matching from base R:
mydata$TOBEINCL[grep(paste(toMatch,collapse="|"), mydata$ID, ignore.case = FALSE, invert = TRUE)] <- 1
Select data:
mydataINCL <- mydata[(mydata$TOBEINCL==1) , ]
mydataINCL$ID <- factor(mydataINCL$ID) #sometimes R sticks with the old values
An option:
df.match <- subset(df.raw.csv, value %in% c("UcenoM", "Uc9d"))
Related
I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)
My data looks like that but number of observations are approx 10000.
Part<-c(1,2,3,4,5,6,7)
Disease_codes>-c("A101.12","A111.12","A121.13","A130.0","B102","C132","D156")
class(Disease_codes)<-Factor
df<-data.frame(Part,Disease_codes)
The obs having Disease_codes starting from A10_A13 are BloodCancer patients. I need to make subset of it and i am trying following
BloodCancer <- subset(df, grepl('^A10', Disease_codes), select = Part
Part_without_Blood_cancer <- subset(df, !grepl('^A10', Disease_codes))
If i am trying the following it is not working.
BloodCancer <- subset(df, grepl('^A10-A13', Disease_codes), select = Part
But it is giving me just A10 coding containing Participants but I want BloodCancer variable to contain all from A10-A13. How can i do this in one command.
the syntax for grepl to return true for any of the strings (e.g. A10, A11) is as follows:
grepl("A10| A11", variable). To keep it as one statement, you can do the following:
BloodCancer = subset(df, grepl(paste(paste("A1", 0:3, sep = ""), collapse = "|"), Disease_codes), select = Part)
try to do it this way
BloodCancer <- subset(df, grepl("^A1[0-3]", as.character(Disease_codes)), select = Part)
An option with dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Disease_codes, "^A1[0-3]")) %>%
select(Part)
I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)
I'm a newbie in R programming. I have a requirement in mind and trying to work it out with for loop. I have a data frame with 14 variables which has empty values for some rows and columns. My requirement is to list the number of empty values in each variable (column).
My code below to achieve it:
for (x in names(df)){
cat(paste("No of rows with empty value for", x, " variable:",
nrow(df[df$x == '', ])))
}
nrow(df[df$x=='',])
From the above nrow command, the x value is not getting substituted for df$x == ''.
Need some expert help to fix it.
Thanks in advance,
Regards,
Vin
You can use sapply though to make your code cleaner.
sapply(df, FUN=function(x) sum(x == ''))
I slightly altered your for loop, and added a line break in the end. It is easier if you sum over the booleans created than counting the rows.
##Create some fake data
df <- data.frame(
first_var = c(rep("",10),1:10),
second_var = c(rep("",9), 1:11),
third_var = c(rep("", 8), 1:12),
fourth_Var = c(rep("", 7), 1:13)
)
for(i in names(df)){
cat(paste0("No of rows with empty value for ",i, " variable:",sum(df[,i] == ""),"\n"))
}
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)