Compute median per column in loop - r

I have this loop to compute the mean per column, which works.
for (i in 1:length(DF1)) {
tempA <- DF1[i] # save column of DF1 onto temp variable
names(tempA) <- 'word' # label temp variable for inner_join function
DF2 <- inner_join(tempA, DF0, by='word') # match words with numeric value from look-up DF0
tempB <- as.data.frame(t(colMeans(DF2[-1]))) # compute mean of column
DF3<- rbind(tempB, DF3) # save results togther
}
The script uses the dplyr package for inner_join.
DF0 is the look-up database with 3 columns (word, value1, value2, value3).
DF 1 is the text data with one word per cell.
DF3 is the output.
Now I want to compute the median instead of the mean. It seemed easy enough with the colMedians function from 'robustbase', but I can't get the below to work.
library(robustbase)
for (i in 1:length(DF1)) {
tempA <- DF1[i]
names(tempA) <- 'word'
DF2 <- inner_join(tempA, DF0, by='word')
tempB <- as.data.frame(t(colMedians(DF2[-1])))
DF3<- rbind(tempB, DF3)
}
The error message reads:
Error in colMedians(tog[-1]) : Argument 'x' must be a matrix.
I've tried to format DF2 as a matrix prior to the colMedians function, but still get the error message:
Error in colMedians(tog[-1]) : Argument 'x' must be a matrix.
I don't understand what is going on here. Thanks for the help!
Happy to provide sample data and error traceback, but trying to keep it as crisp and simple as possible.

According to the comment by the OP, the following solved the problem.
I have added a call to library(dplyr).
My contribution was colMedians(data.matrix(DF2[-1]), na.rm = TRUE).
library(robustbase)
library(dplyr)
for (i in 1:length(DF1)) {
tempA <- DF1[i]
names(tempA) <- 'word'
DF2 <- inner_join(tempA, DF0, by='word')
tempB <- colMedians(data.matrix(DF2[-1]), na.rm = TRUE)
DF3 <- rbind(tempB, DF3)
}

Stumbled on this answer which helped me fix the loop as following:
DF3Mean <- data.frame() # instantiate dataframe
DF4Median <- data.frame( # instantiate dataframe
for (i in 1:length(DF1)) {
tempA <- DF1[i] # save column of DF1 onto temp variable
names(tempA) <- 'word' # label temp variable for inner_join function
DF2 <- inner_join(tempA, DF0, by='word') # match words with numeric value from look-up DF0
tempMean <- as.data.frame(t(colMeans(DF2[-1]))) # compute mean of column
DF3Mean <- rbind(tempMean, DF3Mean) # save results togther
tempMedian <- apply(DF2[ ,2:4], 2, median) #compute mean for columns 2,3, and 4
DF4Median <- rbind(tempMedian, DF4Median) # save results togther
}
I guess I was too stuck in my mind on the colMedian function.

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

Building forvalues loops in R

[Working with R 3.2.2]
I have three data frames with the same variables. I need to modify the value of some variables and change the name of the variables (rename the columns). Instead of doing this data frame by data frame, I would like to use a loop.
This is the code I want to run:
#Change the values of the variables
vlist <- c("var1", "var2", "var3")
dataframe0[,vlist] <- dataframe0[,vlist]/10
dataframe1[,vlist] <- dataframe1[,vlist]/10
dataframe2[,vlist] <- dataframe2[,vlist]/10
#Change the name of the variables
colnames(dataframe0)[colnames(dataframe0)=="var1"] <- "temp_min"
colnames(dataframe0)[colnames(dataframe0)=="var2"] <- "temp_max"
colnames(dataframe0)[colnames(dataframe0)=="var3"] <- "prep"
colnames(dataframe1)[colnames(dataframe1)=="var1"] <- "temp_min"
colnames(dataframe1)[colnames(dataframe1)=="var2"] <- "temp_max"
colnames(dataframe1)[colnames(dataframe1)=="var3"] <- "prep"
colnames(dataframe2)[colnames(dataframe2)=="var1"] <- "temp_min"
colnames(dataframe2)[colnames(dataframe2)=="var2"] <- "temp_max"
colnames(dataframe2)[colnames(dataframe2)=="var3"] <- "prep"
I know the logic to do it with programs like Stata, with a forvalues loop:
#Change the values of the variables
forvalues i=0/2 {
dataframe`i'[,vlist] <- dataframe`i'[,vlist]/10
#Change the name of the variables
colnames(dataframe`i')[colnames(dataframe`i')=="var1"] <- "temp_min"
colnames(dataframe`i')[colnames(dataframe`i')=="var2"] <- "temp_max"
colnames(dataframe`i')[colnames(dataframe`i')=="var3"] <- "prep"
}
But, I am not able to reproduce it in R. How should I proceed? Thanks in advance!
I would go working with a list of dataframe, you can still 'split' it after if really needed:
df1 <- data.frame("id"=1:10,"var1"=11:20,"var2"=11:20,"var3"=11:20,"test"=1:10)
df2 <- df1
df3 <- df1
dflist <- list(df1,df2,df3)
for (i in seq_along(dflist)) {
df[[i]]['test'] <- df[[i]]['test']/10
colnames( dflist[[i]] )[ colnames(dflist[[i]]) %in% c('var1','var2','var3') ] <- c('temp_min','temp_max','prep')
# eventually reassign df1-3 to their list value:
# assign(paste0("df",i),dflist[[i]])
}
The interest of using a list is that you can access them a little more easily in a programmatic way.
I did change your code from 3 calls to only one, as colnames give a vector you can subset it and replace in one pass, this is assuming your var1 to var3 are always in the same order.
Addendum: if you want a single dataset at end you can use do.call(rbind,dflist) or with data.table package rbindlist(dflist).
More details on working with list of data.frames in Gregor's answer here

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources