Looping through values to make changes in data frame - r

I have a code that makes some changes in a dataframe.
value <- iris[1:120,]
cngfunc <- function(day,howmany,howmuch){
shuffled= day[sample(1:nrow(day)), ]
n = as.integer((howmany/100)*nrow(day)) #select percentage of data to be changed
extracted <- shuffled[1:n, ]
extracted$changed <- extracted[,1]*((howmuch/100)+1) #how much the data changes
extracted}
cngfunc(value,10,20)
Now I want to loop through the values of howmany and howmuch.
For example, howmuch <- c(10,20,30,40,50) and howmany <- c(10,20,30,40,50)
So the first result would be for cngfunc(value,10,10), cngfunc(value,10,20),cngfunc(value,10,30)....and cngfunc(value,20,10), cngfunc(value,20,20), and so on such that I'll have 25 different data frame.
Is there a way to do that?

You can do it with expand.grid to get all of the combinations, and the a map2 to create a list of dataframes:
library(tidyverse)
combos <- expand.grid(c(10,20,30,40,50), c(10,20,30,40,50))
result <- map2(combos$Var1, combos$Var2, function(x, y) cngfunc(value, x, y)) %>%
setNames(tidyr::unite(combos, Var, Var1:Var2, sep = "-")$Var)
Not sure where you are getting 120 dataframes from, as 5 * 5 = 25. This should be the general idea though.

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

How to iterate a function over multiple instances based on number of partial string matches?

Had trouble figuring out the best way to phrase this in the title, but the broader issue here is I'm trying to combine two non-overlapping columns (split by gender) in a dataset into a third, gender-neutral column with values for each row/participant... and then do that for i times.
Here's an example. My dataset is ELSH2, and the first set of columns will be HTM1, HTW1, and HT1. I figured out pretty quickly how to combine columns just once:
ELSH2$HT1 <- ifelse(is.na(ELSH2$HTM1), ELSH2$HTW1, ELSH2$HTM1)
So all the values from the HTW1 and HTM1 columns are now combined in the HT1 column. But essentially what I want is:
ELSH2$HTi <- ifelse(is.na(ELSH2$HTMi), ELSH2$HTWi, ELSH2$HTMi)
where i is each sequential number in the range 1-k, k being the largest number at the end of column names matching the above strings (i.e., there are k columns that start with HTM or HTW; HTM and HTW will always have the same k value). In this example, k=5, but I'm going to do this with multiple cases (i.e., other strings to match in place of HTM/HTW) involving different values of k.
I tried using grepl:
ELSH2[,grepl("HT.", names(ELSH2))] <- ifelse(
is.na(ELSH[,grepl("HTM.", names(ELSH2))]),
ELSH2[,grepl("HTW.", names(ELSH2))],
ELSH2[,grepl("HTM.", names(ELSH2))])
But I'm getting the following error:
Warning message:
In `[<-.data.frame`(`*tmp*`, , grepl("HTM.", names(ELSH2)), value = list( :
provided 5300 variables to replace 10 variables
I'm pretty sure there's something wrong with the way I'm trying to make the HT columns here, but even if I create them manually, I get the same sort of error.
EDIT: Here's a sample dataset.
HTM1<- rnorm(10)
HTW1<- rnorm(10)
HTM2<- rnorm(10)
HTW2<- rnorm(10)
HTM3<- rnorm(10)
HTW3<- rnorm(10)
HTM4<- rnorm(10)
HTW4<- rnorm(10)
HTM5<- rnorm(10)
HTW5<- rnorm(10)
HTM <- data.frame(HTM1,HTM2,HTM3,HTM4,HTM5)
HTW <- data.frame(HTW1,HTW2,HTW3,HTW4,HTW5)
HTM[1, ] <- NA
HTM[3, ] <- NA
HTM[5, ] <- NA
HTM[7, ] <- NA
HTM[9, ] <- NA
HTW[2, ] <- NA
HTW[4, ] <- NA
HTW[6, ] <- NA
HTW[8, ] <- NA
HTW[10, ] <- NA
ELSH2 <- cbind(HTW, HTM)
ELSH2 looks like this:
And I want the final HT columns to look like this poorly photoshopped monstrosity:
Just interleaving the columns where they have missing values.
On possibility is just to treat this like a reshaping problem. Here we use dplyr and tidyr to make that easier
library(dplyr)
library(tidyr)
ELSH2 %>%
mutate(row=row_number()) %>%
pivot_longer(HTW1:HTM5) %>%
filter(!is.na(value)) %>%
extract(name, into=c("prefix","code"), "^([A-Za-z]+)(\\d+)$") %>%
mutate(name=paste0("HT", code)) %>%
pivot_wider(row, names_from=name, values_from=value)

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Passing vector with multiple values into R function to generate data frame

I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources