Subsetting data from R data frame using incremental variable names - r

I have an R data frame, df, with column names V1, V2, V3...V1000. I need to subset df by selecting every 20th column, that is, V1, V21, V41, V61 through end of columns.
I think this can be done using dplyr's select(df, num_range("V", val)), but am stumped how to iterate val through 1000 columns, stepping by 20.
Any suggestions?

Use the seq function with dplyr's select and num_range as below:
library(dplyr)
df <- as.data.frame(matrix(rnorm(3000), nrow = 3))
df %>% select(num_range("V", seq(1, 1000, by = 20)))

You can try,
df[seq(1, ncol(df), 20)]

u can use some function like this.Here nskip=20 as u want to skip 20 columns
FOO <- function(data, nSubsets, nskip)
{
outList <- vector("list", length = nSubsets)
totcol <- ncol(data)
for (i in seq_len(nSubsets))
{
colsToGrab<- seq(i, totcol, nSkip)
outList[[i]] <- data[,colsToGrab ]
}
return(outList)
}

Related

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Extract a value from a dataframe iteratively (R)

I have a function to select a value from a dataframe. I want to select that value, save it, remove it from the dataset, and select a value using the same function from the remaining values in the dataframe. What is the best way to do this?
Here is a simple example:
V1 <- c(5,6,7,8,9,10)
df <- data.frame(V1)
V2 <- as.data.frame(matrix(nrow=3,ncol=1))
maximum <- function(x){
max(x)
}
V2[i,]<- maximum(df)
df <- anti_join(df,V2,by='V1')
How can I set this up such that I reapply the maximum function to the remaining values in df and save these values in in V2?
I'm using a different and more complex set of functions and if/else statements than max - this is just an example. I do have to reapply the function to the remaining values, because I will be using the function on a new dataframe if df is empty.
Is this what you're looking for?
V1 <- data.frame(origin = c(5,6,7,8,9,10))
V2 <- as.data.frame(matrix(nrow=3,ncol=1))
df1 <- V1
df2 <- V2
recursive_function <- function(df1,df2,depth = 3,count = 1){
if (count == depth){
# Find index
indx <- which.max(df1[,1])
curVal <- df1[indx,1]
df2[count,1] <- curVal
df1 <- df1[-indx, ,drop = FALSE]
return(list(df1,
df2))
} else {
# Find index
indx <- which.max(df1[,1])
# Find Value
curVal <- df1[indx,1]
# Add value to new data frame
df2[count,1] <- curVal
# Subtract value from old dataframe
df1 <- df1[-indx, ,drop = FALSE]
recursive_function(df1,df2,depth,count + 1)
}
}
recursive_function(df1,df2)
Here is another solution that I stumbled across:
V1 <- c(5,6,7,8,9,10)
df <- data.frame(V1)
minFun <- function(df, maxRun){
V2 <- as.data.frame(matrix(nrow=maxRun,ncol=1))
for(i in 1:maxRun){
V2[i,]<- min(df)
df <- dplyr::anti_join(df,V2,by='V1')
}
return(V2)
}
test <- minFun(df = df, maxRun = 3)
test

r loop for filtering through each column

I have a data frame like this:
gene expression data frame
Assuming column name as different samples and row name as different genes.
Now I want to know the number of genes left after I filter from each column with a number
For example,
sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]
Then,
sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]
But I don't want to repeat this 100 times as I have 100 samples.
Can anyone write a loop for me for this situation? Thank you
Here is a solution using two loops that calculates, by each sample (columns), the number of genes (rows) that have a value greater than the one indicated in the nums vector.
#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)
#Loop for each column
resul <- apply(df, 2, function(x){
#Get the length of rows that have a higher value than each nums entry
sapply(nums, function(y){
length(x[x>y])
})
})
#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
as.data.frame(resul))
We can loop over the columns and do this and create the grouping with cut
lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))
or findInterval and then split
lst1 <- lapply(df, function(x) split(x, findInterval(x, c(5, 10, 20, 30))))
If we go by the way the objects are created in the OP's post, there would be 100 * 4 i.e. 400 objects (100 columns) in the global environment. Instead, it can be single list object.
The objects can be created, but it is not recommended
v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
for(i in v2) {
for(j in v1) {
assign(sprintf('sample%d_more_than_%d', i, j),
value = df[df[,i] > j,, drop = FALSE])
}
}

Loop through df and create new df in R

I have a df (10 rows, 15 columns)
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
I want to loop over each column, do something to each row and create a new df with the answer.
I actually want to do a linear regression on each column. I get back a list for each column. For example I have a second df with what I want to put into the lm. df2<-data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I then want to do something like:
new_df <- data.frame()
for (i in 1:ncol(df)){
j<-lm(df[,i] ~ df2$X1 + df2$X2)
temp_df<-j$residuals
new_df[,i]<-cbind(new_df,temp_df)
}
I get the error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 8
I have checked other similar posts but they always seem to involve a function or something similarly complex for a newbie like me. Please help
This can be done without loops but for your understanding, using loops we can do
new_df <- df
for (i in names(df)) {
j<-lm(df[,i] ~ df$X1 + df$X2)
new_df[i] <- j$residuals
}
You are initialising an empty dataframe with 0 rows and 0 columns initially as new_df and hence when you are trying to assign the value to it, it gives you an error. Instead of that assign original df to new_df as they both are going to share the same structure and then use the above.
Update
Based on the new example
lst1 <- lapply(names(df), function(nm) {dat <- cbind(df[nm], df2[c('X1', 'X2')])
lm(paste0(nm, "~ X1 + X2"), data = dat)$residuals})
out <- setNames(data.frame(lst1), names(df))
Also, this doesn't need any loop
out2 <- lm(as.matrix(df) ~ X1 + X2, data = cbind(df, df2))$residuals
Old
We can do this easily without any loop
new_df <- df + 10
---
If we need a loop, it can be done with `lapply`
new_df <- df
new_df[] <- lapply(df, function(x) x + 10)
---
Or with a `for` loop
lst1 <- vector('list', ncol(df))
for(i in seq_along(df)) lst1[[i]] <- df[, i] + 10
new_df <- as.data.frame(lst1)
data
set.seed(24)
df <- data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
df2 <- data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I would do as suggested by akrun. But if you do need (or want) to loop for some reasons you can use:
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
new_df <- data.frame(replicate(15, rep(NA, 10)))
for (i in 1:ncol(df)){
new_df[ ,i] <- df[ , i] + 10
}

multiply all columns in data frame in R

I need to multiply all columns in a data frame with each other. As an example, I need to achieve the following:
mydata$C1_2<-mydata$sic1*mydata$sic2
but for all my columns with values going from 1 to 733 (sic1, sic2, sic3,..., sic733).
I've tried the following but it doesn't work:
for(i in 1:733){
for(j in 1:733){
mydata$C[i]_[j]<-mydata$sic[i]*mydata$sic[j]
}
}
Could you help me? Thanks for your help.
Despite the question if you really want what you think you want, I feel like this could help:
df <- data.frame(
a = 1:4
, b = 1:4
, c = 4:1
)
multiplyColumns <- function(name1, name2, df){
df[, name1] * df[, name2]
}
combinations <- expand.grid(names(df), names(df), stringsAsFactors = FALSE)
names4result <- paste(combinations[,1], combinations[,2], sep = "_")
result <- as.data.frame(mapply(multiplyColumns, combinations[,1], combinations[,2], MoreArgs = list(df = df)))
names(result) <- names4result
result

Resources