I'm trying to append a row to an existing dataframe in R. The dataframe represents a subject and I want to update this with newly (generated) data. When I run this, the index numbers of the dataframe become strange:
1,
2,
21,
211,
2111,
21111, etc.
These are not practicle to read.
How to get 'normal' index numbers? (1, 2, 3, 4, etc.).
x <- 10
y <- 463
dat <- data.frame(x,y)
for (i in 1:10) {
dat.sub <- dat[nrow(dat),] # select the last row from 'dat'
dat.sub <- within(dat.sub, { # within that selection update the objects
x <- x+1
y <- y+1
})
dat <- rbind(dat, dat.sub, deparse.level = 2) # attach updated row to the 'dat'
}
dat
dat[3,]
I think the problem is dat.sub has data.frame class and has the same index number after second row. The easiest way is to change the class of dat.sub without assigning any index. One way is like:
dat.sub <- c(within(dat.sub, { # within that selection update the objects
x <- x+1
y <- y+1
}))
add a c in your for loop, making dat.sub as a vector.
Related
I want to extract values from a list of (named) lists in R.
Example
The data looks as follows:
data <- list('1' = list(x = c(1,2,3), y = c(2,3,4), z = c(2,3,7)),
'2' = list(x = c(2,3,4,5), y = c(3,4,5,6), z = c(1,2,3,5)))
From a specified list (e.g., '1'), I would like to extract all the first/second/etc elements from the lists. The choice for the index of the element should be random.
For example, if I want to sample from the first list (i.e., '1'), I generate a random index and extract the x, y, and z values corresponding to that random index. Say the index is 2, then the elements should be x=2, y=3, and z=3.
Approach
I thought a function should be able to do the job. The first step was to call the list from the function:
This works:
x <- function(i){
data$`1`
}
x(1)
But this doesn't:
x <- function(i){
data$`i`
}
x(1)
Question
How do I call a list of named lists from within the function? And what is the most convenient way to sample data corresponding to the selected index?
Do you need something like this ?
get_elements <- function(data, i) {
#select the main list
tmp <- data[[i]]
#Check the length of each sublist, select minimum value
#and sample 1 number from 1 to that number
rand_int <- sample(min(lengths(tmp)), 1)
#select that element from each sub-list
sapply(tmp, `[[`, rand_int)
}
get_elements(data, 1)
If I understood your problem correctly a solution would be with the "purrr" package:
library(purrr)
# list "name"
i <- '1'
# index
j <- 2
# to get the needed info as a list:
purrr::map(data[[i]], ~ .x[j])
# to get the needed info as a data.frame:
purrr::map_df(data[[i]], ~ .x[j])
I have a data frame like this:
gene expression data frame
Assuming column name as different samples and row name as different genes.
Now I want to know the number of genes left after I filter from each column with a number
For example,
sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]
Then,
sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]
But I don't want to repeat this 100 times as I have 100 samples.
Can anyone write a loop for me for this situation? Thank you
Here is a solution using two loops that calculates, by each sample (columns), the number of genes (rows) that have a value greater than the one indicated in the nums vector.
#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)
#Loop for each column
resul <- apply(df, 2, function(x){
#Get the length of rows that have a higher value than each nums entry
sapply(nums, function(y){
length(x[x>y])
})
})
#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
as.data.frame(resul))
We can loop over the columns and do this and create the grouping with cut
lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))
or findInterval and then split
lst1 <- lapply(df, function(x) split(x, findInterval(x, c(5, 10, 20, 30))))
If we go by the way the objects are created in the OP's post, there would be 100 * 4 i.e. 400 objects (100 columns) in the global environment. Instead, it can be single list object.
The objects can be created, but it is not recommended
v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
for(i in v2) {
for(j in v1) {
assign(sprintf('sample%d_more_than_%d', i, j),
value = df[df[,i] > j,, drop = FALSE])
}
}
I'm writing a function that will take the most recent observation and add it to the previous days values times a designated share of the previous observations. The below is a version that just uses one transformation and works:
df1<- data.frame(var1=rnorm(10,3,2), var2= rnorm(10, 4, 3))
df1$carryover<- lag(df1$var1, 1, default = 0)*(.5) + df1$var1
>df1
var1 var2 carryover
1 3.2894474 2.0839128 3.2894474
2 3.6059389 7.8880658 5.2506625
3 -1.4274057 6.2763882 0.3755637
4 3.8531253 3.2653448 3.1394225
My function attempts to do the same but across multiple different shares, see below:
carryover<- function(x){
result_df<- data.frame(x)
xnames<- names(x)
for (i in 1:7){
result_column<- lag(x, 1, default = 0)*(i/10) + x
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(result_df)
}
When I run carryover(df1), df$var1 remains the same across all iterations while df1$var2 takes lag values across rows, when I'm aiming for columns. What is structurally wrong about my function that is causing it to not return lag the column values?
Worked on this a bit using feedback from Stackoverflow and came-up with the below solve, defining the the carryover function within a larger function, then using apply with MARGIN=2 to calculate by column:
adStock<- function(x){
# create datafame to store results in
result_df<- data.frame(x)
# assign names to be applied as a column
xnames<- names(x)
# create list of carryovers
carryovers<- seq(.1, .7, .1)
# create carryover function
carryover<- function(x){
x + dplyr::lag(x, 1, default = 0)*(i)
}
# run for loop across all carryover values
for (i in carryovers){
result_column<- apply(x, 2, carryover)
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(data.frame(result_df))
}
I have a main data frame (mydata) and two secondary ones (df1, df2) such as follows:
x <- c(1, 2, 3, 4, 5)
y <- c(5, 4, 3, 2, 1)
mydata <- data.frame(x)
df1 <- data.frame(y)
df2 <- data.frame(y)
df2$y <- y+1 #This way, the columns in the df have the same name but different values.
I want to create new columns in mydata based on a formula with the variables in df1 and df2 like this:
mydata$new1 <- mydata$x*df1$y
mydata$new2 <- mydata$x*df2$y
Is there a way I can do this with a for loop? This is what I had in mind:
for (i in 2) {
mydata$paste0("new", i) <- mydata$x*dfpaste0(i)$y
}
Something along the lines of:
for (i in 1:2) {
mydata[[as.symbol(paste0('new', i))]] <- mydata$x*get(paste0("df", i))$y
}
We could also use mget to get all the object values in a list and multiply with the concerned vector
mydata[paste0("new", 1:2)] <- mydata$x * data.frame(mget(paste0("df", 1:2)))
I have a data.frame, and I want to subset it every 10 rows and then applied a function to the subset, save the object, and remove the previous object. Here is what I got so far
L3 <- LETTERS[1:20]
df <- data.frame(1:391, "col", sample(L3, 391, replace = TRUE))
names(df) <- c("a", "b", "c")
b <- seq(from=1, to=391, by=10)
nsamp <- 0
for(i in seq_along(b)){
a <- i+1
nsamp <- nsamp+1
df_10 <- df[b[nsamp]:b[a], ]
res <- lapply(seq_along(df_10$b), function(x){...}
saveRDS(res, file="res.rds")
rm(res)
}
My problem is the for loop crashes when reaching the last element of my sequence b
When partitioning data, split is your friend. It will create a list with each data subset as an item which is then easy to iterate over.
dfs = split(df, 1:nrow(df) %/% 10)
Then your for loop can be simplified to something like this (untested... I'm not exactly sure what you're doing because example data seems to switch from df to sc2_10 and I only hope your column named b is different from your vector named b):
for(i in seq_along(dfs)){
res <- lapply(seq_along(dfs[[i]]$b), function(x){...}
saveRDS(res, file = sprintf("res_%s.rds", i))
rm(res)
}
I also modified your save file name so that you aren't overwriting the same file every time.