how to put many rows in a dataframe by looping in r - r

I am looping for example, from a list ["A", "B","C"],
I will run a for loop
to get v<- for different run it has v1,v2,v3 different values
I want to use cbind("A", "v1") #I want to get three of rows (after 3 times loop) together to form a dataframe.
At the end, I want to get a dataframe which has the format of
"A" v1
"B" v2
"C" v3
How to get this output? Thanks!

I may have misunderstood the request, but is the following what you are looking for?
input <- c("A", "B", "C")
data.frame(x=input, y=paste0("v", seq_along(input)))
# x y
# 1 A v1
# 2 B v2
# 3 C v3
Note that the approach you mentioned in your question (iteratively building a row and combining with the existing data via rbind) is a bad idea both because it will take a lot more typing (note that I could do the operation in one line) and also because it is inefficient (you can read more about that in the second circle of the R inferno).

The part I have been stuck by is that, I have to start with a empty dataframe
df <-data.frame()
for (e in mylist){
v <- function(e) #get the value our from e by a function
one_row<- cbind(e, v) #cbind e, and v corresponding to e
new_f <-data.frame(one_row)
output <-rbind(output,new_f)
}
At the end, I get the right output.

Related

Recursive indexing only works up to [[1:3]]

I need to refer to individual dataframes within a list of dataframes (one by one) produced from a lapply function, but I'm getting the "recursive indexing failed at level 3" error. I've found similar questions, but none of them explain why this doesn't work.
I used lapply to make a list of dataframes, each with a different filter applied. The output in my reproducible example has 4 dataframes in the output (dfs). Now I want to refer to each dataframe in turn by indexing its position in the list.
If I use the format c(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) I get the output that I want, and it works for the next function I need to apply, but it seems very inefficient.
When I try to shorten it by using c(dfs[[1:4]]) instead, I get the error Error in data1[[1:4]] : recursive indexing failed at level 3. If I try c(dfs[[1:3]]), it runs, bit doesn't give the output I expect (no longer a list of dataframes).
Here's an example:
library(tidyverse) # for glimpse, filter, mutate
data(mtcars)
mtcars2 <- mutate(mtcars, var = rep(c("A", "B", "C", "D"), len = 32)) # need a variable with more than 3 possible outcomes
glimpse(mtcars2)
list <- c("A", "B", "C", "D") # each new dataframe will filter based on these variables
dfs <- lapply(list, function(x) {
mtcars2 %>% filter(var == x) %>% glimpse()
}) # each dataframe now only contains A, B, C, or D
dfs # list of dataframes produced from lapply
dflist1 <- list(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]]) # indexing one by one
dflist1 # this is what I want
dflist2 <- list(dfs[[1:4]]) # indexing all together
dflist2 # this produces an error
dflist3 <- list(dfs[[1:3]])
dflist3 # this runs, but the output is just `[[1]] [1] 4`, not a list of dataframes
I want something that looks like the output from dflist1 but that doesn't require me to add and remove list items every time the number of dataframes changes. I can't use the lapply output (dfs) as it is because my next function can't locate the variables within each dataframe as needed.
Any guidance appreciated.

Split Data Frame and call subframe rows by their index

This is a very basic R programming question but I haven't found the answer anywhere, would really appreciate your help:
I split my large dataframe into 23 subframes of 4 rows in length as follows:
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
Say I want to call the second subframe I can:
DataframeSplits[2]
But what if I want to call a specific row of that subframe (using index position)?
I was hoping for something like this (say I call 2nd subframe's 2nd row):
DataframeSplits[2][2,]
But that doesn't work with the error message
Error in DataframeSplits[2][2, ] : incorrect number of dimensions
If you want to subset the list which is returned by split and use it for later subsetting, you must use double parenthesis like this to get to the sub-data.frame. Then you can subset this one with single parenthesis as you already tried:
Dataframe <- data.frame(x = rep(c("a", "b", "c", "d"), 23), y = 1)
DataframeSplits <- split(Dataframe,rep(1:23,each=4))
DataframeSplits[[2]][2,]
# x y
# 6 b 1
More info on subsetting can be found in the excellent book by Hadley Wickham.

Count with conditional - dataframe

I would like to count how many times a observation appears with the condition one column is greater than another.
For example, how many times the "A", "B" and "C" apperead counting only if the column B is greater than colun C.
set.seed(20170524)
A <- rep(c("A","B","C"),5)
B <- round(runif(15,0,20),0)
C <- round(runif(15,1,5),0) + B
D <- as.data.frame(cbind(A,B,C))
D <- D[order(B),]
Thank you!
#firstly, those numbers got converted to factors, this is problematic.
D$B<-as.numeric(D$B)
D$C<-as.numeric(D$C)
#Then, get the counts for the A:
countA = sum(D$A=='A' & D$B < D$C)
Similarly for 'B' and 'C'
If there's many more than just categories "A,B,C" you might want to do a data.table for the by= option, but someone will probably be along to say that's overkill.
You can use: table(D$A[which(D$B>D$C)])
Note that when you do D <- as.data.frame(cbind(A,B,C)) you will get factors so either you transform B and C into numeric variables afterwards, or you just create directly a data.frame without passing through a matrix:
D <- data.frame(A,B,C)

R: Iteratively extract not NA values for columns in a data table and split into separate columns without typing column names

I am working with a large dataset containing many variables. Therefore, I want to avoid typing in column names at all times. I want to iterate through the columns in my data and extract the value contained fields per column. In other words, I want to end up with separate data tables for each column, none of them containing NA values.
My approach is to write a loop that first eliminates the NA values per column. I extracted the column names in a separate column matrix when reading the .csv file (using fread). The problem is that I did not manage to exclude the column names or the NA with my approach. I worked out a small example to illustrate the problem:
# Example data
dt = data.table(color=c("b","g","r","y",NA),
size=c("S", "XL", NA, NA, "M"),
number=(1:5))
columns = matrix(c("color", "size", "number"), nrow=3, ncol=1)
The loop shown below works, although it is not really a loop because it still requires inserting the column name in the first line:
# Works (but requires typing in the column name)
for(i in 1:1){
var <- dt %>% group_by(color) %>% filter(!is.na(color))
name <- paste("new", columns[i], sep=".")
assign(name, var[, columns[i], with=FALSE])}
# Output:
color
(chr)
1 b
2 g
3 r
4 y
My idea is to refer inside the loop to the subsequent columns by using the extracted column names. The problem here is that the NA values do not get eliminated, i.e., the first line of code inside the loop is not working:
# Does not work
for(i in 1:1){
var <- dt %>% group_by(columns[i]) %>% filter(!is.na(columns[i]))
name <- paste("new", columns[i], sep=".")
assign(name, var[, columns[i], with=FALSE])}
# Output:
color
(chr)
1 b
2 g
3 r
4 y
5 NA
Can anyone help me out to end up with separate columns (of unequal lengths) that do not contain NA values, without typing in the column names? (Another approach than I have used is certainly welcome as well.) Thanks in advance!
sapply(columns, function(x) c(na.omit(dt[[x]])), USE.NAMES = T)
#$color
#[1] "b" "g" "r" "y"
#
#$size
#[1] "S" "XL" "M"
#
#$number
#[1] 1 2 3 4 5
The c() isn't necessary - I just used it to strip na.omit class info to make the output clearer.
And don't use assign - just store the items in a list as above and work with that.

ranking multiple data frames and summing across them in R

I have 10 data frames with 2 columns each, I'm calling the dataframes a, b, c, d, e, f, g, h, i and j.
The first column in each data frame is called s for sequences and the second is p for p-values corresponding to each sequence. The s column contains the same sequences across all 10 data frames, essentially the only difference is in the p-values.
Below is a short version of data frame a, which has 600,000 rows.
s p
gtcg 0.06
gtcgg 0.05
gggaa 0.07
cttg 0.05
I want to rank each dataframe by p-value, the smallest p-value should get a rank of 1 and equal p-values should get the same rank. Each final data frame should be in this format:
s p_rank_a
gtcg 2
gtcgg 1
gggaa 3
cttg 1
I've used this to do one:
r<-rank(a$p)
cbind(a$s,r)
but I'm not very familiar with loops and I don't know how to do this automatically. Ultimately I would like a final file that has the s column and in the next column the rank sum of all the ranks across all data frames for each specific sequence.
SO basically this:
s ranksum_P_a-j
gtcg 34
gtcgg 5
gggaa 5009093
cttg 499
Please help and thanks!
for a single data.frame, you can do it one line, as follows:
credit to #Arun for pointing out to use as.numeric(factor(p))
library(data.table)
aDT <- data.table(a)[, p_rank := as.numeric(factor(p))]
I would suggest keeping all the data.frames in a single list, so that you can easily iterate over them.
Since your date.frames are letters, it's easy to collect the ten of them:
# collect them all
allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv)
# keep in mind you named an object `c`
# convert to DT and create the ranks
allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))])
on a separate note: it might be good habbit to start avoiding naming objects "c" and other common functions in R. otherwise, you will find that you'll start encountering many "unexplainable" behaviors which, after you've beaten your
head against a wall for an hour trying to debug it, you realize that you've overwritten the name of a function. This has never happened to me :)
I'd put all the data.frames in a list and then use lapply and transform as follows:
my_l <- list(a,b,c) # all your data.frames
# you can use rank but it'll give you the average in case of ties
# lapply(my_l, function(x) transform(x, rank_p = rank(p)))
# I prefer this method instead
my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p))))
# now bind them in to a single data.frame
my_o <- do.call(rbind, my_o)
# now paste them
aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=","))
# s p
# 1 cttg 1,1,1
# 2 gggaa 3,3,3
# 3 gtcg 2,2,2
# 4 gtcgg 1,1,1
Edit since you've asked for a potential faster solution (due to large data), I'd suggest, like #Ricardo, a data.table solution:
require(data.table)
# bind all your data.frames together
dt <- rbindlist(my_l) # my_l is your list of data.frames
# replace p-value with their "rank"
dt[, p := as.numeric(factor(p))]
# set key
setkey(dt, "s")
# combine them using `,`
dt[, list(p_ranks = paste(p, collapse=",")), by=s]
Try this out:

Resources