How to select rows based on pre-determined gaps (R)? - r

I recently post another question asking how I could create a new data.frame based on a colunm variable. I thought it would fix my problem but I realize now that I was asking the wrong thing.
What I mean with my question is, how I can select rows in a constant gap and create a new data.frame with them?
Like, if I have:
1 A B C
2 D E F
3 G H I
4 J K L
5 M N O
6 P Q R
I will want to select the rows that grow in two to two like:
2 D E F
4 J K L
6 P Q R
But actually in my case, I need to select the rows that are groing in 40 to 40 and create a new data.frame with them.
Sorry for another post, but I will be really glad if you guys could help me. I'm a new user of R.

It's very easy with the dplyr package.
library(dplyr)
test %>% dplyr::filter(row_number() %% 2 == 0)
Basically, you are calling the row number and selecting only the even ones. If you wanted to go for every 40 rows, you would be doing row_number() %% 40 == 0. If you wanted to start from another row and get every 40 rows, you just need to change the 0 to another number, as the %% operator performs modular divsion.

You can use R base functions if you don't want to load any extra packages:
Option 1
df[seq_len(nrow(df))%%2==0,]
Option 2
subset(df, seq_len(nrow(df))%%2==0)

Related

R: replacing values in a df according to a different df

I am new to R and I am having some trouble.
I am trying to replace some specific values of a data frame according to different values from another df.
For example:
I have this two dataframes:
a <- data.frame(c('a','b','c','d'), c('g','e','p','d'))
1 a g
2 b e
3 c p
4 d d
b <- data.frame(c('a','c'))
1 a
2 c
I want to find out which items that are on df a are also on df b and assign the value of the next column, in this case 'g' and 'p'. I tried with the match function but it has a problem if there are many items with the same name that need to be changed. I would really want an option to do this without checking 1 by 1 with a loop.

For every value in vector: extract value from an appropriate row in a dataframe

I think the best way to explain my question is by an example:
we have a vector:
vector1 (1,2,3,3,5,6,3,7,7)
and a dataframe:
ID VAL
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
I want to create a vector that will look like this:
vector2 (a,b,c,c,e,f,c,g,g)
Sounds very simple and probably is very simple with some trick that I don't know about.
I tried with "%in%" but it produced a vector of values from rows(of the dataframe) present in the vector as opposed to my goal which is a vector of values from the dataframe corresponding to the values in the vector.
Thank you.
Thank you David, following your suggestion I was able solve my problem.
Though I needed to make some preparation (it was because I oversimplified the example)
Actually, (if we will continue with the naming convention from my example) The "ID" column had some strings so the dataframe looked like so:
ID VAL
one a
two b
three c
four d
five e
six f
seven g
eight h
And vector1 looked like this: (one,two,three,three,five,six,three,seven,seven)
Then, I figured I should rename the rownames of the dataframe to the names in "ID" and then perform the command you have suggested.
My preparation looked like this:
rownames(dataframe) <- dataframe$ID
vector2 <- dataframe[vector1, "VAL"]

how to write a loop of the number of for loops in R?

this is probably a simple one, but I somehow got stuck...
I need to many loops to get the result of every sample in my support like the usual stacked loops:
for (a in 1:N1){
for (b in 1:N2){
for (c in 1:N3){
...
}
}
}
but the number of the for loops needed in this messy system depends on another random variable, let's say,
for(f in 1:N.for)
so how can I write a for loop to do deal with this? Or are there more elegant ways to do this?
note that the difference is that the nested for loops above (the variables a,b,c,...) do matter in my calculations, but the variable f of the for loop that controls for the number of for loops needed does not go into any of my calculations for my real purpose - all it does is count/ensure the number of for loops needed is correct.
Did I make it clear?
So what I am actually trying to do is generate all the possible combinations of a number of peoples preferences towards others.
Let's say I have 6 people (the simplest case for my purpose): Abi, Bob, Cath, Dan, Eva, Fay.
Abi and Bob have preference lists of C D E F ( 4!=24 possible permutations for each of them);
Cath and Dan have preference lists of A B and E F, respectively (2! * 2! = 4 possible permutations for each of them);
Eva and Fay have preference lists of A B C D (4!=24 possible permutations for each of them);
So all together there should be 24*24*4*4*24*24 possible permutations of preferences when taking all six them together.
I am just wondering what is a clear, easy and systematic way to generate them all at once?
I'd want them in the format such as
c.prefs <- as.matrix(data.frame(Abi = c("Eva", "Fay", "Dan", "Cath"),Bob = c("Dan", "Eva", "Fay", "Cath"))
but any clear format is fine...
Thank you so much!!
I'll assume you have a list of each loop variable and its maximum value, ordered from the outermost to innermost variable.
loops <- list(a=2, b=3, c=2)
You could create a data frame with all the loop variable values in the correct order with:
(indices <- rev(do.call(expand.grid, lapply(rev(loops), seq_len))))
# a b c
# 1 1 1 1
# 2 1 1 2
# 3 1 2 1
# 4 1 2 2
# 5 1 3 1
# 6 1 3 2
# 7 2 1 1
# 8 2 1 2
# 9 2 2 1
# 10 2 2 2
# 11 2 3 1
# 12 2 3 2
If the code run at the innermost point of the nested loop doesn't depend on the previous iterations, you could use something like apply to process each iteration independently. Otherwise you could loop through the rows of the data frame with a single loop:
for (i in seq_len(nrow(indices))) {
# You can get "a" with indices$a[i], "b" with indices$b[i], etc.
}
For the way of doing the calculation, an option is to use the Reduce function or some other higher-order function.
Since your data is not inherently ordered (an individual is part of a set, its preferences are part of the set) I would keep indivudals in a factor and have eg preferences in lists named with the individuals. If you have large data you can store it in an environment.
The first code is just how to make it reproducible. the problem domain was akin for graph oriented naming. You just need to change in the first line and in runif to change the behavior.
#people
verts <- factor(c(LETTERS[1:10]))
#relations, disallow preferring yourself
edges<-lapply(seq_along(verts), function(ind) {
levels(verts)[-ind]
})
names(edges) <- levels(verts)
#directions
#say you have these stored in a list or something
pool <- levels(verts)
directions<-lapply(pool, function(vert) {
relations <- pool[unique(round(runif(5, 1, 10)))]
relations[!(vert %in% relations)]
})
names(directions) = pool
num_prefs <- (lapply(directions, length))
names(num_prefs) <- names(directions)
#First take factorial of each persons preferences,
#then reduce that with multiplication
combinations <-
Reduce(`*`,
sapply(num_prefs, factorial)
)
I hope this answers your question!

Extract the first, second and last row that meets a criterion

I would like to know how to extract the last row that follow a criterion. I have seen the solution for getting the first one by the function "duplicate" in the next link How do I select the first row in an R data frame that meets certain criteria?.
However is it possible to get the second or last row that meet a criterion?
I would like to make a loop for each Class (here I only put two) and select the first, second and last row that meet the criterion Weight >= 10. And if there is no row that meets the criterion to get a NA.
Finally I want to store the three values (first, second, and last row) in a list containing the values for each class.
Class Weight
1 A 20
2 A 15
3 B 10
4 B 23
5 A 11
6 B 12
7 B 11
8 A 25
9 A 7
10 B 3
Data table can help with this.
This is an edit off of Davids comment to move it into the answers as his approach is the correct way to do this.
library(data.table)
DT <- as.data.table(db)
DT[Weight >= 10][, .SD[c(1, 2, .N)], by = Class]
As as faster alternative also from David look at
indx <- DT[Weight >= 10][, .I[c(1, 2, .N)], by = Class]$V1 ; DT[indx]
Which creates the wanted index using .I and then subsets DT based on those rows.

R applying to a line

I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g

Resources