call variables by name and column number in a data.frame - r

I have a data frame with columns I want to reorder. However, in different iterations of my script, the total number of columns may change.
>Fruit
Vendor A B C D E ... Apples Oranges
Otto 4 5 2 5 2 ... 3 4
Fruit2<-Fruit[c(32,33,2:5)]
So instead of manually adapting the code (the columns 32 and 33 change) I'd like to do the following:
Fruit2<-Fruit[,c("Apples", "Oranges", 2:5)]
I tried a couple of syntaxes but could not get it to do what I want. I know, this is a simple syntax issue, but I could not find the solution yet.
The idea is to mix the variable name with the vector to reference the columns when writing a new data frame. I don't want to spell out the whole vector in variable names because in reality it's 30 variables.

I'm not sure how your data is stored in R, so this is what I used:
Fruit <- data.frame( "X1" = c("A",4),"X2" = c("B",5),"X3" = c("C",2),"X4"=
c("D",5),"X5"= c("E",2),"X6" = c("Apples",3),"X7"=
c("Oranges",4),row.names = c("Vendor","Otto"),stringsAsFactors = FALSE)
X1 X2 X3 X4 X5 X6 X7
Vendor A B C D E Apples Oranges
Otto 4 5 2 5 2 3 4
Then use:
indexes <- which(Fruit[1,]%in%c("Apples","Oranges"))
Fruit2<- Fruit[,c(indexes,2:5)]
Fruit[1,] references the Vendor row, and "%in%" returns a logical vector to the function "which". Then "which" returns indexes.
This gives:
> Fruit2
X6 X7 X2 X3 X4 X5
Vendor Apples Oranges B C D E
Otto 3 4 5 2 5 2
Make sure your data are not being stored as factors, otherwise this will not work. Or you could change the Vendor row to column names as per the comment above.

The answer is, as I found out, use the dplyr package.
It is very powerful.
The solution to the aforementioned problem would be:
Fruit2<-Fruit %>% select(Apples,Oranges,A:E)
This allows dynamic selection of columns and lists of columns even if the indexes of the columns change.

Related

data.table ifelse with multiple columns

I have a dataset with tens of columns that looks something like this:
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y1 = rnorm(9), y2= rnorm(9), x = rnorm(9), xb = rnorm(9))
df
# id time y1 y2 x xb
# 1 1 1 -1.1184009 -1.07430118 0.61398523 -0.68343624
# 2 1 2 0.4347047 -0.53454071 -0.30716538 -1.02328242
# 3 1 3 0.2318315 -0.05854228 0.05169733 -0.22130149
# 4 2 1 1.2640080 2.07899296 -0.95918953 -0.35961156
# 5 2 2 -0.4374764 -0.25284854 -0.46251901 0.08630344
# 6 2 3 0.5042690 0.13322671 1.00881113 0.43807458
# 7 3 1 0.3672216 1.92995242 0.48708183 0.58206127
# 8 3 2 -1.5431709 0.53362731 1.17361087 -1.00932195
# 9 3 3 -1.4577268 0.23413541 -0.32399489 -0.91040641
I would like to modify my data frame using the following logic
df<-setDT(df)[,y1:=ifelse(y1>x,x,y1))]
df<-setDT(df)[,y2:=ifelse(y2>xb,xb,y2))]
However, since I have many variables I would like to do this in a single line expression. In other words, I would like to pass this function for multiple columns at once i.e. y1 with x, y2 with xb and so on...
I have tried the following but it does not seem to work
mod<-c("y1","y2")
max<-c("x","xb")
df2<-setDT(ppta)[,(mod):=ifelse(.(mod)>.(max),.(max),.(mod))]
does anyone knows what I am doing wrong? and how I modify multiple columns with their respective partner column at once?
Consider using pmin instead of your ifelse. You can try:
mod<-c("y1","y2")
max<-c("x","xb")
setDT(df)
df[,c(mod):=Map(pmin,mget(mod),mget(max))]
Explanation:
pmin takes two (or more) vectors and gives the minimum value for each element (equivalent of your ifelse(y1>x,x,y1));
mget returns a list of objects from their names. For instance mget("a","b") returns a list with the a and b objects (if they exist). This is used to retrieve the column from their name in the environment of the data table;
Map applies a function with more arguments element by element. Map(f,a,b) is equivalent to list(f(a[[1]],b[[1]]),f(a[[2]],b[[2]]),...).

Given set of column values, create data.frame with known number of rows

I'm trying to make datasets of a fixed number of rows to make test datasets - however I'm writing to a destination that requires known keys for each column. For my example, assume that these keys are lowercase letters, upper case letters and numbers respectively.
I need to make a function which, provided only the required number of rows, combines keys such that the number of combinations is equal the required number. Naturally there will be some impossible cases such as prime numbers than the largest key and values larger than the product of the number of keys.
A sample output dataset of 10 rows could look like the following:
data.frame(col1 = rep("a", 10),
col2 = rep(LETTERS[1:5], 2),
col3 = rep(1:2, 5))
col1 col2 col3
1 a A 1
2 a B 2
3 a C 1
4 a D 2
5 a E 1
6 a A 2
7 a B 1
8 a C 2
9 a D 1
10 a E 2
Note here that I had to manually specify the keys to get the desired number of rows. How can I arrange things so that R can do this for me?
Things I've already considered
optim - The equation I'm trying to solve is effectively x * y * z = n where all of them must be integers. optim doesn't seem to support that constraint
expand.grid and then subset - almost 500 million combinations, eats up all my memory - not an option.
lpSolve - Has the integer option, but only seems to support linear equations. Could use logs to make it linear, but then I can't use the integer option.
factorize from gmp to get factors - Thought about this, but I can't think of a way to distribute the prime factors back into the keys. EDIT: Maybe a bin packing problem?
For integer optimisation on a low level scale you can use a grid search. Other possibilities are described here.
This should work for your example.
N <- 10
fr <- function(x) {
x1 <- x[1]
x2 <- x[2]
x3 <- x[3]
(x1 * x2 * x3 - N)^2
}
library(NMOF)
gridSearch(fr, list(seq(0,5), seq(0,5), seq(0,5)))$minlevels
I am a bit reluctant,but we can work things out:
a1<-2
a2<-5
eval(parse(text=paste0("data.frame(col1 = rep(LETTERS[1],",a1*a2,"),col2 =
rep(LETTERS[1:",a2,"],",a1,"),col3 = rep(1:",a1,",",a2,"))")))
col1 col2 col3
1 A A 1
2 A B 2
3 A C 1
4 A D 2
5 A E 1
6 A A 2
7 A B 1
8 A C 2
9 A D 1
10 A E 2
Is this something similar to what you are asking?

Dividing one column in my dataset by a fix value in R

Im pretty new to R and here's (maybe a simple) question:
I have big .dat datasets and I add together two of them to get the sum of the values. The datasets kinda look like this:
#stud1
AMR X1 X2 X3...
1 3 4 10
2 4 5 2
#stud2
AMR X1 X2 X3
1 6 4 6
2 1 2 1
So what I did is
> studAll <- stud1 + stud2
and the result was:
# studAll:
AMR X1 X2 X3
2 9 8 16
4 5 7 3
MY PROBLEM NOW IS:
The AMR column is not meant to change, so my idea was to divide this column through the value "2" to get to the former values. Or is there another solution easier than my idea?
If I understand your question correctly you want to make a new dataframe which adds all the columns except AMR?
You could do it the long way:
studAll$X1 <- Stud1$X1 + Stud2$X1
repeat for each X...
Or this would work if the AMR column is preserved accross all
#set up
stud1 =data.frame(c(1, 2), c(3,4),c(4,5),c(10,2))
stud2 <- stud1
cols <- (c("AMR", "X1", "X2", "X3"))
colnames(stud1) <- cols
colnames(stud2) <- cols
#add them
studAll = stud1 + stud2
#replace the AMR column into studAll from stud1
#this assumes the AMR column is the same in all studs'
studAll$X1 <- stud1$X1
You could also select all columns other than AMR and add them
See for example here http://www.r-tutor.com/r-introduction/data-frame

rowSums using an indirect variable (i.e. using a string variable to allocate the column numbers)

I'm still pretty much a newbie in R but enjoying the journey so far. I'm trying to group weekly columns together into quarters, and try to create a more elegant solution rather than creating separate lines to assign values.
So I have created a list of values to contain the column ranges, e.g. Q1 <- 5:9, Q2 <- 10:22, and so forth. After reading the original data frame, I want to create a new one that has Q1 as the variable, and contains the total of column 5-9, Q2 with the total of 10:22, etc. The problem is, rowSums doesn't like me using a variable to denote the actual range.
This is what I am trying to achieve, with sval containing the original weekly data, and qsval, containing the quarterly totals:
Q110 <- 5:9
Q210 <- 10:22
Q310 <- 23:35
Q410 <- 36:48
Q111 <- 49:61
Q211 <- 62:74
Q311 <- 75:87
Q411 <- 88:100
qsval <- sval[,c(1:4)] # Copying the first four columns from the weekly data
period <- c('Q110','Q210','Q310','Q410','Q111','Q211','Q311','Q411')
for (i in 1:8) {
assign(qsval$period[i], rowSums(sval,na.rm=F, get(period[i])))
}
Is this possible at all? The error message given is:
Error in rowSums(sval, na.rm = F, get(period[i])) : invalid 'dims'
Any advice would be much appreciated! Thank you.
In the absence of reproducible data, here's an example which hopefully you can adapt to your specific case:
set.seed(1) # just to make the random data reproducible
sval <- data.frame(replicate(6,sample(1:3)))
# X1 X2 X3 X4 X5 X6
#1 1 3 3 1 3 2
#2 3 1 2 3 1 3
#3 2 2 1 2 2 1
Qlist <- list(Q1=1:3,Q2=4:6)
qsval <- data.frame(lapply(Qlist, function(x) rowSums(sval[x]) ))
# Q1 Q2
#1 7 6
#2 6 7
#3 5 5

How to ddply() without sorting?

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.

Resources