This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to sort a dataframe by column(s) in R
I have a dataset that looks like this:
x y z
1. 1 0.2
1.1 1 1.5
1.2 1 3.
1. 2 8.1
1.1 2 1.0
1.2 2 0.6
What I would like is organise the dataset first as a function of x in increasing order then as a function of y such that
x y z
1. 1 0.2
1. 2 8.1
1.1 1 1.5
1.1 2 1.
1.2 1 3.
1.2 2 0.6
I know that apply, mapply, tapply, etc functions reorganise datasets but I must admit that I don't really understand the differences between them nor do I really understand how to apply which and when.
Thank you for your suggestions.
You can order your data using the order function. There is no need for any apply family function.
Assuming your data is in a data.frame called df:
df[order(df$x, df$y), ]
x y z
1 1.0 1 0.2
4 1.0 2 8.1
2 1.1 1 1.5
5 1.1 2 1.0
3 1.2 1 3.0
6 1.2 2 0.6
See ?order for more help.
On a side note: reshaping in general refers to changing the shape of a data.frame, e.g. converting it from wide to tall format. This is not what is required here.
You can also use the arrange() function in plyr for this. Wrap the variables in desc() that you want to sort the other direction.
> library(plyr)
> dat <- head(ChickWeight)
> arrange(dat,weight,Time)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
This is the fastest way to do this that's still readable, if speed matters in your application. Benchmarks here:
How to sort a dataframe by column(s)?
Related
I'm working on some code where I need to find the maximum value over a set of columns and then update that maximum value. Consider this toy example:
test <- data.table(thing1=c('AAA','BBB','CCC','DDD','EEE'),
A=c(9,5,4,2,5),
B=c(2,7,2,6,3),
C=c(6,2,5,4,1),
ttl=c(1,1,3,2,1))
where the resulting data.table looks like this:
thing1
A
B
C
ttl
AAA
9
2
6
1
BBB
5
7
2
1
CCC
4
2
5
3
DDD
2
6
4
2
EEE
5
3
1
1
The goal is to find the column (A, B, or C) with the maximum value and replace that value by the current value minus 0.1 times the value in the ttl column (i.e. new_value=old_value - 0.1*ttl). The other columns (not containing the maximum value) should remain the same. The resulting DT should look like this:
thing1
A
B
C
ttl
AAA
8.9
2
6
1
BBB
5
6.9
2
1
CCC
4
2
4.7
3
DDD
2
5.8
4
2
EEE
4.9
3
1
1
The "obvious" way of doing this is to write a for loop and loop through each row of the DT. That's easy enough to do and is what the code I'm adapting this from did. However, the real DT is much larger than my toy example and the for loop takes some time to run, which is why I'm trying to adapt the code to take advantage of vectorization and get rid of the loop.
Here's what I have so far:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
which produces this DT:
thing1
A
B
C
ttl
max_position
newmax
AAA
9
2
6
1
A
8.9
BBB
5
7
2
1
B
6.9
CCC
4
2
5
3
C
4.7
DDD
2
6
4
2
B
5.8
EEE
5
3
1
1
A
4.9
The problem comes in assigning the value of the newmax column back to where it needs to go. I naively tried this, along with some other things, which tells me that "'max_position' not found":
test[,(max_position):=newmax,by=1:nrow(test)]
It's straightforward to solve the problem by reshaping the DT, which is the solution I have in place for now (see below), but I worry that with my full DT two reshapes will be slow as well (though presumably better than the for loop). Any suggestions on how to make this work as intended?
Reshaping solution, for reference:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
test <- setDT(gather(test,idgroup,val,c(A,B,C)))
test[,maxval:=max(val),by='thing1']
test[val==maxval,val:=newmax][,maxval:=NULL]
test <- setDT(spread(test,idgroup,val))
With the OP's code, replace can work
test[, (2:4) := replace(.SD, which.max(.SD), max(.SD, na.rm = TRUE) - 0.1 * ttl),
by = 1:nrow(test),.SDcols = 2:4]
-output
> test
thing1 A B C ttl
1: AAA 8.9 2.0 6.0 1
2: BBB 5.0 6.9 2.0 1
3: CCC 4.0 2.0 4.7 3
4: DDD 2.0 5.8 4.0 2
5: EEE 4.9 3.0 1.0 1
In base R, this may be faster with row/column indexing
test1 <- as.data.frame(test)
m1 <- cbind(seq_len(nrow(test1)), max.col(test1[2:4], "first"))
test1[2:4][m1] <- test1[2:4][m1] - 0.1 * test1$ttl
I have UTM coordinate values from GPS collared leopards, and my analysis gets messed up if there are any points that are identical. What I want to do is add a 1 to the end of the decimal string to make each value unique.
What I have:
> View(coords)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.7 4980096
5 618522.7 4980096
6 622674.1 4976161
I want something like this, or something that will make each number unique (doesn't have to be a +1)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.71 4980096.1
5 618522.72 4977148.2
6 622674.1 4976161
Ive looked at existing questions and got this to work for a simulated data set, but not for values with more than 1 duplicated value.
DF <- data.frame(A=c(5,5,6,6,7,7), B=c(1, 1, 2, 2, 2, 3))
>View(DF)
A B
1 5 1
2 5 1
3 6 2
4 6 2
5 7 2
6 7 3
DF <- do.call(rbind, lapply(split(DF, list(DF$A, DF$B)),
function(x) {
x$A <- x$A + seq(0, by=0.1, length.out=nrow(x))
x$B <- x$B + seq(0, by=0.1, length.out=nrow(x))
x
}))
>View(DF
A B
5.1.1 5.0 1.0
5.1.2 5.1 1.1
6.2.3 6.0 2.0
6.2.4 6.1 2.1
7.2 7.0 2.0
7.3 7.0 3.0
The'2s' in column B don't continue to add a decimal place when there are more than 2. I also had a problem accomplishing this when the number was more than 4 digits (i.e. XXXXX vs XX) There's probably a better way to do this, but I would love help on adding these decimals and possibly altering them in the original data frame which has 12 columns of various data.
It is easier to use make.unique
DF[] <- lapply(DF, function(x) as.numeric(make.unique(as.character(x))))
DF
# A B
#1 5.0 1.0
#2 5.1 1.1
#3 6.0 2.0
#4 6.1 2.1
#5 7.0 2.2
#6 7.1 3.0
I am facing a problem with the amount of time needed to run my code. Basically, I have several columns a key value in the last column (that I identify as the mean in the reproducible example). I want it to be 1 when it is below the value and 2 when it is above.
Is there an easier way to do this?
a <- c(1,3,5,6,4)
b <- c(10,4,24,5,3)
df <- data.frame (a,b)
df$mean <- rowMeans (df)
for (i in 1:5){
df[i,1:2] [df[i,1:2]<df$mean[i]] <- 1
df[i,1:2] [df[i,1:2]>df$mean[i]] <- 2
}
Thank you in advance
You can simply do,
df[1:2] <- (df[1:2] > df$mean) + 1 #removed as.integer as per #akrun's comment
Which gives,
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5
Always avoid using loops when possible in R!
Alternative Solution using mutate_each from dplyr
df %>% mutate_each(funs(ifelse(mean>.,1,2)), 1:2)
Also gives
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5
I am new in R.I have one question regarding my data set.
S.NO Type Measurements
1 1 2.1
2 2 3.3
3 2 3.1
4 3 2.7
5 3 2.6
6 3 4.5
7 2 1.1
8 3 2.2
suppose we have measurements in column 3 but their types are given in column 2.Each measurement is either type 1,type 2 or type 3.Now if we are interested to find only
measurements corressponding to type 2(suppose),how we can do it in R?
I am looking forward to response.
This is a basic subsetting question covered in most introductory R guides:
with(mydf, mydf[Type == 2, ])
# S.NO Type Measurements
# 2 2 2 3.3
# 3 3 2 3.1
# 7 7 2 1.1
with(mydf, mydf[Type == 2, "Measurements"])
# [1] 3.3 3.1 1.1
You can also look at the subset function:
subset(mydf, subset = Type == 2, select = "Measurements")
# Measurements
# 2 3.3
# 3 3.1
# 7 1.1
# make some data
testData$measurement=1:10
testData$Type=sample(1:3,10,replace=T)
testData=data.frame(testData)
# fetch only type 2
testData[testData$Type==2,]
# now only the measurements
testData[testData$Type==2,"measurement"]
I'm working with a large data frame that I want to pivot, so that variables in a column become rows across the top.
I've found the reshape package very useful in such cases, except that the cast function defaults to fun.aggregate=length. Presumably this is because I'm performing these operations by "case" and the number of variables measured varies among cases.
I would like to pivot so that missing variables are denoted as "NA"s in the pivoted data frame.
So, in other words, I want to go from a molten data frame like this:
Case | Variable | Value
1 1 2.3
1 2 2.1
1 3 1.3
2 1 4.3
2 2 2.5
3 1 1.8
3 2 1.9
3 3 2.3
3 4 2.2
To something like this:
Case | Variable 1 | Variable 2 | Variable 3 | Variable 4
1 2.3 2.1 1.3 NA
2 4.3 2.5 NA NA
3 1.8 1.9 2.3 2.2
The code dcast(data,...~Variable) again defaults to fun.aggregate=length, which does not preserve the original values.
Thanks for your help, and let me know if anything is unclear!
It is just a matter of including all of the variables in the cast call. Reshape expects the Value column to be called value, so it throws a warning, but still works fine. The reason that it was using fun.aggregate=length is because of the missing Case in the formula. It was aggregating over the values in Case.
Try: cast(data, Case~Variable)
data <- data.frame(Case=c(1,1,1,2,2,3,3,3,3),
Variable=c(1,2,3,1,2,1,2,3,4),
Value=c(2.3,2.1,1.3,4.3,2.5,1.8,1.9,2.3,2.2))
cast(data,Case~Variable)
Using Value as value column. Use the value argument to cast to override this choice
Case 1 2 3 4
1 1 2.3 2.1 1.3 NA
2 2 4.3 2.5 NA NA
3 3 1.8 1.9 2.3 2.2
Edit: as a response to the comment from #Jon. What do you do if there is one more variable in the data frame?
data <- data.frame(expt=c(1,1,1,1,2,2,2,2,2),
func=c(1,1,1,2,2,3,3,3,3),
variable=c(1,2,3,1,2,1,2,3,4),
value=c(2.3,2.1,1.3,4.3,2.5,1.8,1.9,2.3,2.2))
cast(data,expt+variable~func)
expt variable 1 2 3
1 1 1 2.3 4.3 NA
2 1 2 2.1 NA NA
3 1 3 1.3 NA NA
4 2 1 NA NA 1.8
5 2 2 NA 2.5 1.9
6 2 3 NA NA 2.3
7 2 4 NA NA 2.2
Here is one solution. It does not use the package or function you mention, but it could be of use. Suppose your data frame is called df:
M <- matrix(NA,
nrow = length(unique(df$Case)),
ncol = length(unique(df$Variable))+1,
dimnames = list(NULL,c('Case',paste('Variable',sort(unique(df$Variable))))))
irow <- match(df$Case,unique(df$Case))
icol <- match(df$Variable,unique(df$Variable)) + 1
ientry <- irow + (icol-1)*nrow(M)
M[ientry] <- df$Value
M[,1] <- unique(df$Case)
To avoid the warning message, you could subset the data frame according to another variable, i.e a categorical variable having three levels a,b,c. Because in you current data for category a it has 70 cases, for b 80 cases, c has 90. Then the cast function doesn't know how to aggregate them.
Hope this helps.