This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
In R I have a data frame with observations described by several values one of which is a factor. I have sorted the dataset by this factor and would like to add a column in which I would get a number of observation on each level of the factor e.g.
factor obsnum
a 1
a 2
a 3
b 1
b 2
b 3
b 4
c 1
c 2
...
In SAS I do it with something like:
data logs.full;
set logs.full;
count + 1;
by cookie;
if first.cookie then count = 1;
run;
How can I achieve that in R?
Thanks,
Use rle (run length encoding) and sequence:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "c")
data.frame(
x=x,
obsnum = sequence(rle(x)$lengths)
)
x obsnum
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
Here is the ddply() solution
dataset <- data.frame(x = c("a", "a", "a", "b", "b", "b", "b", "c", "c"))
library(plyr)
ddply(dataset, .(x), function(z){
data.frame(obsnum = seq_along(z$x))
})
One solution using base R, assuming your data is in a data.frame named dfr:
dfr$cnt<-do.call(c, lapply(unique(dfr$factor), function(curf){
seq(sum(dfr$factor==curf))
}))
There are likely better solutions (e.g. employing package plyr and its ddply), but it should work.
Related
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame with some variables with the same name but different values. I need to sum the values and keep the original values as a separate column.
data <- data.frame(cod = c("A", "B", "C", "A", "A", "B"),
values = c(3, 4, 5, 1, 2, 5))
data
cod Values
A 3
B 4
C 5
A 1
A 2
B 5
I expect the following, where the original Values column is kept the same and the group sum is added as a new column, Values2:
> data2
cod Values Values2
A 3 6
B 4 9
C 5 5
A 1 6
A 2 6
B 5 9
An option with base R would be
data$Values2 <- with(data, ave(Values, cod, FUN = sum))
firstVector <- c("A", "B", "C", "D", "E")
secondVector <- c(1, 2, 3, 4, 5)
thirdVector <- c("a", "b", "c", "d", "e")
myDataFrame <- data.frame(firstVector, secondVector, thirdVector)
How do I extract row 3 and 4 from my data frame? I want to print it row 3 and 4 in order it to look like this:
firstVector secondVector thirdVector
3 C 3 c
4 D 4 d
You can subset your dataframe like this [rows,columns]:
myDataFrame[c(3,4),]
In your case you want a vector containing rows 3 and 4, therefore c(3,4), you can add more columns in the vector to subset more rows, for example c(1,2,3,12).
If you dont provide an argument it returns the whole dimension. In your example you subset rows, and return all the columns
it's the same for columns:
myDataFrame[c(3,4),c(1,2)]
you can subset rows 3 and 4 and columns 1 and 2.
Another way to do this is using :
c(1:4) means from 1 to 4
Hope this helps
I have a data.frame ystr:
v1
1 a
2 B
3 B
4 C
5 d
6 a
7 B
8 D
I want to find the start and end of each group of letters in CAPS so my output would be:
groupId startPos endPos
1 1 2 4
2 2 7 8
I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:
currentGroupId <-0
for (i in 1:length(ystr[,1])){
if (grepl("[[:upper:]]", ystr[i,]))
{
if (startCounter == 0)
{
currentGroupId <- currentGroupId +1
startCounter <-1
mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
}
}else if (startCounter == 1){
startCounter <-0
mygroups[currentGroupId,3]<- i-1
}
}
Is there a simple way of doing this in R?
This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.
You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.
with(rle(d[,1] == toupper(d[,1])),
data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
end=cumsum(lengths)[values]))
# start end
# 1 2 4
# 2 7 8
You can see other examples of the use of rle by looking at Stack Overflow answers using this command.
Data:
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
You can use the IRanges package. It's basically to find the consecutive ranges.
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8
library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
# start end width
# [1] 2 4 3
# [2] 7 8 2
I have 16*3 data frame. Elements in data frame are character e.g., A, B, C... How can I assign them values e.g., A= 2, B=5, C=4 in R?
You can map the values from the vector you created:
relevel <- function(df, levelmap) {
df[] <- lapply(df, function(x) levelmap[as.character(x)]);df
}
The function subsets the values based on the map vector.
Example
df <- data.frame(x=c("A", "C", "C", "A"), y=c("B", "C", "B", "A"), z=c("A", "B", "C", "A"))
df
x y z
1 A B A
2 C C B
3 C B C
4 A A A
newlevels <- c(A=2,B=5,C=4)
relevel(df, newlevels)
x y z
1 2 5 2
2 4 4 5
3 4 5 4
4 2 2 2
The newlevels vector is a special vector called a named vector. It's very helpful as it can be referenced by both its names and its indices. newlevels["A"] and newlevels[1] both return the same output. This simplifies what in other languages would require hash tables or other lookup arrays.
I have a DF where I want to add a new variable called "B" into the 2nd position.
A C D
1 1 5 2
2 3 3 7
3 6 2 3
4 6 4 8
5 1 1 2
Anyone have an idea?
The easiest way would be to add the columns you want and then reorder them:
dat$B <- 1:5
newdat <- dat[, c("A", "B", "C", "D")]
Another way:
newdat <- cbind(dat[1], B=1:5, dat[,2:3])
If you're concerned about overhead, perhaps a data.table solution? (With help from this answer):
library(data.table)
dattable <- data.table(dat)
dattable[,B:=1:5]
setcolorder(dattable, c("A", "B", "C", "D"))
dat$B <- 1:5
ind <- c(1:which(names(data) == "A"),ncol(data),(which(names(data) == "A")+1):ncol(data)-1)
data <- data[,ind]
Create the variable at the end of the data.frame and then using an indicator vector signaling how to reorder the columns. ind is just a vector of numbers