R: how can I get the Row sequential ID when the number appears again? - r

For instance, I have a data frame like this, there is no duplicated number for each row, numbers are sorted by each row.
W1 W2 W3 W4
1 1 3 4 7
2 4 5 6 7
3 1 2 5 8
4 2 5 9 10
5 4 7 10 13
6 1 2 6 9
I want to get the row ID when 1/2/3.... appears, since 1 in row 1,3,6; 2 in row 3,4,6; 3 in row 1 only, ...; So my result would like this
1 1 3 6
2 3 4 6
3 1
4 1 2 5
5 2 3 4
......

I would do:
split(t(row(df)), unlist(t(df)))
And if you need empty levels to show up:
split(t(row(df)), factor(unlist(t(df)), 1:max(df)))
This should be a lot faster than looping via for example:
lapply(1:max(df), function(i) which(rowSums(df == i) > 0))

Related

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

Transforming a looping factor variable into a sequence of numerics

I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)

Separate unique and duplicate entries in dataframe based off id

I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks
Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.

R indirect reference in data frame

I would like to refer to values in a data frame column with the row index being dependent on the value of another column.
Example:
value lag laggedValue
1 1 2
2 2 4
3 3 6
4 2 6
5 1 6
6 3 9
7 3 10
8 1 9
9 1 10
10 2
In Excel I use this formula in column "laggedValue":
=INDIRECT("B"&(ROW(B2)+C2))
How can I do this in an R data frame?
Thanks!
For row r with associated lag value lag[r] it looks like you're trying to create a new column that is the (r+lag[r])th element of value (or a missing value if this is out of bounds). You can do this with:
dat$laggedValue <- dat$value[seq(nrow(dat)) + dat$lag]
dat
value lag laggedValue
1 1 1 2
2 2 2 4
3 3 3 6
4 4 2 6
5 5 1 6
6 6 3 9
7 7 3 10
8 8 1 9
9 9 1 10
10 10 2 NA
Other commenters are mentioning that it looks like you're just adding the value and lag columns because your value column has the elements 1 through 10, but this solution will work even when your value column has other data stored in it.
Assuming the same thing as #rawr here:
dat <- data.frame(value=c(1:10),
lag=c(1,2,3,2,1,3,3,1,1,2))
dat$laggedValue <- dat$value + dat$lag
dat
value lag laggedValue
1 1 1 2
2 2 2 4
3 3 3 6
4 4 2 6
5 5 1 6
6 6 3 9
7 7 3 10
8 8 1 9
9 9 1 10
10 10 2 12

Summing two dataframes based on common value

I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.

Resources