Increment by one to each duplicate value - r

I am trying to find a proper way, in R, to find duplicated values, and add the value 1 to each subsequent duplicated value grouped by id. For example:
data = data.table(id = c('1','1','1','1','1','2','2','2'),
value = c(95,100,101,101,101,20,35,38))
data$new_value <- ifelse(data[ , data$value] == lag(data$value,1),
lag(data$value, 1) + 1 ,data$value)
data$desired_value <- c(95,100,101,102,103,20,35,38)
Produces:
id value new_value desired_value
1: 1 95 NA 95
2: 1 100 100 100
3: 1 101 101 101 # first 101 in id 1: add 0
4: 1 101 102 102 # second 101 in id 1: add 1
5: 1 101 102 103 # third 101 in id 1: add 2
6: 2 20 20 20
7: 2 35 35 35
8: 2 38 38 38
I tried doing this with ifelse, but it doesn't work recursively so it only applies to the following row, and not any subsequent rows. Also the lag function results in me losing the first value in value.
I've seen examples with character variables with make.names or make.unique, but haven't been able to find a solution for a duplicated numeric value.
Background: I am doing a survival analysis and I am finding that with my data there are stop times that are the same, so I need to make it unique by adding a 1 (stop times are in seconds).

Here's an attempt. You're essentially grouping by id and value and adding 0:(length(value)-1). So:
data[, onemore := value + (0:(.N-1)), by=.(id, value)]
# id value new_value desired_value onemore
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38

With base R we can use ave where we take the first value of each group and basically add the row number of that row in that group.
data$value1 <- ave(data$value, data$id, data$value, FUN = function(x)
x[1] + seq_along(x) - 1)
# id value new_value desired_value value1
#1: 1 95 96 95 95
#2: 1 100 101 100 100
#3: 1 101 102 101 101
#4: 1 101 102 102 102
#5: 1 101 102 103 103
#6: 2 20 21 20 20
#7: 2 35 36 35 35
#8: 2 38 39 38 38

Here is one option with tidyverse
library(dplyr)
data %>%
group_by(id, value) %>%
mutate(onemore = value + row_number()-1)
# id value onemore
# <chr> <dbl> <dbl>
#1 1 95 95
#2 1 100 100
#3 1 101 101
#4 1 101 102
#5 1 101 103
#6 2 20 20
#7 2 35 35
#8 2 38 38
Or we can use base R without anonymous function call
data$onemore <- with(data, value + ave(value, id, value, FUN =seq_along)-1)
data$onemore
#[1] 95 100 101 102 103 20 35 38

To avoid (a potentially costly) by, you may use rowid:
data[, res := value + rowid(id, value) - 1]
# data
# id value new_value desired_value res
# 1: 1 95 96 95 95
# 2: 1 100 101 100 100
# 3: 1 101 102 101 101
# 4: 1 101 102 102 102
# 5: 1 101 102 103 103
# 6: 2 20 21 20 20
# 7: 2 35 36 35 35
# 8: 2 38 39 38 38

Related

Reshape R dataframe with values in new columns from the original dataframe

I have the following dataframe "Diet":
ID Food.group grams day weight
A 12 200 1 60
A 13 300 1 60
A 14 100 1 60
A 15 50 1 60
A 16 200 1 60
A 17 250 1 60
B 13 300 2 73
B 14 140 2 73
B 15 345 2 73
B 17 350 2 73
C 12 120 6 66
C 13 100 6 66
C 16 200 6 66
I need to create a new dataframe with each food group as a new column and the values in grams as their values, all organized by ID. The other columns have unique values for each ID and can become one line. Something like this:
ID 12 13 14 15 16 17 day weight
A 200 300 100 50 200 250 1 60
B N/A 300 140 345 N/A 350 2 73
C 120 100 N/A N/A 200 N/A 6 66
I tried using Diet2 <- reshape(Diet, idvar="ID", timevar="Food.group", direction="wide")
But I get this:
ID 12.grams 12.day 12.weight 13.grams 13.day 13.weight
A 200 1 60 300 1 60
B N/A N/A N/A 300 2 73
C 120 6 66 100 6 66
and so on. How can I get the correct dataframe shape?
You can use the more recent pivot_wider() from {tidyr}.
library(dplyr)
library(tidyr)
diet %>%
pivot_wider(id_cols = c("ID","day","weight")
, names_from = "Food.group"
, values_from = "grams")
This will give you:
# A tibble: 3 x 9
ID day weight `12` `13` `14` `15` `16` `17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 60 200 300 100 50 200 250
2 B 2 73 NA 300 140 345 NA 350
3 C 6 66 120 100 NA NA 200 NA

Reshaping data frame in R: from wide to long, but the 'varying' columns have unequal length

My question is described in the code below. I have looked here and in other forums for similar problems, but haven't found a solution that quite matches what I'm asking here. If it can be solved relying only on basic R, that would be preferable, but using a package is fine too.
id1 <- c("A", "A", "A", "B", "B", "C", "C", "C")
id2 <- c(10, 20, 30, 10, 30, 10, 20, 30)
x.1 <- ceiling(runif(8)*80) + 20
y.1 <- ceiling(runif(8)*15) + 200
x.2 <- ceiling(runif(8)*90) + 20
y.2 <- ceiling(runif(8)*20) + 200
x.3 <- ceiling(runif(8)*80) + 40
# The data frame contains to kinds of data values, x and y, repeated by a suffix number. In my example both
# the id-part and the data-part are not structured in a completely uniform manner.
mywidedata <- data.frame(id1, id2, x.1, y.1, x.2, y.2, x.3)
# If I wanted to make the data frame even wider, this would work. It generates NAs for the missing combination (B,20).
reshape(mywidedata, idvar = "id1", timevar = "id2", direction = "wide")
# What I want is "long", and this fails.
reshape(mywidedata, varying = c(3:7), direction = "long")
# I could introduce the needed column. This works.
mywidecopy <- mywidedata
mywidecopy$y.3 <- NA
mylongdata <- reshape(mywidecopy, idvar=c(1,2), varying = c(3:8), direction = "long", sep = ".")
# (sep-argument not needed in this case - the function can figure out the system)
names(mylongdata)[(names(mylongdata)=="time")] <- "id3"
# I want to reach the same outcome without manual manipulation. Is it possible with the just the
# built-in 'reshape'?
# Trying 'melt'. Not what I want.
reshape::melt(mywidedata, id.vars = c(1,2))
You can use pivot_longer from tidyr :
tidyr::pivot_longer(mywidedata,
cols = -c(id1, id2),
names_to = c('.value', 'id3'),
names_sep = '\\.')
# A tibble: 24 x 5
# id1 id2 id3 x y
# <chr> <dbl> <chr> <dbl> <dbl>
# 1 A 10 1 66 208
# 2 A 10 2 95 220
# 3 A 10 3 89 NA
# 4 A 20 1 34 208
# 5 A 20 2 81 219
# 6 A 20 3 82 NA
# 7 A 30 1 23 201
# 8 A 30 2 80 204
# 9 A 30 3 75 NA
#10 B 10 1 52 210
# … with 14 more rows
Just cbind the missing level as NA.
reshape(cbind(mywidedata, y.2=NA), varying=3:8, direction="long")
# id1 id2 time x y id
# 1.1 A 10 1 98 215 1
# 2.1 A 20 1 38 208 2
# 3.1 A 30 1 97 205 3
# 4.1 B 10 1 61 207 4
# 5.1 B 30 1 73 201 5
# 6.1 C 10 1 96 202 6
# 7.1 C 20 1 100 202 7
# 8.1 C 30 1 94 202 8
# 1.2 A 10 2 73 208 1
# 2.2 A 20 2 69 218 2
# 3.2 A 30 2 64 219 3
# 4.2 B 10 2 104 213 4
# 5.2 B 30 2 99 203 5
# 6.2 C 10 2 92 206 6
# 7.2 C 20 2 49 206 7
# 8.2 C 30 2 59 209 8
# 1.3 A 10 3 63 208 1
# 2.3 A 20 3 91 218 2
# 3.3 A 30 3 42 219 3
# 4.3 B 10 3 67 213 4
# 5.3 B 30 3 90 203 5
# 6.3 C 10 3 74 206 6
# 7.3 C 20 3 86 206 7
# 8.3 C 30 3 83 209 8
We can use melt from data.table
library(data.table)
melt(setDT(mywidedata), measure = patterns("^x", "^y"), value.name = c('x', 'y'))
# id1 id2 variable x y
# 1: A 10 1 97 215
# 2: A 20 1 75 202
# 3: A 30 1 87 213
# 4: B 10 1 51 206
# 5: B 30 1 75 203
# 6: C 10 1 41 210
# 7: C 20 1 58 211
# 8: C 30 1 50 207
# 9: A 10 2 92 204
#10: A 20 2 60 207
#11: A 30 2 35 201
#12: B 10 2 83 202
#13: B 30 2 81 202
#14: C 10 2 55 216
#15: C 20 2 68 204
#16: C 30 2 70 218
#17: A 10 3 89 NA
#18: A 20 3 108 NA
#19: A 30 3 47 NA
#20: B 10 3 78 NA
#21: B 30 3 43 NA
#22: C 10 3 106 NA
#23: C 20 3 92 NA
#24: C 30 3 96 NA

Restructure / reshape data frame ( r )

My dataset has repeated observations for people that work on projects. I need a data frame with two columns that list 'combinations' of projects for each person and time point. Let me explain with an example:
This is my data:
ID Week Project
01 1 101
01 1 102
01 1 103
01 2 101
01 2 102
02 1 101
02 1 102
02 2 101
Person 1 (ID = 1) worked on three projects in week 1. This means that there are six possible combinations of projects (project_i & project_j) for this person, in this week.
This is what I need
ID Week Project_i Project_j
01 1 101 101
01 1 101 102
01 1 101 103
01 1 102 101
01 1 102 102
01 1 102 103
01 1 103 101
01 1 103 102
01 1 103 103
01 2 101 101
01 2 101 102
01 2 102 101
01 2 102 102
02 1 101 101
02 1 101 102
02 1 102 101
02 1 102 102
02 2 101 101
Losing cases that only have one project per week is not an issue.
I have tried basic r and reshape2 for a bit, but I can't figure this out.
Here's one way:
library(data.table)
setDT(DT)
DT[, CJ(P1 = Project, P2 = Project)[P1 != P2], by=.(ID, Week)]
ID Week P1 P2
1: 1 1 101 102
2: 1 1 101 103
3: 1 1 102 101
4: 1 1 102 103
5: 1 1 103 101
6: 1 1 103 102
7: 1 2 101 102
8: 1 2 102 101
9: 2 1 101 102
10: 2 1 102 101
CJ is the Cartesian Join of two vectors, taking all combinations.
If you don't want both (101,102) and (102,101), use P1 > P2 instead of P1 != P2. Oh, the OP has changed the question... so use P1 <= P2.
Here is a solution that uses dplyr and tidyr. The key step is tidyr::complete() combined with dplyr::group_by()
library(dplyr)
library(tidyr)
d %>%
rename(Project_i = Project) %>%
mutate(Project_j = Project_i) %>%
group_by(ID, Week) %>%
complete(Project_i, Project_j) %>%
filter(Project_i != Project_j)
Here's a base option using expand.grid:
do.call(rbind, lapply(split(df, paste(df$ID, df$Week)), function(x){
x2 <- expand.grid(ID = unique(x$ID),
Week = unique(x$Week),
Project_i = unique(x$Project),
Project_j = unique(x$Project))
# omit if 101 102 is different from 102 101; make `<` if 101 101 not possible
x2[x2$Project_i <= x2$Project_j,]
}))
# ID Week Project_i Project_j
# 1 1.1 1 1 101 101
# 1 1.4 1 1 101 102
# 1 1.5 1 1 102 102
# 1 1.7 1 1 101 103
# 1 1.8 1 1 102 103
# 1 1.9 1 1 103 103
# 1 2.1 1 2 101 101
# 1 2.3 1 2 101 102
# 1 2.4 1 2 102 102
# 2 1.1 2 1 101 101
# 2 1.3 2 1 101 102
# 2 1.4 2 1 102 102
# 2 2 2 2 101 101

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources