R - How can I find a duplicated line based in one Column and add extra text in that duplicated value? - r

I'am looking for a easy solution, instead of doing several steps.
I have a data frame with 36 variables with almost 3000 lines, one of vars is a char type with names. They must be unique. I need to find the rows with the same name, and the add "duplicated" in the text. I can't delete the duplicated because it is from a relational data base and I'll need that row ID for others operations.
I can find the duplicated rows and them rename the text manually. But that implies in finding the duplicated, record the row ID and them replace the text name manually.
Is there a way to automatically add the extra text to the duplicated names? I'am still new to R and have a hard time making auto condition based functions.
It would be something like this:
From this:
ID name age sex
1 John 18 M
2 Mary 25 F
3 Mary 19 F
4 Ben 21 M
5 July 35 F
To this:
ID name age sex
1 John 18 M
2 Mary 25 F
3 Mary - duplicated 19 F
4 Ben 21 M
5 July 35 F
Could you guys shed some light?
Thank you very much.

Edit: the comment about adding a column is probably the best thing to do, but if you really want to do what you're suggesting...
The duplicated function will identify duplicates. Then, you just need to use paste to apply the append.
df <- data.frame(
ID = 1:5,
name = c('John', 'Mary', 'Mary', 'Ben', 'July'),
age = c(18, 25, 19, 21, 35),
sex = c('M', 'F', 'F', 'M', 'F'),
stringsAsFactors = FALSE)
# Add "-duplicated" to every duplicated value (following Laterow's comment)
dup <- duplicated(df$name)
df$name[dup] <- paste0(df$name[dup], '-duplicated')
df
ID name age sex
1 1 John 18 M
2 2 Mary 25 F
3 3 Mary-duplicated 19 F
4 4 Ben 21 M
5 5 July 35 F

Related

Filter by Condition occurring Consecutively in R

I'm hoping to see if there is a dplyr solution to this problem as I'm building a survival dataset.
I am looking to create my 'event' coding that would satisfy a particular condition if it occurs twice consecutively. In this case, the event condition would be if Var was > 21 for two consecutive dates. For example, in the following dataset:
ID Date Var
1 1/1/20 22
1 1/3/20 23
2 1/2/20 23
2 2/10/20 18
2 2/16/20 21
3 12/1/19 16
3 12/6/19 14
3 12/20/19 22
In this case, patient 1 should remain, and patient 2 and 3 should be filtered out because > 21 did not happen consecutively, and then i'd like to simply take the maximum date by each ID so that I can easily calculate the time to the event.
Final result:
ID Date Var
1 1/3/20 23
Thank you
As long as the dates are sorted (latest date is later in the table) this should work. However, this is in data.table since I dont use dplyr that much, however it should be pretty similar.
library(data.table)
setDT(df)
df = df[Var > 21 & shift(Var > 21, n = -1), ]
df = unique(df, by = "ID", fromLast = T)

Only changing a single variable in R

I have a dataframe df:
Group Age Sales
A1234 12 1000
A2312 11 900
B2100 23 2100
...
I intend to create a new dataframe through the modification of the Group variable, by only taking the substring of Group. At present, I am able to execute it in 2 steps:
dt1<- dt
dt1$Group<- substr(dt$Group,1,2)
Is it able to do the above in one single command? I guess the following would get tedious if I have to create and transform many intermediate dataframes along the way.
You can try:
dt1<-`$<-`(dt,"Group",substr(dt$Group,1,2))
dt1
# Group Age Sales
#1 A1 12 1000
#2 A2 11 900
#3 B2 23 2100
dt
# Group Age Sales
#1 A1234 12 1000
#2 A2312 11 900
#3 B2100 23 2100
The original table is unchanged and you get the new one with a single line.

switching list elements with dataframe rows

Consider my list IDs that has a dataframe of behaviours in each one:
IDs <- list(Dave = data.frame(Behaviour = c("Aggression","Interaction", "Nursing"), number = c(20,10,5), duration = c(60,39,27)),James = data.frame(Behaviour = c("Aggression","Interaction"), number = c(21,30), duration = c(30,49)))
IDs
$Dave
Behaviour number duration
1 Aggression 20 60
2 Interaction 10 39
3 Nursing 5 27
$James
Behaviour number duration
1 Aggression 21 30
2 Interaction 30 49
Note that James does not exhibit any nursing behaviour and therefore different number of rows between the two list elements.
I want to switch the list elements with the dataframe rows so that I have a list of behaviours and a dataframe of ID. So that it looks like this:
$Aggression
ID number duration
1 Dave 20 60
2 James 21 30
$Interaction
ID number duration
1 Dave 10 39
2 James 30 49
$Nursing
ID number duration
1 Dave 5 27
I thought that it could be achieved with reshape2::melt. I wasn't able to get further than melt(IDs, id = "Behaviour)
Any ideas?
Generally you can do it in two steps:
turning the list into a single data.frame/data.table
splitting it based on Behavior
You can do it like this, for example:
dt <- data.table::rbindlist(IDs, id = "ID")
# or: dt <- dplyr::bind_rows(IDs, .id = "ID")
split(dt, dt$Behaviour)
Note:
If you don't want the Behaviour column in the result and you used the data.table approach, you can modify the split to:
split(dt[,!"Behaviour"], dt$Behaviour)
Try this:
tmp<-data.frame(ID=rep(names(IDs),vapply(IDs,nrow,1L)),do.call(rbind,IDs),row.names=NULL)
split(tmp[-2],tmp$Behaviour)
#$Aggression
# ID number duration
#1 Dave 20 60
#4 James 21 30
#$Interaction
# ID number duration
#2 Dave 10 39
#5 James 30 49
#$Nursing
# ID number duration
#3 Dave 5 27
#6 James 1 17
Or using base R
d1 <- do.call(rbind, Map(cbind, id = names(IDs), IDs))
split(d1, d1$Behaviour)

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources