df <-data.frame(x=c(1:5),y=c(letters[1:5]))
Let's say I want to modify the last row,
update.row<-filter(df,x==5) %>% mutate(y="R")
How do I update this row into the data.frame ?
The only way, I found albeit a strange way is to do an
anti-join and append the results.
df <-anti_join(df,update.row,by="x") %>%
bind_rows(update.row)
However, it seems like a very inelegant way to achieve a simple task.
Any ideas are much appreciated...
With data.table, we can assign (:=) the value to the rows where i is TRUE. It is very efficient as the assignment is done in place.
library(data.table)
setDT(df)[x==5, y:="R"]
df
# x y
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: 5 R
As the OP mentioned about the last row, a more general way is
setDT(df)[.N, y:= "R"]
Or as #thelatemail mentioned, if we want to replace any row just mention the row index in i i.e. in this case 5.
setDT(df)[5, y:="R"]
If you are insistant on dplyr, perhaps
df <-data.frame(x=c(1:5),y=c(letters[1:5]))
library(dplyr)
df %>%
mutate(y = as.character(y)) %>%
mutate(y = ifelse(row_number()==n(), "R", y))
# x y
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 R
Related
This question already has answers here:
R matrix to rownames colnames values
(2 answers)
Closed 3 years ago.
I have a matrix with pairwise comparisons, of which the upper triangle and diagonal was set to NA.
df <- data.frame(a=c(NA,1,2), b=c(NA,NA,3), c=c(NA,NA,NA))
row.names(df) <- names(df)
I want to transform the matrix to long format, for which the standard procedure is to use reshape2's melt, followed by na.omit, so my desired output would be:
Var1 Var2 Value
a b 1
a c 2
b c 3
However, df$c is all NA and thus logical, and will be used as a non-measured variable by melt.
The output of melt(df) is therefore not what i am looking for.
library(reshape2)
melt(df)
How can I prevent melt from using df$c as id variable?
The trick is to convert the rownames to column and then convert to long format. A way to do it in tidyverse would be,
library(tidyverse)
df %>%
rownames_to_column() %>%
gather(var, val, -1) %>%
filter(!is.na(val))
# rowname var val
#1 b a 1
#2 c a 2
#3 c b 3
As #Humpelstielzche mentions in comments, there is a na.rm argument in gather so we can omit the last filtering, i.e.
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE)
While you have other answers already, this can be achieved with reshape2 and melt, if the appropriate function is called. In this case you don't want reshape2:::melt.data.frame but rather reshape2:::melt.matrix to be applied. So, try:
melt(as.matrix(df), na.rm=TRUE)
# Var1 Var2 value
#2 b a 1
#3 c a 2
#6 c b 3
If you then take a look at ?reshape2:::melt.data.frame you will see the statement:
This code is conceptually similar to ‘as.data.frame.table’
which means you could also use the somewhat more convoluted:
na.omit(as.data.frame.table(as.matrix(df), responseName="value"))
# Var1 Var2 value
#2 b a 1
#3 c a 2
#6 c b 3
In base R, we can use row and col to get row names and column names respectively and then filter the NA values.
df1 <- data.frame(col = colnames(df)[col(df)], row = rownames(df)[row(df)],
value = unlist(df), row.names = NULL)
df1[!is.na(df1$value), ]
# col row value
#2 a b 1
#3 a c 2
#6 b c 3
I have a small example like the following:
df1 = data.frame(Id1=c(1,2,3))
I want to obtain the list of all combinations with replacement which would look like this:
So far I have seen the following functions which produces some parts of the above table:
a) combn function
t(combn(df1$Id1,2))
# Does not creates rows 1,4 and 5 in the above image
b) expand.grid function
expand.grid(df1$Id1,df1$Id1)
# Duplicates rows 2,3 and 5. In my case the combination 1,2 and 2,1
#are the same. Hence I do not need both of them at the same time.
c) CJ function (from data.table)
#install.packages("data.table")
CJ(df1$Id1,df1$Id1)
#Same problem as the previous function
For your reference, I know that the in python I could do the same using the itertools package (link here: https://www.hackerrank.com/challenges/itertools-combinations-with-replacement/problem)
Is there a way to do this in R?
Here's an alternative using expand.grid by creating a unique key for every combination and then removing duplicates
library(dplyr)
expand.grid(df1$Id1,df1$Id1) %>%
mutate(key = paste(pmin(Var1, Var2), pmax(Var1, Var2), sep = "-")) %>%
filter(!duplicated(key)) %>%
select(-key) %>%
mutate(row = row_number())
# Var1 Var2 row
#1 1 1 1
#2 2 1 2
#3 3 1 3
#4 2 2 4
#5 3 2 5
#6 3 3 6
My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).
This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I would like to split some text in a data frame column and save it into a data frame together with the row number or an id column.
I normally used plyr to do that, but this is no longer working in dplyr.
If I understand it correctly, it is more a bug in plyr and my code works since it is a bug.
So I am looking for the correct way to do this.
This is a minimal example in plyr:
library(plyr)
set.seed(1)
df <- data.frame(a=seq(2),
b=c(paste(sample(letters,3), collapse=';'),
paste(sample(letters,3), collapse=';')),
stringsAsFactors=FALSE)
ddply(df,.(a),summarise,unlist(strsplit(b,';')))
It turns the original data frame:
a b
1 1 g;j;n
2 2 x;f;v
Into this:
a ..1
1 1 g
2 1 j
3 1 n
4 2 x
5 2 f
6 2 v
What would be the correct dplyr solution?
I'm biased in favor of cSplit from the "splitstackshape" package, but you might be interested in unnest from "tidyr" in conjunction with "dplyr":
library(dplyr)
library(tidyr)
df %>%
mutate(b = strsplit(b, ";")) %>%
unnest(b)
# a b
# 1 1 g
# 2 1 j
# 3 1 n
# 4 2 x
# 5 2 f
# 6 2 v
You could do this using cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'b', ';', 'long')
# a b
#1: 1 g
#2: 1 j
#3: 1 n
#4: 2 x
#5: 2 f
#6: 2 v
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
separate(df, b, c('b1', 'b2', 'b3'), sep=";") %>%
gather(Var, b, -a) %>%
select(-Var) %>%
arrange(a)
Or another option would be to use do
df %>%
group_by(a) %>%
do(data.frame(b=unlist(strsplit(.$b, ';'))))
I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)