This question already has answers here:
Cumulative number of unique values in a column up to current row
(2 answers)
Closed 4 years ago.
My data frame looks something like this:
USER URL
1 homepage.com
1 homepage.com/welcome
1 homepage.com/overview
1 homepage.com/welcome
What I want is a vector with the following values:
UNIQUE
1
2
3
3
How do I do that?
We could use cumsum and duplicated
df$unique <- cumsum(!duplicated(df$URL))
df$unique
#[1] 1 2 3 3
duplicated gives us logical vector of whether a value is duplicate or not, we negate it (!) and then use cumsum over it so we have cumulative sum of unique values.
Using dplyr to add a new column:
library(dplyr)
df %>%
mutate(Dups=cumsum(!duplicated(URL)))
Related
This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed yesterday.
This post was edited and submitted for review yesterday.
Note that there are solutions to other questions that may resolve this specific question, such as Replace NA with previous or next value, by group, using dplyr. However, this question isn't about replacing NA's, NA's are OK in this question in certain circumstances. This question addresses replacing all cells in a group in a dataframe that fall after the the first non-NA in that group, with that first non-NA value. When I researched this issue I didn't find solutions that fit because I only want to replace NA's in certain circumstances (NA's in a group that occur prior to the first non-NA in that group remain; and a group with all NA's and no non-NA in that group retain all their NA's).
Is there a method, with a preference for dplyr or data.table, for extending target values down an R dataframe range when specified row conditions are met in a row within a group? I vaguely remember an rleid function in data.table that may do the trick but I'm having trouble implementing. Either as a new column or by over-writing existing column "State" in my below example.
For example, if we start with the below example dataframe, I'd like to send the target value of 1 in each row to the end of each ID grouping, after the first occurrence of that target value of 1 in a group, and as better explained in the illustration underneath:
myDF <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
State = c(NA,NA,1,NA,1,NA,NA,NA,NA,NA))
You can use fill:
library(tidyr)
myDF %>%
group_by(ID) %>%
fill(State, .direction = "down")
# A tibble: 10 × 2
# Groups: ID [3]
ID State
<dbl> <dbl>
1 1 NA
2 1 NA
3 1 1
4 1 1
5 2 1
6 2 1
7 2 1
8 3 NA
9 3 NA
10 3 NA
This question already has an answer here:
Include levels of zero count in result of table()
(1 answer)
Closed 2 years ago.
I have a list of integers that are all between 1 and 365. There are some integers that appear multiple times and some that do not appear. I would like to use a function like count to have a dataframe that counts the number of occurrences it appears including if it does not appear.
df
x freq
1 0
2 1
3 3
4 0
Currently, both the rows for 1 and 4 do not exist in my current count function df=count(list)
We can use factor with levels specified so that it will also take care of the missing elements and report the count as 0
table(factor(df$x, levels = 1:4))
This question already has answers here:
R ifelse to replace values in a column
(3 answers)
Closed 3 years ago.
It seems I cannot get to the bottom of one issue and I haven't found a satisfactory answer anywhere.
So, here is my problem. I have a data frame with multiple columns and I want to change the values in one column if the value matches a string in another column.
For example:
id Var1 Var2 .................Var10
1 A 1
2 A 1
3 D
R
F .
.
1000 A 1
What I want to achieve ultimately is to change values in Column 10 to another value of my choice if values in Column 1 match a declared value. As an example, if value in column 1 is "A", then change value in Column 10 to NA.
I have tried to use this code, but failed:
if(df$Var1 = "A"){df$Var10 <- "NA"}
Thank you very much for your help!
df$Var10[df$Var1=="A"] <- "NA"
This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I am searching for an efficient and fast way to do the following:
I have a data frame with, say, 2 variables, A and B, where the values for A can occur several times:
mat<-data.frame('VarA'=rep(seq(1,10),2),'VarB'=rnorm(20))
VarA VarB
1 0.95848233
2 -0.07477916
3 2.08189370
4 0.46523827
5 0.53500190
6 0.52605101
7 -0.69587974
8 -0.21772252
9 0.29429577
10 3.30514605
1 0.84938361
2 1.13650996
3 1.25143046
Now I want to get a vector giving me for every unique value of VarA
unique(mat$VarA)
the maximum of VarB conditional on VarA.
In the example here that would be
1 0.95848233
2 1.13650996
3 2.08189370
etc...
My data-frame is very big so I want to avoid the use of loops.
Try this:
library(dplyr)
mat %>% group_by(VarA) %>%
summarise(max=max(VarB))
Try to use data.table package.
library(data.table)
mat <- data.table(mat)
result <- mat[,max(VarB),VarA]
print(result)
Try this:
library(plyr)
ddply(mat, .(VarA), summarise, VarB=min(VarB))
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 7 years ago.
I would like to remove rows that have specific values for columns that match values in another data frame.
a<-c(1,1,2,2,2,4,5,5,5,5)
b<-c(10,10,22,30,30,30,40,40,40,40)
c<-c(1,2,1,2,2,2,2,1,1,2)
d<-rnorm(1:10)
data<-data.frame(a,b,c,d)
a<-c(2,5)
b<-c(30,40)
c<-c(2,1)
x<-data.frame(a,b,c)
So that y can become:
a b c d
1 10 1 -0.2509255
1 10 2 0.4142277
2 22 1 -0.1340514
4 30 2 -1.5372009
5 40 2 1.9001932
5 40 2 -1.2825212
I tried the following, which did not work:
y<-data[!data$a==a & !data$b==b & !data$c==c,]
y<-subset(data, !data$a==x$a & !data$b==x$b & !data$c==x$c)
I also tried to just flag the ones that should be removed in order to subset in a second step, but this did not work either:
y<-data
y$rm<-ifelse(y$a==x$a & y$b==x$b & y$c==x$c, 1, 0)
The real "data" and "x" are much longer, and there are variable number of rows in data that match each row in x.
We can use anti_join from dplyr. It will return all rows from 'data' that are not matching values in 'x'. We specify the variables to be considered in the by argument.
library(dplyr)
anti_join(data, x, by=c('a', 'b', 'c'))