Removing rows on column value by ID in R - r

Apologies if this is posted elsewhere I did searches here and elsewhere and found things that were close but not quite what I needed. After sinking a couple hours into this, I'm posting!
I need to remove rows from a data set for duplicate values in value1 by id. So in the following data frame I'd only want to remove row 3. I do not want to remove row 10 or row 9. If it makes a difference, in the actual date the values are dates.
I know the solution is probably very simple but I've yet to get it exactly right. Thanks!
x <- data.frame(cbind(id=c(1,2,2,2,3,3,4,5,6,6), value1=c(6,8,8,1,9,5,4,3,8,4), value2=1:10))
> x
id value1 value2
1 1 6 1
2 2 8 2
3 2 8 3
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10
I want to end up with:
> x
id value1 value2
1 1 6 1
2 2 8 2
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10

Try duplicated:
> x[!duplicated(x[1:2]), ]
id value1 value2
1 1 6 1
2 2 8 2
4 2 1 4
5 3 9 5
6 3 5 6
7 4 4 7
8 5 3 8
9 6 8 9
10 6 4 10

Related

Create a new variable based on existing variable

My current dataset look like this
Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6
I want to create a new variable called "V2" based on the variables "Order" and "V1". For every 8 items in the "Order" variable, I want to assign a value of "0" in "V2" if the varialbe "Order" has observation equals to 1; otherwise, "V2" takes the value of previous item in "V1".
This is the dataset that I want
Order V1 V2
1 7 0
2 5 7
3 8 5
4 5 8
5 8 5
6 3 8
7 4 3
8 2 4
1 8 0
2 6 8
3 3 6
4 4 3
5 5 4
6 7 5
7 3 7
8 6 3
Since my actual dataset is very large, I'm trying to use for loop with if statement to generate "V2". But my code keeps failing. I appreciate if anyone can help me on this, and I'm open to other statements. Thank you!
(Up front: I am assuming that the order of Order is perfectly controlled.)
You need simply ifelse and lag:
df <- read.table(text="Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6 ", header=T)
df$V2 <- ifelse(df$Order==1, 0, lag(df$V1))
df
# Order V1 V2
# 1 1 7 0
# 2 2 5 7
# 3 3 8 5
# 4 4 5 8
# 5 5 8 5
# 6 6 3 8
# 7 7 4 3
# 8 8 2 4
# 9 1 8 0
# 10 2 6 8
# 11 3 3 6
# 12 4 4 3
# 13 5 5 4
# 14 6 7 5
# 15 7 3 7
# 16 8 6 3
with(dat,{V2<-c(0,head(V1,-1));V2[Order==1]<-0;dat$V2<-V2;dat})
Order V1 V2
1 1 7 0
2 2 5 7
3 3 8 5
4 4 5 8
5 5 8 5
6 6 3 8
7 7 4 3
8 8 2 4
9 1 8 0
10 2 6 8
11 3 3 6
12 4 4 3
13 5 5 4
14 6 7 5
15 7 3 7
16 8 6 3

Grouping cases with at least three variables in common in R

I have want to group my dataset by multiple variables and than id those groups. I can id groups when I only group by one variable using dplyr with group_indices.
But I want to group cases by having the same value on at least one of a certain set of variables and then identify the group cases belong to. How to do this in R?
I have the following dataset
NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9
I want cases to be grouped when they have at least one variable of the three I listed (name, adress, phonenumber) in common.
Cases with most in common to each other should be grouped over cases that have the least in common.
So I want to create a grouping variable which gives cases the same value if they're in the same group.
You can assume the hierarchy of name>address>phone
NPI name adress phone org
1 1 1 1 1
2 1 1 1 1
3 2 2 2 2
4 2 3 3 2
5 3 4 4 3
6 3 4 5 3
7 4 5 6 4
8 5 6 6 4
9 6 7 7 5
10 7 8 8 6
11 1 9 9 1
In the my real dataset I don't have numbers but names, actual addresses and phone numbers. So all the variables I'm working with are string variables.
Try this with dplyr:
library(dplyr)
df %>%
arrange(name, adress, phone) %>%
mutate(group = c(1, ifelse((name != lag(name)) & (adress != lag(adress)) & (phone != lag(phone)), 1, 0)[-1]),
group = cumsum(group)) %>%
arrange(NPI)
Result:
NPI name adress phone group
1 1 1 1 1 1
2 2 1 1 1 1
3 3 2 2 2 2
4 4 2 3 3 2
5 5 3 4 4 3
6 6 3 4 5 3
7 7 4 5 6 4
8 8 5 6 6 4
9 9 6 7 7 5
10 10 7 8 8 6
11 11 1 9 9 1
Note:
This works even if name, adress, and phone are all characters. As long as and id column (NPI) is numeric, the final data.frame would be in the correct order.
Data:
df = read.table(text = " NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9 ", header = TRUE)
library(dplyr)
df = df %>% mutate_at(vars(-NPI), as.character)

R indirect reference in data frame

I would like to refer to values in a data frame column with the row index being dependent on the value of another column.
Example:
value lag laggedValue
1 1 2
2 2 4
3 3 6
4 2 6
5 1 6
6 3 9
7 3 10
8 1 9
9 1 10
10 2
In Excel I use this formula in column "laggedValue":
=INDIRECT("B"&(ROW(B2)+C2))
How can I do this in an R data frame?
Thanks!
For row r with associated lag value lag[r] it looks like you're trying to create a new column that is the (r+lag[r])th element of value (or a missing value if this is out of bounds). You can do this with:
dat$laggedValue <- dat$value[seq(nrow(dat)) + dat$lag]
dat
value lag laggedValue
1 1 1 2
2 2 2 4
3 3 3 6
4 4 2 6
5 5 1 6
6 6 3 9
7 7 3 10
8 8 1 9
9 9 1 10
10 10 2 NA
Other commenters are mentioning that it looks like you're just adding the value and lag columns because your value column has the elements 1 through 10, but this solution will work even when your value column has other data stored in it.
Assuming the same thing as #rawr here:
dat <- data.frame(value=c(1:10),
lag=c(1,2,3,2,1,3,3,1,1,2))
dat$laggedValue <- dat$value + dat$lag
dat
value lag laggedValue
1 1 1 2
2 2 2 4
3 3 3 6
4 4 2 6
5 5 1 6
6 6 3 9
7 7 3 10
8 8 1 9
9 9 1 10
10 10 2 12

R, Using reshape to pull pre post data

I have a simple data frame as follows
x = data.frame(id = seq(1,10),val = seq(1,10))
x
id val
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I want to add 4 more columns. The first 2 are the previous two rows and the next two are the next two rows. For the first two rows and last two rows it needs to write out as NA.
How do I accomplish this using cast in the reshape package?
The final output would look like
1 1 NA NA 2 3
2 2 NA 1 3 4
3 3 1 2 4 5
4 4 2 3 5 6
... and so on...
Thanks much in advance
After your give the example , I change the solution
mat <- cbind(dat,
c(c(NA,NA),head(dat$id,-2)),
c(c(NA),head(dat$val,-1)),
c(tail(dat$id,-1),c(NA)),
c(tail(dat$val,-2),c(NA,NA)))
colnames(mat) <- c('id','val','idp','valp','idn','valn')
id val idp valp idn valn
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
Here is a soluting with sapply. First, choose the relative change for the new columns:
lags <- c(-2, -1, 1, 2)
Create the new columns:
newcols <- sapply(lags,
function(l) {
tmp <- seq.int(nrow(x)) + l;
x[replace(tmp, tmp < 1 | tmp > nrow(x), NA), "val"]})
Bind together:
cbind(x, newcols)
The result:
id val 1 2 3 4
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA

Remove rows from a single-column data frame

When I try to remove the last row from a single column data frame, I get a vector back instead of a data frame:
> df = data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> df[-(length(df[,1])),]
[1] 1 2 3 4 5 6 7 8 9
The behavior I'm looking for is what happens when I use this command on a two-column data frame:
> df = data.frame(a=1:10,b=11:20)
> df
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
> df[-(length(df[,1])),]
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
My code is general, and I don't know a priori whether the data frame will contain one or many columns. Is there an easy workaround for this problem that will let me remove the last row no matter how many columns exist?
Try adding the drop = FALSE option:
R> df[-(length(df[,1])), , drop = FALSE]
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9

Resources