Delete duplicated records within row in a df in R

Delete duplicated records within row in a df in R - r

I would like to get rid of duplicated records in each row of my df:
df <- data.frame(a=c(1,3,5), b =c(1,2,4), c=c(2,3,7))
X1 X2 X3
1 1 1 2
2 3 2 3
3 5 4 7
I want to get this:
X1 X2 X3
1 1 NA 2
2 3 2 NA
3 5 4 7
Now, I can achieve this using apply:
data.frame(t(apply(df,1, function(row) ifelse(!duplicated(row), row, NA))))
but it seems unlikely that there isn't a more compact (and perhaps efficient) way of achieving this.
Am I missing a command or package here?

Related

Rename multiple columns with series index using dplyr in R

My data frame looks like this
X0 <- c(11,2,3,4)
X1 <- c(10,2,3,4)
X2 <- c(8,2,3,4)
X3 <- c(4,6,3,4)
test <- data.frame(X0,X1,X2,X3)
X0 X1 X2 X3
1 11 10 8 4
2 2 2 2 6
3 3 3 3 3
4 4 4 4 4
I would like to rename the first three columns using the character "t" and the series from 1:3.
I want my data frame to look like this
t0 t1 t2 X3
1 11 10 8 4
2 2 2 2 6
3 3 3 3 3
4 4 4 4 4
EDIT
It works like this
test %>%
rename_at(vars(X0:X2), list(~paste0("t", 0:2)))

Or using rename_with
library(dplyr)
library(stringr)
test %>%
rename_with(~ str_c('t', 0:2), X0:X2)

Here is a data.table option with setnames
setnames(setDT(test),1:3,function(v) gsub("X","t",v))

create new column (with outcome min or NA) from multiple selected columns

My data has many columns and subjects, but to illustrate it simpler, lets say I have 7 subjects with 3 variables/columns called x1, x2 and x3 (values range from 1 to 3 and NAs). In the analysis that I want it is important I actually call the columns I want to use (since I cannot just use the whole dataframe in my analysis because there are more variables/columns there)
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1)
id x1 x2 x3
1 1 NA NA
2 2 3 2
3 2 1 NA
4 NA NA NA
5 3 2 NA
6 3 3 NA
7 1 2 1
The class of x1 x2 and x3 are numeric.
Out of that, I want to create a variable/column called ‘x4’ that:
- gives me the lowest number of row x1, x2 and x3.
-If there is an NA in a row of x1,x2,x3, the NA shall be ignored.
-If they are however ALL NAs, I would want the outcome to be NA. (NOT Inf, which is what it does with my code now)
-If there are two lowest numbers that are the same, just display any one of those two. So like this:
>data <- data.frame(‘id’=c(1,2,3,4,5,6,7), ‘x1’=c(1,2,2,NA,3,3,1), ‘x2’=c(NA,3,1,NA,2,3,2), ‘x3’=c(NA,2,NA,NA,3,NA,1), ‘x4’=c(1,2,1,NA,2,3,1)
id x1 x2 x3 x4
1 1 NA NA 1
2 2 3 2 2
3 2 1 NA 1
4 NA NA NA NA
5 3 2 NA 2
6 3 3 NA 3
7 1 2 1 1
I managed to find a very similar question, and I can mostly make it work: min for each row with dataframe in R
data$x4 <- apply(data[, c("x1","x2","x3")],1, FUN=min, na.rm = TRUE)
the problem I have now is that in case of all NAs (so id number 4), my outcome is not NA, but it is 'Inf'.
Question 1:How can I make it so it becomes an NA instead of Inf? I can of course do that afterwards like this:
is.na(data$x4) <- sapply(data$x4, is.infinite)
But I wonder if there is a nice way to do that already with/inside the previous code?
Also, rather then using sapply and the inside FUNction min, I would also like to try to make it work with code in a way like below: Question 2: is using this other code below possible?
data$x4 <- min(data[, c("x1","x2","x3")],1 , na.rm = TRUE)
for this x4 gets the outcome '1' everytime. I guess it just shows the lowest number (1) of the whole column? I dont understand why. I am already using ',1' but doesnt help.
I hope somebody can help me(r and stackoverflow newbie) out, thanks!

You are looking for pmin function which returns the (regular or parallel) minima of the input values. Below are two approaches using pmin:
df$minIget <- do.call(pmin, c(df[,-1], na.rm = TRUE)) # Approch1: using do.call
df %>% rowwise() %>% mutate(minIget = pmin(x1, x2,x3,na.rm = T))# Approch2: using tidyverse.
output:
A tibble: 7 x 5
# Rowwise:
id x1 x2 x3 minIget
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA 1
2 2 2 3 2 2
3 3 2 1 NA 1
4 4 NA NA NA NA
5 5 3 2 3 2
6 6 3 3 NA 3
7 7 1 2 1 1

You can test if all are NA before you call min like:
apply(data[, c("x1","x2","x3")], 1, function(x)
if(all(is.na(x))) NA else min(x, na.rm=TRUE))
#[1] 1 2 1 NA 2 3 1
min(data[, c("x1","x2","x3")],1 , na.rm = TRUE) gives you the minimum of 1 and data[, c("x1","x2","x3")].

Drop data frame rows if NA for certain variables referred to by name in dplyr

I would like to drop entire rows from a data frame if they have all NAs but for only certain subset of columns (which are named in a sequence as well as start with "X").
This is different than other SO answers that I found from what I can tell since I cannot refer to each column manually by name (too many variables) and do not only want to drop the rows if they are completely NA (rather if some variables are completely NA).
So turn sample data:
data1 <- as.data.frame(rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(1, NA, NA), c(4, 8, NA)))
colnames(data1) <- c("Z","X1","X2")
data1
Z X1 X2
1 1 2 3
2 1 NA 4
3 4 6 7
4 1 NA NA
5 4 8 NA
into:
V1 V2 V3
1 1 2 3
2 1 NA 4
3 4 6 7
4 4 8 NA
I.e. drop the row if both X1 and X2 (all of the X sequence) are NA.
In this example there are only two variables(X1:X2)for ease but in reality I have closer to 100 of this sequence and many other important variables that may or may not be NA. I would prefer to do so in dplyr with filter but other solutions would be appreciated as well.
I feel like:
data2 %>% filter(!is.na(all(X1:X2)))
or something similar is close but R does not like the sequence reference to X1:X2 within filter.

You can use rowSums + select + starts_with + filter:
data1 %>%
filter(rowSums(!is.na(select(., starts_with("X")))) != 0)
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#4 4 8 NA

A base R solution using apply would be:
drop <- which(apply(data1[,startsWith(colnames(data1), "X")], 1, function(x) all(is.na(x))))
data1[-drop,]
# Z X1 X2
#1 1 2 3
#2 1 NA 4
#3 4 6 7
#5 4 8 NA
Another option using rowSums:
drop <- which(rowSums(is.na(data1[,c("X1","X2")]))>=2)
data1[-drop]

How to split a list and save objects individually?

I am trying to add a new column to multiple data frames, and then replace the original data frame with the new one. This is how I am creating the new data frames:
df1 <- data.frame(X1=c(1,2,3),X2=c(1,2,3))
df2 <- data.frame(X1=c(4,5,6),X2=c(4,5,6))
groups <- list(df1,df2)
groups <- lapply(groups,function(x) cbind(x,X3=x[,1]+x[,2]))
groups
[[1]]
X1 X2 X3
1 1 1 2
2 2 2 4
3 3 3 6
[[2]]
X1 X2 X3
1 4 4 8
2 5 5 10
3 6 6 12
I'm satisfied with how the new data frames have been created. What I'm stuck on is then breaking up my groups list and then saving the list elements back into their respective original data frames.
Desired Output
Essentially, I want to do something like df1,df2 <- groups[[1]],groups[[2]] but that is of course not syntatically valid. I have more than 2 data frames, which is why I'm hoping for a more programmatic approach than simply typing out N lines of code.

for (i in 1:length(groups)){
assign(paste("df",i,sep=""),as.data.frame(groups[[i]]))
}
should do it. Try it out, please.

#Rockbar led me to a general solution as well:
for(i in 1:length(groups)){
assign(names(groups)[i],as.data.frame(groups[[i]]))
}
> df1
X1 X2 X3
1 1 1 2
2 2 2 4
3 3 3 6
> df2
x1 X3 X3
1 4 4 8
2 5 5 10
3 6 6 12
I should note that this only works if the objects in the list are all named. Thank you again #Rockbar for guiding me to this.

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own

Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.

I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete duplicated records within row in a df in R - r

Related

Rename multiple columns with series index using dplyr in R

create new column (with outcome min or NA) from multiple selected columns

Drop data frame rows if NA for certain variables referred to by name in dplyr

How to split a list and save objects individually?

Match dataframe rows according to two variables (Indexing)

Categories

Resources