I want to delete the IDs that have no information in the remaining columns - r

Here is a representation of my dataset:
Number<-c(1:10)
AA<-c(head(LETTERS,4), rep(NA,6))
BB<-c(head(letters,6), rep(NA,4))
CC<-c(1:6, rep(NA,4))
DD<-c(10:14, rep(NA,5))
EE<-c(3:8, rep(NA,4))
FF<-c(6:1, rep(NA,4))
mydata<-data.frame(Number,AA,BB,CC,DD,EE,FF)
I want to delete all the IDs (Number) that have no information in the remaining columns, automatically. I want to tell the function that if there is a value in Number but there is only NA in all the remaining columns, delete the row.
I must have the dataframe below:
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1

Another possible base R solution:
mydata[rowSums(is.na(mydata[,-1])) != ncol(mydata[,-1]), ]
Output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Or we could use apply:
mydata[!apply(mydata[,-1], 1, function(x) all(is.na(x))),]

A possible solution, using janitor::remove_empty:
library(dplyr)
library(janitor)
inner_join(mydata, remove_empty(mydata[-1], which = "rows"))
#> Joining, by = c("AA", "BB", "CC", "DD", "EE", "FF")
#> Number AA BB CC DD EE FF
#> 1 1 A a 1 10 3 6
#> 2 2 B b 2 11 4 5
#> 3 3 C c 3 12 5 4
#> 4 4 D d 4 13 6 3
#> 5 5 <NA> e 5 14 7 2
#> 6 6 <NA> f 6 NA 8 1

We can use if_all/if_all
library(dplyr)
mydata %>%
filter(if_any(-Number, complete.cases))
-output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
or
mydata %>%
filter(!if_all(-Number, is.na))
Or with base R
subset(mydata, rowSums(!is.na(mydata[-1])) >0 )
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1

Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
This makes a copy of your data though. If you have a large dataset then you can use:
Filter(function(x)!all(is.na(x)), df)
and depending on your approach you can use
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
If you want to use a data.table which is usually a pretty solid go-to

Related

Creating two columns of cumulative sum based on the categories of one column

I like to create two columns with cumulative frequency of "A" and "B" in the assignment columns.
df = data.frame(id = 1:10, assignment= c("B","A","B","B","B","A","B","B","A","B"))
id assignment
1 1 B
2 2 A
3 3 B
4 4 B
5 5 B
6 6 A
7 7 B
8 8 B
9 9 A
10 10 B
The resulting table would have this format
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7
How to generalize the codes for more than 2 categories (say for "A","B",C")?
Thanks
Use lapply over unique values in assignment to create new columns.
vals <- sort(unique(df$assignment))
df[vals] <- lapply(vals, function(x) cumsum(df$assignment == x))
df
# id assignment A B
#1 1 B 0 1
#2 2 A 1 1
#3 3 B 1 2
#4 4 B 1 3
#5 5 B 1 4
#6 6 A 2 4
#7 7 B 2 5
#8 8 B 2 6
#9 9 A 3 6
#10 10 B 3 7
We can use model.matrix with colCumsums
library(matrixStats)
cbind(df, colCumsums(model.matrix(~ assignment - 1, df[-1])))
A base R option
transform(
df,
A = cumsum(assignment == "A"),
B = cumsum(assignment == "B")
)
gives
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7

Create a function to Impute values form one data frame into another

The NA values in column A should be filled by the A value from the dat data frame and so on for the other variables.
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,6,8,9,0,6,7,9)
B <- c(5,6,1,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,8,3,2,9,NA,2,6,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
dat <- data.frame(col=c("A","B","C","D"), value=c(23,45,26,89))
dat
dat
col value
1 A 23
2 B 45
3 C 26
4 D 89
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
I was thinking something like this but I dont know how to connect those data frames in a function...
test <- function(i){
df[,i][is.na(df[,i])] <- dat$value
}
test(2)
If you want it in your format
test <- function(i){
df[,i][is.na(df[,i])] <<- dat$value[dat$col==i]
}
test("A")
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
One approach is to iterate over the columns and values and use coalesce():
library(dplyr)
library(purrr)
df[-1] <- map2_df(df[-1], dat$value, coalesce)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
Or same using replace():
map2_df(df[-1], dat$value, ~ replace(.x, is.na(.x), .y))

Spread data table by id

I have the following data.table:
> df
month student A B C D
1: 1 Amy 9 6 1 11
2: 1 Bob 8 5 5 2
3: 2 Amy 7 7 2 4
4: 2 Bob 6 6 6 6
5: 3 Amy 6 8 10 7
6: 3 Bob 9 7 11 3
I want to transform this data.table to this format: > df1
month cols Amy Bob
1: 1 A 9 8
2: 1 B 6 5
3: 1 C 1 5
4: 1 D 11 2
5: 2 A 7 6
6: 2 B 7 6
7: 2 C 2 6
8: 2 D 4 6
9: 3 A 6 9
10: 3 B 8 7
11: 3 C 10 11
12: 3 D 7 3
I tried multiple ways using dcast etc. but I couldn't transform the data. Help please!
You have to melt the dataframe and then dcast -
tmp = melt(df, id = c("month", "student"), , variable.name = "cols")
df1 = dcast(tmp, month + cols ~ student, value.var = "value")
Both are from the data.table library
A tidyr approach.
> library(tidyr)
> df %>%
gather(cols, values, A:D) %>%
spread(student, values)
month cols Amy Bob
1 1 A 9 8
2 1 B 6 5
3 1 C 1 5
4 1 D 11 2
5 2 A 7 6
6 2 B 7 6
7 2 C 2 6
8 2 D 4 6
9 3 A 6 9
10 3 B 8 7
11 3 C 10 11
12 3 D 7 3

r - dedupe the rows with value in dataframe

How to subset only the rows with values in a particular column among the duplicates based on another column.
Example:
df
A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5
So in the above dataset, first 4 rows are duplicate based on column A and C, so among them, I want to choose only the rows which has value in column B.
Desired output,
A B C D
1 5 8 9
2 6 5 8
3 NA 8 5
Thanks.
Using dplyr:
df <- read.table(text="A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5", header=T)
df %>%
group_by(A,C) %>%
filter(n()==1|!is.na(B))
A B C D
<int> <int> <int> <int>
1 1 5 8 9
2 2 6 5 8
3 3 NA 8 5
Duplicates back or forwards and not missing on B; or not a duplicate:
anydup <- duplicated(df[c("A","C")]) | duplicated(df[c("A","C")], fromLast=TRUE)
df[(anydup & (!is.na(df$B))) | (!anydup),]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Or use ave to check the length per group as per #HubertL's dplyr answer:
df[!is.na(df$B) | ave(df$B, df[c("A","C")], FUN=length)==1,]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Here is one option with data.table
library(data.table)
setDT(df)[df[, .I[.N==1 | complete.cases(B)] , .(A, C)]$V1]
# A B C D
#1: 1 5 8 9
#2: 2 6 5 8
#3: 3 NA 8 5

Reproduce character pattern as numeric pattern

I would like to expand the following data frame
d <- data.frame(a = c(rep("A",5),rep("B",5),rep("C",3),rep("D",2)))
> d
a
1 A
2 A
3 A
4 A
5 A
6 B
7 B
8 B
9 B
10 B
11 C
12 C
13 C
14 D
15 D
so that there is a column b looking like:
> d
a b
1 A 1
2 A 1
3 A 1
4 A 1
5 A 1
6 B 2
7 B 2
8 B 2
9 B 2
10 B 2
11 C 3
12 C 3
13 C 3
14 D 4
15 D 4
Not really sure how to realise that.
Use match:
match(d$a, unique(d$a))
d$b <- as.integer(factor(d$a, levels=unique(d$a)))

Resources