Creating new dataframe with missing value - r

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?

You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.

new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Related

r - remove first row of condition per subject in dataframe

I have a long format dataframe with multiple subjects and multiple conditions for each subject.
I want to remove the first row of each condition (except the first one) for all subjects.
My dataframe looks like this:
> df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)), cond = (rep(c("A", "A", "B", "B"),times=3)), value = round(runif(12, min = 0, max = 10)))
> df
subj cond value
1 A 1
1 A 5
1 B 3
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
I have found the duplicated() function but it only removes the first row of each condition for the first subject:
df <- df[duplicated(df$cond),]
subj cond value
1 A 5
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
Is there a way to "reset" the finding of a duplicate whenever a new subject begins?
And how can I stop it from excluding the first row of the first condition?
Thank you all so much!
You could subset with the duplicated interaction of the two variables:
> df
subj cond value
1 1 A 5
2 1 A 7
3 1 B 4
4 1 B 8
5 2 A 5
6 2 A 2
7 2 B 8
8 2 B 5
9 3 A 8
10 3 A 1
11 3 B 1
12 3 B 5
df1 <- df[!duplicated(interaction(df$subj, df$cond)),]
> df1
subj cond value
1 1 A 5
3 1 B 4
5 2 A 5
7 2 B 8
9 3 A 8
11 3 B 1
Edit:
I've read your question again and it seems you want to remove the first row, not the last. In this case, use
df1 <- df[!duplicated(interaction(df$subj, df$cond), fromLast = TRUE),]
> df1
subj cond value
2 1 A 4
4 1 B 9
6 2 A 9
8 2 B 7
10 3 A 1
12 3 B 2
Alternative (but does depend on actual df):
df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)),
cond = (rep(c("A", "A", "B", "B"),times=3)),
value = round(runif(12, min = 0, max = 10)))
df
dummy <- as.character(df$cond) # factor to character
mask <- c(FALSE, dummy[-1] == dummy[-length(dummy)])
df[mask,]

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Create equal length vectors from time series based upon factor in R

I have a data frame that is something like this:
time type count
1 -2 a 1
2 -1 a 4
3 0 a 6
4 1 a 2
5 2 a 5
6 0 b 3
7 1 b 7
8 2 b 2
I want to create a new data frame that takes type 'b' and creates the full time series by filling in zeroes for count. It should look like this:
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2
I can certainly subset(df, df$type = 'b') and then hack the beginning and rbind, but I want it to be more dynamic just in case the time vector changes.
We can use complete from tidyr to get the full 'time' for all the unique values of 'type' and filter the value of interest in 'type'.
library(tidyr)
library(dplyr)
val <- "b"
df1 %>%
complete(time, type, fill=list(count=0)) %>%
filter(type== val)
# time type count
# <int> <chr> <dbl>
#1 -2 b 0
#2 -1 b 0
#3 0 b 3
#4 1 b 7
#5 2 b 2
With base R:
df1 <- data.frame(time=df[df$type == 'a',]$time, type='b', count=0)
df1[match(df[df$type=='b',]$time, df1$time),]$count <- df[df$type=='b',]$count
df1
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2

Resources