R: dplyr duplicated column name - r

I have a large dataset that looks like below:
df1 <- data.frame(matrix(vector(),ncol=4, nrow = 3))
colnames(df1) <- c("Date","A","B","C")
df1[1,] <- c("2000-01-30","0","1","0")
df1[2,] <- c("2000-01-31","2","0","3")
df1[3,] <- c("2000-02-29","1","2","1")
df1[4,] <- c("2000-03-31","2","1","3")
df1
Date A A C
1 2000-01-30 0 1 0
2 2000-01-31 2 0 3
3 2000-02-29 1 2 1
4 2000-03-31 2 1 3
I'm trying to get:
Date A A C
1 2000-01 2 1 3
2 2000-02 1 2 1
3 2000-03 2 1 3
Here is what I did:
library(zoo)
library(dplyr)
df1[,-1] = lapply(df1[,-1], as.numeric)
df1 %>% mutate(Date = as.yearmon(Date)) %>%
group_by(Date) %>%
summarise_each(funs(sum))
This works great when there are no duplicate column names. However, given the size of the data, some of the columns may have the same name, which causes the error found duplicated column name: A. I do not want to combine the columns, and I want to get the result as mentioned above. Please advise.

Related

R: How can I merge two column in one (for loger column, not merge by column )

For example, I have this data frame:
Id
Age
1
14
2
28
and I want to make a long column like this:
Id
new column
1
1
2
2
14
28
What should I do?
We may unlist data and create the column by padding NA based on the max length
lst1 <- list(df1$id, unlist(df1))
out <- data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))
names(out) <- c("id", "new_column")
Here is another approach:
df1 <- data.frame(New_column = c(df[,"Id"], df[,"Age"]))
merge(df$Id, df1, by="row.names", all=TRUE)[,-1]
Output:
x New_column
1 1 1
2 2 2
3 NA 14
4 NA 28
An approach with dplyr
library(dplyr)
df %>%
mutate(Age = Id) %>%
bind_rows(
df %>%
mutate(Id = NA)
) %>%
rename(new_column = Age)
# A tibble: 4 x 2
Id new_column
<int> <int>
1 1 1
2 2 2
3 NA 14
4 NA 28

How to add row to other matrix

I have following panel data:
firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2
I want to transform this long format to wide but only for date 1 to look like this
firmid return in date=1
1 1
3 2
I appreciate any advice!
df <- read.table(header = T, text = "firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2")
Base R solution:
df <- df[df$date == 1, ]
df$date <- NULL
df
firmid return
1 1 1
6 3 2
data.table solution:
library(data.table)
setDT(df)
df <- df[date == 1, ]
df[, date := NULL]
firmid return
1: 1 1
2: 3 2
You can use dplyr to achieve it too:
library(dplyr)
df2 <- df %>%
filter(date == 1) %>%
select(-date)
# firmid return
#1 1 1
#2 3 2
A different dplyr solution that allows you to have multiple values of return within firmid:
df %>%
filter(date == 1) %>%
group_by(firmid, return) %>%
summarise()

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

Create NAs for first two rows using group_by

I am trying to replace my observations with NAs. I would like to replace NAs only for the first two observations with respect to each group represented by a given ID.
So from:
id b
1 1 0.1125294
2 1 -0.6871102
3 1 0.1721639
4 2 0.2714921
5 2 0.1012665
6 2 -0.3538989
Get:
id b
1 1 NA
2 1 NA
3 1 0.1721639
4 2 NA
5 2 NA
6 2 -0.3538989
Tried this, but it does not work...
data<- data %>% group_by(id) %>% mutate(data$b[1:2] = NA)
Thanks for any help!
library(dplyr)
df <- data.frame(id = rep(1:2, each = 3), value = rnorm(6))
df %>% group_by(id) %>% mutate(value=replace(value, 1:2, NA))

Order by month and year in R data frame

I have cleaned and ordered my data by date, which looks like below:
df1 <- data.frame(matrix(vector(),ncol=4, nrow = 3))
colnames(df1) <- c("Date","A","B","C")
df1[1,] <- c("2000-01-30","0","1","0")
df1[2,] <- c("2000-01-31","2","0","3")
df1[3,] <- c("2000-02-29","1","2","1")
df1[4,] <- c("2000-03-31","2","1","3")
df1
Date A B C
1 2000-01-30 0 1 0
2 2000-01-31 2 0 3
3 2000-02-29 1 2 1
4 2000-03-31 2 1 3
However, I want to drop the day and order the data by month and year so the data will look like:
Date A B C
1 2000-01 2 1 3
3 2000-02 1 2 1
4 2000-03 2 1 3
I tried to use as.yearmon from zoo df2 <- as.yearmon(df1$Date, "%b-%y") and it returns NA. Thank you in advance for your generous help!
Here's a way to get the sum of the values for each column within each combination of Year-Month:
library(zoo)
library(dplyr)
# Convert non-date columns to numeric
df1[,-1] = lapply(df1[,-1], as.numeric)
df1 %>% mutate(Date = as.yearmon(Date)) %>%
group_by(Date) %>%
summarise_each(funs(sum))
Or, even shorter:
df1 %>%
group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum))
Date A B C
1 Jan 2000 2 1 3
2 Feb 2000 1 2 1
3 Mar 2000 2 1 3
A couple of additional enhancements:
Add the number of rows for each group:
df1 %>% group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum)) %>%
bind_cols(df1 %>% count(d=as.yearmon(Date)) %>% select(-d))
Multiple summary functions:
df1 %>% group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum(.), mean(.))) %>%
bind_cols(df1 %>% count(d=as.yearmon(Date)) %>% select(-d))
Date A_sum B_sum C_sum A_mean B_mean C_mean n
1 Jan 2000 2 1 3 1 0.5 1.5 2
2 Feb 2000 1 2 1 1 2.0 1.0 1
3 Mar 2000 2 1 3 2 1.0 3.0 1
Your Date column is a character vector, when it needs to be a Date type vector. So:
df1$Date <- as.Date(df1$Date)
df1$Date <- as.yearmon(df1$Date)
Date A B C
1 Jan 2000 0 1 0
2 Jan 2000 2 0 3
3 Feb 2000 1 2 1
4 Mar 2000 2 1 3

Resources