Is there a better way to combine these dataframes and match values? - r

I have two dataframes that I have combined using left_join()
data1 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2011-7-26','2010-11-1','2010-11-1','2009-5-10','2009-5-10','2008-3-25','2008-3-25','2007-3-14','2007-3-14'))
Location <- c("A","B","A","B","A","B","A","B","A","B")
Result <- sample(1:30, 10)
data1 <- data.frame(Date,Location,Result)
data2 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2009-5-10','2007-3-14'))
Flow_A <- c(6,2,9)
Flow_B <- c(10,11,25)
data2 <- data.frame(Date,Flow_A,Flow_B)
After combining by date, I have this
data3 <- left_join(data2, data1, by = "Date")
Date Flow_A Flow_B Location Result
1 2011-07-26 6 10 A 11
2 2011-07-26 6 10 B 17
3 2009-05-10 2 11 A 6
4 2009-05-10 2 11 B 22
5 2007-03-14 9 25 A 20
6 2007-03-14 9 25 B 1
Each value in Result corresponds to a specific Location (A or B) and I want to attach the correct values for Flow (Flow_A or Flow_B) to that row according to location (i.e. combine columns Flow_A and Flow_B into one column 'Flow' with just the correct value). I have been able to do this using a combination of mutate(),ifelse(),grepl(),and very simple functions:
a <- data3$Flow_A
Choose_A <- function(a) {
return(a)}
d <- data3$Flow_B
Choose_B <- function(b) {
return(b)}
data3 <- mutate(data3, Flow =
ifelse(grepl("A", Location), Choose_A(a),
ifelse(grepl("B", Location), Choose_B(b),NA)))
Date Flow_A Flow_B Location Result Flow
1 2011-07-26 6 10 A 11 6
2 2011-07-26 6 10 B 17 10
3 2009-05-10 2 11 A 6 2
4 2009-05-10 2 11 B 22 11
5 2007-03-14 9 25 A 20 9
6 2007-03-14 9 25 B 1 25
But this seems rather clunky. Is there a better (more efficient) way to achieve this?
Please excuse my ignorance - I'm still learning!
Thanks!

You can create a vector of column numbers to extract from each row using match and create a matrix with cbind which is used to subset relevant value from either 'Flow_A' or 'Flow_B' depending on Location column.
column_num <- match(paste0('Flow_', data3$Location), names(data3))
row_num <- seq_len(nrow(data3))
data3$Flow <- data3[cbind(row_num, column_num)]

Related

lag/lead entire dataframe in R

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14
EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14
I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

Trying to create a count column with a specific column [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 7 years ago.
I have a data frame df that looks like the following where the gender column is a factor with two levels:
gender age
m 18
f 14
m 18
m 18
m 15
f 15
I would like to add a new column called count that simply reflects the number of times that gender level appears in the data frame. So, ultimately, the data frame would look like:
gender age count
m 18 4
f 14 2
m 18 4
m 18 4
m 15 4
f 15 2
I know that I can do table(df$gender) that gives me the number of times the factor appears, but I do not know how to translate those results into a new column in df. I'm wondering how can I use the table function--or is there a better way to achieve my new column?
You may try ave:
# first, convert 'gender' to class character
df$gender <- as.character(df$gender)
df$count <- as.numeric(ave(df$gender, df$gender, FUN = length))
df
# gender age count
# 1 m 18 4
# 2 f 14 2
# 3 m 18 4
# 4 m 18 4
# 5 m 15 4
# 6 f 15 2
Update following #flodel's comment - thanks!
df <- transform(df, count = ave(age, gender, FUN = length))
Since gender is a factor, you can use it to index the table output:
dat$count <- table(dat$gender)[dat$gender]
Or to avoid repeating dat$ too many times:
dat <- transform(dat, count = table(gender)[gender])
Using plyr:
library(plyr)
ddply(dat,.(gender),transform,count=length(age))
gender age count
1 f 14 2
2 f 15 2
3 m 18 4
4 m 18 4
5 m 18 4
6 m 15 4
And a data.table version for good measure.
library(data.table)
df <- as.data.table(df)
Once you have the data.table, it's then a simple operation:
df[,count := .N,by="gender"]
df
# gender age count
#1: m 18 4
#2: f 14 2
#3: m 18 4
#4: m 18 4
#5: m 15 4
#6: f 15 2
You can set the counts and then do something like this, but that's not exactly elegant.
m.cnt <- length(which(df$gender == "m"))
f.cnt <- length(which(df$gender == "f"))
df$count <- NA
df$count[which(df$gender == "m")] <- m.cnt
df$count[which(df$gender == "f")] <- f.cnt
Alternatively you can use plyr but this results in recalculating the same thing over and over again, which might not be worth it since you only have 2 factors.

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

Resources