Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14
Related
I have a data frame with a subset of variables that starts with 'AA_' (e.g., AA_1, AA_2, ... AA_100) along with other variables X, Y, Z.
If I would like to get the produce of all 'AA_' variables, what would be the most efficient way in R to achieve this?
I am thinking something like
mydata = mydata %>%
mutate(AA_product = reduce(starts_with('AA_'), `*`))
but it does not quite work
Here, we need to select the data
library(dplyr)
library(purrr)
mydata %>%
mutate(AA_product = reduce(select(., starts_with( 'AA_')), `*`))
-output
# X Y Z AA_1 AA_2 AA_3 AA_product
#1 1 2 3 1 2 3 6
#2 2 3 4 2 3 4 24
#3 3 4 5 3 4 5 60
Another less efficient approach is rowwise with c_across
mydata %>%
rowwise() %>%
mutate(AA_prod = prod(c_across(starts_with('AA')))) %>%
ungroup
data
mydata <- data.frame(X = 1:3, Y = 2:4, Z = 3:5,
AA_1 = 1:3, AA_2 = 2:4, AA_3 = 3:5)
If you want row-wise product for "AA_" columns, you can do this in base R with Reduce :
cols <- grep('AA_', names(mydata))
mydata$AA_product <- Reduce(`*`, mydata[cols])
and apply :
mydata$AA_product <- apply(mydata[cols], 1, prod)
I have the following data frame:
df <- tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
I now need to extract the values from the second row, third to fifth column with this command:
newrow <- df[2,3:5]
I now want to insert a new row after the second row. The problem is that I need the new row to start at column 2. If I use the following code, the row will be added at the same column positions as I extracted it from:
df%>% add_row(newrow, .before = 3)
Hope anybody can help with this, any help is much appreciated.
Your newrow dataframe has the colnames from coluns 3:5 (z,a,b). Therefore add_row()matches the newrow to these columns.
You need to rename the columns of newrow with the first three column names.
df%>% add_row(setNames(newrow, names(df)[1:ncol(newrow)]),
.before = 3)
I'm not sure exactly what you're desired outcome is but does this achieve what you want?
library(tibble)
library(dplyr)
df <- tibble::tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
whatrow <- 2
whatcolumns <- 3:5
beforerow <- 3
newdf <-
slice(df, whatrow) %>%
select(all_of(whatcolumns)) %>%
setNames(., names(df)[whatcolumns - 1]) %>%
add_row(df, ., .before = beforerow)
newdf
#> # A tibble: 4 x 5
#> x y z a b
#> <int> <int> <int> <int> <int>
#> 1 1 3 4 6 7
#> 2 2 2 5 5 8
#> 3 NA 5 5 8 NA
#> 4 3 1 6 4 9
library(purrr)
library(tibble)
library(dplyr)
Starting list of dataframes
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2]),
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2)))
lst
#> $df1
#> X.1 heading
#> 1 1 a
#> 2 2 b
#>
#> $df2
#> X.32 another.topic
#> 1 3 Line 1
#> 2 4 Line 2
Expected "combined" dataframe, with new consistent variable names, and old variable names in the first row of each constituent dataframe.
#> id h1 h2
#> 1 df1 X.1 heading
#> 2 df1 1 a
#> 3 df1 2 b
#> 4 df2 X.32 another.topic
#> 5 df2 3 Line 1
#> 6 df2 4 Line 2
add_row requires "Name-value pairs, passed on to tibble(). Values can be defined only for columns that already exist in .data and unset columns will get an NA value."
Which is what I think I have achieved with this:
df_nms <-
map(lst, names) %>%
map(set_names)
#> $df1
#> X.1 heading
#> "X.1" "heading"
#>
#> $df2
#> X.32 another.topic
#> "X.32" "another.topic"
But I cannot tie up the last bit, using a purrr function to add the names to the head of each dataframe. I've tried numerous variations with map2 and pmap the closest I can get at present (if I treat add_row as a formula , prefixing it with ~ and remove the .y I get a new first row populated with NAs). I think I'm missing how to pass the name-value pairs to the add_row function.
map2(lst, df_nms, add_row(.x, .y, .before = 1)) %>%
map(set_names, c("h1", "h2")) %>%
map_dfr(bind_rows, .id = "id")
#> Error in add_row(.x, .y, .before = 1): object '.x' not found
A pointer to resolve this last step would be most appreciated.
Not quite sure how to do this via purrr map functions, but here is an alternative,
library(dplyr)
bind_rows(lapply(lst, function(i){d1 <- as.data.frame(matrix(names(i), ncol = ncol(i)));
rbind(d1, setNames(i, names(d1)))}), .id = 'id')
# id V1 V2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Here's an approach using map, rbindlist from data.table and some base R functions:
library(purrr)
library(dplyr)
library(data.table)
map(lst, ~ as.data.frame(unname(rbind(colnames(.x),as.matrix(.x))))) %>%
rbindlist(idcol = "id")
# id V1 V2
#1: df1 X.1 heading
#2: df1 1 a
#3: df1 2 b
#4: df2 X.32 another.topic
#5: df2 3 Line 1
#6: df2 4 Line 2
Alternatively we could use map_df if we use colnames<-:
map_df(lst, ~ as.data.frame(rbind(colnames(.x),as.matrix(.x))) %>%
`colnames<-`(.,paste0("h",seq(1,dim(.)[2]))), .id = "id")
# id h1 h2
#1 df1 X.1 heading
#2 df1 1 a
#3 df1 2 b
#4 df2 X.32 another.topic
#5 df2 3 Line 1
#6 df2 4 Line 2
Key things here are:
Use as.matrix to get rid of the factor / character incompatibility.
Remove names with unname or set them with colnames<-
Use the idcols = or .id = feature to get the names of the list as a column.
I altered your sample data a bit, setting stringsAsFactors to FALSE when creating the data.frames in lst.
here is a solution using data.table::rbindlist().
#sample data
lst <- list(df1 = data.frame(X.1 = as.character(1:2),
heading = letters[1:2],
stringsAsFactors = FALSE), # !! <--
df2 = data.frame(X.32 = as.character(3:4),
another.topic = paste("Line ", 1:2),
stringsAsFactors = FALSE) # !! <--
)
DT <- data.table::rbindlist( lapply( lst, function(x) rbind( names(x), x ) ),
use.names = FALSE, idcol = "id" )
setnames(DT, names( lst[[1]] ), c("h1", "h2") )
# id h1 h2
# 1: df1 X.1 heading
# 2: df1 1 a
# 3: df1 2 b
# 4: df2 X.32 another.topic
# 5: df2 3 Line 1
# 6: df2 4 Line 2
I have a data.frame df1
df1 <- data.frame(id=1:10)
and I have a second data.frame df2
df2 <- data.frame(id=1:100, key=sample(1:10,100,replace=T), var1=sample(c(TRUE, FALSE),100, replace=T), var2=sample(c("X", "Y"),100, replace=T))
Variable df2$key is a secondary key and points to the variable df1$id.
Now for each entry in df1 I would like to check how many entries there are in df2, given a certain condition.
An example:
If df1$id==5 I would like to create a variable df1$count that counts the number of entries in data.frame df2 where df2$key==5 and df2$var==TRUE.
Thank you for your help!
Here's how you could do it in base R:
merge(df1, aggregate(var1 ~ key, df2, FUN = sum),
by.x = "id", by.y = "key", all.x = TRUE)
id var1
1 1 3
2 2 1
3 3 4
4 4 6
5 5 9
6 6 4
7 7 5
8 8 7
9 9 4
10 10 3
or using dplyr:
library(dplyr)
df2 %>%
filter(var1) %>%
count(key) %>%
right_join(df1, by = c("key" = "id"))
In both cases we do the counting first and then merge the result to df1.
I have two vectors. First vector name is comments$author_id and second is enrolments$learner_id. I want to add new column into enrolmens dataframe that shows count of repeated rows in comments$author_id vector for each enrolment$learner_id row.
Example:
if(enrolments$learner_id[1] repeated 5 times in comments$author_id)
enrolments$freqs[1] = 5
Can I do this don't using any loops?
The vector samples are as follows:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,4,5,6))
I want to add "counts" column to "df2" that shows counts of repeated v2 element in v1.
"[tabulate] gives me this error: Error in $<-.data.frame(tmp, "comments_count", value = c(0L, 0L, : replacement has 25596 rows, data
has 25597"
That is prly because there is one value at the end of df2$v2, which are not part of df1$v1 - I add 0 and 7 to your example to show that:
df1 <- data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 <- data.frame(v2 = c(1,2,3,0,4,5,6,7))
df2$count <- tabulate(factor(df1$v1, df2$v2))
# Error in `$<-.data.frame`(`*tmp*`, count, value = c(7L, 5L, 3L, 0L, 5L, :
# replacement has 7 rows, data has 8
To correct that using tabulate, which might be the fastest solution on larger data:
df2$count <- tabulate(factor(df1$v1, df2$v2), length(df2$v2))
df2
# v2 count
# 1 1 7
# 2 2 5
# 3 3 3
# 4 0 0
# 5 4 5
# 6 5 6
# 7 6 2
# 8 7 0
See ?tabulate for the documentation on that function.
Using your df1 and df2 example, you could do it like this:
# Make data
df1 = data.frame(v1 = c(1,1,1,4,5,5,4,1,2,3,5,6,2,1,5,2,3,4,1,6,4,2,3,5,1,2,5,4))
df2 = data.frame(v2 = c(1,2,3,4,5,6))
# Add 'count' variable as reqeuested
df2$counts = sapply(df2$v2, function(x) {
sum(df1$v1 == x, na.rm = T) #na.rm=T just in case df1$v1 has missing values
})
df2 #view output
What you essentially are doing is aggregating the df1 to get a count, and then adding this count back to the df2 set. This logic can be easily translated to a bunch of different methods:
# base R
merge(
df2,
aggregate(cbind(df1[0], count=1), df1["v1"], FUN=sum),
by.x="v2", by.y="v1", all.x=TRUE
)
# data.table
library(data.table)
setDT(df1)
setDT(df2)
df2[df1[, .(count=.N), by=v1], on=c("v2"="v1")]
# dplyr
library(dplyr)
df1 %>%
group_by(v1) %>%
count() %>%
left_join(df2, ., by=c("v2"="v1"))
# v2 count
#1 1 7
#2 2 5
#3 3 3
#4 4 5
#5 5 6
#6 6 2