I'm trying to find a way to create a new table with variables using the rowSums() function from an existing dataframe. For example, my existing dataframe is called 'asn' and I want to sum up the values for each row of all variables which contain "2011" in the variable title. I want a new table consisting of just one column called asn_y2011 which contains the sum of each row using the variables containing "2011"
Data
structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
The existing 'asn' dataframe looks like this
row south_2010 south_2011 south_2012 north_2010 north_2011 north_2012
1 1 4 5 3 2 1
2 5 0 8 4 6 1
3 7 4 6 1 0 2
I'm trying to use the following function:
asn %>%
transmute(asn_y2011 = rowSums(, grep("2011")))
to get something like this
row asn_y2011
1 6
2 6
3 4
Continuing with your code, grep() should work like this:
library(dplyr)
asn %>%
transmute(row, asn_y2011 = rowSums(.[grep("2011", names(.))]))
# row asn_y2011
# 1 1 6
# 2 2 6
# 3 3 4
Or you can use tidy selection in c_across():
asn %>%
rowwise() %>%
transmute(row, asn_y2011 = sum(c_across(contains("2011")))) %>%
ungroup()
Another base R option using rowSums
cbind(asn[1],asn_y2011 = rowSums(asn[grep("2011",names(asn))]))
which gives
row asn_y2011
1 1 6
2 2 6
3 3 4
An option in base R with Reduce
cbind(df['row'], asn_y2011 = Reduce(`+`, df[endsWith(names(df), '2011')]))
# row asn_y2011
#1 1 6
#2 2 6
#3 3 4
data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-3L))
I think that this code will do what you want:
library(magrittr)
tibble::tibble(row = 1:3, south_2011 = c(4, 0, 4), north_2011 = c(2, 6, 0)) %>%
tidyr::gather(- row, key = "key", value = "value") %>%
dplyr::mutate(year = purrr::map_chr(.x = key, .f = function(x)stringr::str_split(x, pattern = "_")[[1]][2])) %>%
dplyr::group_by(row, year) %>%
dplyr::summarise(sum(value))
I first load the package magrittr so that I can use the pipe, %>%. I've explicitly listed the packages from which the functions are exported, but you are welcome to load the packages with library if you like.
I then create a tibble, or data frame, like what you specify.
I use gather to reorganize the data frame before creating a new variable, year. I then summarise the counts by value of row and year.
You can try this approach
library(tidyverse)
df2 <- df %>%
select(grep("_2011|row", names(df), value = TRUE)) %>%
rowwise() %>%
mutate(asn_y2011 = sum(c_across(south_2011:north_2011))) %>%
select(row, asn_y2011)
# row asn_y2011
# <int> <int>
# 1 1 6
# 2 2 6
# 3 3 4
Data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L, 0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,-3L))
Related
I have data where each row represents a household, and I would like to have one row per individual in the different households.
The data looks similar to this:
df <- data.frame(village = rep("aaa",5),household_ID = c(1,2,3,4,5),name_1 = c("Aldo","Giovanni","Giacomo","Pippo","Pippa"),outcome_1 = c("yes","no","yes","no","no"),name_2 = c("John","Mary","Cindy","Eva","Doron"),outcome_2 = c("yes","no","no","no","no"))
I would still like to keep the wide format of the data, just with one individual (and related outcome variables) per row. I could find examples that tell how to do the opposite, going from individual to grouped data using dcast, but I could not find examples of this problem I am facing now.
I have tried with melt
reshape2::melt(df, id.vars = "household_ID")
but I get a long format data.
Any suggestions welcome...
Thank you
Use pivot_longer() in tidyr, and set ".value" in names_to to indicate new column names from the pattern of the original column names.
library(tidyr)
df %>%
pivot_longer(-c(village, household_ID),
names_to = c(".value", "n"),
names_sep = "_")
# # A tibble: 10 x 5
# village household_ID n name outcome
# <fct> <dbl> <chr> <fct> <fct>
# 1 aaa 1 1 Aldo yes
# 2 aaa 1 2 John yes
# 3 aaa 2 1 Giovanni no
# 4 aaa 2 2 Mary no
# 5 aaa 3 1 Giacomo yes
# 6 aaa 3 2 Cindy no
# 7 aaa 4 1 Pippo no
# 8 aaa 4 2 Eva no
# 9 aaa 5 1 Pippa no
# 10 aaa 5 2 Doron no
Data
df <- structure(list(village = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "aaa", class = "factor"),
household_ID = c(1, 2, 3, 4, 5), name_1 = structure(c(1L,
3L, 2L, 5L, 4L), .Label = c("Aldo", "Giacomo", "Giovanni",
"Pippa", "Pippo"), class = "factor"), outcome_1 = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
name_2 = structure(c(4L, 5L, 1L, 3L, 2L), .Label = c("Cindy",
"Doron", "Eva", "John", "Mary"), class = "factor"), outcome_2 = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
I have two data frame. One data frame has only 1 record and 3 columns. Another data frame has 6 rows and 3 columns.
Now I want to subtract data frame 1 values from data frame 2 values.
Sample data:
df1 = structure(list(col1 = 2L, col2 = 3L, col3 = 4L), .Names = c("col1",
"col2", "col3"), class = "data.frame", row.names = c(NA, -1L))
df2 = structure(list(col1 = c(1L, 2L, 4L, 5L, 6L, 3L), col2 = c(1L,
2L, 4L, 3L, 5L, 7L), col3 = c(6L, 4L, 3L, 6L, 4L, 6L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
Final output should be like,
output = structure(list(col1 = c(-1L, 0L, 2L, 3L, 4L, 1L), col2 = c(-2L,
-1L, 1L, 0L, 2L, 4L), col3 = c(2L, 0L, -1L, 2L, 0L, 2L)), .Names = c("col1","col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
Try this..
# Creating Datasets
df1 = structure(list(col1 = 2L, col2 = 3L, col3 = 4L), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -1L))
df2 = structure(list(col1 = c(1L, 2L, 4L, 5L, 6L, 3L), col2 = c(1L,2L, 4L, 3L, 5L, 7L), col3 = c(6L, 4L, 3L, 6L, 4L, 6L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
# Output
data.frame(sapply(names(df1), function(i){df2[[i]] - df1[[i]]}))
# col1 col2 col3
# 1 -1 -2 2
# 2 0 -1 0
# 3 2 1 -1
# 4 3 0 2
# 5 4 2 0
# 6 1 4 2
If you do df2 - df1 directly you get
df2 - df1
Error in Ops.data.frame(df2, df1) :
‘-’ only defined for equally-sized data frames
So let us make df1 the same size as df2 by repeating rows and then subtract
df2 - df1[rep(seq_len(nrow(df1)), nrow(df2)), ]
# col1 col2 col3
#1 -1 -2 2
#2 0 -1 0
#3 2 1 -1
#4 3 0 2
#5 4 2 0
#6 1 4 2
Or another option is using mapply without replicating rows
mapply("-", df2, df1)
This would return a matrix, if you want a dataframe back
data.frame(mapply("-", df2, df1))
# col1 col2 col3
#1 -1 -2 2
#2 0 -1 0
#3 2 1 -1
#4 3 0 2
#5 4 2 0
#6 1 4 2
We can use sweep:
x <- sweep(df2, 2, unlist(df1), "-")
#test if same as output
identical(output, x)
# [1] TRUE
Note, it is twice slower than mapply:
df2big <- data.frame(col1 = runif(100000),
col2 = runif(100000),
col3 = runif(100000))
microbenchmark::microbenchmark(
mapply = data.frame(mapply("-", df2big, df1)),
sapply = data.frame(sapply(names(df1), function(i){df2big[[i]] - df1[[i]]})),
sweep = sweep(df2big, 2, unlist(df1), "-"))
# Unit: milliseconds
# expr min lq mean median uq max neval
# mapply 5.239638 7.645213 11.49182 8.514876 9.345765 60.60949 100
# sapply 5.250756 5.518455 10.94827 8.706027 10.091841 59.09909 100
# sweep 10.572785 13.912167 21.18537 14.985525 16.737820 64.90064 100
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)
I have a data which look like:
AAA_1 AAA_2 AAA_3 BBB_1 BBB_2 BBB_3 CCC
1 1 1 1 2 2 2 1
2 3 1 4 0 0 0 0
3 5 3 0 1 1 1 1
For each row, I want to make a mean for those columns which have a common feature as follow
feature <- c("AAA","BBB","CCC")
the desired output should look like:
AAA BBB CCC
1 1 2 1
2 2.6 0 0
3 2.6 1 1
for each pattern separately I was able to do that:
data <- read.table("data.txt",header=T,row.name=1)
AAA <- as.matrix(rowMeans(data[ , grepl("AAA" , names( data ) ) ])
But I did not know how to do partially match for different patterns in one row
Also tried some other things like :
for (i in 1:length(features)){
feature[i] <- as.matrix(rowMeans(data[ , grepl(feature[i] , names( data ) ) ]))
}
Assuming your colnames are always structured as shown in your example, then you can split the names and aggregate.
new_names <- unlist(strsplit(names(df),"\\_.*"))
colnames(df) <- new_names
#Testing with your data, we need to prevent the loss of dimension by using drop = FALSE
sapply(unique(new_names), function(i) rowMeans(df[, new_names==i, drop = FALSE]))
# AAA BBB CCC
#[1,] 1.000000 2 1
#[2,] 2.666667 0 0
#[3,] 2.666667 1 1
Data:
df <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
Here is another option for you. Seeing your column pattern, I chose to use gsub() and get the first three letters. Using ind which includes AAA, BBB, and CCC, I used lapply(), subsetted the data for each element of ind, calculated row means, and extracted a column for row mean only. Then, I used bind_cols() and created foo. The last thing was to assign column names to foo.
library(dplyr)
ind <- unique(gsub("_\\d+$", "", names(mydf)))
lapply(ind, function(x){
select(mydf, contains(x)) %>%
transmute(out = rowMeans(.))
}) %>%
bind_cols() %>%
add_rownames -> foo
names(foo) <- ind
# AAA BBB CCC
# (dbl) (dbl) (dbl)
#1 1.000000 2 1
#2 2.666667 0 0
#3 2.666667 1 1
DATA
mydf <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
library(dplyr)
library(tidyr)
data %>%
add_rownames() %>%
gather("variable", "value", -rowname) %>%
mutate(variable = gsub("_.*$", "", variable)) %>%
group_by(rowname, variable) %>%
summarise(mean = mean(value)) %>%
spread(variable, mean)
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)