I have multiple dataframes that I created in a for loop. They all have the same 3 columns, XLocs, YLocs, PatchStatus. XLocs and YLocs contain the same coordinates in each dataframe. PatchStatus can be either 0 or 1 depending how the model ran. Example of dataframe 1 looks like
print(listofdfs[1])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 0
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
And dataframe 2 looks like
print(listofdfs[2])
allPoints.xLocs allPoints.yLocs allPoints.patchStatus
1 73.5289654 8.8633913 0
2 21.0795393 44.4840248 1
3 51.5969348 21.7864016 0
4 61.9007129 32.4763183 1
5 62.3447741 41.0651838 1
6 16.9311605 6.3765206 0
I'm hoping to have 1 resultant dataframe that has XLocs, YLocs, and SUM of patch status (note I plan on combining 15 data frames, so PatchStatus can be between 0 and 15).
I posted this answer along with the heatmap - Plot-3: https://stackoverflow.com/a/60974584/1691723
library('data.table')
df2 <- rbindlist(l = listofdfs)
df2 <- df2[, .(sum_patch = sum(allPoints.patchStatus)), by = .(allPoints.xLocs, allPoints.yLocs)]
We can bind the datasets together and do a group by sum
library(dplyr)
bind_rows(listofdfs) %>%
group_by( allPoints.xLocs, allPoints.yLocs) %>%
summarise(allPoints.patchStatus = sum(allPoints.patchStatus))
Or using rbind and aggregate from base R
aggregate(allPoints.patchStatus ~ ., do.call(rbind, listofdfs), FUN = sum)
Related
In the reproducible working code shown at the bottom of this post, I'm trying to add a column ("ClassAtSplit", shown in the image below with the header highlighted orange) to the R dataframe that flags the activity in other columns of that dataframe. I could use a series of cumbersome conditionals and offsets like I do in the Excel formula for "ClassAtSplit" shown in the right-most column in the image below labeled "ClassAtSplitFormula", but I am looking for an efficient way to do this in R using something like dplyr. Any suggestions for doing this?
I'm not trying to reproduce "ClassAtSplitFormula"! I only included it to show the formulas in "ClassAtSplit" column.
Reproducible code:
library(dplyr)
data <-
data.frame(
Element = c("C","B","D","A","A","A","C","B","B","B"),
SplitCode = c(0,0,0,0,1,1,0,0,2,2)
)
data <- data %>% group_by(Element) %>% mutate(PreSplitClass=row_number())
data
Likely there is the most efficient way to solve this just in one or two lines but with a case_when() you can implement fast and clearly the nested ifelse excel formulas using dplyr.
library(dplyr)
data <-
data.frame(
Element = c("C","B","D","A","A","A","C","B","B","B"),
SplitCode = c(0,0,0,0,1,1,0,0,2,2)
)
data <- data %>% group_by(Element) %>% mutate(PreSplitClass=row_number()) %>% ungroup()
data %>%
mutate(ClassAtSplit =
case_when(
SplitCode == 0 ~as.integer(0), # This eliminates checking for > 0
SplitCode > lag(SplitCode) ~ PreSplitClass, # if > previous value
SplitCode == lag(SplitCode) ~ lag(PreSplitClass) # if equal (0s are avoided)
)
)
Output:
# A tibble: 10 × 4
Element SplitCode PreSplitClass ClassAtSplit
<chr> <dbl> <int> <int>
1 C 0 1 0
2 B 0 1 0
3 D 0 1 0
4 A 0 1 0
5 A 1 2 2
6 A 1 3 2
7 C 0 2 0
8 B 0 2 0
9 B 2 3 3
10 B 2 4 3
If you specify the first case SplitCode == 0 then you won't need to check all the time if the values are equal only because they are 0s.
I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
d1
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
d2
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.
Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
library(dplyr)
d %>%
group_by(group) %>%
summarise_all(funs(sum))
Returns:
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0
Overview
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
library(tidyverse)
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
bind_rows()
# view results -----
df2
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #
I am relatively new to R, and am trying to count the number of each value for each variable, in my whole data frame, where this would all be summarised into a new data frame. For example, my data looks like this:
cluster <- data.frame(sex = c(1,1,1,1,0),
mut = c(0,0,0,0,0),
ht = c(1,1,0,1,0),
wt = c(0,1,1,0,1),
group = c(1,0,0,0,0))
cluster
sex mut ht wt group
1 0 1 0 1
1 0 1 1 0
1 0 0 1 0
1 0 1 0 0
0 0 0 1 0
And I want to count how many 1's vs 0's of each variable there is, for the whole data frame.
My desired output is:
Zeroes Ones
sex 1 4
mut 5 0
ht 2 3
wt 2 3
group 4 1
I know how to do this for each variable individually through a variety of means, for example:
>table(cluster$sex)
0 1
1 4
but I have 32 variables in each of 6 data frames so a quicker way to summarise this would be very helpful. I am thinking some sort of looping function, although I am not very knowledgeable in those. Any help would be greatly appreciated!
You can apply a function by column using apply:
df <- apply(cluster, 2, function(x) c('one' = sum(x == 1), 'zero' = sum(x == 0)))
df <- data.frame(t(df)) # Rotate it so categories are rows
df
one zero
sex 4 1
mut 0 5
ht 3 2
wt 3 2
group 1 4
stack with table (PS: convert to data.frame as.data.frame.matrix)
with(stack(df),table(ind,values))
0 1
group 4 1
ht 2 3
mut 5 0
sex 1 4
wt 2 3
I want to take a data frame like this one:
df <- data.frame(
SortCol1 = rep(c("One", "Two", "Three", "Four"), times = 5),
SortCol2 = rep(c("A", "B"), times = 10),
Arb1 = rep(c(1,0,1,1,0), times = 4),
Arb2 = rep(c(0,1,1,0,0), times = 4)
)
SortCol1 SortCol2 Arb1 Arb2
1 One A 1 0
2 Two B 0 1
3 Three A 1 1
4 Four B 1 0
5 One A 0 0
6 Two B 1 0
7 Three A 0 1
8 Four B 1 1
9 One A 1 0
10 Two B 0 0
11 Three A 1 0
12 Four B 0 1
13 One A 1 1
14 Two B 1 0
15 Three A 0 0
16 Four B 1 0
17 One A 0 1
18 Two B 1 1
19 Three A 1 0
20 Four B 0 0
Then subset it by SortCol1 and SortCol2 to return a list of all subsetted data frames.
I have done something similar to this many times before using ddply when I want to apply a function to the Arb1 and Arb2 columns.
e.g. I know that
ddply(df, c("SortCol1", "SortCol2"), numcolwise(sum))
Will subset based on the two columns I want, and return a minimal frame which has those columns and the sum function applied.
What I want is rather than applying a function to those columns, just have each subset returned as an element of a list.
Pretend the function that does that is called ddply_list. I would hope for something akin to
ddply_list(df, c("SortCol1", "SortCol2"))
Which would return a list whose elements would be the data frames (which I have manually created for now):
df[df$SortCol1=="One" & df$SortCol2 == "A",]
SortCol1 SortCol2 Arb1 Arb2
1 One A 1 0
5 One A 0 0
9 One A 1 0
13 One A 1 1
17 One A 0 1
df[df$SortCol1=="Two" & df$SortCol2 == "B",]
SortCol1 SortCol2 Arb1 Arb2
2 Two B 0 1
6 Two B 1 0
10 Two B 0 0
14 Two B 1 0
18 Two B 1 1
etc for all combinations of SortCol1 and SortCol2.
If there's a function list that already, perfect! If not, any advice for how to get towards this solution would be awesome!
The main bit I'm not sure on, is the simplest way to return all subsets of a data frame (subsetted by columns) as a list of data frames.
To put it in another way, the ddply documentation described the .fun argument as... function to apply to each piece. I think what I want is a way of just returning each 'piece' as an element of a list (preferably with the columns used for subsetting still attached).
Turns out it's very simple:
split(df, df[c("SortCol1", "SortCol2"], drop=TRUE)
Answer stolen from here:
Automatically subset data frame by factor
Usage:
split(x, f, drop = FALSE, ...)
Where x is a vector or dataframe and y is a factor or list of factors for defining groups.
This question already has answers here:
Reshaping a data.frame so a column containing multiple features becomes multiple binary columns
(4 answers)
Closed 5 years ago.
I have a data frame that has a bunch of data that's joined with commas in certain elements of the rows. Something that looks like:
df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014)
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")
)
names(df) <- c("year", "type")
I want to get it in a form that dcast is close to getting it to, with the year,a,b,c,etc being the columns, and the frequency across the data frame being in the cells of the resultant data frame. I tried first to do colsplit on df and then use dcast after, but that seems to only work if I want to aggregate on one of the levels instead of all.
df2 <- data.frame( df$year, colsplit(df$type, ',' , c('v1','v2','v3','v4','v5')) )
df3 <- dcast(df2, df.year ~ v1)
This result only gives me for the first level of the colsplit, instead of all of them. Am I close to a solution or should I be using a different approach entirely?
Here is a single line option with base R by splitting the 'type' column with strsplit, then set the names of the list output as 'year', stack it to a single data.frame and get the frequency count using table
table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1])
# values
#ind a b c d e f g h i
# 2012 2 1 3 2 1 1 0 0 0
# 2013 4 1 1 1 0 0 0 0 0
# 2014 1 1 0 0 1 0 2 1 1
You are close to the solution. You just need one more step. You need to melt all values in one column before dcast. See the example.
require(reshape2)
df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014),
c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i"))
names(df) <- c("year", "type")
df
df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5')))
df2
df3 <- melt(df2, id.vars = "df.year", na.rm = T)
df3
df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length)
df4
Here's a data.table approach:
library(data.table)
setDT(df)
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year],
year ~ V1, value.var = "V1", fun.aggregate = length)
# year a b c d e f g h i
#1: 2012 2 1 3 2 1 1 0 0 0
#2: 2013 4 1 1 1 0 0 0 0 0
#3: 2014 1 1 0 0 1 0 2 1 1
We first split the type column by comma and per year-group to a long-format, then dcast to wide with the length as aggregate function.
Maybe, something like this could work?
# extract unique values and years
vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]")))
years <- unique(df$year)
# count
df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){
sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl)
}))})))
df4 <- cbind(years, df4)
df4
#result
years a b c d e f g h i
1 2012 2 1 3 2 1 1 0 0 0
2 2013 4 1 1 1 0 0 0 0 0
3 2014 1 1 0 0 1 0 2 1 1