How to do a running calculation across vectors in R? - r

I created this data frame:
Count <- c(1:10)
Give <- c(0,0,5,0,0,5,0,5,0,5)
X <- c(rep(0,10))
Y <- c(rep(0,10))
Z <- c(rep(0,10))
X_Target <- 5
Y_Target <- 10
Z_Target <- 5
Basically I have 3 vectors (X,Y,Z) and a target for each one of them.
I want to have a new calculation for X,Y and Z that based on the vector Give.
Once the number on Give is bigger than 0 then it's need to be added to Vector X until it equel to X_Target. Then - the calcultion need to move to the next vector (Y) and do the same, and then to next vector...
The output should be like the following:
Count Give X Y Z
1 0 0 0 0
2 0 0 0 0
3 5 5 0 0
4 0 5 0 0
5 0 5 0 0
6 5 5 5 0
7 0 5 5 0
8 5 5 10 0
9 0 5 10 0
10 5 5 10 5
In this example I have only 3 vectors but please keep in mind that I'll have at least 60 vectors so I need it to be automatic as it can.
Hope I manage to explain myself :)
Thnanks!

It's ugly, but it gives the desired result.
tab1 = data.frame(
Count = c(1:10),
Give = c(0,0,5,0,0,5,0,5,0,5),
X = c(rep(0,10)),
Y = c(rep(0,10)),
Z = c(rep(0,10))
)
targets <- c(5,10,5)
tab2 <- tab1
start <- 2
for(col in 3:ncol(tab2)) {
target <- targets[col-2]
for(row in start:nrow(tab2)) {
if(tab2[row, 2] > 0 & tab2[row, col] < target) {
tab2[row, col] <- pmin(tab2[row - 1, col] + tab2[row, col - 1], target)
} else {
tab2[row, col] <- tab2[row - 1, col]
}
}
start <- which(tab2[, 2] > 0 & tab2[, col] == target)[2]
}
> tab2
Count Give X Y Z
1 1 0 0 0 0
2 2 0 0 0 0
3 3 5 5 0 0
4 4 0 5 0 0
5 5 0 5 0 0
6 6 5 5 5 0
7 7 0 5 5 0
8 8 5 5 10 0
9 9 0 5 10 0
10 10 5 5 10 5

Turn it into a data frame :
tab1 = data.frame(
Count = c(1:10),
Give =c(0,0,5,0,0,5,0,5,0,5),
X = c(rep(0,10)),
Y = c(rep(0,10)),
Z = c(rep(0,10))
)
# create a list of targets for looping
targets = c(X_Target, Y_Target, Z_Target)
Without using data.table you can just put the whole thing in a loop. It will work, but be much slower.
# loop through each column
for(col in seq(1,length(targets))){
print(col)
# loop through each row
for(row in seq(1, dim(tab1[2+col])[1])){
# condition
while(tab1[row,(2+col)] < targets[col] & tab1[row,2]>0){
tab1[row,(2+col)] = tab1[row,(2+col)] +tab1[row,2]
}
}
}

Here is something else to try, using tidyverse.
Put your data into long form, and include targets with a join.
In a loop through Count, find the first row for a given Count that is below target. For current and following rows that have matching names (X, Y, or Z), add Give amount.
In the end, put result back into wide form.
library(tidyverse)
df <- data.frame(Count, Give, X, Y, Z) %>%
pivot_longer(cols = X:Z) %>%
left_join(data.frame(X_Target, Y_Target, Z_Target) %>%
pivot_longer(cols = everything(),
names_to = c("name", ".value"),
names_pattern = "(\\w+)_(\\w+)"))
for (i in seq_along(Count)) {
below_target <- min(which(df$Count == i & df$value < df$Target))
name_rows <- which(df$name == df[below_target, "name", drop = T])
rows_to_change <- name_rows[name_rows >= below_target]
df[rows_to_change, "value"] <- df[rows_to_change, "value"] + df[below_target, "Give", drop = T]
}
df %>%
pivot_wider(id_cols = Count)
Output
Count X Y Z
<int> <dbl> <dbl> <dbl>
1 1 0 0 0
2 2 0 0 0
3 3 5 0 0
4 4 5 0 0
5 5 5 0 0
6 6 5 5 0
7 7 5 5 0
8 8 5 10 0
9 9 5 10 0
10 10 5 10 5

My approach was to make use of the cumulative sums of the Give and then track if that exceeds the targeted values for the columns. Then do some cleaning up.
targets <- c(X_Target, Y_Target, Z_Target)
targets_0 <- c(0, targets)
csum_give <- cumsum(Give)
# from cumsum give take off sum of previous targets
result <- sapply(1:length(targets),
function(x) csum_give - sum(targets_0[1:x]))
# Set max value to target max of column
sapply(1:length(targets),
function(x) result[result[, x] > targets[x], x] <<- targets[x])
# set min value to zero
result[which(result < 0)] <- 0
result
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 0 0 0
# [3,] 5 0 0
# [4,] 5 0 0
# [5,] 5 0 0
# [6,] 5 5 0
# [7,] 5 5 0
# [8,] 5 10 0
# [9,] 5 10 0
# [10,] 5 10 5

Related

How to repeat a code in R fulfilling a condition across repeats

I need to repeat a code 24 times (for 24 different participants), making sure that overall, for each Scene2 in each Trial and Route, I have the same number of 1 and 0 across the columns Random of each participant (i.e., Part.1, Part.2, Part.3, etc.) when the Target is equal to 0.
Here is the code I am using:
Scene2 = rep(c(1:10), times=9)
myDF2 <- data.frame(Scene2)
myDF2$Target <- rep(0,10, each=9)
myDF2$Target[myDF2$Scene2==7] <- 1
myDF2$Trial <- rep(c(1:9),each=10)
myDF2$Route <- rep(LETTERS[1:6], each=10, length=nrow(myDF2))
library(plyr)
myDF3 <- myDF2 %>% group_by(Trial, Route) %>% mutate(Random = ifelse(myDF2$Target==0,sample(c(rep(0,5),rep(1,5))),1)) %>% as.data.frame()
I need to obtain something like this:
Scene2 Target Trial Route Part.1 Part.2 Part.3 Part.4 … Part.24 Tot.1 Tot.0
1 0 1 A 0 1 1 0 0 12 12
2 0 1 A 1 0 1 0 0 12 12
3 0 1 A 1 0 0 0 0 12 12
4 0 1 A 0 1 0 1 0 12 12
5 0 1 A 1 0 1 1 0 12 12
6 0 1 A 1 0 0 0 1 12 12
7 1 1 A 1 1 1 1 1 24 0
8 0 1 A 0 0 1 1 1 12 12
9 0 1 A 0 1 1 1 1 12 12
10 0 1 A 0 1 0 0 1 12 12
How to achieve this? Any suggestion would be very much appreciated.
Since there's some conditional logic here that needs to meet particular specifications, I think this is easier to do with a function.
Scene2 = rep(c(1:10), times=9)
myDF2 <- data.frame(Scene2)
myDF2$Target <- rep(0,10, each=9)
myDF2$Target[myDF2$Scene2==7] <- 1
myDF2$Trial <- rep(c(1:9),each=10)
myDF2$Route <- rep(LETTERS[1:6], each=10, length=nrow(myDF2))
library(tidyverse)
fill_random_columns <- function(df, reps) {
# Start a loop with a counter
for (i in 1:reps) {
# Create a vector of 1s and 0s for filling rows
bag <- c(rep(0, 12), rep(1, 12))
# Build up conditional data frame of 1s and 0s
row_vector <- as.data.frame(t(sapply(df$Target, function(v) {
if (v == 1) return(rep(1, reps))
else (return(sample(bag, reps)))
})))
}
# Create column names
colnames <- lapply(1:reps, function(i) {paste0("Part.", i)})
# Name columns and sum up rows
row_vector <- row_vector %>%
`colnames<-`(colnames) %>%
mutate(Total = rowSums(.))
# Attach to original data frame
df <- bind_cols(df, row_vector)
return(df)
}
myDF3 <- myDF2 %>%
group_by(Trial, Route) %>%
fill_random_columns(., 24)

Count occurrence of a value within a data frame within the rows above it

I'm trying to find a way to create a matrix which counts values from each row of a data frame. I'd like it to recognise the values in each row of the data frame, and count how many times that value has occurred in all rows above the row the value occurs in (not the whole data frame).
The same value will never occur more than once in a single row of the data frame.
For example:
# df:
a b c
1 2 3
3 4 5
3 2 6
7 8 9
8 3 6
matrix result:
0 0 0 (none of the df values have occurred as there are no rows above)
1 0 0 (3 has occurred once above, the others have not occurred)
2 1 0 (3 has occurred twice above, 2 has occurred once above, 6 has not occurred)
0 0 0 (none of the df values have occurred in rows above)
1 3 1 (8 has occurred once, 3 has occurred 3 times, 6 has occurred once)
Here's one way:
# convert to a vector
x = as.vector(t(as.matrix(df)))
# get counts of each unique element (in the right place)
# and add them up
res = rowSums(sapply(unique(x), function(z) {
r = integer(length(x))
r[x == z] = 0:(sum(x == z) - 1)
return(r)
}))
# convert to matrix
res = matrix(res, ncol = ncol(df), byrow = T)
res
# [,1] [,2] [,3]
# [1,] 0 0 0
# [2,] 1 0 0
# [3,] 2 1 0
# [4,] 0 0 0
# [5,] 1 3 1
Using this data:
df = read.table(text = "
a b c
1 2 3
3 4 5
3 2 6
7 8 9
8 3 6", header = T)
Another...for fun
out<-matrix(1,nrow = nrow(df),ncol = ncol(df))
for(i in 1:nrow(df)){
out[i,]<-sapply(1:ncol(df),function(z) sum(unlist(df[0:(i-1),]) %in% df[i,z]))
}
out
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 2 1 0
[4,] 0 0 0
[5,] 1 3 1
Three other approaches:
1) with base R:
temp <- stack(df)[c(outer(c(0,5,10), 1:5, '+')),]
temp$val2 <- with(temp, ave(values, values, FUN = seq_along)) - 1
df2 <- unstack(temp, val2 ~ ind)
which gives:
> df2
a b c
1 0 0 0
2 1 0 0
3 2 1 0
4 0 0 0
5 1 3 1
2) with data.table:
library(data.table)
melt(setDT(df)[, r := .I],
id = 'r')[order(r), val2 := rowid(value) - 1
][, dcast(.SD, rowid(variable) ~ variable, value.var = 'val2')
][, variable := NULL][]
which gives the same result.
3) with the tidyverse:
library(dplyr)
library(tidyr)
df %>%
mutate(r = row_number()) %>%
gather(k, v, -4) %>%
arrange(r) %>%
group_by(v) %>%
mutate(v2 = row_number() - 1) %>%
ungroup() %>%
select(r, k, v2) %>%
spread(k, v2)
which, off course, also gives the same result.
Here is another solution:
df = read.table(text = "a b c
1 2 3
3 4 5
3 2 6
7 8 9
8 3 6", header = T)
elements = sort(unique(unlist(df)))
frequency = sapply(elements, # for each element
function(element) {apply(df == element, 1, sum)}) # Sum the number of occurances per row
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 1 1 1 0 0 0 0 0 0
# [2,] 0 0 1 1 1 0 0 0 0
# [3,] 0 1 1 0 0 1 0 0 0
# [4,] 0 0 0 0 0 0 1 1 1
# [5,] 0 0 1 0 0 1 0 1 0
results = df
for(i in 1:nrow(df)){
for(j in 1:ncol(df))
results[i,j] = sum(frequency[1:i-1, # Sum the prevoius rows occurances
which(df[i,j] == elements)]) # Of the same element
}
# a b c
# 1 0 0 0
# 2 1 0 0
# 3 2 1 0
# 4 0 0 0
# 5 1 3 1
I know we're not supposed to comment with "thanks", but thank you to all. I've marked Brian's response as the most useful because I'm pretty new to R and his was the example I could follow all the way through without needing to look anything up. I'll have fun finding out about all the other ways and new (to me) functions / approaches you've kindly shared though.

Increment call vector on itself in R - Finding counts in between values

I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).

extract information from a data frame

I have a data frame like below
df<- structure(list(s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"
), class = "factor"), s2 = structure(1:3, .Label = c("2-4", "3-15",
"7-16"), class = "factor")), .Names = c("s1", "s2"), row.names = c(NA,
-3L), class = "data.frame")
Looks like below
> df
# s1 s2
#1 3-4 2-4
#2 4-1 3-15
#3 5-4 7-16
what I want to do is to first search and find those values that are similar after -
for example here 4 is in first row of s1, first row of s2 and third row of s1
-The second column indicates how many times those values were found
-The third column shows how many of them are from first column of df
-The fourth column shows how many of them are from second column of df
-The fifth is which strings are from the first columns
-The sixth is which strings are from teh second columns
the output looks like this
Value repeated s1N s1N ss1 ss2
4 3 2 1 3,5 2
1 1 1 - 4 -
15 1 - 1 - 3
16 1 - 1 - 7
Surprisingly tough problem. It's good to break it down into several logical steps:
## 1: split into (val,ss) pairs, and capture ci (column index) association
res <- setNames(do.call(rbind,lapply(seq_along(df),function(ci)
do.call(rbind,lapply(strsplit(as.character(df[[ci]]),'-'),function(x)
data.frame(x[2L],x[1L],ci,stringsAsFactors=F)
))
)),c('val','ss','ci'));
res;
## val ss ci
## 1 4 3 1
## 2 1 4 1
## 3 4 5 1
## 4 4 2 2
## 5 15 3 2
## 6 16 7 2
## 2: aggregate ss (joining on comma) by (val,ci), and capture record count as n
res <- do.call(rbind,by(res,res[c('val','ci')],function(x)
data.frame(val=x$val[1L],ci=x$ci[1L],n=nrow(x),ss=paste(x$ss,collapse=','),stringsAsFactors=F)
));
res;
## val ci n ss
## 1 1 1 1 4
## 2 4 1 2 3,5
## 3 15 2 1 3
## 4 16 2 1 7
## 5 4 2 1 2
## 3: reshape to wide format
res <- reshape(res,idvar='val',timevar='ci',dir='w');
res;
## val n.1 ss.1 n.2 ss.2
## 1 1 1 4 NA <NA>
## 2 4 2 3,5 1 2
## 3 15 NA <NA> 1 3
## 4 16 NA <NA> 1 7
## 4: add repeated column; can be calculated by summing all n.* columns
## note: leveraging psum() from <http://stackoverflow.com/questions/12139431/add-variables-whilst-ignoring-nas-using-transform-function>
psum <- function(...,na.rm=F) { x <- list(...); rowSums(matrix(unlist(x),ncol=length(x)),na.rm=na.rm); };
res$repeated <- do.call(psum,c(res[grep('^n\\.[0-9]+$',names(res))],na.rm=T));
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 NA <NA> 1
## 2 4 2 3,5 1 2 3
## 3 15 NA <NA> 1 3 1
## 4 16 NA <NA> 1 7 1
With regard to the NAs, you can fix them up afterward if you want. However, I would advise that the proper type of the n.* columns is integer, since they represent counts, therefore the use of '-' (as in your sample output) to represent null cells is inappropriate. I would suggest zero instead. The dash is fine for the ss.* columns, since they are strings. Here's how you can do this:
n.cis <- grep('^n\\.[0-9]+$',names(res));
ss.cis <- grep('^ss\\.[0-9]+$',names(res));
res[n.cis][is.na(res[n.cis])] <- 0L;
res[ss.cis][is.na(res[ss.cis])] <- '-';
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 0 - 1
## 2 4 2 3,5 1 2 3
## 3 15 0 - 1 3 1
## 4 16 0 - 1 7 1
First thing you will need to do is extract the numbers from your strings. Running:
newdfstring <- apply(df,1:2, function(v) unlist(strsplit(v,"-")))
newdf <- apply(newdfstring,1:3, as.numeric)
splits the strings in the first line, and converts them to numeric values in the second. The result is a 3-dimensional matrix which you can use to extract your values.
First create a new dataframe:
#length of the columns in the new frame = number of unique values
dflength <- length(unique(array(newdf[2,,])))
dfout <- data.frame(Value=rep(0,dflength),repeated=rep(0,dflength),s1N=rep(0,dflength),s2N=rep(0,dflength),ss1=rep(0,dflength),ss2=rep(0,dflength))
The most obvious way (yet maybe not the most efficient) would then be to loop and match whatever it is you need:
dfout$Value <- unique(array(newdf[2,,]))
for(i in 1:dflength){
getID <- which(as.data.frame(table(newdf[2,,]))$Var1==dfout$Value[i])
dfout$repeated[i] <- as.data.frame(table(newdf[2,,]))$Freq[getID]
dfout$s1N[i] <- as.data.frame(table(newdf[2,,1]))$Freq[getID]
if(is.na(dfout$s1N[i])){
dfout$s1N[i] <- 0
}
dfout$s2N[i] <- as.data.frame(table(newdf[2,,2]))$Freq[getID]
if(is.na(dfout$s2N[i])){
dfout$s2N[i] <- 0
}
getID <- which(newdf[2,,1]==dfout$Value[i])
if(length(getID)>0){
dfout$ss1[i] <- toString(newdf[1,,1][getID])
} else {
dfout$ss1[i] <- 0
}
getID <- which(newdf[2,,2]==dfout$Value[i])
if(length(getID)>0){
dfout$ss2[i] <- toString(newdf[1,,2][getID])
} else {
dfout$ss2[i] <- 0
}
}
dfout
# Value repeated s1N s2N ss1 ss2
#1 4 3 2 1 3, 5 2
#2 1 1 1 1 4 0
#3 15 1 0 1 0 3
#4 16 1 0 0 0 7
EDIT to loop n amount of s values
newdfstring <- apply(df,1:2, function(v) unlist(strsplit(v,"-")))
newdf <- apply(newdfstring,1:3, as.numeric)
dflength <- length(unique(array(newdf[2,,])))
#find the number of s variables
slength <- length(newdf[1,1,])
#create a matrix of appropriate size
dfout <- matrix(data=NA,nrow=dflength,ncol=(2+2*slength))
#create a (near)-empty names array, we will fill it in later
names <- c("Value","repeated",rep("",2*slength))
#fill in the Values column
dfout[,1] <- unique(array(newdf[2,,]))
#loop for every s variable
for(j in 1:slength){
#get their names, paste N or s and add them to the names array
names[2+j] <- paste(names(df)[j],"N",sep="")
names[2+j+slength] <- paste("s",names(df)[j],sep="")
#loop to get the other values
for(i in 1:dflength){
getID <- which(as.data.frame(table(newdf[2,,]))$Var1==dfout[i,1])
dfout[i,2] <- as.data.frame(table(newdf[2,,]))$Freq[getID]
dfout[i,2+j] <- as.data.frame(table(newdf[2,,j]))$Freq[getID]
if(is.na(dfout[i,2+j])){
dfout[i,2+j] <- 0
}
getID <- which(newdf[2,,j]==dfout[i,1])
if(length(getID)>0){
dfout[i,2+j+slength] <- toString(newdf[1,,j][getID])
} else {
dfout[i,2+j+slength] <- 0
}
}
}
colnames(dfout)<-names
as.data.frame(dfout)
# Value repeated s1N s2N ss1 ss2
#1 4 3 2 1 3, 5 2
#2 1 1 1 1 4 0
#3 15 1 0 1 0 3
#4 16 1 0 0 0 7
df <-
structure(
list(
s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"), class = "factor"),
s2 = structure(1:3, .Label = c("2-4", "3-15", "7-16"), class = "factor"
)
), .Names = c("s1", "s2"), row.names = c(NA,-3L), class = "data.frame"
)
library(tidyr)
library(dplyr)
# Split columns at "-" and add to data.frame
splitCols <- function(df) {
new_headers <- paste("s1", c("1st", "2nd"), sep = "_")
split_1 <- (separate(df, s1, into = new_headers, sep = "-"))[,new_headers]
split_1$s1_1st <- as.integer(split_1$s1_1st)
split_1$s1_2nd <- as.integer(split_1$s1_2nd)
new_headers <- paste("s2", c("1st", "2nd"), sep = "_")
split_2 <- (separate(df, s2, into = new_headers, sep = "-"))[,new_headers]
split_2$s2_1st <- as.integer(split_2$s2_1st)
split_2$s2_2nd <- as.integer(split_2$s2_2nd)
cbind(df, split_1, split_2)
}
# given a df outputted from splitCols return final df
analyzeDF <- function(df) {
target_vals <- unique(c(df$s1_2nd, df$s2_2nd)) # for each uniq val compute stuff
out_df <- data.frame(Value = integer(0),
repeated = integer(0),
s1N = integer(0),
s2N = integer(0),
ss1 = character(0),
ss2 = character(0))
# iterate through target_vals, create a row of output,
# and append to out_df
for (val in target_vals) {
s1_match <- val == df$s1_2nd
s2_match <- val == df$s2_2nd
total_cnt <- sum(s1_match, s2_match)
s1_firstcol <- paste(df$s1_1st[s1_match], collapse = ",")
s2_firstcol <- paste(df$s2_1st[s2_match], collapse = ",")
# coerce empty string to "-"
if (s1_firstcol == "") s1_firstcol <- "-"
if (s2_firstcol == "") s2_firstcol <- "-"
row_df <- data.frame(Value = val,
repeated = total_cnt,
s1N = sum(s1_match),
s2N = sum(s2_match),
ss1 = s1_firstcol,
ss2 = s2_firstcol)
out_df <- rbind(out_df, row_df)
}
return(out_df)
}
(df_split <- splitCols(df))
analyzeDF(df_split)
## Value repeated s1N s2N ss1 ss2
## 1 4 3 2 1 3,5 2
## 2 1 1 1 0 4 -
## 3 15 1 0 1 - 3
## 4 16 1 0 1 - 7
I have totally rewritten all the codes based on your real data, and I have tested it on my machine. Since it is a pretty big dataframe, it takes some time to run, and the loops are not avoidable in my opinion.
# function to split the strings
myfun<-function(x){
x<-strsplit(as.character(x), '-')
x1<-unlist(x)
x.new<-as.data.frame(matrix(x1, byrow = T, length(x)))
return(x.new)
}
# this returns a list of dataframes
list.v<-lapply(df[1:dim(df)[2]], myfun)
# like this
head(list.v[[17]])
# try to combine all the dfs, produced an error of mismatching # of columns
df2<-do.call(rbind, list.v)
# some of the dfs in list.v are all NA's, they should be dropped
sum<-summary(list.v)
list.v<-list.v[-which(sum[,1] != "2")] # this excludes those all-NA datafrmes in list.v
# now combine all dfs for indexing purposes
df2<-do.call(rbind, list.v)
# create "value", "repeated" column in the desired result df.
# These codes are same as my previous answer
value<-names(table(df2[,2]))
repeated<-as.vector(table(df2[,2]))
# create an empty list to store the counts columns
list.count<-vector("list", length = length(list.v))
# every df in list.v has same number of rows, get the row number
rownum<-nrow(list.v[[1]])
# use a for loop to fill out list.count
for(i in 0:(length(list.count)-1)){
row.start<-i*rownum+1 # it is kind of tricky here
row.end<-(i+1)*rownum # same as above
list.count[[i + 1]]<-as.vector(table(df2[,2][row.start:row.end]))
}
# combine the vectors in list.count and assing names
count.df<-do.call(cbind, list.count)
count.df<-as.data.frame(count.df)
# create & assign colum names in the format of "s_n", and "_" is filled with corresponding original column name
names.cnt<-character()
for(i in 1:length(names(list.v))){
names.cnt[i]<-paste("s", names(list.v)[i], "n", sep="")
}
names(count.df)<-names.cnt
# this is a very long loop to concatenate the strings and store them into a matrix, but it gets the job done here.
ss.store<-matrix(,nrow = length(value), ncol = length(list.v), byrow = FALSE)
for(i in 1:length(list.v)){
for(j in 1:length(value)){
ss.store[j,i]<-paste(list.v[[i]][,1][which(list.v[[i]][,2] == value[j])], collapse =",")
}
}
# create a df for strings
string.df<-as.data.frame(ss.store, stringsAsFactors = FALSE)
# create & assign names to the df
names.str<-character()
for(i in 1:length(names(list.v))){
names.str[i]<-paste("s", "s", names(list.v)[i], sep="")
}
names(string.df)<-names.str
# combine everything and form the new data frame
new.df<-cbind(value, repeated, count.df, string.df, stringAsFactors = FALSE)
new.df[1:10, 1:15]
value repeated sAn sF1n sF2n sF3n sF4n sF5n sF6n sF7n sF8n sF9n sF10n sF11n sF12n
1 100 155 3 0 0 0 0 0 0 0 0 0 0 0 0
2 1005 14 1 0 0 0 0 0 0 0 0 0 0 0 0
3 1006 50 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1023 1 1 0 0 0 0 0 0 0 0 0 0 0 0
5 1025 38 1 0 0 0 0 0 0 0 0 0 0 0 0
6 1030 624 1 0 1 2 0 0 0 0 0 0 1 0 0
7 1035 1 1 0 0 0 0 0 0 0 0 0 0 0 0
8 104 165 2 0 0 0 0 0 0 0 0 0 0 0 0
9 1076 186 2 0 0 0 0 0 0 0 0 0 0 0 0
10 1078 333 3 0 0 0 0 0 0 0 0 0 0 0 0

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources