using crossprod under specific conditions - r

I am trying to organise a dataset in a very specific way for my research, however I am new to R and I am really struggling, any assistance would be greatly appreciated.
I am attempting to take the value of the cell at every third column (starting from the first one) and multiply it by the column beside it, but only if there is a negative value in said cell. Following this, I would like to sum the results together and store it in a new column in an external spreadsheet.
so far the code I have written is as follows:
NegTotal = NULL
p = NULL
for (i in 1:nrow(Datafile))
{for (j in 1:ncol(Datafile))
{if ((j %% 3 == 0) && (Datafile [i,j] < 0)) {
p <- (datafile[i,j] * datafile[i,j+1])
NegTotal <- sum(p) }
else { }
}
}
for (l in seq(along = NegTotal)) {
dim(newColumn)
AsNewData.DataColumn("datafile", GetType(System.String))
NewColumn.DefaultValue = "NegTotal"
table.Columns.Add(newColumn)
}
I am aware that this code is probably completely wrong, this is the first time I've used R and I am not very proficient at computer programming in general.
The current data is arranged as follows:
df <- data.frame(F1 = c( 1, -2, -1), E1 = c(1, 1, 0), Y1 = c(0, 0, 1),
F2 = c(-1, 2, -1), E2 = c(1, 1, 1), Y2 = c(0, 0, 1),
F3 = c(-2, -2, -1), E3 = c(1, 1, 1), Y3 = c(1, 1, 0))
# F1 E1 Y1 F2 E2 Y2 F3 E3 Y3
# 1 1 1 0 -1 1 0 -2 1 1
# 2 -2 1 0 2 1 0 -2 1 1
# 3 -1 0 1 -1 1 1 -1 1 0
Desired Output:
# F1 E1 Y1 F2 E2 Y2 F3 E3 Y3 NegTotal
# 1 1 1 0 -1 1 0 -2 1 1 -3
# 2 -2 1 0 2 1 0 -2 1 1 -4
# 3 -1 0 1 -1 1 1 -1 1 0 -2
So if x = Fy * Ey;
NegTotal = x1 + x2 + x3, only when F$y < 0.
I hope that all makes sense!

Here's how I would approach this with dplyr and tidyr:
library(dplyr)
library(tidyr)
# Add a respondent column (i.e. row number)
df$respondent <- 1:nrow(df)
df %>%
gather(key, value, -respondent) %>%
separate(key, c("letter", "letter_sub"), sep = 1) %>%
spread(letter, value) %>%
mutate(Neg = ifelse(F < 0, E * F, NA)) %>%
group_by(respondent) %>%
summarise(NegTotal = sum(Neg, na.rm = TRUE))
# Source: local data frame [3 x 2]
#
# respondent NegTotal
# (int) (dbl)
# 1 1 -3
# 2 2 -4
# 3 3 -2
To understand what's going on, I would run the pipeline in pieces. For example, look at the results of the first few functions:
df %>%
gather(key, value, -respondent) %>%
separate(key, c("letter", "letter_sub"), sep = 1) %>%
spread(letter, value)
# respondent letter_sub E F Y
# 1 1 1 1 1 0
# 2 1 2 1 -1 0
# 3 1 3 1 -2 1
# 4 2 1 1 -2 0
# 5 2 2 1 2 0
# 6 2 3 1 -2 1
# 7 3 1 0 -1 1
# 8 3 2 1 -1 1
# 9 3 3 1 -1 0
Getting the data in this form, makes it easier to perform the summary tasks.

This code will give you your desired output. However, if your actual dataset is more complex than the example you gave, you may need a more elegant solution.
df$NegTotal<- (pmin(0,df$F1) * df$E1) + (pmin(0,df$F2) * df$E2) + (pmin(0,df$F3) * df$E3)

Related

pairwise subtraction of columns in a dataframe in R

I was wondering is there a way to automate (e.g., loop) the subtraction of (X2-X1), (X3-X1), (X3-X2) in my data below and add them as three new columns to the data?
m="
id X1 X2 X3
A 1 0 4
B 2 2 2
C 3 4 1"
data <- read.table(text = m, h = T)
This is very similar to this question; we basically just need to change the function that we are using in map2_dfc:
library(tidyverse)
combn(names(data)[-1], 2) %>%
map2_dfc(.x = .[1,], .y = .[2,],
.f = ~transmute(data, !!paste0(.y, "-", .x) := !!sym(.y) - !!sym(.x))) %>%
bind_cols(data, .)
#> id X1 X2 X3 X2-X1 X3-X1 X3-X2
#> 1 A 1 0 4 -1 3 4
#> 2 B 2 2 2 0 0 0
#> 3 C 3 4 1 1 -2 -3
With combn:
dif <- combn(data[-1], 2, \(x) x[, 2] - x[, 1])
colnames(dif) <- combn(names(data)[-1], 2, \(x) paste(x[2], x[1], sep = "-"))
cbind(data, dif)
# id X1 X2 X3 X2-X1 X3-X1 X3-X2
#1 A 1 0 4 -1 3 4
#2 B 2 2 2 0 0 0
#3 C 3 4 1 1 -2 -3

Iterating over columns to create flagging variables

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

Create a time to and time after event variables

I am working on panel data that looks like this:
d <- data.frame(id = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c"),
time = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
iz = c(0,1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1))
id time iz
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
5 a 5 0
6 b 1 0
7 b 2 0
8 b 3 0
9 b 4 0
10 b 5 1
11 c 1 0
12 c 2 0
13 c 3 0
14 c 4 1
15 c 5 1
Here iz is an indicator for an event or a treatment (iz = 1). What I need is a variable that counts the periods before and after an event or the distance to and from an event. This variable would look like this:
id time iz nvar
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -1
12 c 2 0 -2
13 c 3 0 -3
14 c 4 1 0
15 c 5 1 0
I have tried working with the answers given here and here but can't make it work in my case.
I would really appreciate any ideas how to approach this problem. Thank you in advance for all ideas and suggestions.
1) rleid This code applies rleid from data.table to each id and then generates a negative reverse sequence if that produces a run of 1's and a forward sequence otherwise, i.e. we assume that a forward positive sequence should be used except before the first run of ones. For the 1's in iz zero that out. There can be any number of runs in an id and it also supports id's with only 0's or only 1's. It assumes that time has no gaps.
library(data.table)
Seq <- function(x, s = seq_along(x)) if (x[1] == 1) -rev(s) else s
nvar <- function(iz, r = rleid(iz)) ave((1-iz) * r, r, FUN = Seq)
transform(d, nvar = (1-iz) * ave(iz, id, FUN = nvar))
giving:
id time iz nvar
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -3
12 c 2 0 -2
13 c 3 0 -1
14 c 4 1 0
15 c 5 1 0
2) base This code uses only base R. It assumes that every id has at most one run of ones. There is no restriction on whether there are any zeros. Also it supports gaps in time. It applies nvar to the row numbers of each id. First it calculates the range rng of the times of the ones and then calculates the signed distance in the last line of nvar. The output is identical to that shown in (1). If we could assume that every id has exactly one run of 1's the if statement could be omitted.
nvar <- function(ix) with(d[ix, ], {
if (all(iz == 0)) return(iz)
rng <- range(time[iz == 1])
(time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
})
transform(d, nvar = ave(1:nrow(d), id, FUN = nvar))
2a) This variation of (2) passes time and iz to nvar by encoding them as the real and imaginary parts of a complex vector in order to avoid having to deal with row numbers but it is otherwise the same as (2). We have omitted the if statement in (2) but it could be added back in if any id's have no ones.
nvar <- function(x, time = Re(x), iz = Im(x), rng = range(time[iz == 1]))
(time < rng[1]) * (time - rng[1]) + (time > rng[2]) * (time - rng[2])
transform(d, nvar = Re(ave(time + iz * 1i, id, FUN = nvar)))
Here is a solution that is a (tiny) bit more complex than the one from G.Grothendieck. But is will be able to handle non-sequential times.
library( data.table )
#make d a data.table
setDT(d)
#you can remove the trailing [], they are just for passing the output to the console...
#nvar = 0 where iz = 1
d[ iz == 1, nvar := 0 ][]
#calculate nvar for iz == 0 BEFORE iz == 1, using a forward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0, ]
d[ iz == 0, nvar := time - d1[ d0, x.time, on = .(id, time), roll = -Inf ] ][]
#calculate nvar for iz == 0 AFTER iz == 1, usning a backward rolling join
#create subsets for redability
d1 <- d[ iz == 1, ]
d0 <- d[ iz == 0 & is.na( nvar ), ]
d[ iz == 0 & is.na(nvar) , nvar := time - d1[ d0, x.time, on = .(id, time), roll = Inf ] ][]
# id time iz nvar
# 1: a 1 0 -1
# 2: a 2 1 0
# 3: a 3 1 0
# 4: a 4 0 1
# 5: a 5 0 2
# 6: b 1 0 -4
# 7: b 2 0 -3
# 8: b 3 0 -2
# 9: b 4 0 -1
# 10: b 5 1 0
# 11: c 1 0 -3
# 12: c 2 0 -2
# 13: c 3 0 -1
# 14: c 4 1 0
# 15: c 5 1 0
One dplyr and purrr option could be:
d %>%
group_by(id) %>%
mutate(nvar = map_dbl(.x = seq_along(iz), ~ min(abs(.x - which(iz == 1)))),
nvar = if_else(cumsum(iz) == 0, -nvar, nvar))
id time iz nvar
<fct> <dbl> <dbl> <dbl>
1 a 1 0 -1
2 a 2 1 0
3 a 3 1 0
4 a 4 0 1
5 a 5 0 2
6 b 1 0 -4
7 b 2 0 -3
8 b 3 0 -2
9 b 4 0 -1
10 b 5 1 0
11 c 1 0 -3
12 c 2 0 -2
13 c 3 0 -1
14 c 4 1 0
15 c 5 1 0

R: Iterative deletion of rows with group criteria

I'm trying to delete rows iteratively, if they meet two criteria:
slope column < 0
max of Lfd within Ring group
Ring <- c(1, 1, 1, 1, 2, 2, 2, 2)
Lfd <- c(1:4, 1:4)
slope <- c(2, 2, -1, -2, 2, -1, 2, -2)
test <- data.frame(Ring, Lfd, slope)
Ring Lfd slope
1 1 1 2
2 1 2 2
3 1 3 -1
4 1 4 -2
5 2 1 2
6 2 2 -1
7 2 3 2
8 2 4 -2
After first iteration they should look like
Ring Lfd slope
1 1 1 2
2 1 2 2
3 1 3 -1
5 2 1 2
6 2 2 -1
7 2 3 2
And after second like
Ring Lfd slope
1 1 1 2
2 1 2 2
5 2 1 2
6 2 2 -1
7 2 3 2
I already tried without iteration:
test_out <- test %>%
group_by(Ring) %>%
filter(Lfd != which.max(Lfd) & (slope > 0)) %>%
ungroup
And with iteration:
del.high.neg <- function(x) {
success <- FALSE
while (!success) {
test_out <- test %>%
group_by(Ring) %>%
filter(Lfd == which.max(Lfd)) %>%
select(Ring, Lfd, slope) %>%
ungroup
Index <- test_out[test_out$slope < 0, ]
test_out <- test_out[!(test_out$Ring %in% Index),]
success <- Index == NULL
}
return(x)
}
I think this is what you want - it will delete every negative row from the end of the data, until it hits your first positive value:
library(dplyr)
test %>% group_by(Ring) %>%
mutate(row = row_number()) %>%
filter(row <= max(which(slope > 0)))
Source: local data frame [5 x 4]
Groups: Ring [2]
Ring Lfd slope row
(dbl) (int) (dbl) (int)
1 1 1 2 1
2 1 2 2 2
3 2 1 2 1
4 2 2 -1 2
5 2 3 2 3
you can add on a select(-row) if you'd like the row column gone too.
I think you are saying that you want to delete all the rows that have a negative slope and have Lfd that is greater than or equal to the row with the maximum value of Lfd and a non-negative slope. If you want to do that within Ring, you can use the following:
library(plyr)
testmax <- ddply(test,.(Ring),summarize,maxLfd = max(Lfd[slope>=0]))
test1 <- merge(test,testmax)
test_out <- test1[!(test1$Lfd>=test1$maxLfd & test1$slope<0),-4]
test_out
# Ring Lfd slope
# 1 1 1 2
# 2 1 2 2
# 5 2 1 2
# 6 2 2 -1
# 7 2 3 2

how to divide my dataset by the number of times a value appears in R

I have a dataset and need to know on average how many times the number 1 appears, the number 0 appears, and the number -1 appears. But it is not a traditional average. I explain:
this is part of my dataset:
position
1
1
1
0
0
-1
0
-1
-1
-1
-1
-1
1
1
So if I subset the number of times each number appears by vectors I would have:
position '1' position '-1' position '0'
X1 X2 X1 X2 X1 X2
1 1 -1 -1 0 0
1 1 -1 0
1 -1
-1
-1
This way I can find the average for 1 as: (X1+X2)/2 where 2 is the number of vectors that appear. This depends and can be any number given by the number of consecutive times a number appears.
This is a little confusing but I hope you understand my point. I have been thinking how to do this but can't find a way.
Thank you very much!
rle is the way to go, as mentioned by #KonradRudolph. Then you can use split to get a proper format
with(rle(position), split(lengths, values))
# $`-1`
# [1] 1 5
#
# $`0`
# [1] 2 1
#
# $`1`
# [1] 3 2
And, to do the averaging, tapply would work
with(rle(position), tapply(lengths, values, FUN=mean))
# -1 0 1
# 3.0 1.5 2.5
You could also use dplyr with diff:
library(dplyr)
data %>% mutate(group = c(0, cumsum(diff(position)!=0))) %>%
group_by(position) %>%
summarise(mean = n()/length(unique(group)))
Source: local data frame [3 x 2]
position mean
(int) (dbl)
1 -1 3.0
2 0 1.5
3 1 2.5
A bit verbose, but this shows how all of this comes together:
library(dplyr)
position <- c(1, 1, 1, 0, 0, -1, 0, -1, -1, -1, -1, -1, 1, 1)
rle_pos <- rle(position)
df <- data_frame(position_code = rle_pos$values,
length = rle_pos$lengths)
df
# Source: local data frame [6 x 2]
#
# position_code length
# (dbl) (int)
# 1 1 3
# 2 0 2
# 3 -1 1
# 4 0 1
# 5 -1 5
# 6 1 2
df %>%
group_by(position_code) %>%
summarise(count = n(),
sum_lengths = sum(length)) %>%
mutate(average = sum_lengths / count)
# Source: local data frame [3 x 4]
#
# position_code count sum_lengths average
# (dbl) (int) (int) (dbl)
# 1 -1 2 6 3.0
# 2 0 2 3 1.5
# 3 1 2 5 2.5

Resources