IF THEN on a Dataframe in r with LAG - r

I have a dataframe with multiple columns, but two columns in particular are interesting for me.
Column1 contains values 0 and a number (>0)
Column2 contains numbers as well.
I want to create 21 new columns containing new information from Column2 given Column1.
So when Column1 is positive (not 0) I want the first new column, Column01, to take the value from Column2 that goes 10 back. and Column02 goes 9 back,.. Column11 is the exact same as Column2 value.. and Column21 is 10 forward.
For example
Column 1 Column2 Columns01 Columns02.. Columns11..Columns20 Columns21
0 5 0 0 0 0 0
0 2 0 0 0 0 0
0 0 0 0 0 0 0
1 3 0 0 3 5 4
0 10 0 0 0 0 0
0 83 0 0 0 0 0
0 2 0 0 0 0 0
0 5 0 0 0 0 0
0 4 0 0 0 0 0
1 8 0 5 8 5 3
0 6 0 0 0 0 0
0 5 0 0 0 0 0
0 55 0 0 0 0 0
0 4 0 0 0 0 0
2 3 10 83 3 5 0
0 2 0 0 0 0 0
0 3 0 0 0 0 0
0 4 0 0 0 0 0
0 5 0 0 0 0 0
0 3 0 0 0 0 0
1 22 6 5 22 0 0
0 12 0 0 0 0 0
0 0 0 0 0 0 0
0 5 0 0 0 0 0
Hope this makes sense to you and you can help.

Here's one way using the newly implemented shift() function from data.table v1.9.5:
require(data.table) ## v1.9.5+
setDT(dat) ## (1)
cols = paste0("cols", sprintf("%.2d", 1:21)) ## (2)
dat[, cols[1:10] := shift(Column2, 10:1, fill=0)] ## (3)
dat[, cols[11] := Column2] ## (4)
dat[, cols[12:21] := shift(Column2, 1:10, fill=0, type="lead")] ## (5)
dat[Column1 == 0, (cols) := 0] ## (6)
Assuming dat is your data.frame, setDT(dat) converts it to a data.table, by reference (the data is not copied physically to a new location in memory, for efficiency).
Generate all the column names.
Generated lagged vectors of Column2 with periods 10:1 and assign it to the first 10 columns.
11th column is = Column2.
Generated leading vectors of Column2 with periods 1:10 and assign it to the last 10 columns.
Get indices of all the rows where Column1 == 0, and replace/reset all newly generated columns for those indices to 0.
Use setDF(dat) if you want a data.frame back.
You can wrap this in a function with the values -10:10 and choosing type="lag" or type="lead" accordingly, depending on whether the values are negative or positive.. I'll leave that to you.

An option using base R
cols = paste0("cols", sprintf("%.2d", 1:21)) #copied from #Arun's post
m1 <- matrix(c(rep(0,10), dat1[,2]), nrow=nrow(dat1)+10+1, ncol=21,
dimnames=list(NULL, cols))[1:nrow(dat1),]
dat2 <- cbind(dat1,m1*dat1[,1])
NOTE: While creating m1, there will be a warning though.
Checking with the output from #Arun's solution (after running the codes on 'dat')
library(data.table)
setDF(dat) #convert the 'data.table' to 'data.frame'
all.equal(dat2, dat, check.attributes=FALSE)
#[1] TRUE
data
set.seed(24)
dat1 <- data.frame(Column1 = sample(0:1,10, replace=TRUE),
Column2 = sample(1:5, 10, replace=TRUE))
dat <- copy(dat1)

Related

Replace a sequence in data frame column

I have a data frame in R that looks somewhat like this:
A | B
0 0
1 0
0 0
0 0
0 1
0 1
1 0
1 0
1 0
I now want to replace all sequences of more than one "1" in the columns so that only the first "1" is kept and the others are replaced by "0", so that the result looks like this
A | B
0 0
1 0
0 0
0 0
0 1
0 0
1 0
0 0
0 0
I hope you understood what I meant (English is not my mother tongue and especially the R-"vocabulary" is a bit hard for, which is probably why I couldn't find a solution through googling). Thank you in advance!
Try this solution:
Input data
df<-data.frame(
A=c(1,0,0,0,0,0,1,1,1,0),
B=c(1,1,0,1,0,0,1,1,0,0))
f<-function(X)
{
return(as.numeric((diff(c(0,X)))>0))
}
Your output
data.frame(lapply(df,f))
A B
1 1 1
2 0 0
3 0 0
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 0 0
10 0 0
You can use ave and create groups based on the difference of your values to capture the consecutives 1s and 0s as different groups and replace duplicates with 0, i.e.
df[] <- lapply(df, function(i)ave(i, cumsum(c(1, diff(i) != 0)),
FUN = function(i) replace(i, duplicated(i), 0)))
which gives,
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
Here's a simple one line answer:
> df * rbind(c(0,0), sapply(df, diff))
A B
1 0 0
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
9 0 0
This takes advantage of the fact that all unwanted 1's in the original data will become 0's with the diff function.
Here is an option with rleid
library(data.table)
df1[] <- lapply(df1, function(x) +(x==1& !ave(x, rleid(x), FUN = duplicated)))
df1
# A B
#1 0 0
#2 1 0
#3 0 0
#4 0 0
#5 0 1
#6 0 0
#7 1 0
#8 0 0
#9 0 0
<
Here's a more functional approach. Though, I find shorter answers here, but it's good to know the possible implementation under the hood:
# helper function
make_zero <- function(val)
{
get_index <- c()
for(i in seq(val))
{
if(val[i] == 1) get_index <- c(get_index, i)
else if (val[i] != 1) get_index <- c()
if(all(diff(get_index)) == 1)
{
val[get_index[-1]] <- 0
}
}
# set values as 0
return (val)
}
df <- sapply(df, make_zero)
head(df)
A B
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 0
[5,] 0 1
[6,] 0 0
[7,] 1 0
[8,] 0 0
[9,] 0 0
Explanation:
1. We save the indexes of consecutive 1s in get_index.
2. Next, we check if the difference between indexes is 1.
3. If found, we update the value in the column.

How to recode values in a matrix based on a calculation?

I have a question. I'm working on building a recommendation system in R, and I'm fairly new to the language. I can't seem to figure the following out.
I have a matrix like:
eventID g_26 g_27 g_28 g_29 g_30 g_31 g_32 g_33 g_34 g_35 g_36 g_37 g_38 g_39 g_40 g_41 g_42 g_43
1: 1010 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
2: 1016 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
3: 1019 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
4: 1053 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
5: 1168 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0
6: 1188 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
What I´d like to do is replace all values that have 1 to 1/sqrt(total # of 1's in that particular row).
I'm using the Data Table package as well if that makes it easier.
Thanks in advance!
We can multiply the dataframe with the value.
All the numbers that are 0 will remain 0 and the one with 1's will get changed to the desired output
df[-1] * 1/sqrt(rowSums(df==1))
We can specify the columns of interest in .SDcols (-1 implies we selected all the columns except the first column), get the sum of each row in the Subset of Data.table with Reduce and +, take the square root (sqrt), divide by 1, multiply with the Subset of data.table (.SD) and assign (:=) it to the columns of interest
dt[, (2:ncol(dt)) := .SD*1/sqrt(Reduce(`+`, .SD)), .SDcols = -1]
As an example
m <- matrix(c(1, 1, 0, 1, 0, 0, 1, 0, 0), ncol = 3, byrow = T)
rs <- apply(m,1,sum)
rs <- sqrt(rs)
m <- m/rs
Hope that's helpful

How to randomly divide an integer into a fixed number of integers, such that the obtained tuples are uniformly distributed?

Based on this reply: Random numbers that add to 100: Matlab
I tried to apply the suggested method to randomly divide an integer into a fixed number of integers whose sum is equal to the integer. Although that method seems to result in a uniformly distributed set of points when the values are not integers, in the case of integers, the resulting tuples are not obtained with equal probability.
This is shown by the following implementation in R, where a simple case is tested with 3 divisors and with the integer to be divided equal to 5:
# Randomly divide an integer into a defined number of integers
# Goal: obtain with equal probability any combination of variable values, with the condition that sum(variables) = dividend.
# install.packages(rgl) # Install rgl package if not yet installed. This allows to use the plot3d function to create a 3D scatterplot.
library(rgl)
n_draws = 10000
n_variables = 3 # Number of divisors. These need to be randomly calculated. Their value must be in the interval [0:dividend] and their sum must be equal to the dividend. Two variables can have the same value.
dividend = 5 # Number that needs to be divided.
rand_variables = matrix(nrow = n_draws, ncol = n_variables) # This matrix contains the final values for each variable (one column per variable).
rand_samples = matrix(nrow = n_draws, ncol = n_variables-1) # This matrix contains the intermediate values that are used to randomly divide the dividend.
for (k in 1:n_draws){
rand_samples[k,] = sample(x = c(0:dividend), size = n_variables-1, replace = TRUE) # Randomly select (n_variables - 1) values within the range 0:dividend. The values in rand_samples are uniformly distributed.
midpoints = sort(rand_samples[k,])
rand_variables[k,] = sample(diff(c(0, midpoints, dividend)), n_variables) # Calculate the values of each variable such that their sum is equal to the dividend.
}
plot3d(rand_variables) # Create a 3D scatterplot showing the values of rand_variables. This plot does not show how frequently each combination of values of the n_variables is obtained, only which combinations of values are possible.
table(data.frame(rand_variables)) # This prints out the count of each combination of values of n_variables. It shows that the combinations of values in the corners (e.g. (5,0,0)) are obtained less frequently than other combinations (e.g. (1,2,2)).
The last line gives the following output, which shows how many times were obtained each combination of values of (X1, X2, X3) that respect the condition X1 + X2 + X3 = 5:
, , X3 = 0
X2
X1 0 1 2 3 4 5
0 0 0 0 0 0 397
1 0 0 0 0 471 0
2 0 0 0 469 0 0
3 0 0 446 0 0 0
4 0 456 0 0 0 0
5 358 0 0 0 0 0
, , X3 = 1
X2
X1 0 1 2 3 4 5
0 0 0 0 0 450 0
1 0 0 0 539 0 0
2 0 0 560 0 0 0
3 0 588 0 0 0 0
4 426 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 2
X2
X1 0 1 2 3 4 5
0 0 0 0 428 0 0
1 0 0 603 0 0 0
2 0 549 0 0 0 0
3 461 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 3
X2
X1 0 1 2 3 4 5
0 0 0 500 0 0 0
1 0 549 0 0 0 0
2 455 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 4
X2
X1 0 1 2 3 4 5
0 0 465 0 0 0 0
1 458 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 5
X2
X1 0 1 2 3 4 5
0 372 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
As the output shows, the combinations of values in the corners of the plane (e.g. (5,0,0)) are obtained less frequently than other tuples.
How can I obtain any integer tuple with the same probability?
I'm looking for a solution that is applicable for any positive integer and for any number of divisors.
I think trying to make these combinations/permutations manually is reinventing the wheel. There are efficient algorithms to do this implemented in partitions. For example,
library(partitions) # compositions, parts, restrictedparts may be of interest
sample_size <- 1000
pool <- compositions(5, 3) # pool of possible tuples
samp <- sample(ncol(pool), sample_size, TRUE) # sample uniformly
## These are you sampled tuples, each column
z <- matrix(pool[,samp], 3)
Side note: don't use a data.frame, use a matrix to store a set of integers. data.frames will be entirely copied every time you modify something ([.data.frame is not a primitive), whereas matrices will modify in place.

counting the occurrences of a number and when it occurred in R data.frame and data.table

I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)

Simple crosstable with row- and multicolumn columnnames from R to latex

I am trying to produce a simple crosstable in R and have that exported to latex using knitr in Rstudio.
I want the table to look like a publishable table, with row header, column header, and subheaders for each category of the variable in the column. Since my table have identical categories for rows and columns, I wish to replace the column level headers with numbers. See example below:
Profession Mother
ProfesssionFather 1. 2. 3.
1. Bla frequency frequency frequency
2. blahabblab
3. blahblahblah
I am getting close with 'xtable' (I can't get row and column headers to print, and not multicolumn header), and the 'tables' package (I can't replace the column categories with numbers).
Minimal example:
work1 <- paste("LongString", 1:10, sep="")
work2 <- paste("LongString", 1:10, sep="")
t <- table(work1, work2) # making table
t # table with repated row/column names
colnames(t) <- paste(1:10, ".", sep="") # replacing column names with numeric values
xtable(t) # headers are omitted for both rows and columns
work <- data.frame(cbind(work1, work2)) # prepare for use of tabular
tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work) # have headers, but no way to change column categories from "LongString"x to numeric.
You need to assign the output of the tabular function to a named object:
tb <- tabular((FathersProfession=work1) ~ (MothersProfession=work2), data=work)
str(tb)
It should be obvious that the data is in a list and that the column-names are in the attribute that begins:
- attr(*, "colLabels")= chr [1:2, 1:10] "MothersProfession" "LongString1" NA "LongString10" ...
So
attr(tb, "colLabels") <-
gsub("LongString", "" , attr(tb, "colLabels") )
This is then the output to the screen, but the output to a latex device would be different.
> tb
MothersProfession
FathersProfession 1 10 2 3 4 5 6 7 8 9
LongString1 1 0 0 0 0 0 0 0 0 0
LongString10 0 1 0 0 0 0 0 0 0 0
LongString2 0 0 1 0 0 0 0 0 0 0
LongString3 0 0 0 1 0 0 0 0 0 0
LongString4 0 0 0 0 1 0 0 0 0 0
LongString5 0 0 0 0 0 1 0 0 0 0
LongString6 0 0 0 0 0 0 1 0 0 0
LongString7 0 0 0 0 0 0 0 1 0 0
LongString8 0 0 0 0 0 0 0 0 1 0
LongString9 0 0 0 0 0 0 0 0 0 1

Resources