I have two time series (zoo) objects and a data frame
z1
z1 <- structure(c(400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L
,400L, 125L, 125L, 125L, 120L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 400 400 400 400
2013-01-20 125 125 125 125
2013-01-21 125 125 125 125
2013-01-22 125 125 125 125
2013-01-23 120 120 120 120
z2
z2 <- structure(c(40L, 12L, 25L, 15L, 10L,40L, 25L, 15L, 123L, 190L,150L, 115L, 155L, 105L, 80L
,40L, 425L, 225L, 115L, 20L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 40 40 150 40
2013-01-20 12 25 115 425
2013-01-21 25 15 155 225
2013-01-22 15 123 105 115
2013-01-23 10 190 80 20
df
l <- "Name, DOB, TypeOfApply, House
T1, 2008-12-16, sync,44
T2, 2008-12-15, sync,54
T3, 2008-12-19, async,34
T4, 2008-12-18, async,84
T5, 2008-12-11, sync,94"
df <- read.csv(text = l)
I want to apply a formula(function I created to use "calc") bsaed on condition that TypeOfApply == "sync". Z1 and Z2 is going to have same no of rows and columns.
calc(z1,z2,df$DOB-2013-01-18,df$House)
T1 T2 T3 T6
2013-01-18 calc(400,40,((2008-12-16)-(2013-01-18)),44) calc(400,40,((2008-12-15)-(2013-01-18)),54) 400 400
2013-01-20 calc(125,12,((2008-12-16)-(2013-01-20)),44) calc(400,25,((2008-12-15)-(2013-01-20)),54) 125 125
2013-01-21 calc(125,25,((2008-12-16)-(2013-01-21)),44) calc(400,15,((2008-12-15)-(2013-01-21)),54) 125 125
2013-01-22 calc(125,15,((2008-12-16)-(2013-01-22)),44) calc(400,123,((2008-12-15)-(2013-01-22)),54) 125 125
2013-01-23 calc(120,10,((2008-12-16)-(2013-01-23)),44) calc(400,190,((2008-12-15)-(2013-01-23)),54) 120 120
So, in this code T1 and T2 will have formula to be applied, but others will not
T3 - Type of Apply is async
T5 - Does not exist in z1 and z2
T6 - Does not exist in df
Update
Sequence of names in df may be different. So it may be like T2, T1, T3, T5, T4
Just as sample calc function
calc <- function(x,y,z,v)
{
val <- x+y+(z/365)+v
return(val)
}
Here, I am using str_trim as there are leading/lagging spaces in "df" columns. Converted the "factor" column "DOB' to "Date" class, created a "indx" based on the condition that of "TypeOfApply" elements are "sync" and corresponding "Name" elements are present in the column names of "z1". This "indx" is used for subsetting the "df", as well as "z1", and "z2". Then use "Map" function and get the corresponding columns of "z1", "z2", elements of "df1$DOB", "df1$House", which can be used as inputs in the "calc" function.
library(stringr)
indx <- intersect(with(df,str_trim(Name[str_trim(TypeOfApply)=='sync'])),
colnames(z1))
df1 <- df[str_trim(as.character(df$Name)) %in% indx,c(2,4)]
df1$DOB <- as.Date(str_trim(df1$DOB))
Map(function(u,v,x,y) calc(u,v, x-'2013-01-18', y),
as.data.frame(z1[,indx]), as.data.frame(z2[,indx]), df1$DOB, df1$House)
Update
Using the calc function from OP's post
z3 <- z1[,indx]
index <- as.Date('2013-01-18')
z3[] <- mapply(calc, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB-index, df1$House)
z3
# T1 T2
#2013-01-18 479.9068 489.9041
#2013-01-20 176.9068 199.9041
#2013-01-21 189.9068 189.9041
#2013-01-22 179.9068 297.9041
#2013-01-23 169.9068 359.9041
Suppose, if I change the order of "df" rows
set.seed(24)
df <- df[sample(1:nrow(df)),]
Then, the "Map" list elements will be in the same order as "indx", for example,
indx
#[1] "T2" "T1"
df1
# DOB House
#2 2008-12-15 54
#1 2008-12-16 44
Map(function(u,v,x,y) u, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB, df1$House)
#$T2
#[1] 400 125 125 125 120
#$T1
#[1] 400 125 125 125 120
Related
I have a data frame where each row contains numbers of a contingency table on which I would like to run a chisq.test command (to each row in data frame) in R. The output from each row should be added into the data frame as new columns (X-squared-value,p-value).
DF1:
ID1 ID2 female_boxing female_cycling male_boxing male_cycling
A zit 43 170 159 710
B tag 37 134 165 744
C hfs 32 96 170 784
D prt 17 61 185 811
E its 31 112 169 762
F qrw 68 233 130 645
This is what I tried:
apply(DF1[,c('female_boxing','female_cycling','male_boxing','male_cycling')], 1, function(x) chisq.test(x) )
But this gives me only the summary table for each row.
You were close, just inspect one single test with str which helps you to decide which elements to select.
apply(dat[,c('female_boxing','female_cycling','male_boxing','male_cycling')],
1, function(x) chisq.test(x)[c('statistic', 'p.value')] )
The apply gives you a list, the results are a little nicer using sapply and looping over the rows.
chi <- t(sapply(seq(nrow(dat)), function(i)
chisq.test(dat[i, c('female_boxing','female_cycling','male_boxing','male_cycling')])[
c('statistic', 'p.value')]))
cbind(dat, chi)
# ID1 ID2 female_boxing female_cycling male_boxing male_cycling statistic p.value
# 1 A zit 43 170 159 710 988.7209 5.033879e-214
# 2 B tag 37 134 165 744 1142.541 2.146278e-247
# 3 C hfs 32 96 170 784 1334.991 3.762222e-289
# 4 D prt 17 61 185 811 1518.015 0
# 5 E its 31 112 169 762 1245.218 1.133143e-269
# 6 F qrw 68 233 130 645 752.3941 9.129485e-163
Data:
dat <- structure(list(ID1 = c("A", "B", "C", "D", "E", "F"), ID2 = c("zit",
"tag", "hfs", "prt", "its", "qrw"), female_boxing = c(43L, 37L,
32L, 17L, 31L, 68L), female_cycling = c(170L, 134L, 96L, 61L,
112L, 233L), male_boxing = c(159L, 165L, 170L, 185L, 169L, 130L
), male_cycling = c(710L, 744L, 784L, 811L, 762L, 645L)), class = "data.frame", row.names = c(NA,
-6L))
I have slightly edited the data table.
I would like to correlate variable with similar name in my dataset:
A_y B_y C_y A_p B_p C_p
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30
..........
n 55 23 85 12 34 52
I would like to obtain correlation of
A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93
How can I do it in R? Is it possible?
This is really dangerous. Behavior of data.frames with invalid column names is undefined by the language definition. Duplicated column names are invalid.
You should restructure your input data. Anyway, here is an approach with your input data.
DF <- read.table(text = " A B C A B C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)
sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
# A B C
#0.9995544 0.1585501 -0.6004010
#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544
Here is another base R option
within(
rev(
stack(
Map(
function(x) do.call(cor, unname(x)),
split.default(df, unique(gsub("_.*", "", names(df))))
)
)
),
ind <- sapply(
ind,
function(x) {
paste0(grep(paste0("^", x), names(df), value = TRUE),
collapse = "-"
)
}
)
)
which gives
ind values
1 A_y-A_p 0.9995544
2 B_y-B_p 0.1585501
3 C_y-C_p -0.6004010
Data
df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L),
C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L,
46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))
I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)
Problem
I have data on two measures for four individuals each in a wide format. The measures are x and y and the individuals are A, B, C, D. The data frame looks like this
d <- data.frame(matrix(sample(1:100, 40, replace = F), ncol = 8))
colnames(d) <- paste(rep(c("x.", "y."),each = 4), rep(LETTERS[1:4], 2), sep ="")
d
x.A x.B x.C x.D y.A y.B y.C y.D
1 56 65 42 96 100 76 39 26
2 19 93 94 75 63 78 5 44
3 22 57 15 62 2 29 89 79
4 49 13 95 97 85 81 60 37
5 45 38 24 91 23 82 83 72
Now, would I would like to obtain for each row is the value of y for the individual with the lowest value of x.
So in the example above, the lowest value of x in row 1 is for individual C. Hence, for row 1 I would like to obtain y.C which is 39.
In the example, the resulting vector should be 39, 63, 89, 81, 83.
Approach
I have tried to get to this by first generating a matrix of the subset of d for the values of x.
t(apply(d[,1:4], 1, function(x) min(x) == x))
x.A x.B x.C x.D
[1,] FALSE FALSE TRUE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE
Now I wanted to apply this matrix to subset the subset of the data frame for the values of y. But I cannot find a way to achieve this.
Any help is much appreciated. Suggestions for a totally different - more elegant - approach are highly welcome too.
Thanks a lot!
We subset the dataset with the columns starting with 'x' ('dx') and 'y' ('dy'). Get the column index of the minimum value in each row of 'dx' using max.col, cbind with the row index and get the corresponding elements in 'dy'.
dx <- d[grep('^x', names(d))]
dy <- d[grep('^y', names(d))]
dy[cbind(1:nrow(dx),max.col(-dx, 'first'))]
#[1] 39 63 89 81 83
The above can be easily be converted to a function
get_min <- function(dat){
dx <- dat[grep('^x', names(dat))]
dy <- dat[grep('^y', names(dat))]
dy[cbind(1:nrow(dx), max.col(-dx, 'first'))]
}
get_min(d)
#[1] 39 63 89 81 83
Or using the OP's apply based method
t(d[,5:8])[apply(d[,1:4], 1, function(x) min(x) == x)]
#[1] 39 63 89 81 83
data
d <- structure(list(x.A = c(56L, 19L, 22L, 49L, 45L),
x.B = c(65L,
93L, 57L, 13L, 38L), x.C = c(42L, 94L, 15L, 95L, 24L),
x.D = c(96L,
75L, 62L, 97L, 91L), y.A = c(100L, 63L, 2L, 85L, 23L),
y.B = c(76L,
78L, 29L, 81L, 82L), y.C = c(39L, 5L, 89L, 60L, 83L),
y.D = c(26L,
44L, 79L, 37L, 72L)), .Names = c("x.A", "x.B", "x.C",
"x.D",
"y.A", "y.B", "y.C", "y.D"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
Here is my solution. The core idea is that there are functions which.min, which.max that can be row applied to the data frame:
Edit:
Now, would I would like to obtain for each row is the value of y for
the individual with the lowest value of x.
ind <- apply(d[ ,1:4], 1, which.min) # build column index by row
res <- d[,5:8][cbind(1:nrow(d), ind)] # rows are in order, select values by matrix
names(res) <- colnames(d)[5:8][ind] # set colnames as names from the sample column
res
y.D y.B y.D y.A y.D
18 46 16 85 80
Caveat: only works if individuals are in the same order for treatment x. and y. and all individuals present. Otherwise you can use grep like in Akrun's solution.
# My d was:
x.A x.B x.C x.D y.A y.B y.C y.D
1 88 96 65 55 14 99 63 18
2 12 11 27 45 70 46 20 69
3 32 81 21 9 77 44 91 16
4 8 84 42 78 85 94 28 90
5 31 51 83 2 67 25 54 80
We can create a function as follows,
get_min <- function(x){
d1 <- x[,1:4]
d2 <- x[,5:8]
mtrx <- as.matrix(d2[,apply(d1, 1, which.min)])
a <- row(mtrx) - col(mtrx)
split(mtrx, a)$"0"
}
get_min(d)
#[1] 39 63 89 81 83
I have this time series as
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 125 100
2013-01-18 120 0
And wants output as
Quant1 Quant2
2013-01-23 400 200
2013-01-22 125 100
2013-01-21 125 100
2013-01-20 125 100
2013-01-18 120 0
I am trying this, but it does not seem to work. I am getting null error NULL Warning encountered while processing method
replace(df,df == 0, NA)
df <- na.locf(df)
df[is.na(df)] <- 0
Any suggestions?
Update
As per most voted answer I tried (I modified input dates)
> z <- structure(c(400L, 0L, 0L, 125L, 120L, 200L, 0L, 0L, 100L,
+ 0L), .Dim = c(5L, 2L), .Dimnames = list(NULL, c("Quant1", "Quant2"
+ )), index = structure(c(15728, 15727, 15726, 15725, 15723), class = "Date"),
+ class = "zoo")
> z
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 125 100
2013-01-18 120 0
> L <- rowSums(z != 0) > 0
> z[] <- coredata(z)[which(L)[cumsum(L)],]
> z
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 0 0
2013-01-18 120 0
In the future please make your questions self-contained including the library calls and dput(x) output of any input x.
We assume this is a zoo object as shown at the end. We will call it z since df suggests that its a data frame.
library(zoo)
L <- rowSums(z != 0) > 0
z[] <- coredata(z)[which(L)[cumsum(L)],]
giving:
> z
Quant1 Quant2
2013-01-18 400 200
2013-01-20 400 200
2013-01-21 400 200
2013-01-22 125 100
2013-01-23 120 0
Note: This input was used:
z <- structure(c(400L, 400L, 400L, 125L, 120L, 200L, 200L, 200L, 100L,
0L), .Dim = c(5L, 2L), .Dimnames = list(NULL, c("Quant1", "Quant2"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
I also assumed it to be a zoo-object and build the following function by hand which only cares about Quant1 to be zero or not.
It is less elegant and probably slower (one should replace the for loop by some apply-function) than the previous approach by Grothendieck but maybe is somewhat instructive to you.
require(zoo)
times <- as.POSIXct(c("2013-01-18", "2013-01-20", "2013-01-21", "2013-01-22", "2013-01-23", "2013-01-25", "2013-01-29", "2013-02-02", "2013-02-04"))
quant1 <- c(400,0,0,125,120,0,70,0,0)
quant2 <- c(200,0,0,100,0,300,150,80, 200)
z <- zoo(data.frame(Quant1 = quant1, Quant2 = quant2), order.by = times)
repl_zeros <- function (z) {
diffs <- c(0, diff(as.numeric(z$Quant1 == 0)))
beginnings <- which(diffs == 1)
ends <- which(diffs == -1) - 1
valueindices <- ends + 1
for (i in 1:length(valueindices)) {
z[beginnings[i]:ends[i],]$Quant1 <- z[valueindices[i],]$Quant1
z[beginnings[i]:ends[i],]$Quant2 <- z[valueindices[i],]$Quant2
}
z
}
Note: repl_zeros replaces zeros by following values as in your example, where you said you want to replace by previous values in the title of your question. Adjusting it to what you really meant should be easy though.