Sum of longest string of non-zero values - r

I have a dataframe containing the daily rainfall values at 76 stations from 1964-2013. Each row is a different month for a particular station. Here is a snippet of the dataframe-
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0
...
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USW00093129 2013 10 31 0 0 0 0 0 0 0 0 43 15 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 3 8 0
USW00093129 2013 11 30 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 79 18 20 0 0 0 0 0 0 0 Inf
USW00093129 2013 12 31 0 0 175 33 0 0 3 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 0 0 0 0 0 0 0 0 0
I am trying to find the length of the longest stretch of non-zero rainfall values for each row and the total rainfall in that stretch. The easiest way to find the length of the longest stretch would be to convert the dataframe to 0s and 1s, use rle and apply max(y$lengths[y$values!=0]) along each row. But how do I find the sum of the values?
Thanks for helping out, in advance!

Not exactly a one-liner, but this works :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,check.names=FALSE,text=
"Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0")
res <- lapply(1:nrow(df), function(r){
monthDays <- df[r,'Days']
rain <- as.numeric(df[r,(1:monthDays) + 4])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
})
columnsToAdd <- do.call(rbind,res)
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
Result :
# We print the result without months values for better readability
> df2[,-(5:35)]
Station Year Month Days StretchLen StretchRain
1 USC00020750 1964 1 31 3 110
2 USC00020750 1964 2 29 1 48
3 USC00020750 1964 3 31 4 328
4 USC00020750 1964 4 30 4 127
5 USC00020750 1964 5 31 2 59
6 USC00020750 1964 6 30 1 38
7 USC00020750 1964 7 31 3 210
8 USC00020750 1964 8 31 3 175
9 USC00020750 1964 9 30 2 66
10 USC00020750 1964 10 31 0 0
11 USC00020750 1964 11 30 2 130
12 USC00020750 1964 12 31 2 127
BTW, if you want to stick with apply, it would be like this :
columnsToAdd <-
t(apply(df[,-(1:3)],MARGIN=1,function(r){
monthDays <- r[1]
rain <- as.numeric(r[-1])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
}))
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
I don't like using apply on data.frame's since it has been created for matrices and so it coerces the columns to the same type before calling the function (hence if you work on columns of different types you need to be careful).

Here's another solution with dplyr/tidyr
data %>%
gather(day, rain, -Station, -Year, -Month, -Days) %>%
arrange(Station, Year, Month, day) %>%
group_by(Station, Year, Month) %>%
mutate(previous_rain = lag(rain)) %>%
filter(!(rain %in% c(0, Inf))) %>%
mutate(storm = cumsum(previous_rain %in% c(0, NA))) %>%
group_by(Station, Year, Month, storm) %>%
summarize(total_rain = sum(rain),
number_of_days = n(),
start_day = first(day),
end_day = last(day)) %>%
arrange(desc(number_of_days)) %>%
slice(1)

Here's another take at it, where I've used the rle() function to find run lengths. It's protracted but primarily to make it clear what is happening - you could shorten it easily.
raindf <-
tmp <- read.table(textConnection(" Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0"), header = TRUE)
rainfall <- unlist(as.data.frame(t(raindf[1:3, -c(1:4)])), use.names = FALSE)
rainfall <- rainfall[!is.infinite(rainfall)]
rainfall[rainfall > 0] <- 1
rainyruns <- rle(rainfall)
rainyrunsDf <- data.frame(lengths = rainyruns$lengths, values = rainyruns$values)
rainyrunsDf <- subset(rainyrunsDf, values != 0)
rainyrunsDf <- rainyrunsDf[order(rainyrunsDf$lengths, decreasing = TRUE), ]
rainyrunsDf[1,1]
## [1] 4

Related

Adding multiple columns in between columns in a data frame using a For Loop

outputdata (df)
Store.No Task
1 70
2 50
3 20
I am trying to add 53 columns after the 'Task' column by using its position not the name. Then I want want columns names to begin from 1 and end on the number 53 with 0 in the rows. The rows in this example go to row number 3 but it could vary so would it be possible to use nrow function to specify the number of rows rather than hard coding
outputdata- Desired Outcome
Store.No Task 1 2 3 4 5 6 7 8 9 10 ...53
1 70 0 0 0 0 0 0 0 0 0 0
2 50 0 0 0 0 0 0 0 0 0 0
3 20 0 0 0 0 0 0 0 0 0 0
Code used
x <- 1
y <- 0
for (i in 1:53){
outputdata <- add_column(outputdata, x = 0, .after = Fo+y)
y <- y + 1
x <- x + 1
}
The error i'm getting is the columns are being called x,x.1,x.2,x.3,x.4...x.53. Rather than 1,2,3,4...53...not too sure why this could be
I am still quite new to R so there is a far more efficient way of doing this then please let me know
Many thanks
You do not need to loop to do this:
as.data.frame(cbind(df, matrix(0, nrow = nrow(df), ncol = 53)))
Store.No Task Third Fourth 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 70 4 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 50 5 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 20 6 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
matrix will create a matrix with 53 columns and 3 rows filled with 0
cbind will add this matrix to the end of your data
as.data.frame will convert it to a dataframe
Update
To insert these zero columns positionally you can subset your df into two parts: df[, 1:2] are the first and second columns, while df[,3:ncol(df)] are the third to end of your dataframe.
as.data.frame(cbind(df[,1:2], matrix(0, nrow = nrow(df), ncol = 53), df[,3:ncol(df)))
Store.No Task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Third Fourth
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 7
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 8
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 9
add_column
Alternatively you can use the add_column function from the tibble package as you were in your post using the .after argument to insert after the second column:
library(tibble)
tibble::add_column(df, as.data.frame(matrix(0, nrow = nrow(df), ncol = 53)), .after = 2)
Note: this function will fix the column names to add a "V" before any column name that starts with a number. So 1 will become V1.
Data
df <- data.frame(Store.No = 1:3,
Task = c(70, 50, 20),
Third = 4:6,
Fourth = 7:9)

Estimation transition matrix with low observation count

I am building a markov model with an relativ low count of observations for a given number of states.
Are there other methods to estimate the real transition probabilities than the cohort method? Especially to ensure that the probabilities are decreasing with increasing distance from the current state. The pair (11,14) does not behave in that manner and the underlying model wouldn't support this.
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 2 10 8 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 9 53 13 2 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 17 42 17 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 21 71 21 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 23 102 21 3 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 23 57 33 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 1 34 142 28 1 3 0 0 0 0 0
12 0 0 0 0 0 0 0 0 1 28 127 27 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 28 134 27 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 27 93 20 2 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 23 133 19 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 22 114 20 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 192 19 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 263 21
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 827
Thanks

EDITED: spreading data based on column match

I have an empty data frame I am trying to populate.
Df1 looks like this:
col1 col2 col3 col4 important_col
1 82 193 104 86 120
2 85 68 116 63 100
3 78 145 10 132 28
4 121 158 103 15 109
5 48 175 168 190 151
6 91 136 156 180 155
Df2 looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A data frame full of 0's.
I combine the data frames to make df_fin.
What I am trying to do now is something similar to a dummy variable approach… I have the column in important_col. What I am trying to do is spread this column out, so if important_col = 28 then put a 1 in column 28.
How can I go about creating this?
EDIT: I added a comment to illustrate what I am trying to achieve. I paste it here also.
Say that the important_col is countries, then the column names would
be all the countries in the world. That is in this example all of the
241 countries in the world. However the data I might have already
collected might only contain 200 of these countires. So
one_hot_encoding here would give me 200 columns but I am missing
potentially 41 countries. So if a new user from a country (not
currently in the data) comes to the data and inputs their country,
then it wouldn´t be recognised
Smaller example:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 11 14 3 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 19 15 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 17 10 10 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 13 10 8 17 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 18 5 3 18 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 11 10 9 5 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 5 11 18 16 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 5 8 13 8 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10 1 7 16 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 4 17 17 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Expected output:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 11 14 3 11 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 19 15 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 17 10 10 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 13 10 8 17 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 18 5 3 18 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
6 11 10 9 5 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7 5 11 18 16 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
8 5 8 13 8 6 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10 1 7 16 12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
10 4 17 17 3 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The number of columns is greater than the number of potential entries into important_col. Using the countries example the columns would be all countries in the world and the important_col would consist of a subset of these countries.
Code to generate the above:
df1 <- data.frame(replicate(5, sample(1:20, 10, rep=TRUE)))
colnames(df1) <- c("col1", "col2", "col3", "col4", "important_col")
df2 <- data.frame(replicate(20, sample(0:0, nrow(df1), rep=TRUE)))
colnames(df2) <- gsub("X", "", colnames(df2))
df_fin <- cbind(df1, df2)
df_fin
Does this solve the problem:
Data:
set.seed(123)
df1 <- data.frame(replicate(5, sample(1:20, 10, rep=TRUE)))
colnames(df1) <- c("col1", "col2", "col3", "col4", "important_col")
df2 <- data.frame(replicate(20, sample(0:0, nrow(df1), rep=TRUE)))
colnames(df2) <- gsub("X", "", colnames(df2))
df_fin <- cbind(df1, df2)
Result:
vecp <- colnames(df2)
imp_col <- df1$important_col
m <- matrix(vecp, byrow = TRUE, nrow = length(imp_col), ncol = length(vecp))
d <- ifelse(m == imp_col, 1, 0)
df_fin <- cbind(df1, d)
Output:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 6 20 18 20 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 16 10 14 19 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 9 14 13 14 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 18 12 20 16 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 19 3 14 1 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1 18 15 10 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 11 5 11 16 5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 18 1 12 5 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9 12 7 6 7 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 10 20 3 5 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
What you are trying to do is one hot encoding which you can easily achieve using model.matrix
Below example should take you to the right direction:
df <- data.frame(important_col = as.factor(c(1:3)))
df
important_col
1 1
2 2
3 3
as.data.frame(model.matrix(~.-1, df))
important_col1 important_col2 important_col3
1 1 0 0
2 0 1 0
3 0 0 1
Like Sonny mentioned, model.matrix() should do the job. One potential problem is that you have to add back columns that did not show up in your important_col like the following case:
df <- data.frame(important_col = as.factor(c(1:3, 5)))
df
important_col
1 1
2 2
3 3
4 5
as.data.frame(model.matrix(~.-1, df))
important_col1 important_col2 important_col3 important_col5
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Col4 is missing in the second df, because the important_col does not include value 4. You have to add back the col 4 if you need it for analysis.

Loosing observation when I use reshape in R

I have data set
> head(pain_subset2, n= 50)
PatientID RSE SE SECODE
1 1001-01 0 0 0
2 1001-01 0 0 0
3 1001-02 0 0 0
4 1001-02 0 0 0
5 1002-01 0 0 0
6 1002-01 1 2a 1
7 1002-02 0 0 0
8 1002-02 0 0 0
9 1002-02 0 0 0
10 1002-03 0 0 0
11 1002-03 0 0 0
12 1002-03 1 1 1
> dim(pain_subset2)
[1] 817 4
> table(pain_subset2$RSE)
0 1
788 29
> table(pain_subset2$SE)
0 1 2a 2b 3 4 5
788 7 5 1 6 4 6
> table(pain_subset2$SECODE)
0 1
788 29
I want to create matrix with n * 6 (n :# of PatientID, column :6 levels of SE)
I use reshape, I lost many observations
> dim(p)
[1] 246 9
My code:
p <- reshape(pain_subset2, timevar = "SE", idvar = c("PatientID","RSE"),v.names = "SECODE", direction = "wide")
p[is.na(p)] <- 0
> table(p$RSE)
0 1
226 20
Compare with table of RSE, I lost 9 patients having 1.
This is out put I have
PatientID RSE SECODE.0 SECODE.2a SECODE.1 SECODE.5 SECODE.3 SECODE.2b SECODE.4
1 1001-01 0 0 0 0 0 0 0 0
3 1001-02 0 0 0 0 0 0 0 0
5 1002-01 0 0 0 0 0 0 0 0
6 1002-01 1 0 1 0 0 0 0 0
7 1002-02 0 0 0 0 0 0 0 0
10 1002-03 0 0 0 0 0 0 0 0
12 1002-03 1 0 0 1 0 0 0 0
13 1002-04 0 0 0 0 0 0 0 0
15 1003-01 0 0 0 0 0 0 0 0
18 1003-02 0 0 0 0 0 0 0 0
21 1003-03 0 0 0 0 0 0 0 0
24 1003-04 0 0 0 0 0 0 0 0
27 1003-05 0 0 0 0 0 0 0 0
30 1003-06 0 0 0 0 0 0 0 0
32 1003-07 0 0 0 0 0 0 0 0
35 1004-01 0 0 0 0 0 0 0 0
36 1004-01 1 0 0 0 1 0 0 0
40 1004-02a 0 0 0 0 0 0 0 0
Anyone knows what happens, I really appreciate.
Thanks for your help, best.
Try:
library(dplyr)
library(tidyr)
pain_subset2 %>%
spread(SE, SECODE)

zoo's NA handling methods in r

I am experimenting with different imputation method in zoo
So far I tried on my dataset na.locf, na.approx, na.spline. However, when I tried the same dataset with na.StructTS which uses seasonal Kalman filter it returns me the following error:
Error in StructTS(y) : 'x' must be numeric
Did I miss something? Any help is appreciated.
UPD1
my code:
empty <-zoo(order.by=seq.Date(head(index(df1.zoo),1),tail(index(df1.zoo),1),by="days"))
merged<-na.StructTS(merge(df1.zoo,empty))
here is df1.zoo:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2012-01-01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42
2012-01-02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 57
2012-01-03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51
2012-01-04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41
2012-01-05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56
2012-01-06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55
here is empty:

Resources