What is this operation called and how do I achieve this? (I can't find an example.)
Given
temp1
Var1 Freq
1 (0,0.78] 0
2 (0.78,0.99] 0
3 (0.99,1.07] 0
4 (1.07,1.201] 1
5 (1.201,1.211] 0
6 (1.211,1.77] 2
How do I split the intervals in Var1 into two vectors for start and end?
Like this
df2
start end Freq
1 0.000 0.780 0
2 0.780 0.990 0
3 0.990 1.070 0
4 1.070 1.201 1
5 1.201 1.211 0
6 1.211 1.770 2
This is an XY problem. You shouldn't need to have that format to fix in the first place.
E.g.:
x <- 1:10
brks <- c(0,5,10)
data.frame(table(cut(x,brks)))
# Var1 Freq
#1 (0,5] 5
#2 (5,10] 5
data.frame(start=head(brks,-1), end=tail(brks,-1), Freq=tabulate(cut(x,brks)))
# start end Freq
#1 0 5 5
#2 5 10 5
Related
I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?
See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1
I have two dataframes: list1 and list2
>head(list1)
RS_ID CHROM POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058 1 195680131 C T 0.9996
2 rs73056353 1 195680971 A G 0.9999
3 rs12130880 1 195681419 A T 0.5475
4 rs76457267 1 195681460 A C 0.9993
5 rs10921893 1 195681616 T C 0.5060
6 rs75239769 1 195682022 G A 0.9999
AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1 0.0004 0.9996 0.0004 0.7830
2 0.0001 0.9998 0.0002 0.3740
3 0.4525 0.5442 0.4558 0.0597
4 0.0007 0.9992 0.0008 0.3590
5 0.4940 0.5099 0.4901 0.0302
6 0.0001 1.0000 0.0000 0.5500
>head(list2)
RS_ID CHROM POS REF_ALLELE ALT_ALLELE AF_REF_allsamples
1 rs77599058 1 195680131 C T 0.9996
2 rs73056353 1 195680971 A G 0.9999
3 rs12130880 1 195681419 A T 0.5475
4 rs76457267 1 195681460 A C 0.9993
5 rs10921893 1 195681616 T C 0.5060
6 rs75239769 1 195682022 G A 0.9999
AF_ALT_allsamples AF_REF_onlycontrol AF_ALT_onlycontrol pvalues
1 0.0004 0.9996 0.0004 0.7830
2 0.0001 0.9998 0.0002 0.3740
3 0.4525 0.5442 0.4558 0.0597
4 0.0007 0.9992 0.0008 0.3590
5 0.4940 0.5099 0.4901 0.0302
6 0.0001 1.0000 0.0000 0.5500
> dim(list1)
[1] 235111 10
> dim(list2)
[1] 234520 10
as you can see with dim() they differ in number of rows by 591. I now want to get a new dataframe with all rows from list1 that are not in list2 (those 591)
I tried
> match_diff=list1[!(list1 %in% list2)]
> dim(match_diff)
[1] 235111 10
but as you can see it tells me, that all rows from list1 differ from list2.
I checked with str() if there's an underlying cause, but both are identical (originate from the same rawdata)
I can't check by a single column but must compare each row as a whole.
This is database join operation. If you search for joins you will find more information on the different kinds out there. As #starja said, you want the anti_join from dplyr:
Install dplyr if you don't have it already with install.packages('dplyr')
R> list1 <- data.frame(a=0:5, b=10:15)
R> list2 <- data.frame(a=(0:5)+3, b=(10:15)+3)
R> list1
a b
1 0 10
2 1 11
3 2 12
4 3 13
5 4 14
6 5 15
R> list2
a b
1 3 13
2 4 14
3 5 15
4 6 16
5 7 17
6 8 18
R> list3 <- dplyr::anti_join(list1, list2)
Joining, by = c("a", "b")
R> list3
a b
1 0 10
2 1 11
3 2 12
R>
Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0
It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6
I have some data on organism survival as a function of time. The data is constructed using the averages of many replicates for each time point, which can yield a forward time step with an increase in survival. Occasionally, this results in a survivorship greater than 1, which is impossible. How can I conditionally change values greater than 1 to the value preceeding it in the same column?
Here's what the data looks like:
>df
Generation Treatment time lx
1 0 1 0 1
2 0 1 2 1
3 0 1 4 0.970
4 0 1 6 0.952
5 0 1 8 0.924
6 0 1 10 0.913
7 0 1 12 0.895
8 0 1 14 0.729
9 0 2 0 1
10 0 2 2 1
I've tried mutating the column of interest as such, which still yields values above 1:
df1 <- df %>%
group_by(Generation, Treatment) %>%
mutate(lx_diag = as.numeric(lx/lag(lx, default = first(lx)))) %>% #calculate running survival
mutate(lx_diag = if_else(lx_diag > 1.000000, lag(lx_diag), lx_diag)) #substitute values >1 with previous value
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 1.01
10 12 1 19 0.76 1.04
I expect the results to look something like:
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 0.814
10 12 1 19 0.76 0.814
I know you can conditionally change the values to a specific value (i.e. ifelse with no else), but I haven't found any solutions that can conditionally change a value in a column to the value in the previous row. Any help is appreciated.
EDIT: I realized that mutate and if_else are quite efficient when it comes to converting values. Instead of replacing values in sequence from the first to last, as I would have expected, the commands replace all values at the same time. So in a series of values >1, you will have some left behind. Thus, if you just run the command:
SurvTot1$lx_diag <- if_else(SurvTot1$lx_diag > 1, lag(SurvTot1$lx_diag), SurvTot1$lx_diag)
over again, you can rid of the values >1. Not the most elegant solution, but it works.
This looks like a very ugly solution to me, but I couldn't think of anything else:
df = data.frame(
"Generation" = rep(12,10),
"Treatent" = rep(1,10),
"Time" = c(seq(0,14,by=2),15,19),
"lx_diag" = c(1,1,1,0.996,0.992,0.968,0.925,0.814,1.04,1.04)
)
update_lag = function(x){
k <<- k+1
x
}
k=1
df %>%
mutate(
lx_diag2 = ifelse(lx_diag <=1,update_lag(lx_diag),lag(lx_diag,n=k))
)
Using the data from #Fino, here is my vectorized solution using base R
vals.to.replace <- which(df$lx_diag > 1)
vals.to.substitute <- sapply(vals.to.replace, function(x) tail( df$lx_diag[which(df$lx_diag[1:x] <= 1)], 1) )
df$lx_diag[vals.to.replace] = vals.to.substitute
df
Generation Treatent Time lx_diag
1 12 1 0 1.000
2 12 1 2 1.000
3 12 1 4 1.000
4 12 1 6 0.996
5 12 1 8 0.992
6 12 1 10 0.968
7 12 1 12 0.925
8 12 1 14 0.814
9 12 1 15 0.814
10 12 1 19 0.814
I have a dataset similar to the following format:
Account_ID Date Delinquency age count
1 01/01/2016 0 1 0
1 02/01/2016 1 2 0
1 03/01/2016 2 3 1
1 04/01/2016 0 4 2
1 05/01/2016 1 5 2
1 06/01/2016 2 6 2
2 01/01/2016 0 1 0
2 02/01/2016 0 2 0
2 03/01/2016 1 3 0
2 04/01/2016 0 4 1
2 05/01/2016 1 5 1
3 01/01/2016 1 1 0
3 02/01/2016 2 2 1
3 03/01/2016 3 3 2
3 04/01/2016 4 4 3
3 05/01/2016 5 5 4
3 06/01/2016 6 6 5
I want to count the number of non-zeros in the previous 3 months by account for each row, i.e. I want to create the count variable using the first 4 variables (Account_ID, Date, Delinquency, Age). I would like to know how to do this for n past months. I'm hoping I can extend this exercise to other tasks such as finding the max delinquency in the past 3 months.
welcome to SE!
In case you would like to count non-zero deliquency event for 3 previous months by account for each row, you can use aggregate function as well as zlag function of TSA package in a following manner (see the code below). As the data you provided in count column are dificult to interpret as well as to connect with the condition provided the data in an example were simulated.
library(lubridate)
set.seed(123)
# data simulation
df <- data.frame( id = factor(rep(0:9, 100)),
date = sample(seq(ymd("2010-12-01"), by = 1, length.out = 1000), 1000, replace = TRUE),
deliquency = sample(c(rep(0, 30), 1:5), 1000, replace = TRUE),
age = sample(1:10, 1000, replace = TRUE))
head(df)
# id date deliquency age
# 1 0 2011-08-06 0 10
# 2 1 2013-08-16 0 6
# 3 2 2012-11-17 0 1
# 4 3 2012-09-12 0 9
# 5 4 2011-07-29 0 1
# 6 5 2011-02-25 0 9
# aggregation of non-zero deliquency by month
df$year_month <- df$date
day(df$year_month) <- 1
df_m <- aggregate(deliquency ~ id + year_month, data = df, sum)
df_m <- df_m[order(as.character(df_m$id, df_m$year_month)), ]
df_m$is_zero <- df_m$deliquency > 0
head(df_m)
# id year_month deliquency is_zero
# 1 0 2010-12-01 1 TRUE
# 10 0 2011-01-01 0 FALSE
# 19 0 2011-02-01 0 FALSE
# 29 0 2011-03-01 0 FALSE
# 39 0 2011-04-01 0 FALSE
# 65 0 2011-07-01 1 TRUE
# calculate zero-deliquency events for three last months
library(TSA)
dfx <- df_m
df_m_l <- by(df_m, df_m$id, function(dfx) {
dfx$zero_del <- zlag(dfx$is_zero, 1) + zlag(dfx$is_zero, 2) + zlag(dfx$is_zero, 3)
dfx})
df_m_res <- do.call(rbind, df_m_l)
head(df_m_res)
You can see as an output the data.frame which shows non-zero amount of deliquency events in the last 3 months. E.g. output here is:
id year_month deliquency is_zero zero_del
0.1 0 2010-12-01 1 TRUE NA
0.10 0 2011-01-01 0 FALSE NA
0.19 0 2011-02-01 0 FALSE NA
0.29 0 2011-03-01 0 FALSE 1
0.39 0 2011-04-01 0 FALSE 0
0.65 0 2011-07-01 1 TRUE 0