gen new variable conditionally equal old variable in r - r

I want to conditionally create a new var = old var. My data looks like this:
id id2
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14
12.12 12 15
12.2 NA 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20
How can I create a new var = id2 when id is missing? If id is not missing, id3 is missing.
id id2 id3
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14 14
12.12 12 15
12.2 NA 16 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20
Thanks!!

Assuming that dat is your data frame, you can do the following based on ifelse in base R.
dat$id3 <- with(dat, ifelse(is.na(id), id2, NA))
Or
dat2 <- transform(dat, id3 = ifelse(is.na(id), id2, NA))
DATA
dat <- read.table(text = " id id2
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14
12.12 12 15
12.2 NA 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20",
header = TRUE)

Related

Summarize in a column using a condition and return a new row with the summed value

I have a dataset and I am trying to find a solution for it using dplyr. My goal is to summarize the values in the columns value and percentage, but only for the value smaller than 10 and add this to a new item name called: "cheap_stuff", while removing the rows with the low values.
My data looks like this:
df <- data.frame(group=c(rep("A",4), rep("B",4), rep("C",4), rep("D",4)),
value=c(1, 23, 15, 5, 3, 45, 7, 21, 4, 8, 26, 30, 3, 9, 37, 68),
percentage=c(2.27, 52.27, 34.09, 11.36 ,3.95 ,59.21 ,9.21 ,27.63 ,5.88 ,11.76 ,38.24 ,44.12 ,2.56 ,7.69, 31.62, 58.12),
item=c("cheap1","expensive1" ,"expensive2", "cheap2",
"cheap1", "expensive1","cheap2","expensive2",
"cheap1","cheap2","expensive1","expensive2",
"cheap1","cheap2","expensive1","expensive2"))
view(df)
group value percentage item
1 A 1 2.27 cheap1
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 A 5 11.36 cheap2
5 B 3 3.95 cheap1
6 B 45 59.21 expensive1
7 B 7 9.21 cheap2
8 B 21 27.63 expensive2
9 C 4 5.88 cheap1
10 C 8 11.76 cheap2
11 C 26 38.24 expensive1
12 C 30 44.12 expensive2
13 D 3 2.56 cheap1
14 D 9 7.69 cheap2
15 D 37 31.62 expensive1
16 D 68 58.12 expensive2
My desired output looks like this:
group value percentage item
1 A 6 13.64 cheap_stuff
2 A 23 52.27 expensive1
3 A 15 34.09 expensive2
4 B 10 13.16 cheap_stuff
5 B 45 59.21 expensive1
6 B 21 27.63 expensive2
7 C 12 17.65 cheap_stuff
8 C 26 38.24 expensive1
9 C 30 44.12 expensive2
10 D 12 10.26 cheap_stuff
11 D 37 31.62 expensive1
12 D 68 58.12 expensive2
This post comes in the right direction,
Summarize with mathematical conditions in dplyr
But, there all values are summed, and a new column is created.
I have tried something like this:
library(dplyr)
df%>%
group_by(group) %>%
mutate(item= replace(item, which(value <10),"cheap_stuff")) %>%
mutate(value = sum(value[value < 10]))
But that fails in the sense that I can not removed the rows that I want, and it write over the rows with expensive values.
# A tibble: 16 × 4
# Groups: group [4]
group value percentage item
<chr> <dbl> <dbl> <chr>
1 A 6 2.27 cheap_stuff
2 A 6 52.3 expensive1
3 A 6 34.1 expensive2
4 A 6 11.4 cheap_stuff
5 B 10 3.95 cheap_stuff
6 B 10 59.2 expensive1
7 B 10 9.21 cheap_stuff
8 B 10 27.6 expensive2
9 C 12 5.88 cheap_stuff
10 C 12 11.8 cheap_stuff
11 C 12 38.2 expensive1
12 C 12 44.1 expensive2
13 D 12 2.56 cheap_stuff
14 D 12 7.69 cheap_stuff
15 D 12 31.6 expensive1
16 D 12 58.1 expensive2
Using value<10 instead of grepl:
df %>%
group_by(group,item=case_when(value < 10~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))%>%
ungroup
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1
Original answer:
df %>%
group_by(group,item=case_when(grepl("cheap",item,fixed=T)~"cheap_stuff",
T~item)) %>%
summarise(value=sum(value),
percentage=sum(percentage))
group item value percentage
<chr> <chr> <dbl> <dbl>
1 A cheap_stuff 6 13.6
2 A expensive1 23 52.3
3 A expensive2 15 34.1
4 B cheap_stuff 10 13.2
5 B expensive1 45 59.2
6 B expensive2 21 27.6
7 C cheap_stuff 12 17.6
8 C expensive1 26 38.2
9 C expensive2 30 44.1
10 D cheap_stuff 12 10.2
11 D expensive1 37 31.6
12 D expensive2 68 58.1

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3

Filter data frame based on column value [duplicate]

This question already has answers here:
Selecting only integers from a vector [duplicate]
(2 answers)
Closed 5 years ago.
I would like to filter my data frame based on integer values from the first column v :
v P_el
1 2.5 0
2 3.0 78
3 3.5 172
4 4.0 287
5 4.5 426
6 5.0 601
7 5.5 814
8 6.0 1069
9 6.5 1367
10 7.0 1717
11 7.5 2110
12 8.0 2546
13 8.5 3002
14 9.0 3427
15 9.5 3751
16 10.0 3922
The output should look like this :
v P_el
2 3 78
4 4 287
6 5 601
8 6 1069
10 7 1717
12 8 2546
14 9 3427
16 10 3922
We can check if the values divided by one are with a remainder of 0.
dat[dat$v %% 1 == 0, ]
v P_el
2 3 78
4 4 287
6 5 601
8 6 1069
10 7 1717
12 8 2546
14 9 3427
16 10 3922
DATA
dat <- read.table(text = " v P_el
1 2.5 0
2 3.0 78
3 3.5 172
4 4.0 287
5 4.5 426
6 5.0 601
7 5.5 814
8 6.0 1069
9 6.5 1367
10 7.0 1717
11 7.5 2110
12 8.0 2546
13 8.5 3002
14 9.0 3427
15 9.5 3751
16 10.0 3922",
header = TRUE)
You can use seq( ) function if you have an idea of sequence in column v
dat
# v P_el
# 1 2.5 0
# 2 3.0 78
# 3 3.5 172
# 4 4.0 287
# 5 4.5 426
# 6 5.0 601
# 7 5.5 814
# 8 6.0 1069
# 9 6.5 1367
# 10 7.0 1717
# 11 7.5 2110
# 12 8.0 2546
# 13 8.5 3002
# 14 9.0 3427
# 15 9.5 3751
# 16 10.0 3922
dat[seq(2,16,by = 2),]
# v P_el
# 2 3 78
# 4 4 287
# 6 5 601
# 8 6 1069
# 10 7 1717
# 12 8 2546
# 14 9 3427
# 16 10 3922

R: Deleting Columns When Trying To Replace

I'm trying to replace NA values in a column in a data frame with the value from another column in the same row. Instead of replacing the values the entire column seems to be deleted.
fDF is a data frame where some values are NA. When column 1 has an NA value I want to replace it with the value in column 2.
fDF[columns[1]] = if(is.na(fDF[columns[1]]) == TRUE &
is.na(fDF[columns[2]]) == FALSE) fDF[columns[2]]
I'm not sure what I'm doing wrong here.
Thanks
You can adjust following code to your data:
> ddf
xx yy zz
1 1 10 11.88
2 2 9 NA
3 3 11 12.20
4 4 9 12.48
5 5 7 NA
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 NA
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 NA
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 NA
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
>
> idx = is.na(ddf$zz)
> idx
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[22] FALSE FALSE
>
> ddf$zz[idx]=ddf$yy[idx]
>
> ddf
xx yy zz
1 1 10 11.88
2 2 9 9.00
3 3 11 12.20
4 4 9 12.48
5 5 7 7.00
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 5.00
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 6.00
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 8.00
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
You want an ifelse() expression:
fDF[columns[1]] <- ifelse(is.na(fDF[columns[1]]), fDF[columns[2]], fDF[columns[1]])
not trying to assign the result of an if statement to a vector, which doesn't make any sense.
[EDIT only for David Arenburg: if that wasn't already explicit enough, in R if statements are not vectorized, hence can only handle scalar expressions, hence they're not what the OP needed. I had already tagged the question 'vectorization' yesterday and the OP is free to go read about vectorization in R in any of the thousands of good writeups and tutorials out there.]

adding new column to data frame in R

rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA
I got a question in adding a new column, my data frame is called highway1,and i want to add a column named S/N, as slim divided by acpt, what can I do?
Thanks
> mydf$SN <- mydf$slim/mydf$acpt
> mydf
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy SN
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI 11.956522
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI 13.636364
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI 12.765957
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI 17.105263
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI 31.818182
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA 2.217742
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA 5.000000
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA 2.972973
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA 6.666667
I hope an explanation is not necessary for the above.
While $ is the preferred route, you can also consider cbind.
First, create the numeric vector and assign it to SN:
SN <- Data[,6]/Data[,9]
Now you use cbind to append the numeric vector as a column to the existing data frame:
Data <- cbind(Data, SN)
Again, using the dollar operator $ is preferred, but it doesn't hurt seeing what an alternative looks like.

Resources