extracting specific column with certain condition on one column only - r

I have a data(in R) in below form,
enter image description here
A B C D
x alpha sine 0
y gama cos 1
z beta tan 2
and I want to extract only column A & B where column D > 0.
i tried using data %>% filter(D > 0), which gives me last two rows where D>0 but it also gives me column c which i don't want.
how can i get only column A&B with condition applied on column D only.?
Data in text:
A
B
C
D
x
alpha
sine
0
y
gama
cos
1
z
beta
tan
2

data %>% filter(D > 0) %>%select(A,B, D)
A B D
1 y gama 1
2 z beta 2
or even:
data %>% filter(D > 0) %>%select(-C)
A B D
1 y gama 1
2 z beta 2

Related

What exactly does the logical parameter on the `subset` function in R?

I am Learning R with the book Learning R - Richard Cotton, Chapter 5: List and Dataframes and I don't understand this example give, I have this dataframe and the following scripts:
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = runif(5) > 0.5
))
x y z
1 a 0.6395739 FALSE
2 b -1.1645383 FALSE
3 c -1.3616093 FALSE
4 d 0.5658254 FALSE
5 e 0.4345538 FALSE
subset(a_data_frame, y > 0 | z, x) # what exactly mean y > 0 | z ?
I read the book and said:
subset takes up to three arguments: a data frame to subset, a
logical vector of conditions for rows to include, and a vector of
column names to keep
No more information about the second logic parameter.
It's a tricky example because the (a_data_frame, y > 0 | z, x) the second parameter means y > 0 and the "| z" means or the values in z column that are True.
y>0 evaluate the values given by rnorm(5) your values is different than the book because are randomly generate also the "or" "|" symbol is in the case the column z is selected if the condition is True, in your case all the values False and you can't see what's going on but as didactic example if we change z = rnorm(5) instead of runif(5)>5, you can understand better how works this function.
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = rnorm(5)
))
x y z
1 a -0.91016367 2.04917552
2 b 0.01591093 0.03070526
3 c 0.19146220 -0.42056236
4 d 1.07171934 1.31511485
5 e 1.14760483 -0.09855757
So If we have y<0 or z<0 the output of column will be the row a,c,e
> subset(a_data_frame, y < 0 | z < 0, x)
x
1 a
3 c
5 e
> subset(a_data_frame, y < 0 & z<0, x)
[1] x
<0 rows> (or 0-length row.names) # there is no values for y<0 and z<0
> subset(a_data_frame, y < 0 & z, x) # True for row 2.
x
2 b
> subset(a_data_frame, y < 0 | z, x) # true for row 2 and row 4.
x
2 b
4 d

Fill empty cells between two values in column with last non empty cell and next non empty cell in R

I need to loop over IDs in a dataframe to fill NA values in a column by attributing empty cells evenly between the last and first filled entry outside of the NA cells.
ID Value X Y
1 A x y
1 NA x y
1 NA x y
1 NA x y
1 NA x y
1 NA x y
1 B x y
2 C x y
2 NA x y
2 NA x y
2 NA x y
2 NA x y
2 D x y
Which should be filled to this:
ID Value X Y
1 A x y
1 A x y
1 A x y
1 B x y
1 B x y
1 B x y
1 B x y
2 C x y
2 C x y
2 C x y
2 D x y
2 D x y
2 D x y
In case of 2n NA values between observations, n is attributed to the last and n to the next. In case of 2n+1 values, n is attributed to the last and n+1 to the next.
I know I need to use na.locf from the zoo package which works well with a large database for filling in empty values based on the last non-empty cell, along with the fromLast argument to perform "next observation carried backwards". I cannot however structure a loop to account for an even or odd number of NA values, and use both of these together.
Using the tidyverse package,
> library(tidyr)
> library(dplyr)
> df %>% dplyr::group_by(test$id) %>% fill(Value, .direction ="downup") %>% dplyr::ungroup()
This fills in NA values in both directions but does not account for different border values for NA cells in a group.
Define interp which replaces each successive non-NA with successive integers, applies na.appro9x, rounds and replaces the resulting integers with the original values.
library(zoo)
interp <- function(x) {
x0 <- ifelse(is.na(x), NA, cumsum(!is.na(x)))
xx <- na.approx(x0, rule = 2)
na.omit(x)[round(xx)]
}
transform(DF, Value = interp(Value))
giving:
ID Value X Y
1 1 A x y
2 1 A x y
3 1 A x y
4 1 B x y
5 1 B x y
6 1 B x y
7 1 B x y
8 2 C x y
9 2 C x y
10 2 C x y
11 2 D x y
12 2 D x y
13 2 D x y
Note
It is assumed that the input is the following, shown in reproducible form.
Lines <- "ID Value X Y
1 A x y
1 NA x y
1 NA x y
1 NA x y
1 NA x y
1 NA x y
1 B x y
2 C x y
2 NA x y
2 NA x y
2 NA x y
2 NA x y
2 D x y"
DF <- read.table(text = Lines, header = TRUE)
I guess the simplest way is to use the function: na.locf: Last Observation Carried Forward if we are in zoo/time-series.
see: https://www.rdocumentation.org/packages/zoo/versions/1.8-9/topics/na.locf

How to change specific values in a dataframe

Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.
Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))

R collapse duplicate pairs (in any order) across dataframe columns and edit 3rd column?

I used rbind to join 2 dataframes, with a column denoting its source, resulting in
from | to | source
1 A B X
2 C D Y
3 B A Y
...
I would like to look for overlapping pairs, regardless of "order", combine those pairs, then edit the source column to something else, e.g. "Z".
In the above example, rows 1 and 3 would be flagged as overlapping, so they will be combined and modified.
So the desired output would look something like
from | to | source
1 A B Z
2 C D Y
...
How can this be done?
You can try the code below
unique(
transform(
transform(
df,
from = pmin(from, to),
to = pmax(from, to)
),
source = ave(source, from, to, FUN = function(x) ifelse(length(x) > 1, "Z", x))
)
)
which gives
from to source
1 A B Z
2 C D Y
Example
set.seed(1)
df=data.frame(
"from"=sample(LETTERS[1:4],10,replace=T),
"to"=sample(LETTERS[1:4],10,replace=T),
"source"=sample(c("X","Y"),10,replace=T)
)
from to source
1 A C X
2 D C X
3 C A X
4 A A X
5 B A X
6 A B X
7 C B Y
8 C B X
9 B B X
10 B C Y
and then
tmp=t(
apply(df,1,function(x){
sort(x[1:2])
})
)
t1=duplicated(tmp,fromLast=F)
t2=duplicated(tmp,fromLast=T)
df[t2,"source"]="Z"
df[!t1,]
from to source
1 A C Z
2 D C X
4 A A X
5 B A Z
7 C B Z
9 B B X

R help - change the maximum value of each row in a certain condition

I am in a novice of R. I have a dataframe with columns 1:n. Excluding column 1 and n, I want to change the maximum value of each row if the row has a specific value in a different column AND set the remaining values (excluding column 1 and n) to zero. I have about 300,000 cases and 40 columns in my real data, however, the example below illustrates what I am trying to achieve:
A <- c(1,1,5,5,10)
B <- rnorm(1:5)
C <- rnorm(1:5)
D <- rnorm(1:5)
E <- c(10,15,100,100,100)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
Here, if column A of each row has 1, I want to change the maximum value of each row into the value of column E, and set columns B, C and D to 0.
So, the result should be like this:
A B C D E
1 1 0 0 10 10
2 1 0 15 0 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
I tried to do this for two days. Thanks.
Try this out and see what happens :)
df <- read.table(text = "A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100", stringsAsFactor = FALSE)
# find the max in columns B,C,D
z <- apply(df[df$A == 1, 2:4], 1, max)
# substitute the maximum value of each row for columns B,C,D where A == 1
# with the value of column E. Assign 0 to the others
y <- ifelse(df[df$A == 1, 2:4] == z, df$E[df$A == 1], 0)
# Change the values in your dataframe
df[df$A == 1, 2:4] <- y

Resources