Subset of a melt dataframe - r

My dataframe
a<-c(2, 4, 6, 6, 8, 10, 12, 13, 14)
c<-c(2, 2, 2, 2, 2, 2, 4, 4,4)
d<-c(10, 10, 10, 30, 30, 30, 50, 50, 50)
ID<-rep(c("no","bo", "fo"), each=3)
mydata<-data.frame(ID, a, c, d)
gg.df <- melt(mydata, id="ID", variable.name="variable")
I want to subset just the variable "no", I have tried:
gg.df[,"variable"=="no"]
which returns
data frame with 0 columns and 27 rows

Please refer to the subset function:
no.df <- subset( x = gg.df
, subset = ID == "no"
)
ID variable value
1 no a 2
2 no a 4
3 no a 6
10 no c 2
11 no c 2
12 no c 2
19 no d 10
20 no d 10
21 no d 10
Or:
gg.df[ ID == "no", ]
ID variable value
1 no a 2
2 no a 4
3 no a 6
10 no c 2
11 no c 2
12 no c 2
19 no d 10
20 no d 10
21 no d 10

Related

Replacing values in data frame column based on another column

I have a data frame in R :
a b c d e
1 2 3 23 1
4 5 6 -Inf 2
7 8 9 2 8
10 11 12 -Inf NaN
and I'd like to replace all the values in column e with NA if the corresponding value in column d is -Inf
like this:
a b c d e
1 2 3 23 1
4 5 6 -Inf NA
7 8 9 2 8
10 11 12 -Inf NA
Any help is appreciated. I haven't been able to do it without loops, and its taking a long time for the full data frame.
ifelse is vectorize. We can use ifelse without using a loop.
dat$e <- ifelse(dat$d == -Inf, NA, dat$e)
DATA
dat <- read.table(text = "a b c d e
1 2 3 23 1
4 5 6 -Inf 2
7 8 9 2 8
10 11 12 -Inf NaN", header = TRUE)
Using data.table
library(data.table)
setDT(dat)[is.infinite(d), e := NA]
A solution with dplyr:
library(tidyverse)
df <- tribble(
~a, ~b, ~c, ~d, ~e,
1, 2, 3, 23, 1,
4, 5, 6, -Inf, 2,
7, 8, 9, 2, 8,
10, 11, 12, -Inf, NaN)
df1 <- df %>%
dplyr::mutate(e = case_when(d == -Inf ~ NA_real_,
TRUE ~ e)
)

Using column value as dataframe index in list dataframes (Map or lapply with seq_along)?

I have a list of dataframes list1 and need a new column 'mn' in each dataframe that is the mean of a conditional number of columns based on the value in another column num plus one. So, for num=3 the new column would be the mean of the first four columns. For the example below
df1 <- data.frame(num= c(3, 1, 1, 1, 2), d1= c(1, 17, 17, 17, 15), d2= c(1, 15, 15, 15, 21), d3= c(6, 21, 21, 21, 23), d4= c(2, 3, 3, 3, 2))
df2 <- data.frame(num= c(3, 2, 2, 2, 2), d1= c(1, 10, 10, 10, 15), d2= c(1, 5, 5, 5, 21), d3= c(6, 2, 2, 2, 23), d4= c(2, 3, 3, 3, 5))
list1 <- list(df1, df2)
I would expect
newlist
[[1]]
num d1 d2 d3 d4 mn
1 3 1 1 6 2 2.5
2 1 17 15 21 3 16.0
3 1 17 15 21 3 16.0
The closest I've gotten is
newlist <- lapply(list1, function(x) {
x <- cbind(x, sapply(x$num, function(y) {
y <- rowSums(x[2:(2+y)])/(y+1)
}))
})
which binds columns for the means of every row. Based on this post I think I need a seq_along or maybe a Map on the inside function but I can't figure out how to implement it.
An option is to loop over the list with lapply, extract the number of elements for each row with apply based on the 'num' column value (+1), get the mean and create the new column in transform
lapply(list1, function(x) transform(x,
mn = apply(x, 1, function(y) mean(y[-1][seq(y[1]+1)]))))
#[[1]]
# num d1 d2 d3 d4 mn
#1 3 1 1 6 2 2.50000
#2 1 17 15 21 3 16.00000
#3 1 17 15 21 3 16.00000
#4 1 17 15 21 3 16.00000
#5 2 15 21 23 2 19.66667
#[[2]]
# num d1 d2 d3 d4 mn
#1 3 1 1 6 2 2.500000
#2 2 10 5 2 3 5.666667
#3 2 10 5 2 3 5.666667
#4 2 10 5 2 3 5.666667
#5 2 15 21 23 5 19.666667
Or with tidyverse, by pivoting to 'long' format with pivot_longer, do a group by row and get the mean of the first 'n' elements based on the 'num' value
library(purrr)
library(dplyr)
library(tidyr)
map(list1, ~
.x %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with('d')) %>%
group_by(rn) %>%
summarise(value = mean(value[seq_len(first(num) + 1)])) %>%
pull(value) %>%
bind_cols(.x, mn = .))

How to subtract values by group (subtract blank stored as one group) using dplyr?

I have some tidy data, and one of the group is a blank:
df <- data.frame(Group = c(rep(LETTERS[1:3], 3), "Blank", "Blank", "Blank"),
ID = rep(1:3, 4),
Value = c(10, 11, 12, 21, 22, 23, 31, 32, 33, 1, 2, 3))
df
Group ID Value
1 A 1 10
2 B 2 11
3 C 3 12
4 A 1 21
5 B 2 22
6 C 3 23
7 A 1 31
8 B 2 32
9 C 3 33
10 Blank 1 1
11 Blank 2 2
12 Blank 3 3
I wanted to subtract Blank from each group (A, B, C), so the normalized data will look like that:
df_normalized<- data.frame(Group = rep(LETTERS[1:3], 3),
ID = rep(1:3, 3),
Value = c(9, 9, 9, 20, 20, 20, 30, 30, 30))
df_normalized
Group ID Value
1 A 1 9
2 B 2 9
3 C 3 9
4 A 1 20
5 B 2 20
6 C 3 20
7 A 1 30
8 B 2 30
9 C 3 30
How to do it nicely using dplyr?
EDIT:
How to do that for multiple groups? e.g:
df <- data.frame(Cluster = c(rep("C1", 12), rep("C2", 12)),
Group = rep(c(rep(LETTERS[1:3], 3), "Blank", "Blank", "Blank"), 2),
ID = rep(1:3, 8),
Value = sample(24))
Assuming you'll have only one "Blank" value per ID as shown in the example, you can do
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Value = Value - Value[Group == "Blank"]) %>%
filter(Group != "Blank")
# Group ID Value
# <fct> <int> <dbl>
#1 A 1 9
#2 B 2 9
#3 C 3 9
#4 A 1 20
#5 B 2 20
#6 C 3 20
#7 A 1 30
#8 B 2 30
#9 C 3 30
If you have more than one "Blank" you can use match which would ensure that only the first value is selected.
df %>%
group_by(ID) %>%
mutate(Value = Value - Value[match("Blank", Group)]) %>%
filter(Group != "Blank")

get the maximum and minimum values of a sub group of columns in a dataframe in ddply in R

I am trying to select the maximum and minimum values of a group of variables from within a data frame using the ddply function from the plyr package. However, it does not seem to work.
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
f=letters[1:5]
d= data.frame(f,a1, a2, a3)
t=ddply(d,.(f), summarize,
minima=apply(f[,c(1:3)], 1, min),
maxima=apply(f[,c(1:3)], 1, min))
Thanks!
This dplyr approach produces mins and maxes. You may need to reshape the resulting data frame, depending on what you are using it for.
library(dplyr)
# Create dataframe
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
f=letters[1:5]
d= data.frame(f,a1, a2, a3)
# Get min and max value for a1,a2,a3
d %>% group_by(f) %>% summarise_at(vars(a1,a2,a3),funs(min = min(.),max = max(.)) )
#> # A tibble: 5 × 7
#> f a1_min a2_min a3_min a1_max a2_max a3_max
#> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 6 11 1 6 11
#> 2 b 2 7 12 2 7 12
#> 3 c 3 8 13 3 8 13
#> 4 d 4 9 14 4 9 14
#> 5 e 5 10 15 5 10 15

R code: how to generate variable based on multiple conditions from other variables

I have a beginner R user:
This is my dataset
factor1 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8,8,9, 9, 10, 10)
factor2 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,16,17, 18, 19, 20)
factor3 <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d")
factor4 <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,160,170, 180, 190, NA)
dataset <- data.frame(factor1, factor2, factor3, factor4)
I created a new variable this way:
dataset$newvar <-"NA"
How to do the following:
I want newvar to take the value 1 if factor1>=5 and factor2<19 and (factor3="b" or factor3="c") and factor4 is different from missing and newvar is equal to missing
Ideally I want to specify different conditions, so some observations will be value 1, 2, 3 and 4 in the variable newvar dependent on the values of several other variables.
This is very simple and intuitive in STATA and would like to know if there is a simple and intuitive way to do the same in R.
Generate a new variable based on several conditions for several values.
This bit of the question was not explicitly addressed:
Ideally I want to specify different conditions, so some observations will be value 1, 2, 3 and 4 in the variable newvar dependent on the values of several other variables.
A simple solution would be to use case_when. Similar to Stata's recode it allows you to specify several values simultaneously.
It works the following way:
newvar = case_when(
condition1 ~ target value,
condition2 ~ target value)
e.g. var1 == 1 ~ 0
Important you need a , after each line.
library(dplyr)
dataset <- mutate(dataset,
newvar = case_when(
factor1 >= 5 & factor2<19 & (factor3 =="b" | factor3 =="c") ~ 1,
factor1 == 1 ~ 2,
factor1 == 2 ~ 3,
TRUE ~ NA_real_ # This is for all other values
)) # not covered by the above.
dataset
# factor1 factor2 factor3 factor4 newvar
# 1 1 1 a 10 2
# 2 1 2 a 20 2
# 3 2 3 a 30 3
# 4 2 4 a 40 3
# 5 3 5 a 50 NA
# 6 3 6 b 60 NA
# 7 4 7 b 70 NA
# 8 4 8 b 80 NA
# 9 5 9 b 90 1
# 10 5 10 b 100 1
# 11 6 11 c 110 1
# 12 6 12 c 120 1
# 13 7 13 c 130 1
# 14 7 14 c 140 1
# 15 8 15 c 150 1
# 16 8 16 d 160 NA
# 17 9 17 d 170 NA
# 18 9 18 d 180 NA
# 19 10 19 d 190 NA
# 20 10 20 d NA NA
Note, you can not use NA (missing) as a target value, instead use one of the following
NA_character_
NA_real_
NA_complex_
NA_double_
In base R you can just do (promoting my comment to an answer):
dataset$newvar <- NA
dataset[dataset$factor1 >= 5 & dataset$factor2 < 19 & (dataset$factor3=="b" | dataset$factor3 =="c"), "newvar"] <- 1
or:
dataset$newvar <- NA
indx <- dataset$factor1 >= 5 & dataset$factor2 < 19 & (dataset$factor3=="b" | dataset$factor3 =="c") & !is.na(dataset$factor4)
dataset[indx, "newvar"] <- 1
Using dplyr
library(dplyr)
dataset %>%
mutate(newvar = ifelse(factor1 > 5 &
factor2 < 19 &
(factor3=="b" | factor3=="c") &
!is.na(factor4), 1, NA))

Resources