[ First Stack question please be kind :) ]
I'm creating multiple new columns in a data frame based on multiple conditional statements of existing columns - all essentially new combinations of columns.
For example, if there are 4 columns (a:d), I need new columns of all combinations (abcd, abc, abd, etc) and a 0/1 coding based on threshold data in a:d.
Toy data example included and desired outcome. However needs to be scalable: there are 4 base columns, but I need all combinations of 2, 3 and 4 columns not just 3-value (abc, abd, .... ab, ac, ad, ... total n = 11)
[Background for context: this is actually flow cytometry data from multipotent stem cells that can grow into colonies of all lineage cell type (multipotent, or abcd) or progressively more restricted populations (only abc, or abd, ab, ac, etc)
# Toy data set
set.seed(123)
df <- tibble(a = c(sample(10:50, 10)),
b = c(sample(10:50, 10)),
c = c(sample(10:50, 10)),
d = c(sample(10:50, 10)))
Current code produces the desired result however, this needs 11 lines of repetitive code which is error prone and I hope has a more elegant solution:
df %>%
mutate(
abcd = if_else(a > 30 & b > 20 & c > 30 & d > 30, 1, 0),
abc = if_else(a > 30 & b > 20 & c > 30 & d <= 30, 1, 0),
abd = if_else(a > 30 & b > 20 & c <= 30 & d > 30, 1, 0),
acd = if_else(a > 30 & b <= 20 & c > 30 & d > 30, 1, 0),
bcd = if_else(a <= 30 & b > 20 & c > 30 & d > 30, 1, 0))
What I understand from your question, for each row you just need to find which columns meet the criteria defined in your ifelse() conditions. This vectorized solution will add a column to your df which contains all the combinations. This probably is also faster than multiple ifelse conditions as well. Finally, the new column can be used for ordering or grouping.
# define the threshold levels for all columns
threshold = c(a=30, b=20, c=30, d=30)
# get names of columns meeting the threshold and paste names
df$combn <- apply(df, 1, function(x) {
paste(names(x)[x > threshold], collapse = "")
})
> df
# A tibble: 10 x 5
a b c d combn
<int> <int> <int> <int> <chr>
1 21 49 46 49 bcd
2 41 28 37 46 abcd
3 25 36 34 36 bcd
4 43 31 47 40 abcd
5 44 13 48 10 ac
6 11 42 35 27 bc
7 28 18 29 48 d
8 40 11 30 17 a
9 46 20 19 20 a
10 24 40 14 43 bd
If I get that correctly, you want to categorize each row into exactly one class, so getting the category name as concatenation of threshold tests should be enough. Then you can get 0/1 columns using spread():
df %>%
mutate(
a_ = if_else(a > 30, 'a', 'x'),
b_ = if_else(b > 20, 'b', 'x'),
c_ = if_else(c > 30, 'c', 'x'),
d_ = if_else(d > 30, 'd', 'x'),
all_ = paste0(a_, b_, c_, d_),
one_ = 1) %>%
spread(all_, one_, fill = 0) %>%
select(-ends_with("_"))
Gives
# A tibble: 10 x 11
a b c d abcd axcx axxx xbcd xbcx xbxd xxxd
<int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 42 35 27 0 0 0 0 1 0 0
2 21 49 46 49 0 0 0 1 0 0 0
3 24 40 14 43 0 0 0 0 0 1 0
4 25 36 34 36 0 0 0 1 0 0 0
5 28 18 29 48 0 0 0 0 0 0 1
6 40 11 30 17 0 0 1 0 0 0 0
7 41 28 37 46 1 0 0 0 0 0 0
8 43 31 47 40 1 0 0 0 0 0 0
9 44 13 48 10 0 1 0 0 0 0 0
10 46 20 19 20 0 0 1 0 0 0 0
(You can use '' instead of 'x', but then spread() will overwrite some of your original columns.)
Related
I'd like to loop through the following data frame in order of the sum of the first 2 column values for each row, and then assign the third column value a number as a result of that.
Initial Table:
Col 1
Col 2
Col 3
20
0
5
0
20
0
0
10
20
0
10
0
20
40
15
0
The sums of columns 1 and 2 give:
20+0=20
5+0=5
20+0=20
0+10=10
20+0=20
10+0=10
20+40=60
15+0=15
Col 1
Col 2
Col 3
20
0
10
5
0
20
20
0
10
0
10
20
20
0
10
10
0
20
20
40
5
15
0
20
The 3 lowest sums get Col 3 value 20, the next 4 lowest get value 10, and the highest value gets 5.
This can be done using a single assignment rather than a loop, for example:
#Example data
df <- data.frame(col1 = c(20, 5, 20, 0, 21, 10, 20, 15), col2=c(0,0,0,10,0,0,40,0))
#Add dummy values
df$col3 <- NA
#Assign required values
df$col3[order(df$col1+df$col2)] <- rep(c(20,10,5), c(3,4,1))
df
# col1 col2 col3
#1 20 0 10
#2 5 0 20
#3 20 0 10
#4 0 10 20
#5 21 0 10
#6 10 0 20
#7 20 40 5
#8 15 0 10
Let's take the example you gave:
df <- data.frame(Col1 = c(20,5,20,0,20,10,20,15),
Col2 = c(0,0,0,10,0,0,40,0))
colnames(df) <- c("Col 1", "Col 2")
We then can do this:
library(dplyr)
df <- df %>%
mutate(`Col 3` = `Col 1` + `Col 2`)
col3_values <- sort(df$`Col 3`)
df <- df %>%
mutate(`Col 3` = case_when(`Col 3` <= col3_values[[3]] ~ 20,
`Col 3` > col3_values[[3]] & `Col 3` <= col3_values[[7]] ~ 10,
TRUE ~ 5))
Output:
Col 1 Col 2 Col 3
1 20 0 10
2 5 0 20
3 20 0 10
4 0 10 20
5 20 0 10
6 10 0 20
7 20 40 5
8 15 0 10
Note that the last line isn't what you expected because the sum isn't one of the 3 smallest (you have a 5 and two 10 before).
But as Limey commented, this wont work if you have more than 8 rows. You will have to change the bounds where the given value is affected
I have a large dataset where I am trying to extract intervals (from the column Zone) where the Anom value is >1 for 5+ consecutive cells, and calculate the means of each interval. In the example below I would like to extract the information that Anom intervals include Zones = 5 to 11 and 17 to 26, but ignoring 28 to 29 (as the number of consecutive cells is <5). Any help is much appreciated.
df <- data.frame("Zone" = 1:30, "Anom" = 1:30)
df[,2] <- 0
df[5:11,2] <- 1
df[17:26,2] <- 1
df[28:29,2] <- 1
df
Zone Anom
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
17 17 1
18 18 1
19 19 1
20 20 1
21 21 1
22 22 1
23 23 1
24 24 1
25 25 1
26 26 1
27 27 0
28 28 1
29 29 1
30 30 0
The sort of output I would like to generate
1 Zone.From Zone.To Anom.Mean
2 5 11 1
3 17 26 1
One way using dplyr and data.table's rleid is to create a new group for each change in Anom. For each group get first and last value of Zone, mean of Anom, number of rows in it and first value of Anom. We can then filter and keep only those groups where we have greater than equal to 5 rows and Anom is greater than 0.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(Anom)) %>%
summarise(Zone.From = first(Zone),
Zone.To = last(Zone),
mean_anom = mean(Anom),
N = n(),
Anom = first(Anom)) %>%
filter(Anom > 0 & N >= 5) %>%
select(-c(grp, N, Anom))
# Zone.From Zone.To mean_anom
# <int> <int> <dbl>
#1 5 11 1
#2 17 26 1
I have two datasets like these ones:
df <- data.frame(id = 1:20,
Sex = rep(x = c(0,1), each=10),
age = c(25,56,29,42,33,33,33,25,25,25,26,57,30,43,34,34,34,26,26,26),
ov = letters[1:20])
df1 <- data.frame(Sex = c(0,0,0,1,1),
age = c(25,33,39,41,43))
I want to take 1 random row for every group of sex and age of df according every group of df1, but not all cases of age in df1 match in df, so I want to impute for every group in df1 that no match in df the value of var ov which is related with the same sex and the closest age, something like this:
df3 <- rbind(df[c(8,7),2:4],c(0,39,"d"),c(1,41,"n"),df[14,2:4])
Note that the donor for case in which sex = 0 and age = 39 is the df[4,] and note that the donor for case in which sex = 1 and age = 41 is the df[14,]
How can I do this:
Using data.table you can try something like this:
1) Convert data to data.table and add keys:
df1
dt1 <- as.data.table(df1) # convert to data.table
dt1[, newSex := Sex] # this will serve as grouping column
dt1[, newage := age] # also this
setkey(dt1, Sex, age) # set data.tables keys
dt1
Sex age newSex newage
1: 0 25 0 25
2: 0 33 0 33
3: 0 39 0 39
4: 1 41 1 41
5: 1 43 1 43
# we do similar with df:
dt <- as.data.table(df)
setkey(dt, Sex, age)
dt
id Sex age ov
1: 1 0 25 a
2: 8 0 25 h
3: 9 0 25 i
4: 10 0 25 j
5: 3 0 29 c
6: 5 0 33 e
7: 6 0 33 f
8: 7 0 33 g
9: 4 0 42 d
10: 2 0 56 b
11: 11 1 26 k
12: 18 1 26 r
13: 19 1 26 s
14: 20 1 26 t
15: 13 1 30 m
16: 15 1 34 o
17: 16 1 34 p
18: 17 1 34 q
19: 14 1 43 n
20: 12 1 57 l
2) Using rolling merge we get dtnew with new groups:
dtnew <- dt1[dt, roll = "nearest"]
dtnew
Sex age newSex newage id ov
1: 0 25 0 25 1 a
2: 0 25 0 25 8 h
3: 0 25 0 25 9 i
4: 0 25 0 25 10 j
5: 0 29 0 25 3 c
6: 0 33 0 33 5 e
7: 0 33 0 33 6 f
8: 0 33 0 33 7 g
9: 0 42 0 39 4 d
10: 0 56 0 39 2 b
11: 1 26 1 41 11 k
12: 1 26 1 41 18 r
13: 1 26 1 41 19 s
14: 1 26 1 41 20 t
15: 1 30 1 41 13 m
16: 1 34 1 41 15 o
17: 1 34 1 41 16 p
18: 1 34 1 41 17 q
19: 1 43 1 43 14 n
20: 1 57 1 43 12 l
3) Now we can sample. In your case we can simply reorder rows in random order, and then take firs row of each group:
dtnew <- dtnew[sample(.N)] #create random order
sampleDT <- unique(dtnew, by = c("newSex", "newage")) #take first unique by newSex and newage
sampleDT
Sex age newSex newage id ov
1: 0 56 0 39 2 b
2: 0 29 0 25 3 c
3: 1 43 1 43 14 n
4: 1 34 1 41 16 p
5: 0 33 0 33 7 g
I'm cleaning up some eye-tracking data, which is, as expected, messy. I'm stuck on a preliminary step that I'll do my best to describe thoroughly. The solution is likely quite simple.
I've got two variables, one binary (x1) and the other continuous (x2), such as that created by:
dat <- data.frame(x1 = c(0,1,1,0,1,1,1,0,1,1),
x2 = c(22,23,44,25,36,37,28,19,30,41))
I need to create a new variable (x3) that is the cumulative sum of x2 only for consecutive cases in which x1 is equal to 1. The end product would look like such:
dat <- data.frame(x1 = c(0,1,1,0,1,1,1,0,1,1),
x2 = c(22,23,44,25,36,37,28,19,30,41),
x3 = c(0, 23, 67, 0, 36, 73, 101, 0, 30, 71))
In other words, it's a cumsum() of x2 that "resets" after each 0 in x1.
dat$x3 <- with(dat, ave(replace(x2, x1 == 0, 0), cumsum(x1 == 0), FUN=cumsum))
dat
# x1 x2 x3
#1 0 22 0
#2 1 23 23
#3 1 44 67
#4 0 25 0
#5 1 36 36
#6 1 37 73
#7 1 28 101
#8 0 19 0
#9 1 30 30
#10 1 41 71
In data.table, you could group by runs of x1 (using by=rleid(x1)) and then return 0 if the group of x1 is 0, or otherwise return the cumulative sum of x2. := is used to assign the variable by reference.
library(data.table)
setDT(dat)[, x3 := if(x1[1] == 0) 0 else cumsum(x2), by=rleid(x1)]
this returns
dat
x1 x2 x3
1: 0 22 0
2: 1 23 23
3: 1 44 67
4: 0 25 0
5: 1 36 36
6: 1 37 73
7: 1 28 101
8: 0 19 0
9: 1 30 30
10: 1 41 71
Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>