I am a new user of R and am used to using Stata software.
I used to loop through multiple variables by foreach command in Stata. So, for example, I can convert multiple numerical variables to factor ones.
In Stata, first, define the label:
label define NoYes 0 "No" 1 "Yes"
Then, apply the loop command:
foreach x in var1 var2 var3 {
label values `x' NoYes
}
I am figuring out how I can do so in R; any help would be appreciated.
In base R we can use lapply.
dat[c(1, 3)] <- lapply(dat[c(1, 3)], factor, levels=0:1, labels=c('No', 'Yes'))
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
To avoid confusion, I generally recommend not using too many fancy packages while you're new to R.
The literal translation could look like this (reload dat before trying):
for (x in c('X1', 'X3')) {
dat[[x]] <- factor(dat[[x]], levels=0:1, labels=c('No', 'Yes'))
}
dat
# X1 X2 X3 X4 X5
# 1 No 1 No 1 0
# 2 No 1 Yes 1 1
# 3 No 0 No 0 0
# 4 No 1 No 0 0
# 5 Yes 0 Yes 0 0
# 6 Yes 1 Yes 0 0
Data:
dat <- structure(list(X1 = c(0L, 0L, 0L, 0L, 1L, 1L), X2 = c(1L, 1L,
0L, 1L, 0L, 1L), X3 = c(0L, 1L, 0L, 0L, 1L, 1L), X4 = c(1L, 1L,
0L, 0L, 0L, 0L), X5 = c(0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Example data:
library(dplyr)
df <- data.frame(var1 = c(1, 1, 0, 0),
var2 = c(0, 0, 1, 1),
var3 = c(0, 1, 0, 1))
There may be several alternatives, and across of dplyr is one of them.
new_df <- df %>% mutate(across(var1:var3, ~ factor(.x, levels = c(0, 1), labels=c("No", "Yes"))))
new_df
var1 var2 var3
1 Yes No No
2 Yes No Yes
3 No Yes No
4 No Yes Yes
You do not really need levels = c(0, 1) here, but I would always do it in real data just to be safe.
Related
I need to perform appears to be an union on two tables in R. However the union needs to include columns that are not common to the two parent matrices / tables.
This scenario looks very similar to the Outer Union described here: https://cs.stackexchange.com/questions/6997/what-is-outer-union-and-why-is-it-partially-compatible
I have two Matrices:
Matrix 1
Name Var1 Var2
1 1 0
2 1 0
Matrix 2
Name Var1 Var3
3 0 1
4 0 1
That I need to combine into Matrix 3:
Name Var1 Var2 Var3
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
A base R solution using merge
M <- replace(M<-as.matrix(merge(data.frame(M1),data.frame(M2),all = T)),
which(is.na(M)),
0)
such that
> M
Name Var1 Var2 Var3
[1,] 1 1 0 0
[2,] 2 1 0 0
[3,] 3 0 0 1
[4,] 4 0 0 1
DATA
M1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
M2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))
We can convert to data.frame and use bind_rows. By default, it fills the missing values with NA
library(dplyr)
library(tidyr)
bind_rows(as.data.frame(m1), as.data.frame(m2)) %>%
mutate_all(replace_na, 0) %>%
as.matrix
# Name Var1 Var2 Var3
#[1,] 1 1 0 0
#[2,] 2 1 0 0
#[3,] 3 0 0 1
#[4,] 4 0 0 1
Or as #markus mentioned rbind.fill.matrix from plyr would be useful
plyr::rbind.fill.matrix(m1, m2)
data
m1 <- structure(c(1L, 2L, 1L, 1L, 0L, 0L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var2")))
m2 <- structure(c(3L, 4L, 0L, 0L, 1L, 1L), .Dim = 2:3, .Dimnames = list(
NULL, c("Name", "Var1", "Var3")))
I am trying to record an original table with SNP ID in rows and Sample ID in columns.
So far, I only managed to convert the data into presence/absence with 0 and 1.
I tried some easy codes to do further conversion but cannot find one that does I want.
The original table looks like this
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 0 1 1 1 0 0 1 0
A_001 0 0 1 0 1 0 1 1
A_002 1 1 0 1 1 1 0 0
A_002 0 1 1 0 1 0 1 1
A_003 1 0 0 1 0 1 1 0
A_003 1 1 0 1 1 0 0 1
A_004 0 0 1 0 0 1 0 0
A_004 1 0 0 1 0 1 1 0
I would like to record the scores to 0/0 = NA, 0/1 = 0, 1/1 = 2, 1/0 = 1 so the product looks something like this.
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 NA 1 2 1 0 NA 2 0
A_002 1 2 0 1 2 1 0 0
A_003 2 0 NA 2 0 1 1 0
A_004 0 NA 1 0 NA 2 0 NA
This is just an example. My total snpID is ~96000 and total sample ID column is ~500.
Any helps with writing this code would be really appreciated.
Here are a few dplyr-based examples that each work in a single pipe and get the same output. The main first step is to group by your ID, then collapse all the columns with a /. Then you can use mutate_at to select all columns that start with Cal_—this may be useful if you have other columns besides the ID that you don't want to do this operation on.
First method is a case_when:
library(dplyr)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), ~case_when(
. == "0/1" ~ 0,
. == "1/1" ~ 2,
. == "1/0" ~ 1,
TRUE ~ NA_real_
))
#> # A tibble: 4 x 9
#> snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A_001 NA 1 2 1 0 NA 2 0
#> 2 A_002 1 2 0 1 2 1 0 0
#> 3 A_003 2 0 NA 2 0 1 1 0
#> 4 A_004 0 NA 1 0 NA 2 0 NA
However, (in my opinion) case_when is a little tricky to read, and this doesn't showcase its real power, which is doing if/else checks on multiple variables. Better suited to checks on one variable at a time is dplyr::recode:
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")),
~recode(.,
"0/1" = 0,
"1/1" = 2,
"1/0" = 1,
"0/0" = NA_real_))
# same output as above
Or, for more flexibility & readability, create a small lookup object. That way, you can reuse the recode logic and change it easily. recode takes a set of named arguments; using tidyeval, you can pass in a named vector and unquo it with !!! (there's a similar example in the recode docs):
lookup <- c("0/1" = 0, "1/1" = 2, "1/0" = 1, "0/0" = NA_real_)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), recode, !!!lookup)
# same output
You might use aggregate to concatenate the values for each snpID and then replace the values according to your needs with the help of case_when from dplyr.
(out <- aggregate(.~ snpID, dat, toString))
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 0, 0 1, 0 1, 1 1, 0 0, 1 0, 0 1, 1 0, 1
#2 A_002 1, 0 1, 1 0, 1 1, 0 1, 1 1, 0 0, 1 0, 1
#3 A_003 1, 1 0, 1 0, 0 1, 1 0, 1 1, 0 1, 0 0, 1
#4 A_004 0, 1 0, 0 1, 0 0, 1 0, 0 1, 1 0, 1 0, 0
Now recode the columns
library(dplyr)
out[-1] <- case_when(out[-1] == "0, 0" ~ NA_integer_,
out[-1] == "0, 1" ~ 0L,
out[-1] == "1, 0" ~ 1L,
TRUE ~ 2L)
Result
out
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 NA 1 2 1 0 NA 2 0
#2 A_002 1 2 0 1 2 1 0 0
#3 A_003 2 0 NA 2 0 1 1 0
#4 A_004 0 NA 1 0 NA 2 0 NA
data
dat <- structure(list(snpID = c("A_001", "A_001", "A_002", "A_002",
"A_003", "A_003", "A_004", "A_004"), Cal_X1 = c(0L, 0L, 1L, 0L,
1L, 1L, 0L, 1L), Cal_X2 = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L),
Cal_X3 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L), Cal_X4 = c(1L,
0L, 1L, 0L, 1L, 1L, 0L, 1L), Cal_X5 = c(0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L), Cal_X6 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L),
Cal_X7 = c(1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), Cal_X8 = c(0L,
1L, 0L, 1L, 0L, 1L, 0L, 0L)), .Names = c("snpID", "Cal_X1",
"Cal_X2", "Cal_X3", "Cal_X4", "Cal_X5", "Cal_X6", "Cal_X7", "Cal_X8"
), class = "data.frame", row.names = c(NA, -8L))
I have the following dataframe in R:
data=
Time X1 X2 X3
1 1 0 0
2 1 1 1
3 0 0 1
4 1 1 1
5 0 0 0
6 0 1 1
7 1 1 1
8 0 0 0
9 1 1 1
10 0 0 0
Is there a way to programatically select those rows that are equal to (0,1,1)? I know it can be done by doing data[data$X1 == 0 & data$X2 == 1 & data$X3 == 1,] but, in my scenario, (0,1,1) is a list in a variable. My ultimate goal here is to determine the number of rows that are equal to (0,1,1), or any other combination that list variable can hold.
Thanks!
Mariano.
Here's a couple of options using a merge:
merge(list(X1=0,X2=1,X3=1), dat)
#or
merge(setNames(list(0,1,1),c("X1","X2","X3")), dat)
Or even using positional indexes based on what columns you want matched up:
L <- list(0,1,1)
merge(L, dat, by.x=seq_along(L), by.y=2:4)
All of which return:
# X1 X2 X3 Time
#1 0 1 1 6
If your matching variables are all of the same type, you could also safely do it via matrix comparison like:
dat[colSums(t(dat[c("X1","X2","X3")]) == c(0,1,1)) == 3,]
apply(data, 1, function(x) all(x==c(0,1,1)))
This will go down each row of the frame and return TRUE for each row where the row is equal to c(0,1,1).
this is your data
mydf <- structure(list(Time = 1:10, X1 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L,
0L, 1L, 0L), X2 = c(0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L),
X3 = c(0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L)), .Names = c("Time",
"X1", "X2", "X3"), class = "data.frame", row.names = c(NA, -10L
))
Using subset
subset(mydf, X1 == 0 & X2==1 & X3==1)
# Time X1 X2 X3
#6 6 0 1 1
another way
mydf[mydf$X1 ==0 & mydf$X2 ==1 & mydf$X3 ==1, ]
# Time X1 X2 X3
#6 6 0 1 1
or like this
mydf[mydf$X1 ==0 & mydf$X2 & mydf$X3 %in% c(1,1), ]
# Time X1 X2 X3
#6 6 0 1 1
you can also do that by
library(dplyr)
filter(mydf, X1==0 & X2==1 & X3==1)
# Time X1 X2 X3
#1 6 0 1 1
If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.
What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1