I want to create a new variable called N1 based on three existing variables (resp, exp.1, exp.2) in R.
df <- data.frame(
resp = c(1, 2, 4, 3, 5, 7 ),
exp.1 = c(0, 0.24, 1, 1.5, 0, 0.4),
exp.2 = c(1, 1, 0, 0, 0.3, 0.2)
)
df resp exp.1 exp.2
1 1 0 1
2 2 0.24 1
3 2 1 0
4 4 1.5 0
5 5 0 0.3
6 7 0.4 0.2
I want to make a new variable N1 like this:
when resp >4, extracting values from exp.1
when resp <4, extracting values from exp.2
when resp == 4, making it missing values.
The desired outcome is:
df resp exp.1 exp.2 N1
1 1 0 1 1
2 2 0.24 1 1
3 4 1 0 NA
4 3 1.5 0 0
5 5 0 0.3 0
6 7 0.4 0.2 0.4
I tried my best using mutate() or car::recode() but it does not work. Any clues?
Using case_when,
library(dplyr)
df %>%
mutate(N1 = case_when(
resp>4 ~ exp.1,
resp<4 ~ exp.2,
resp == 5 ~ NA_real_
))
resp exp.1 exp.2 N1
1 1 0.00 1.0 1.0
2 2 0.24 1.0 1.0
3 4 1.00 0.0 NA
4 3 1.50 0.0 0.0
5 5 0.00 0.3 0.0
6 7 0.40 0.2 0.4
Edit: Using case_when(), as given in the solution above, might be better.
library(dplyr)
# #Data
df <- data.frame(
resp = c(1, 2, 4, 3, 5, 7 ),
exp.1 = c(0, 0.24, 1, 1.5, 0, 0.4),
exp.2 = c(1, 1, 0, 0, 0.3, 0.2)
)
df %>%
rowwise() %>%
mutate(N1 = if (resp >4) {
exp.1
} else if (resp <4) {
exp.2
} else if (resp ==4) {
NA
} else {
NA
}
)
## A tibble: 6 x 4
## Rowwise:
# resp exp.1 exp.2 N1
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 1 1
#2 2 0.24 1 1
#3 4 1 0 NA
#4 3 1.5 0 0
#5 5 0 0.3 0
#6 7 0.4 0.2 0.4
Related
Suppose I've got this data simulated from the below R code:
library(RNGforGPD)
set.seed(1)
sample.size = 10; no.gpois = 3
lambda.vec = c(-0.2, 0.2, -0.3); theta.vec = c(1, 3, 4)
M = c(0.352, 0.265, 0.342); N = diag(3); N[lower.tri(N)] = M
TV = N + t(N); diag(TV) = 1
cstar = CmatStarGpois(TV, theta.vec, lambda.vec, verbose = TRUE)
data = GenMVGpois(sample.size, no.gpois, cstar, theta.vec, lambda.vec, details = FALSE)
> prop.table(table(data[,1]))
0 1 2
0.3 0.4 0.3
> prop.table(table(data[,2]))
2 3 6 8 10
0.2 0.4 0.1 0.2 0.1
> prop.table(table(data[,3]))
2 3 4 5 6
0.2 0.3 0.1 0.3 0.1
> table(data)
data
0 1 2 3 4 5 6 8 10
3 4 7 7 1 3 2 2 1
I'd like to create a proportion matrix for each of the three categorical variables. If the category is missing for a specific column, it will be identified as 0.
Cat
X1
X2
X3
0
0.3
0.0
0.0
1
0.4
0.0
0.0
2
0.3
0.2
0.2
3
0.0
0.4
0.3
4
0.0
0.0
0.1
5
0.0
0.0
0.3
6
0.0
0.1
0.1
8
0.0
0.2
0.0
10
0.0
0.1
0.0
This is the data-object:
dput(data)
structure(c(1, 0, 2, 1, 0, 0, 1, 2, 2, 1, 3, 8, 3, 3, 2, 2, 6,
3, 10, 8, 2, 5, 2, 6, 3, 3, 4, 3, 5, 5), .Dim = c(10L, 3L), .Dimnames = list(
NULL, NULL))
Tried to put logic at appropriate points in code sequence.
props <- data.frame(Cat = sort(unique(c(data))) ) # Just the Cat column
#Now fill in the entries
# the entries will be obtained with table function
apply(data, 2, table) # run `table(.)` over the columns individually
[[1]]
0 1 2 # these are actually character valued names
3 4 3 # while these are the count values
[[2]]
2 3 6 8 10
2 4 1 2 1
[[3]]
2 3 4 5 6
2 3 1 3 1
Now iterate over that list to fill in values that match the Cat column:
props2 <- cbind(props, # using dfrm first argument returns dataframe object
lapply( apply(data, 2, table) , # irregular results are a list
function(col) { # first make a named vector of zeros
x <- setNames(rep(0,length(props$Cat)), props$Cat)
# could have skipped that step by using `tabulate`
# then fill with values using names as indices
x[names(col)] <- col # values to matching names
x}) )
props2
#-------------
Cat V1 V2 V3
0 0 3 0 0
1 1 4 0 0
2 2 3 2 2
3 3 0 4 3
4 4 0 0 1
5 5 0 0 3
6 6 0 1 1
8 8 0 2 0
10 10 0 1 0
#---
# now just "proportionalize" those counts
props2[2:4] <- prop.table(data.matrix(props2[2:4]), margin=2)
props2
#-------------
Cat V1 V2 V3
0 0 0.3 0.0 0.0
1 1 0.4 0.0 0.0
2 2 0.3 0.2 0.2
3 3 0.0 0.4 0.3
4 4 0.0 0.0 0.1
5 5 0.0 0.0 0.3
6 6 0.0 0.1 0.1
8 8 0.0 0.2 0.0
10 10 0.0 0.1 0.0
colnames(data) <- c("X1", "X2", "X3")
as_tibble(data) %>%
pivot_longer(cols = "X1":"X3", values_to = "Cat") %>%
group_by(name, Cat) %>%
count() %>%
ungroup(Cat) %>%
summarize(name, Cat, proportion = n / sum(n)) %>%
pivot_wider(names_from = name, values_from = proportion) %>%
arrange(Cat) %>%
replace(is.na(.), 0)
# A tibble: 9 × 4
Cat X1 X2 X3
<dbl> <dbl> <dbl> <dbl>
1 0 0.3 0 0
2 1 0.4 0 0
3 2 0.3 0.2 0.2
4 3 0 0.4 0.3
5 4 0 0 0.1
6 5 0 0 0.3
7 6 0 0.1 0.1
8 8 0 0.2 0
9 10 0 0.1 0
If you would like it as a matrix, you can use as.matrix()
I have a dataframe that looks like this:
a b c d e
1 0 0 1 1
.5 1 1 0 1
1 1. .5 .5. 0
0 0 1 NA 1
0 1 0 1 .5
I am looking for an output like:
col val count
a 1 2
.5 1
0 2
b 1 3
0 2
c 1 2
.5 1
0 2
d 1 2
.5 1
0 1
NA 1
e 1 3
.5 1
0 1
I have tried using
data %>%
summarize_at(colnames(data)), n(), na.rm = TRUE)
but this doesn't give me what I want. Any suggestions greatly appreciated, thank you!
I've assumed column d row 3 is a typo and .5. really is 0.5, in which case you could do the following:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(count = n()) %>%
arrange(name, desc(value))
# or more succinctly as pointed out by #LMc
df %>%
pivot_longer(everything()) %>%
count(name, value) %>%
arrange(name, desc(value))
#> # A tibble: 15 x 3
#> name value count
#> <chr> <dbl> <int>
#> 1 a 1 2
#> 2 a 0.5 1
#> 3 a 0 2
#> 4 b 1 3
#> 5 b 0 2
#> 6 c 1 2
#> 7 c 0.5 1
#> 8 c 0 2
#> 9 d 1 2
#> 10 d 0.5 1
#> 11 d 0 1
#> 12 d NA 1
#> 13 e 1 3
#> 14 e 0.5 1
#> 15 e 0 1
data
df <- structure(list(a = c(1, 0.5, 1, 0, 0), b = c(0, 1, 1, 0, 1),
c = c(0, 1, 0.5, 1, 0), d = c(1, 0, 0.5, NA, 1),
e = c(1, 1, 0, 1, 0.5)), class = "data.frame", row.names = c(NA,
-5L))
Created on 2021-04-13 by the reprex package (v2.0.0)
For some of you, this could be an easy exercise. Please see below the dataset that I am working with:
d1t1 d1t2 d1t3 d1t4 d2t1 d2t2 d2t3 d2t4
1 1 1 2 1 1 1 2
2 2 0 5 1 2 0 2
1 2 0 7 1 2 1 2
1 1 0 7 1 2 1 2
A short explanation of the variables:
d1t1=Day 1 time 1
d1t2=Day 1 time 2
....
d2t1=Day2 time 1
d2t2=Day2 time 2
0,1,2,5,7 = different types of measurements
I would like to calculate the percentage of time spent on measurements every day at the exact same moment of time. But I don't know how to that I tried to format my data from wide to long but I don't know how to return the percentages for the measurements based on different time steps.
Output:
t1
d1: 1-75%; 2-25% # considering that during d1t1 4 people took measurements
d2: 1-100%;
t2
d1: 1-50%; 2-50%
d2: 1-50%; 2-50%
Sample data:
df<-structure(list(d1t1 = c(1, 2, 1, 1),
d1t2 = c(1, 2, 2, 1), d1t3 = c(1, 0, 0, 0), d1t4 = c(2, 5, 7, 7),
d2t1 = c(1, 1, 1, 1), d2t2 = c(1, 2, 2, 2), d2t3 = c(1, 0, 1 ,1), d2t4=c(2,2,2,2)), row.names = c(NA,
4L), class = "data.frame")
If you are looking for data frame output, you can try
dfout <- with(
aggregate(cnt ~ ., cbind(stack(df), cnt = 1), sum),
perc <- 100 * cnt / ave(cnt, gsub("t\\d+", "", ind), gsub("d\\d+", "", ind), FUN = sum)
)
such that
values ind cnt perc
1 1 d1t1 3 75
2 2 d1t1 1 25
3 1 d1t2 2 50
4 2 d1t2 2 50
5 0 d1t3 3 75
6 1 d1t3 1 25
7 2 d1t4 1 25
8 5 d1t4 1 25
9 7 d1t4 2 50
10 1 d2t1 4 100
11 1 d2t2 1 25
12 2 d2t2 3 75
13 0 d2t3 1 25
14 1 d2t3 3 75
15 2 d2t4 4 100
If you want output saved in a list, you can try prop.table like below
Map(function(x) prop.table(table(unname(x))),df)
such that
> Map(function(x) prop.table(table(unname(x))),df)
$d1t1
1 2
0.75 0.25
$d1t2
1 2
0.5 0.5
$d1t3
0 1
0.75 0.25
$d1t4
2 5 7
0.25 0.25 0.50
$d2t1
1
1
$d2t2
1 2
0.25 0.75
$d2t3
0 1
0.25 0.75
$d2t4
2
1
If you want to see the percentage grouped by t1, t2 and t3, you can try
Map(
function(x) {
Map(
function(v) prop.table(table(unname(v))),
x
)
},
split.default(df, gsub(".*(t\\d+)", "\\1", names(df)))
)
such that
$t1
$t1$d1t1
1 2
0.75 0.25
$t1$d2t1
1
1
$t2
$t2$d1t2
1 2
0.5 0.5
$t2$d2t2
1 2
0.25 0.75
$t3
$t3$d1t3
0 1
0.75 0.25
$t3$d2t3
0 1
0.25 0.75
$t4
$t4$d1t4
2 5 7
0.25 0.25 0.50
$t4$d2t4
2
1
You can get the data in long format and then calculate the proportion :
library(dplyr)
df %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('day', 'time'),
names_pattern = '(d\\d+)(t\\d+)') %>%
count(day, time, value) %>%
group_by(time, day) %>%
mutate(n = n/sum(n) * 100)
# day time value n
# <chr> <chr> <dbl> <dbl>
# 1 d1 t1 1 75
# 2 d1 t1 2 25
# 3 d1 t2 1 50
# 4 d1 t2 2 50
# 5 d1 t3 0 75
# 6 d1 t3 1 25
# 7 d1 t4 2 25
# 8 d1 t4 5 25
# 9 d1 t4 7 50
#10 d2 t1 1 100
#11 d2 t2 1 25
#12 d2 t2 2 75
#13 d2 t3 0 25
#14 d2 t3 1 75
#15 d2 t4 2 100
I will merge the below two df's in ascending order by time, non-duplicating.
My goal is to also have two new variables.
df1
time freq
1 1.5 1
2 3.5 1
3 4.5 2
4 5.5 1
5 8.5 2
6 9.5 1
7 10.5 1
8 11.5 1
9 15.5 1
10 16.5 1
11 18.5 1
12 23.5 1
13 26.5 1
df2
time freq
1 0.5 6
2 2.5 2
3 3.5 1
4 6.5 1
5 15.5 1
Please help me with the code for creating the two new columns:
Where if the freq value corresponds to a time in df1, then a new variable (var1) would record the associated freq value, AND 0 if no such time value exists for df1.
Where if the freq value corresponds to a time in df2, then a second new variable (var2) would record that freq value from df2, AND 0 if no such time value exists for df2.
So I would have a table like this below:
time var1 var2
0.5 0 6
1.5 1 0
2.5 0 2
3.5 1 1
4.5 2 0
5.5 1 0
...
If I understood how your dataframe looks like correctly (something that would be created through:)
df1 = data.frame(time = c(1.5, 3.5, 4.5, 5.5, 8.5, 9.5, 10.5, 11.5, 15.5, 16.5, 18.5, 23.5, 26.5), freq = c(1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1))
df2 = data.frame(time = c(0.5, 2.5, 3.5, 6.5, 15.5), freq = c(6, 2, 1, 1, 1))
Then you would get what you are looking for by:
df_new = data.frame(time = sort(unique(c(df1$time, df2$time))), var1 = sapply(sapply(time, function(x) {df1$freq[df1$time == x]}), function(x) {ifelse(length(x) == 0, 0, x)}), var2 = sapply((sapply(time, function(x) {df2$freq[df2$time == x]})), function(x) {ifelse(length(x) == 0, 0, x)}))
Hope this helps,
Code - base R
df3 <- merge(x = df1, df2, by.x = 'time', by.y = 'time', all = TRUE, sort = TRUE)
df3$freq.x[is.na(df3$freq.x)] <- 0
df3$freq.y[is.na(df3$freq.y)] <- 0
Code - data.table library
library('data.table')
setDT(df1)
setkey(df1, time)
df3 <- merge(x = df1, df2, all = TRUE, sort = TRUE)
df3[is.na(freq.x), freq.x := 0 ]
df3[is.na(freq.y), freq.y := 0 ]
Output
df3
# time freq.x freq.y
# 1: 0.5 0 6
# 2: 1.5 1 0
# 3: 2.5 0 2
# 4: 3.5 1 1
# 5: 4.5 2 0
# 6: 5.5 1 0
# 7: 6.5 0 1
# 8: 8.5 2 0
# 9: 9.5 1 0
# 10: 10.5 1 0
# 11: 11.5 1 0
# 12: 15.5 1 1
# 13: 16.5 1 0
# 14: 18.5 1 0
# 15: 23.5 1 0
# 16: 26.5 1 0
Data
df1 <- read.table(text =
'time freq
1 1.5 1
2 3.5 1
3 4.5 2
4 5.5 1
5 8.5 2
6 9.5 1
7 10.5 1
8 11.5 1
9 15.5 1
10 16.5 1
11 18.5 1
12 23.5 1
13 26.5 1', header = TRUE, stringsAsFactor = FALSE)
df2 <- read.table(text =
'time freq
1 0.5 6
2 2.5 2
3 3.5 1
4 6.5 1
5 15.5 1', header = TRUE, stringsAsFactor = FALSE)
A more straightforward approach using tidyverse or dplyr:
library(tidyverse)
df1 <- tibble(time = c(1.5, 3.5, 4.5, 5.5), freq = c(1, 1, 2, 1))
df2 <- tibble(time = c(0.5, 2.5, 3.5, 6.5), freq = c(6, 2, 1, 1))
full_join(df1, df2, by = "time", suffix = c("_1", "_2")) %>%
mutate_all(~ .x %>% replace_na(0)) %>%
arrange(time)
# A tibble: 7 x 3
time freq_1 freq_2
<dbl> <dbl> <dbl>
1 0.5 0 6
2 1.5 1 0
3 2.5 0 2
4 3.5 1 1
5 4.5 2 0
6 5.5 1 0
7 6.5 0 1
I have two datasets which both share a common ID variable, and also share n variables which are denoted SNP1-SNPn. An example of the two datasets is shown below
Dataset 1
ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1 0 1 1 0 0 0 0
2 1 1 0 0 0 0 0
3 1 0 0 0 1 1 0
4 0 1 1 0 0 0 0
5 1 0 0 0 1 1 0
6 1 0 0 0 1 1 0
7 0 1 1 0 0 0 0
Dataset 2
ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1 0.65 1.3 2.8 0.43 0.62 0.9 1.5
2 0.74 1.6 3.4 0.9 2.4 4.4 2.3
3 0.28 0.5 5.7 6.7 0.3 2.5 0.56
4 0.74 1.6 3.4 0.9 2.4 4.4 2.3
5 0.65 1.3 2.8 0.43 0.62 0.9 1.5
6 0.74 1.6 3.4 0.9 2.4 4.4 2.3
7 0.28 0.5 5.7 6.7 0.3 2.5 0.56
I would like to multiply each value in a given position in dataframe 1, with the value in the equivalent position in dataframe 2.
For example, I would like to multiple position [1,2] in dataset 1 (value = 0), by position [1,2] in dataset 2 (value = 0.65). My data set is very large and spans almost 300 columns and 500,000 IDs.
Variable names for SNP1-n are longer in reality (for example they actually read Affx.5869593), so I cannot just use SNP1-300 in my code, it would have to be specified by the number of columns.
Do I need to unlist both datasets by person ID and SNP name first? What function can be used for multiplying values within two datasets?
I am assuming that you are trying to return a third dataframe which will have, in each position, the product of the values that were in that position in the two data frames.
For example, if the following are your two dataframes
df1 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1a = c(0, 1, 1, 0, 1
), SNP2a = c(1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA,
-5L))
ID SNP1a SNP2a
1 0 1
2 1 1
3 1 0
4 0 1
5 1 0
df2 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1b = c(0.65, 0.74,
0.28, 0.74, 0.65), SNP2b = c(1.3, 1.6, 0.5, 1.6, 1.3)), class = .
"data.frame", row.names = c(NA, -5L))
ID SNP1b SNP2b
1 0.65 1.3
2 0.74 1.6
3 0.28 0.5
4 0.74 1.6
5 0.65 1.3
Then
df3 <- df1[,2:3] * df2[,2:3]
SNP1 SNP2
1 0.00 1.3
2 0.74 1.6
3 0.28 0.0
4 0.00 1.6
5 0.65 0.0
Will work (As long as the two dataframes are of equivalent size).
If your data frames have identical set of id's and they are the same size, you could sort both for id and do this:
df <- data.frame(
id = c(1,2,3,4,5),
snp1 = c(0,0,1,0,0),
snp2 = c(1,1,1,0,1)
)
df2 <- data.frame(
id <- c(1,2,3,4,5),
snp1 <- c(0.3,0.2,0.3,0.1,0.2),
snp2 <- c(0.5,0.8,0.2,0.3,0.3)
)
res <- mapply(`*`, df[,-1], df2[,-1)
res$id <- df$id