Related
I have an empty data frame that looks like that:
df <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS"),
A183=c(NA, NA, NA, NA, NA),
A240=c(NA, NA, NA, NA, NA),
A330=c(NA, NA, NA, NA, NA))
I would like to use a larger data frame to populate the previous one. The structure of the larger data frame is the following:
df2 <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS", "TP53", "EGFR"),
A183=c(2.3, 3.3, 2.6, 4.7, 1.2, 5.7, 3.4),
A240=c(1.3, 2.3, 4.6, 5.7, 2.2, 7.7, 1.4),
A330=c(0.3, 2.3, 1.6, 1.7, 4.2, 1.7, 4.4),
A335=c(1.3, 0.3, 0.6, 0.7, 0.2, 0.7, 0.4),
A345=c(0.3, 4.3, 4.6, 4.7, 4.2, 4.7, 0.4))
My desired output should look like that:
Hugo_Symbol A183 A240 A330
1 CDKN2A 2.3 1.3 0.3
2 JUN 3.3 2.3 2.3
3 IRS2 2.6 4.6 1.6
4 MTOR 4.7 5.7 1.7
5 NRAS 1.2 2.2 4.2
I tried to use dplyr package, specifically semi_join() function, but it returns empty table to me.
You can also use the following solution:
library(dplyr)
df %>%
left_join(df2, by = "Hugo_Symbol") %>%
mutate(across(ends_with(".x"), ~ coalesce(.x, get(gsub(".x", ".y", cur_column()))))) %>%
select(Hugo_Symbol, ends_with(".x")) %>%
rename_with(~ gsub(".x", "", .), ends_with(".x"))
Hugo_Symbol A183 A240 A330
1 CDKN2A 2.3 1.3 0.3
2 JUN 3.3 2.3 2.3
3 IRS2 2.6 4.6 1.6
4 MTOR 4.7 5.7 1.7
5 NRAS 1.2 2.2 4.2
We could use a join
library(data.table)
nm1 <- names(df)[-1]
df[nm1] <- lapply(df[nm1], as.numeric)
setDT(df)[df2, (nm1) := mget(paste0('i.', nm1)), on = .(Hugo_Symbol)]
-ouptut
df
Hugo_Symbol A183 A240 A330
1: CDKN2A 2.3 1.3 0.3
2: JUN 3.3 2.3 2.3
3: IRS2 2.6 4.6 1.6
4: MTOR 4.7 5.7 1.7
5: NRAS 1.2 2.2 4.2
Is it possible to just drop the NA columns from the first data frame? If so, a left join will produce the desired output.
df <- data.frame(
Hugo_Symbol = c("CDKN2A", "JUN", "IRS2", "MTOR",
"NRAS"),
A183 = c(NA, NA, NA, NA, NA),
A240 = c(NA, NA, NA, NA, NA),
A330 = c(NA, NA, NA, NA, NA)
)
df2 <- data.frame(
Hugo_Symbol = c("CDKN2A", "JUN", "IRS2", "MTOR",
"NRAS", "TP53", "EGFR"),
A183 = c(2.3, 3.3, 2.6, 4.7, 1.2, 5.7, 3.4),
A240 = c(1.3, 2.3, 4.6, 5.7, 2.2, 7.7, 1.4),
A330 = c(0.3, 2.3, 1.6, 1.7, 4.2, 1.7, 4.4),
A335 = c(1.3, 0.3, 0.6, 0.7, 0.2, 0.7, 0.4),
A345 = c(0.3, 4.3, 4.6, 4.7, 4.2, 4.7, 0.4)
)
library(dplyr)
left_join(df["Hugo_Symbol"], df2, by = "Hugo_Symbol")
one more way to do it-
left_join on hugo_symbol
then use transmute across on those columns only which either end in suffix .y and hugo_symbol
retain values as such. hence ~.
remove .y from names using .names argument. use regex [.]y so that is not interpreted as wildcard and y.
library(dplyr)
df %>% left_join(df2, by = 'Hugo_Symbol') %>%
transmute(across(Hugo_Symbol | ends_with('.y'), ~., .names = '{gsub("[.]y", "", .col )}'))
#> Hugo_Symbol A183 A240 A330
#> 1 CDKN2A 2.3 1.3 0.3
#> 2 JUN 3.3 2.3 2.3
#> 3 IRS2 2.6 4.6 1.6
#> 4 MTOR 4.7 5.7 1.7
#> 5 NRAS 1.2 2.2 4.2
Created on 2021-07-24 by the reprex package (v2.0.0)
Testing some models for predicting the species with a modified Iris dataset. Limiting to SVM and Random Forest for now. Running this in R-studio.
Abbreviated Set-up:
library(caret)
#data
data(iris)
#rename
dataset <- iris
#smaller sample
sample_data <- dataset[sample(nrow(dataset), 60), ]
#create some noise so model is less-than-perfect
noise_df <- data.frame(
Sepal.Length = c(5.7, 5.7, 5.7, 5.7, 5.7, 5.7, 5.7, 5.7, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0),
Sepal.Width = c(3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 2.7, 2.7, 2.7, 2.8, 2.8, 2.8, 2.8, 2.8, 3.1, 3.1, 3.1, 3.1, 3.1, 3.1),
Petal.Length = c(5.2, 5.2, 5.3, 5.3, 5.4, 5.4, 5.4, 5.4, 5.5, 5.5, 5.5, 5.6, 5.6, 5.7, 5.7, 5.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3),
Petal.Width = c(1.8, 1.8, 1.8, 1.9, 1.9, 1.9, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2),
Species = c("setosa","setosa", "setosa","setosa", "setosa","setosa","setosa","setosa", "setosa","setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica")
)
#combine sample with noise
dataset2 <- rbind(sample_data, noise_df)
#split data into train/test
set.seed(7)
validation_index <- createDataPartition(dataset2$Species, p=0.70, list=FALSE)
test_set <- dataset2[-validation_index,]
train_set <- dataset2[validation_index,]
#====================
#build models
#====================
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#random forest model
set.seed(3)
fit.rf <- train(Species~., data=train_set, method="rf", metric=metric, trControl=control)
#svm model
set.seed(3)
fit.svm <- train(Species~., data=train_set, method="svmRadial", metric=metric, trControl=control)
#====================
#run model on test
#====================
predictions <- predict(fit.svm, test_set)
confusionMatrix(predictions, test_set$Species)
Confusion Matrix output:
Reference
Prediction setosa versicolor virginica
setosa 11 0 3
versicolor 0 3 0
virginica 0 1 5
I'm wondering if it's possible to list the probability for each prediction. For example:
setosa versicolor virginica predicted
1 0.9 0.0 0.1 setosa
2 0.1 0.8 0.1 versicolor
3 0.33 0.33 0.33 virginica
I would guess that Random Forest perhaps just lists 0 vs 1, but wondering if SVM has the option to break out the probabilities like the example above. If so, I'm not sure how to shape my data or the functions to use. Is it a decision_function or predict_proba function, but I'm not clear on how to correctly perform it in r.
For randomforest, the probability is the proportion of decision trees that predict each label, and you can do it using predict(.. , type="prob") :
data.frame(predict(fit.rf,type="prob", newdata=test_set),
predicted=predict(fit.rf, newdata=test_set))
setosa versicolor virginica predicted
147 0.016 0.002 0.982 virginica
15 0.908 0.068 0.024 setosa
103 0.486 0.000 0.514 virginica
118 0.416 0.056 0.528 virginica
129 0.344 0.000 0.656 virginica
39 0.388 0.080 0.532 virginica
For kernlab svm you needa set prob.model = TRUE:
set.seed(3)
fit.svm <- train(Species~., data=train_set, method="svmRadial", metric=metric, trControl=control, prob.model = TRUE)
data.frame(predict(fit.svm,newdata=test_set,type="prob"),
predicted=predict(fit.svm,newdata=test_set))
setosa versicolor virginica predicted
1 0.129916071 0.051873046 0.81821088 virginica
2 0.884025291 0.030853736 0.08512097 setosa
3 0.129054108 0.006256384 0.86468951 virginica
4 0.104952659 0.124066424 0.77098092 virginica
I have two lists that I would like to join together.
I want to join the second list to the first list as a new column.
The second list looks like:
[[1]]
[1] 2.46
[[2]]
[1] 2.475
[[3]]
[1] 2.4875
[[4]]
[1] 2.485
[[5]]
[1] 2.4625
[[6]]
[1] 2.4875
So I would like to join [[1]] as a new column in list 1. Expected output for 2 of the lists:
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Value
1 5.1 3.5 1.4 0.2 setosa 2.46
2 4.9 3.0 1.4 0.2 setosa 2.46
3 4.7 3.2 1.3 0.2 setosa 2.46
4 4.6 3.1 1.5 0.2 setosa 2.46
5 5.0 3.6 1.4 0.2 setosa 2.46
6 5.4 3.9 1.7 0.4 setosa 2.46
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Value
2 4.9 3.0 1.4 0.2 setosa 2.475
3 4.7 3.2 1.3 0.2 setosa 2.475
4 4.6 3.1 1.5 0.2 setosa 2.475
5 5.0 3.6 1.4 0.2 setosa 2.475
6 5.4 3.9 1.7 0.4 setosa 2.475
7 4.6 3.4 1.4 0.3 setosa 2.475
This is related to a question I have here where in the EDIT I compute the mean which is the value presented here.
EDIT:
I realise I forgot the data.
Data1
list1 <- list(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4
), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4,
1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 0.2,
0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
6L), class = "data.frame"), structure(list(Sepal.Length = c(4.9,
4.7, 4.6, 5, 5.4, 4.6), Sepal.Width = c(3, 3.2, 3.1, 3.6, 3.9,
3.4), Petal.Length = c(1.4, 1.3, 1.5, 1.4, 1.7, 1.4), Petal.Width = c(0.2,
0.2, 0.2, 0.2, 0.4, 0.3), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = 2:7, class = "data.frame"),
structure(list(Sepal.Length = c(4.7, 4.6, 5, 5.4, 4.6, 5),
Sepal.Width = c(3.2, 3.1, 3.6, 3.9, 3.4, 3.4), Petal.Length = c(1.3,
1.5, 1.4, 1.7, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2,
0.4, 0.3, 0.2), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 3:8, class = "data.frame"),
structure(list(Sepal.Length = c(4.6, 5, 5.4, 4.6, 5, 4.4),
Sepal.Width = c(3.1, 3.6, 3.9, 3.4, 3.4, 2.9), Petal.Length = c(1.5,
1.4, 1.7, 1.4, 1.5, 1.4), Petal.Width = c(0.2, 0.2, 0.4,
0.3, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 4:9, class = "data.frame"),
structure(list(Sepal.Length = c(5, 5.4, 4.6, 5, 4.4, 4.9),
Sepal.Width = c(3.6, 3.9, 3.4, 3.4, 2.9, 3.1), Petal.Length = c(1.4,
1.7, 1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.4, 0.3,
0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = 5:10, class = "data.frame"),
structure(list(Sepal.Length = c(5.4, 4.6, 5, 4.4, 4.9, 5.4
), Sepal.Width = c(3.9, 3.4, 3.4, 2.9, 3.1, 3.7), Petal.Length = c(1.7,
1.4, 1.5, 1.4, 1.5, 1.5), Petal.Width = c(0.4, 0.3, 0.2,
0.2, 0.1, 0.2), Species = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), row.names = 6:11, class = "data.frame"))
Data 2:
list2 <- list(2.46, 2.475, 2.4875, 2.485, 2.4625, 2.4875)
Data 3:
list3 <- list(1.80438213020271, 1.81796589626978, 1.81591080488058, 1.81906569425076,
1.81978971735325, 1.86302586794048)
Data 4:
list4 <- list(0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
We can use map2
library(dplyr)
library(purrr)
map2(lst1, lst2, ~ .x %>%
mutate(Value = .y))
If there are more lists, we can wrap it in a single list and use pmap
pmap(list(lst1, lst2, lst3, lst4), ~ ..1 %>%
mutate(mean = ..2, sd = ..3, min = ..4))
Or in base R with Map
Map(cbind, lst1, Value = lst2)
Being a new user here, my questions are not being fully answered due to not being reproducible. I read the thread relating to producing reproducible code but to avail. Specifically lost on how to use the dput() function.
Could someone provide a step by step on how to use the dput() using the iris df for eg it would be very helpful.
Using the iris dataset, which is handily included into R, we can see how dput() works:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now we can get the whole dataset using dput(iris). In most situations, a whole dataset is unnecessary to provide for a Stackoverflow question, as a few lines of the relevant variables suffice as a working data example.
Two things come in handy: The head() function outputs only the first six rows of a dataframe/matrix. Also, the indexing in R (via brackets) allows you to select only specific columns.
Therefore, we can restrict the output of dput() using a combination of these two:
dput(head(iris[, c(1, 3)]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
will give us the code to reproduce the first (up to) six rows of column 1 and 3 of the iris dataset.
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
If the first rows do not suffice, we can skip using head() and rely on indexing only:
dput(iris[1:20, c(1, 3)])
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
will give us the the first twenty rows:
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
7 4.6 1.4
8 5.0 1.5
9 4.4 1.4
10 4.9 1.5
11 5.4 1.5
12 4.8 1.6
13 4.8 1.4
14 4.3 1.1
15 5.8 1.2
16 5.7 1.5
17 5.4 1.3
18 5.1 1.4
19 5.7 1.7
20 5.1 1.5
I've asked many questions about this and all the answers were really helpful...but once again my data is weird and I need help...Basically, what I want to do is find the average speed at a certain range of intervals...lets say from 6 s to 40 s my average speed would be 5 m/s...etc etc..
So it was pointed out to me to use this code...
library(IRanges)
idx <- seq(1, ncol(data), by=2)
# idx is now 1, 3, 5. It will be passed one value at a time to `i`.
# that is, `i` will take values 1 first, then 3 and then 5 and each time
# the code within is executed.
o <- lapply(idx, function(i) {
ir1 <- IRanges(start=seq(0, max(data[[i]]), by=401), width=401)
ir2 <- IRanges(start=data[[i]], width=1)
t <- findOverlaps(ir1, ir2)
d <- data.frame(mean=tapply(data[[i+1]], queryHits(t), mean))
cbind(as.data.frame(ir1), d)
})
which gives this output
# > o
# [[1]]
# start end width mean
# 1 0 400 401 1.05
#
# [[2]]
# start end width mean
# 1 0 400 401 1.1
#
# [[3]]
# start end width mean
# 1 0 400 401 1.383333
So if I wanted it to be every 100 s... I'll just change ir1 <- ....., by = 401 to become by=100.
But my data is weird because of a few things
my data doesnt always start with 0 s sometimes it starts at 20 s...depending on the specimen and whether it moves
My data collection does not happen every 1s or 2s or 3s. Hence sometimes I get data 1-20 s but it skips over 20-40 s simply because the specimen does not move.
I think the findOverlaps portion of the code affects my output. How can I get rid of that without disturbing the output?
Here is some data to illustrate my troubles...but all of my real data ends in 2000s
Time Speed Time Speed Time Speed
6.3 1.6 3.1 1.7 0.3 2.4
11.3 1.3 5.1 2.2 1.3 1.3
13.8 1.3 6.3 3.4 3.1 1.5
14.1 1.0 7.0 2.3 4.5 2.7
47.4 2.9 11.3 1.2 5.1 0.5
49.2 0.7 26.5 3.3 5.9 1.7
50.5 0.9 27.3 3.4 9.7 2.4
57.1 1.3 36.6 2.5 11.8 1.3
72.9 2.9 40.3 1.1 13.1 1.0
86.6 2.4 44.3 3.2 13.8 0.6
88.5 3.4 50.9 2.6 14.0 2.4
89.0 3.0 62.6 1.5 14.8 2.2
94.8 2.9 66.8 0.5 15.5 2.6
117.4 0.5 67.3 1.1 16.4 3.2
123.7 3.2 67.7 0.6 26.5 0.9
124.5 1.0 68.2 3.2 44.7 3.0
126.1 2.8 72.1 2.2 45.1 0.8
As you can see from the data, it doesnt necessarily end in 60 s etc sometimes it only ends at 57 etc
EDIT add dput of data
structure(list(Time = c(6.3, 11.3, 13.8, 14.1, 47.4, 49.2, 50.5,
57.1, 72.9, 86.6, 88.5, 89, 94.8, 117.4, 123.7, 124.5, 126.1),
Speed = c(1.6, 1.3, 1.3, 1, 2.9, 0.7, 0.9, 1.3, 2.9, 2.4,
3.4, 3, 2.9, 0.5, 3.2, 1, 2.8), Time.1 = c(3.1, 5.1, 6.3,
7, 11.3, 26.5, 27.3, 36.6, 40.3, 44.3, 50.9, 62.6, 66.8,
67.3, 67.7, 68.2, 72.1), Speed.1 = c(1.7, 2.2, 3.4, 2.3,
1.2, 3.3, 3.4, 2.5, 1.1, 3.2, 2.6, 1.5, 0.5, 1.1, 0.6, 3.2,
2.2), Time.2 = c(0.3, 1.3, 3.1, 4.5, 5.1, 5.9, 9.7, 11.8,
13.1, 13.8, 14, 14.8, 15.5, 16.4, 26.5, 44.7, 45.1), Speed.2 = c(2.4,
1.3, 1.5, 2.7, 0.5, 1.7, 2.4, 1.3, 1, 0.6, 2.4, 2.2, 2.6,
3.2, 0.9, 3, 0.8)), .Names = c("Time", "Speed", "Time.1",
"Speed.1", "Time.2", "Speed.2"), class = "data.frame", row.names = c(NA,
-17L))
sorry if i don't understand your question entirely, could you explain why this example doesn't do what you're trying to do?
# use a pre-loaded data set
mtcars
# choose which variable to cut
var <- 'mpg'
# define groups, whether that be time or something else
# and choose how to cut it.
x <- cut( mtcars[ , var ] , c( -Inf , seq( 15 , 25 , by = 2.5 ) , Inf ) )
# look at your cut points, for every record
x
# you can merge them back on to the mtcars data frame if you like..
mtcars$cutpoints <- x
# ..but that's not necessary
# find the mean within those groups
tapply(
mtcars[ , var ] ,
x ,
mean
)
# find the mean within groups, using a different variable
tapply(
mtcars[ , 'wt' ] ,
x ,
mean
)