Use numbers as column names while grouping them - r

Here is my sample data;
mydata<-structure(list(x1 = c(0, 8.6, 11.2, 8.4, 0, 0), x2 = c(0, 0,
7.8, 7.6, 1.2, 10.2), y1 = c(0, 0, 3.4, 21.4, 1.8, 1.4), y2 = c(7.8,
7.6, 1.2, 10.2, 7, 0), z1 = c(0, 1.6, 7.6, 23.6, 3.2, 0), z2 = c(8.6,
1.4, 0, 0, 0, 0)), .Names = c("x1", "x2", "y1", "y2", "z1", "z2"
), class = "data.frame", row.names = c(NA, -6L))
x1 x2 y1 y2 z1 z2
1 0.0 0.0 0.0 7.8 0.0 8.6
2 8.6 0.0 0.0 7.6 1.6 1.4
3 11.2 7.8 3.4 1.2 7.6 0.0
4 8.4 7.6 21.4 10.2 23.6 0.0
5 0.0 1.2 1.8 7.0 3.2 0.0
6 0.0 10.2 1.4 0.0 0.0 0.0
With the code below, it is possible to group columns as x, y and z.
grps <- unique(gsub("[0-9]", "", colnames(mydata)))
# [1] "x" "y" "z"
But When I rename columns like that;
myd<-structure(list(X2005 = c(0, 8.6, 11.2, 8.4, 0, 0), X2005.1 = c(0,
0, 7.8, 7.6, 1.2, 10.2), X2006 = c(0, 0, 3.4, 21.4, 1.8, 1.4),
X2006.1 = c(7.8, 7.6, 1.2, 10.2, 7, 0), X2007 = c(0, 1.6,
7.6, 23.6, 3.2, 0), X2007.1 = c(8.6, 1.4, 0, 0, 0, 0)), .Names = c("X2005",
"X2005.1", "X2006", "X2006.1", "X2007", "X2007.1"), row.names = c(NA,
6L), class = "data.frame")
X2005 X2005.1 X2006 X2006.1 X2007 X2007.1
1 0.0 0.0 0.0 7.8 0.0 8.6
2 8.6 0.0 0.0 7.6 1.6 1.4
3 11.2 7.8 3.4 1.2 7.6 0.0
4 8.4 7.6 21.4 10.2 23.6 0.0
5 0.0 1.2 1.8 7.0 3.2 0.0
6 0.0 10.2 1.4 0.0 0.0 0.0
I want to see;
# [1] "2005" "2006" "2007"

We can use gsub to match the letter 'X' at the beginning (^) of the string or (| the . followed by numbers at the end ($) of the string and replace with blank ("")
names(myd) <- gsub("^X|\\.\\d+$", "", names(myd))
names(myd)
#[1] "2005" "2005" "2006" "2006" "2007" "2007"
unique(names(myd))
#[1] "2005" "2006" "2007"
If we know the number of digits and position, then substr would be faster
substr(names(myd), 2, 5)

One option would be to to use sub and convert the names to factor with labels as needed.
names(mydata) <- factor(sub("[0-9]", "", names(mydata)), labels = 2005:2007)
and then check your column names
names(mydata)
#[1] "2005" "2005" "2006" "2006" "2007" "2007"

Related

Populate table with values from another table based on both rows and columns

I have an empty data frame that looks like that:
df <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS"),
A183=c(NA, NA, NA, NA, NA),
A240=c(NA, NA, NA, NA, NA),
A330=c(NA, NA, NA, NA, NA))
I would like to use a larger data frame to populate the previous one. The structure of the larger data frame is the following:
df2 <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS", "TP53", "EGFR"),
A183=c(2.3, 3.3, 2.6, 4.7, 1.2, 5.7, 3.4),
A240=c(1.3, 2.3, 4.6, 5.7, 2.2, 7.7, 1.4),
A330=c(0.3, 2.3, 1.6, 1.7, 4.2, 1.7, 4.4),
A335=c(1.3, 0.3, 0.6, 0.7, 0.2, 0.7, 0.4),
A345=c(0.3, 4.3, 4.6, 4.7, 4.2, 4.7, 0.4))
My desired output should look like that:
Hugo_Symbol A183 A240 A330
1 CDKN2A 2.3 1.3 0.3
2 JUN 3.3 2.3 2.3
3 IRS2 2.6 4.6 1.6
4 MTOR 4.7 5.7 1.7
5 NRAS 1.2 2.2 4.2
I tried to use dplyr package, specifically semi_join() function, but it returns empty table to me.
You can also use the following solution:
library(dplyr)
df %>%
left_join(df2, by = "Hugo_Symbol") %>%
mutate(across(ends_with(".x"), ~ coalesce(.x, get(gsub(".x", ".y", cur_column()))))) %>%
select(Hugo_Symbol, ends_with(".x")) %>%
rename_with(~ gsub(".x", "", .), ends_with(".x"))
Hugo_Symbol A183 A240 A330
1 CDKN2A 2.3 1.3 0.3
2 JUN 3.3 2.3 2.3
3 IRS2 2.6 4.6 1.6
4 MTOR 4.7 5.7 1.7
5 NRAS 1.2 2.2 4.2
We could use a join
library(data.table)
nm1 <- names(df)[-1]
df[nm1] <- lapply(df[nm1], as.numeric)
setDT(df)[df2, (nm1) := mget(paste0('i.', nm1)), on = .(Hugo_Symbol)]
-ouptut
df
Hugo_Symbol A183 A240 A330
1: CDKN2A 2.3 1.3 0.3
2: JUN 3.3 2.3 2.3
3: IRS2 2.6 4.6 1.6
4: MTOR 4.7 5.7 1.7
5: NRAS 1.2 2.2 4.2
Is it possible to just drop the NA columns from the first data frame? If so, a left join will produce the desired output.
df <- data.frame(
Hugo_Symbol = c("CDKN2A", "JUN", "IRS2", "MTOR",
"NRAS"),
A183 = c(NA, NA, NA, NA, NA),
A240 = c(NA, NA, NA, NA, NA),
A330 = c(NA, NA, NA, NA, NA)
)
df2 <- data.frame(
Hugo_Symbol = c("CDKN2A", "JUN", "IRS2", "MTOR",
"NRAS", "TP53", "EGFR"),
A183 = c(2.3, 3.3, 2.6, 4.7, 1.2, 5.7, 3.4),
A240 = c(1.3, 2.3, 4.6, 5.7, 2.2, 7.7, 1.4),
A330 = c(0.3, 2.3, 1.6, 1.7, 4.2, 1.7, 4.4),
A335 = c(1.3, 0.3, 0.6, 0.7, 0.2, 0.7, 0.4),
A345 = c(0.3, 4.3, 4.6, 4.7, 4.2, 4.7, 0.4)
)
library(dplyr)
left_join(df["Hugo_Symbol"], df2, by = "Hugo_Symbol")
one more way to do it-
left_join on hugo_symbol
then use transmute across on those columns only which either end in suffix .y and hugo_symbol
retain values as such. hence ~.
remove .y from names using .names argument. use regex [.]y so that is not interpreted as wildcard and y.
library(dplyr)
df %>% left_join(df2, by = 'Hugo_Symbol') %>%
transmute(across(Hugo_Symbol | ends_with('.y'), ~., .names = '{gsub("[.]y", "", .col )}'))
#> Hugo_Symbol A183 A240 A330
#> 1 CDKN2A 2.3 1.3 0.3
#> 2 JUN 3.3 2.3 2.3
#> 3 IRS2 2.6 4.6 1.6
#> 4 MTOR 4.7 5.7 1.7
#> 5 NRAS 1.2 2.2 4.2
Created on 2021-07-24 by the reprex package (v2.0.0)

Example of using dput()

Being a new user here, my questions are not being fully answered due to not being reproducible. I read the thread relating to producing reproducible code but to avail. Specifically lost on how to use the dput() function.
Could someone provide a step by step on how to use the dput() using the iris df for eg it would be very helpful.
Using the iris dataset, which is handily included into R, we can see how dput() works:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now we can get the whole dataset using dput(iris). In most situations, a whole dataset is unnecessary to provide for a Stackoverflow question, as a few lines of the relevant variables suffice as a working data example.
Two things come in handy: The head() function outputs only the first six rows of a dataframe/matrix. Also, the indexing in R (via brackets) allows you to select only specific columns.
Therefore, we can restrict the output of dput() using a combination of these two:
dput(head(iris[, c(1, 3)]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
will give us the code to reproduce the first (up to) six rows of column 1 and 3 of the iris dataset.
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 6L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
If the first rows do not suffice, we can skip using head() and rely on indexing only:
dput(iris[1:20, c(1, 3)])
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
will give us the the first twenty rows:
df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1
), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,
1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5)), .Names = c("Sepal.Length",
"Petal.Length"), row.names = c(NA, 20L), class = "data.frame")
> df
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
7 4.6 1.4
8 5.0 1.5
9 4.4 1.4
10 4.9 1.5
11 5.4 1.5
12 4.8 1.6
13 4.8 1.4
14 4.3 1.1
15 5.8 1.2
16 5.7 1.5
17 5.4 1.3
18 5.1 1.4
19 5.7 1.7
20 5.1 1.5

Create function to fit multiple models to 1 dataset

I am trying to fit a GLM to a small dataset, consisting of 5 columns of variables y, x1, x2, x3, x4, and 24 rows of data.
This is not a problem in itself, but with these predictive variables there are 2^4 models possible. I am trying to write a function such that it will create a GLM for all different models, and return the coefficients along with the AIC value in 1 table. Can anyone help me out?
The dataset looks like this:
i y x1 x2 x5 x7
1 29.5 5.0208 1.0 2.0 4
2 27.9 4.5429 1.0 1.0 3
3 25.9 4.5573 1.0 1.0 3
4 29.9 5.0597 1.0 1.0 3
5 29.9 3.8910 1.0 1.0 3
6 30.9 5.8980 1.0 1.0 3
7 28.9 5.6039 1.0 0.0 3
8 35.9 5.8282 1.0 2.0 3
9 31.5 5.3003 1.0 1.0 3
10 31.0 6.2712 1.0 1.0 2
11 30.9 5.9592 1.0 2.0 3
12 30.0 5.0500 1.0 0.0 2
13 36.9 8.2464 1.5 2.0 4
14 41.9 6.6969 1.5 1.5 3
15 40.5 7.7841 1.5 1.0 3
16 43.9 9.0384 1.0 1.5 3
17 37.5 5.9894 1.0 2.0 3
18 37.9 7.5422 1.5 1.0 3
19 44.5 8.7951 1.5 2.0 4
20 37.9 6.0831 1.5 1.0 3
21 38.9 8.3607 1.5 2.0 4
22 36.9 8.1400 1.0 2.0 3
23 45.8 9.1416 1.5 1.5 4
24 25.9 4.9176 1.0 1.0 4
And the dput is:
structure(list(y = c(29.5, 27.9, 25.9, 29.9, 29.9, 30.9, 28.9,
35.9, 31.5, 31, 30.9, 30, 36.9, 41.9, 40.5, 43.9, 37.5, 37.9,
44.5, 37.9, 38.9, 36.9, 45.8, 25.9), x1 = c(5.0208, 4.5429, 4.5573,
5.0597, 3.891, 5.898, 5.6039, 5.8282, 5.3003, 6.2712, 5.9592,
5.05, 8.2464, 6.6969, 7.7841, 9.0384, 5.9894, 7.5422, 8.7951,
6.0831, 8.3607, 8.14, 9.1416, 4.9176), x2 = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1.5, 1.5, 1.5, 1, 1, 1.5, 1.5, 1.5, 1.5,
1, 1.5, 1), x5 = c(2, 1, 1, 1, 1, 1, 0, 2, 1, 1, 2, 0, 2, 1.5,
1, 1.5, 2, 1, 2, 1, 2, 2, 1.5, 1), x7 = c(4, 3, 3, 3, 3, 3, 3,
3, 3, 2, 3, 2, 4, 3, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4)), .Names = c("y",
"x1", "x2", "x5", "x7"), row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23", "24"), class = "data.frame")
I used the glmulti package (link below) a few years ago to do something similar to this. I don't know if it will work for your problem though (you should post your data or a subset of your data so people can try things out).
https://www.jstatsoft.org/article/view/v034i12/v34i12.pdf

how to plot data in time series

I have data that looks like this:
time sucrose fructose glucose galactose molasses water
1 5 0.0 0.00 0.0 0.0 0.3 0
2 10 0.3 0.10 0.1 0.0 1.0 0
3 15 0.8 0.20 0.2 0.2 1.4 0
4 20 1.3 0.35 0.7 0.4 2.5 0
5 25 2.2 0.80 1.6 0.5 3.5 0
6 30 3.1 1.00 2.3 0.6 4.5 0
7 35 3.6 1.60 3.1 0.7 5.7 0
8 40 5.1 2.80 4.3 0.7 6.7 0
How can i make a time series plot that uses the time column? They are all increasing values.
I saw this post multiple-time-series-in-one-plot which uses ts.plot to achieve something similar to what i want to show, which is this:
Input data for the table above:
structure(list(time = c(5, 10, 15, 20, 25, 30, 35, 40), sucrose = c(0,
0.3, 0.8, 1.3, 2.2, 3.1, 3.6, 5.1), fructose = c(0, 0.1, 0.2,
0.35, 0.8, 1, 1.6, 2.8), glucose = c(0, 0.1, 0.2, 0.7, 1.6, 2.3,
3.1, 4.3), galactose = c(0, 0, 0.2, 0.4, 0.5, 0.6, 0.7, 0.7),
molasses = c(0.3, 1, 1.4, 2.5, 3.5, 4.5, 5.7, 6.7), water = c(0,
0, 0, 0, 0, 0, 0, 0)), .Names = c("time", "sucrose", "fructose",
"glucose", "galactose", "molasses", "water"), row.names = c(NA,
-8L), class = "data.frame")
It doesn't seem like a ts plot is necessary. Here's how you could do it in base-R:
with(df, plot(time, sucrose, type="n", ylab="contents"))
var <- names(df)[-1]
for(i in var) lines(df$time, df[,i])
The more elegant solution would however be using the 'dplyrandggplot2` package:
df <- df %>%
gather(content, val, -time)
ggplot(df, aes(time, val, col=content)) + geom_line()

R programming help in editing code

I've asked many questions about this and all the answers were really helpful...but once again my data is weird and I need help...Basically, what I want to do is find the average speed at a certain range of intervals...lets say from 6 s to 40 s my average speed would be 5 m/s...etc etc..
So it was pointed out to me to use this code...
library(IRanges)
idx <- seq(1, ncol(data), by=2)
# idx is now 1, 3, 5. It will be passed one value at a time to `i`.
# that is, `i` will take values 1 first, then 3 and then 5 and each time
# the code within is executed.
o <- lapply(idx, function(i) {
ir1 <- IRanges(start=seq(0, max(data[[i]]), by=401), width=401)
ir2 <- IRanges(start=data[[i]], width=1)
t <- findOverlaps(ir1, ir2)
d <- data.frame(mean=tapply(data[[i+1]], queryHits(t), mean))
cbind(as.data.frame(ir1), d)
})
which gives this output
# > o
# [[1]]
# start end width mean
# 1 0 400 401 1.05
#
# [[2]]
# start end width mean
# 1 0 400 401 1.1
#
# [[3]]
# start end width mean
# 1 0 400 401 1.383333
So if I wanted it to be every 100 s... I'll just change ir1 <- ....., by = 401 to become by=100.
But my data is weird because of a few things
my data doesnt always start with 0 s sometimes it starts at 20 s...depending on the specimen and whether it moves
My data collection does not happen every 1s or 2s or 3s. Hence sometimes I get data 1-20 s but it skips over 20-40 s simply because the specimen does not move.
I think the findOverlaps portion of the code affects my output. How can I get rid of that without disturbing the output?
Here is some data to illustrate my troubles...but all of my real data ends in 2000s
Time Speed Time Speed Time Speed
6.3 1.6 3.1 1.7 0.3 2.4
11.3 1.3 5.1 2.2 1.3 1.3
13.8 1.3 6.3 3.4 3.1 1.5
14.1 1.0 7.0 2.3 4.5 2.7
47.4 2.9 11.3 1.2 5.1 0.5
49.2 0.7 26.5 3.3 5.9 1.7
50.5 0.9 27.3 3.4 9.7 2.4
57.1 1.3 36.6 2.5 11.8 1.3
72.9 2.9 40.3 1.1 13.1 1.0
86.6 2.4 44.3 3.2 13.8 0.6
88.5 3.4 50.9 2.6 14.0 2.4
89.0 3.0 62.6 1.5 14.8 2.2
94.8 2.9 66.8 0.5 15.5 2.6
117.4 0.5 67.3 1.1 16.4 3.2
123.7 3.2 67.7 0.6 26.5 0.9
124.5 1.0 68.2 3.2 44.7 3.0
126.1 2.8 72.1 2.2 45.1 0.8
As you can see from the data, it doesnt necessarily end in 60 s etc sometimes it only ends at 57 etc
EDIT add dput of data
structure(list(Time = c(6.3, 11.3, 13.8, 14.1, 47.4, 49.2, 50.5,
57.1, 72.9, 86.6, 88.5, 89, 94.8, 117.4, 123.7, 124.5, 126.1),
Speed = c(1.6, 1.3, 1.3, 1, 2.9, 0.7, 0.9, 1.3, 2.9, 2.4,
3.4, 3, 2.9, 0.5, 3.2, 1, 2.8), Time.1 = c(3.1, 5.1, 6.3,
7, 11.3, 26.5, 27.3, 36.6, 40.3, 44.3, 50.9, 62.6, 66.8,
67.3, 67.7, 68.2, 72.1), Speed.1 = c(1.7, 2.2, 3.4, 2.3,
1.2, 3.3, 3.4, 2.5, 1.1, 3.2, 2.6, 1.5, 0.5, 1.1, 0.6, 3.2,
2.2), Time.2 = c(0.3, 1.3, 3.1, 4.5, 5.1, 5.9, 9.7, 11.8,
13.1, 13.8, 14, 14.8, 15.5, 16.4, 26.5, 44.7, 45.1), Speed.2 = c(2.4,
1.3, 1.5, 2.7, 0.5, 1.7, 2.4, 1.3, 1, 0.6, 2.4, 2.2, 2.6,
3.2, 0.9, 3, 0.8)), .Names = c("Time", "Speed", "Time.1",
"Speed.1", "Time.2", "Speed.2"), class = "data.frame", row.names = c(NA,
-17L))
sorry if i don't understand your question entirely, could you explain why this example doesn't do what you're trying to do?
# use a pre-loaded data set
mtcars
# choose which variable to cut
var <- 'mpg'
# define groups, whether that be time or something else
# and choose how to cut it.
x <- cut( mtcars[ , var ] , c( -Inf , seq( 15 , 25 , by = 2.5 ) , Inf ) )
# look at your cut points, for every record
x
# you can merge them back on to the mtcars data frame if you like..
mtcars$cutpoints <- x
# ..but that's not necessary
# find the mean within those groups
tapply(
mtcars[ , var ] ,
x ,
mean
)
# find the mean within groups, using a different variable
tapply(
mtcars[ , 'wt' ] ,
x ,
mean
)

Resources