I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))
Related
I have a dataframe in the wide format such as below:
Subject
Volume.1
Volume.2
Volume.3
Volume.4
1
77
22
1
NA
2
65
182
NA
NA
3
98
NA
NA
NA
4
66
76
145
677
I am wanting to select the volume.1 and the column and the largest volume of Volume1-4 irrespective of which column it came from but am struggling to code this correctly. Some of the columns are Na when a subject does not have a recording then.
For instance with the above example the table would look like:
Subject
Volume.1
Worst volume
1
77
22
2
65
182
3
98
NA
4
66
677
I was wondering if anyone could help?
We may use pmax
cbind(df[1:2], WorseVolume = do.call(pmax, c(df[3:5], na.rm = TRUE)))
-output
Subject Volume.1 WorseVolume
1 1 77 22
2 2 65 182
3 3 98 NA
4 4 66 677
data
df <- structure(list(Subject = 1:4, Volume.1 = c(77L, 65L, 98L, 66L
), Volume.2 = c(22L, 182L, NA, 76L), Volume.3 = c(1L, NA, NA,
145L), Volume.4 = c(NA, NA, NA, 677L)), class = "data.frame", row.names = c(NA,
-4L))
for easier explanation I'm gonna use a smaller example.
I have two DF:
DF1: T01 T02 T03 T04 T05
1 15 20 48 25 5
2 12 18 35 30 12
3 13 15 50 60 42
DF2: MEDIAN SD
T01 13 1.24
T02 18 2.05
T03 45 6.64
T04 30 15.45
T05 12 16.04
What I want to do is create a loop that adds a dummy to DF1 for each variable, that take value 1 if DF1$T01 ≈ (almost equal) to DF2$MEDIAN[1], and 0 if it's not, and then goes to T02, T03, until it breaks.
Until now, I haven't been able to create a loop (I'm not really good at creating loops tho) that makes this. I did manage to make the dummy for one of the variables (T01), but in the real DF I have over 40 variables, so doing it by hand it´s not efficient at all. What I have right now is:
DF1$dummyt01 <- ifelse(almost.equal(DF1$T01, DF2$MEDIAN[1], tolerance = 2),1,0)
outcome expected:
DF1: T01 T02 T03 T04 T05 dummyT01 dummyT02 ... dummyT05
1 15 20 48 25 5 1 1 ... 0
2 12 18 35 30 12 1 1 ... 1
3 13 15 50 60 42 1 0 ... 0
Note: Not a native english speaker. Sorry for any mistakes.
EDIT: Expected Outcome.
We may use tidyverse. Loop across the columns of 'DF1', get the column names of that column looped (cur_column()), use that to subset the 'DF2' (as row names) 'MEDIAN' element, do the comparison with almost.equal to return a logical vector, which is coerced to binary with as.integer or +. In the .names add the prefix 'dummy' so as to create as new columns
library(dplyr)
library(berryFunctions)
DF1 <- DF1 %>%
mutate(across(everything(), ~ +(almost.equal(.,
DF2[cur_column(), "MEDIAN"], tolerance = 1)),
.names = "dummy{.col}"))
-output
DF1
T01 T02 T03 T04 T05 dummyT01 dummyT02 dummyT03 dummyT04 dummyT05
1 15 20 48 25 5 0 0 0 0 0
2 12 18 35 30 12 1 1 0 1 1
3 13 15 50 60 42 1 0 0 0 0
Or using a for loop
for(i in seq_along(DF1))
DF1[paste0('dummy', names(DF1)[i])] <- +(almost.equal(DF1[[i]],
DF2[names(DF1)[i], "MEDIAN"], tolerance = 1))
data
DF1 <- structure(list(T01 = c(15L, 12L, 13L), T02 = c(20L, 18L, 15L),
T03 = c(48L, 35L, 50L), T04 = c(25L, 30L, 60L), T05 = c(5L,
12L, 42L)), class = "data.frame", row.names = c("1", "2",
"3"))
DF2 <- structure(list(MEDIAN = c(13L, 18L, 45L, 30L, 12L), SD = c(1.24,
2.05, 6.64, 15.45, 16.04)), class = "data.frame", row.names = c("T01",
"T02", "T03", "T04", "T05"))
I have slightly edited the data table.
I would like to correlate variable with similar name in my dataset:
A_y B_y C_y A_p B_p C_p
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30
..........
n 55 23 85 12 34 52
I would like to obtain correlation of
A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93
How can I do it in R? Is it possible?
This is really dangerous. Behavior of data.frames with invalid column names is undefined by the language definition. Duplicated column names are invalid.
You should restructure your input data. Anyway, here is an approach with your input data.
DF <- read.table(text = " A B C A B C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)
sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
# A B C
#0.9995544 0.1585501 -0.6004010
#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544
Here is another base R option
within(
rev(
stack(
Map(
function(x) do.call(cor, unname(x)),
split.default(df, unique(gsub("_.*", "", names(df))))
)
)
),
ind <- sapply(
ind,
function(x) {
paste0(grep(paste0("^", x), names(df), value = TRUE),
collapse = "-"
)
}
)
)
which gives
ind values
1 A_y-A_p 0.9995544
2 B_y-B_p 0.1585501
3 C_y-C_p -0.6004010
Data
df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L),
C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L,
46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))
I have original temperature data in table1.txt with station number header which reads as
Date 101 102 103
1/1/2001 25 24 23
1/2/2001 23 20 15
1/3/2001 22 21 17
1/4/2001 21 27 18
1/5/2001 22 30 19
I have a lookup table file lookup.txt which reads as :
ID Station
1 101
2 103
3 102
4 101
5 102
Now, I want to create a new table (new.txt) with ID number header which should read as
Date 1 2 3 4 5
1/1/2001 25 23 24 25 24
1/2/2001 23 15 20 23 20
1/3/2001 22 17 21 22 21
1/4/2001 21 18 27 21 27
1/5/2001 22 19 30 22 30
Is there anyway I can do this in R or matlab??
I came up with a solution using tidyverse. It involves some wide to long transformation, matching the data frames on Station, and then spreading the variables.
#Recreating the data
library(tidyverse)
df1 <- read_table("text1.txt")
lookup <- read_table("lookup.txt")
#Create the output
k1 <- df1 %>%
gather(Station, value, -Date) %>%
mutate(Station = as.numeric(Station)) %>%
inner_join(lookup) %>% select(-Station) %>%
spread(ID, value)
k1
We can use base R to do this. Create a column index by matching the 'Station' column with the names of the first dataset, use that to duplicate the columns of 'df1' and then change the column names with the 'ID' column of second dataset
i1 <- with(df2, match(Station, names(df1)[-1]))
dfN <- df1[c(1, i1 + 1)]
names(dfN)[-1] <- df2$ID
dfN
# Date 1 2 3 4 5
#1 1/1/2001 25 23 24 25 24
#2 1/2/2001 23 15 20 23 20
#3 1/3/2001 22 17 21 22 21
#4 1/4/2001 21 18 27 21 27
#5 1/5/2001 22 19 30 22 30
data
df1 <- structure(list(Date = c("1/1/2001", "1/2/2001", "1/3/2001", "1/4/2001",
"1/5/2001"), `101` = c(25L, 23L, 22L, 21L, 22L), `102` = c(24L,
20L, 21L, 27L, 30L), `103` = c(23L, 15L, 17L, 18L, 19L)),
class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = 1:5, Station = c(101L, 103L, 102L, 101L,
102L)), class = "data.frame", row.names = c(NA, -5L))
Here is an option with MatLab:
T = readtable('table1.txt','FileType','text','ReadVariableNames',1);
L = readtable('lookup.txt','FileType','text','ReadVariableNames',1);
old_header = strcat('x',num2str(L.Station));
newT = array2table(zeros(height(T),height(L)+1),...
'VariableNames',[{'Date'} strcat('x',num2cell(num2str(L.ID)).')]);
newT.Date = T.Date;
for k = 1:size(old_header,1)
newT{:,k+1} = T.(old_header(k,:));
end
writetable(newT,'new.txt','Delimiter',' ')
Problem
I have data on two measures for four individuals each in a wide format. The measures are x and y and the individuals are A, B, C, D. The data frame looks like this
d <- data.frame(matrix(sample(1:100, 40, replace = F), ncol = 8))
colnames(d) <- paste(rep(c("x.", "y."),each = 4), rep(LETTERS[1:4], 2), sep ="")
d
x.A x.B x.C x.D y.A y.B y.C y.D
1 56 65 42 96 100 76 39 26
2 19 93 94 75 63 78 5 44
3 22 57 15 62 2 29 89 79
4 49 13 95 97 85 81 60 37
5 45 38 24 91 23 82 83 72
Now, would I would like to obtain for each row is the value of y for the individual with the lowest value of x.
So in the example above, the lowest value of x in row 1 is for individual C. Hence, for row 1 I would like to obtain y.C which is 39.
In the example, the resulting vector should be 39, 63, 89, 81, 83.
Approach
I have tried to get to this by first generating a matrix of the subset of d for the values of x.
t(apply(d[,1:4], 1, function(x) min(x) == x))
x.A x.B x.C x.D
[1,] FALSE FALSE TRUE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE
Now I wanted to apply this matrix to subset the subset of the data frame for the values of y. But I cannot find a way to achieve this.
Any help is much appreciated. Suggestions for a totally different - more elegant - approach are highly welcome too.
Thanks a lot!
We subset the dataset with the columns starting with 'x' ('dx') and 'y' ('dy'). Get the column index of the minimum value in each row of 'dx' using max.col, cbind with the row index and get the corresponding elements in 'dy'.
dx <- d[grep('^x', names(d))]
dy <- d[grep('^y', names(d))]
dy[cbind(1:nrow(dx),max.col(-dx, 'first'))]
#[1] 39 63 89 81 83
The above can be easily be converted to a function
get_min <- function(dat){
dx <- dat[grep('^x', names(dat))]
dy <- dat[grep('^y', names(dat))]
dy[cbind(1:nrow(dx), max.col(-dx, 'first'))]
}
get_min(d)
#[1] 39 63 89 81 83
Or using the OP's apply based method
t(d[,5:8])[apply(d[,1:4], 1, function(x) min(x) == x)]
#[1] 39 63 89 81 83
data
d <- structure(list(x.A = c(56L, 19L, 22L, 49L, 45L),
x.B = c(65L,
93L, 57L, 13L, 38L), x.C = c(42L, 94L, 15L, 95L, 24L),
x.D = c(96L,
75L, 62L, 97L, 91L), y.A = c(100L, 63L, 2L, 85L, 23L),
y.B = c(76L,
78L, 29L, 81L, 82L), y.C = c(39L, 5L, 89L, 60L, 83L),
y.D = c(26L,
44L, 79L, 37L, 72L)), .Names = c("x.A", "x.B", "x.C",
"x.D",
"y.A", "y.B", "y.C", "y.D"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
Here is my solution. The core idea is that there are functions which.min, which.max that can be row applied to the data frame:
Edit:
Now, would I would like to obtain for each row is the value of y for
the individual with the lowest value of x.
ind <- apply(d[ ,1:4], 1, which.min) # build column index by row
res <- d[,5:8][cbind(1:nrow(d), ind)] # rows are in order, select values by matrix
names(res) <- colnames(d)[5:8][ind] # set colnames as names from the sample column
res
y.D y.B y.D y.A y.D
18 46 16 85 80
Caveat: only works if individuals are in the same order for treatment x. and y. and all individuals present. Otherwise you can use grep like in Akrun's solution.
# My d was:
x.A x.B x.C x.D y.A y.B y.C y.D
1 88 96 65 55 14 99 63 18
2 12 11 27 45 70 46 20 69
3 32 81 21 9 77 44 91 16
4 8 84 42 78 85 94 28 90
5 31 51 83 2 67 25 54 80
We can create a function as follows,
get_min <- function(x){
d1 <- x[,1:4]
d2 <- x[,5:8]
mtrx <- as.matrix(d2[,apply(d1, 1, which.min)])
a <- row(mtrx) - col(mtrx)
split(mtrx, a)$"0"
}
get_min(d)
#[1] 39 63 89 81 83