Correlation of similar variables in R - r

I have slightly edited the data table.
I would like to correlate variable with similar name in my dataset:
A_y B_y C_y A_p B_p C_p
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30
..........
n 55 23 85 12 34 52
I would like to obtain correlation of
A_y-A_p: 0.78
B_y-B_p: 0.88
C_y-C_p: 0.93
How can I do it in R? Is it possible?

This is really dangerous. Behavior of data.frames with invalid column names is undefined by the language definition. Duplicated column names are invalid.
You should restructure your input data. Anyway, here is an approach with your input data.
DF <- read.table(text = " A B C A B C
1 15 52 32 30 98 56
2 30 99 60 56 46 25
3 10 25 31 20 22 30", header = TRUE, check.names = FALSE)
sapply(unique(names(DF)), function(s) do.call(cor, unname(DF[, names(DF) == s])))
# A B C
#0.9995544 0.1585501 -0.6004010
#compare:
cor(c(15, 30, 10), c(30, 56, 20))
#[1] 0.9995544

Here is another base R option
within(
rev(
stack(
Map(
function(x) do.call(cor, unname(x)),
split.default(df, unique(gsub("_.*", "", names(df))))
)
)
),
ind <- sapply(
ind,
function(x) {
paste0(grep(paste0("^", x), names(df), value = TRUE),
collapse = "-"
)
}
)
)
which gives
ind values
1 A_y-A_p 0.9995544
2 B_y-B_p 0.1585501
3 C_y-C_p -0.6004010
Data
df <- structure(list(A_y = c(15L, 30L, 10L), B_y = c(52L, 99L, 25L),
C_y = c(32L, 60L, 31L), A_p = c(30L, 56L, 20L), B_p = c(98L,
46L, 22L), C_p = c(56L, 25L, 30L)), class = "data.frame", row.names = c("1",
"2", "3"))

Related

Calculating Percent Change in R for Multiple Variables

I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))

How to create a new table from original data and lookup table in R or matlab?

I have original temperature data in table1.txt with station number header which reads as
Date 101 102 103
1/1/2001 25 24 23
1/2/2001 23 20 15
1/3/2001 22 21 17
1/4/2001 21 27 18
1/5/2001 22 30 19
I have a lookup table file lookup.txt which reads as :
ID Station
1 101
2 103
3 102
4 101
5 102
Now, I want to create a new table (new.txt) with ID number header which should read as
Date 1 2 3 4 5
1/1/2001 25 23 24 25 24
1/2/2001 23 15 20 23 20
1/3/2001 22 17 21 22 21
1/4/2001 21 18 27 21 27
1/5/2001 22 19 30 22 30
Is there anyway I can do this in R or matlab??
I came up with a solution using tidyverse. It involves some wide to long transformation, matching the data frames on Station, and then spreading the variables.
#Recreating the data
library(tidyverse)
df1 <- read_table("text1.txt")
lookup <- read_table("lookup.txt")
#Create the output
k1 <- df1 %>%
gather(Station, value, -Date) %>%
mutate(Station = as.numeric(Station)) %>%
inner_join(lookup) %>% select(-Station) %>%
spread(ID, value)
k1
We can use base R to do this. Create a column index by matching the 'Station' column with the names of the first dataset, use that to duplicate the columns of 'df1' and then change the column names with the 'ID' column of second dataset
i1 <- with(df2, match(Station, names(df1)[-1]))
dfN <- df1[c(1, i1 + 1)]
names(dfN)[-1] <- df2$ID
dfN
# Date 1 2 3 4 5
#1 1/1/2001 25 23 24 25 24
#2 1/2/2001 23 15 20 23 20
#3 1/3/2001 22 17 21 22 21
#4 1/4/2001 21 18 27 21 27
#5 1/5/2001 22 19 30 22 30
data
df1 <- structure(list(Date = c("1/1/2001", "1/2/2001", "1/3/2001", "1/4/2001",
"1/5/2001"), `101` = c(25L, 23L, 22L, 21L, 22L), `102` = c(24L,
20L, 21L, 27L, 30L), `103` = c(23L, 15L, 17L, 18L, 19L)),
class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = 1:5, Station = c(101L, 103L, 102L, 101L,
102L)), class = "data.frame", row.names = c(NA, -5L))
Here is an option with MatLab:
T = readtable('table1.txt','FileType','text','ReadVariableNames',1);
L = readtable('lookup.txt','FileType','text','ReadVariableNames',1);
old_header = strcat('x',num2str(L.Station));
newT = array2table(zeros(height(T),height(L)+1),...
'VariableNames',[{'Date'} strcat('x',num2cell(num2str(L.ID)).')]);
newT.Date = T.Date;
for k = 1:size(old_header,1)
newT{:,k+1} = T.(old_header(k,:));
end
writetable(newT,'new.txt','Delimiter',' ')

R programming_ Subsetting rows on logic conditions

Sample data:
sampleData
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
.........
Want to extract records on the condition $ozone > 31
Here is the code:
data <- sampleData[sampleData$ozone > 31]
And get the error below:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
How should I correct it? Thanks!
R is case sensitive, so your ozone has to match the name in your data.frame. Also to subset a data.frame, you need two indices (row and column) separated by a comma. If there is nothing after the comma, it means that you are selecting all the columns:
sampleData[sampleData$Ozone > 31,]
Other methods to subset a data.frame:
subset(sampleData, Ozone > 31)
or with dplyr:
library(dplyr)
sampleData %>%
filter(Ozone > 31)
Result:
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
Data:
sampleData = structure(list(Ozone = c(41L, 36L, 12L), Solar.R = c(190L, 118L,
149L), Wind = c(7.4, 8, 12.6), Temp = c(67L, 72L, 74L), Month = c(5L,
5L, 5L), Day = 1:3, sampleData.Ozone = c(41L, 36L, 12L)), .Names = c("Ozone",
"Solar.R", "Wind", "Temp", "Month", "Day", "sampleData.Ozone"
), class = "data.frame", row.names = c("1", "2", "3"))

Subset data frame with matrix of logical values

Problem
I have data on two measures for four individuals each in a wide format. The measures are x and y and the individuals are A, B, C, D. The data frame looks like this
d <- data.frame(matrix(sample(1:100, 40, replace = F), ncol = 8))
colnames(d) <- paste(rep(c("x.", "y."),each = 4), rep(LETTERS[1:4], 2), sep ="")
d
x.A x.B x.C x.D y.A y.B y.C y.D
1 56 65 42 96 100 76 39 26
2 19 93 94 75 63 78 5 44
3 22 57 15 62 2 29 89 79
4 49 13 95 97 85 81 60 37
5 45 38 24 91 23 82 83 72
Now, would I would like to obtain for each row is the value of y for the individual with the lowest value of x.
So in the example above, the lowest value of x in row 1 is for individual C. Hence, for row 1 I would like to obtain y.C which is 39.
In the example, the resulting vector should be 39, 63, 89, 81, 83.
Approach
I have tried to get to this by first generating a matrix of the subset of d for the values of x.
t(apply(d[,1:4], 1, function(x) min(x) == x))
x.A x.B x.C x.D
[1,] FALSE FALSE TRUE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE FALSE
[4,] FALSE TRUE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE
Now I wanted to apply this matrix to subset the subset of the data frame for the values of y. But I cannot find a way to achieve this.
Any help is much appreciated. Suggestions for a totally different - more elegant - approach are highly welcome too.
Thanks a lot!
We subset the dataset with the columns starting with 'x' ('dx') and 'y' ('dy'). Get the column index of the minimum value in each row of 'dx' using max.col, cbind with the row index and get the corresponding elements in 'dy'.
dx <- d[grep('^x', names(d))]
dy <- d[grep('^y', names(d))]
dy[cbind(1:nrow(dx),max.col(-dx, 'first'))]
#[1] 39 63 89 81 83
The above can be easily be converted to a function
get_min <- function(dat){
dx <- dat[grep('^x', names(dat))]
dy <- dat[grep('^y', names(dat))]
dy[cbind(1:nrow(dx), max.col(-dx, 'first'))]
}
get_min(d)
#[1] 39 63 89 81 83
Or using the OP's apply based method
t(d[,5:8])[apply(d[,1:4], 1, function(x) min(x) == x)]
#[1] 39 63 89 81 83
data
d <- structure(list(x.A = c(56L, 19L, 22L, 49L, 45L),
x.B = c(65L,
93L, 57L, 13L, 38L), x.C = c(42L, 94L, 15L, 95L, 24L),
x.D = c(96L,
75L, 62L, 97L, 91L), y.A = c(100L, 63L, 2L, 85L, 23L),
y.B = c(76L,
78L, 29L, 81L, 82L), y.C = c(39L, 5L, 89L, 60L, 83L),
y.D = c(26L,
44L, 79L, 37L, 72L)), .Names = c("x.A", "x.B", "x.C",
"x.D",
"y.A", "y.B", "y.C", "y.D"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
Here is my solution. The core idea is that there are functions which.min, which.max that can be row applied to the data frame:
Edit:
Now, would I would like to obtain for each row is the value of y for
the individual with the lowest value of x.
ind <- apply(d[ ,1:4], 1, which.min) # build column index by row
res <- d[,5:8][cbind(1:nrow(d), ind)] # rows are in order, select values by matrix
names(res) <- colnames(d)[5:8][ind] # set colnames as names from the sample column
res
y.D y.B y.D y.A y.D
18 46 16 85 80
Caveat: only works if individuals are in the same order for treatment x. and y. and all individuals present. Otherwise you can use grep like in Akrun's solution.
# My d was:
x.A x.B x.C x.D y.A y.B y.C y.D
1 88 96 65 55 14 99 63 18
2 12 11 27 45 70 46 20 69
3 32 81 21 9 77 44 91 16
4 8 84 42 78 85 94 28 90
5 31 51 83 2 67 25 54 80
We can create a function as follows,
get_min <- function(x){
d1 <- x[,1:4]
d2 <- x[,5:8]
mtrx <- as.matrix(d2[,apply(d1, 1, which.min)])
a <- row(mtrx) - col(mtrx)
split(mtrx, a)$"0"
}
get_min(d)
#[1] 39 63 89 81 83

Matching dataframes with data.table

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).
An idea of my three matrices:
MA.ID creates an identifier to look in the big DF the needed variables:
a b c
a ID.aa ID.ab ID.ac
b ID.ba ID.bb ID.bc
c ID.ca ID.cb ID.cc
The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:
ID 1990 1991 1992
ID.aa 10 11 12
ID.ab 13 14 15
ID.ac 16 17 18
ID.ba 19 20 21
ID.bb 22 23 24
ID.bc 25 26 27
ID.ca 28 29 30
ID.cb 31 32 33
ID.cc 34 35 36
ID.xx 40 40 55
ID.xy 50 51 45
....
MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):
a b c
a 10 13 16
b 19 22 25
c 28 31 34
I've tried to use match but honestly it didn't work out:
MA$a = DF[match(MA.ID$a, DF$ID),2]
I was recommended to use the data.table package but I couldn't see how that would help me.
Anyone has any good way to approach this problem?
Supposing that your input are dataframes, then you could do the following:
library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
, .SDcols = colnames(ma)]
which returns:
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Explanation:
With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.
An alternative approach with data.table is first transforming ma to a long data.table:
ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value) # set key by which 'ma' has to be indexed
setDT(df, key="ID") # transform to a datatable & set key by which 'df' has to be indexed
# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value := `1990`]
which gives:
> ma2
variable value
1: a 10
2: b 13
3: c 16
4: a 19
5: b 22
6: c 25
7: a 28
8: b 31
9: c 34
The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:
ma2[df, value := `1990`][, value := as.numeric(value)]
If you want to change it back to wide format you could use the rowid function within dcast:
ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]
which gives:
> ma3
a b c
1: 10 13 16
2: 19 22 25
3: 28 31 34
Used data:
ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"),
b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"),
c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")),
.Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))
df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"),
`1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L),
`1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L),
`1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)),
.Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))
In base R, it can be seen as a job for outer:
> outer(1:nrow(MA.ID), 1:ncol(MA.ID), Vectorize(function(x,y) {DF[which(DF$ID==MA.ID[x,y]),'1990']}))
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
Explanations:
outer creates a matrix as the outer product of the first argument X (here a b c) and the second argument Y (here the same, a b c)
for every combination of X and Y, it applies a function that looks up the value in DF, at the row where the ID is MA.ID[X,Y], and at the column 1990
the important trick here is to wrap the function with Vectorize, because outer expects a vectorized function
the result is finally returned as matrix
Alternatively, another way to do it (still in base R) is:
to convert your data frame MA.ID into a vector
sapply a quick function that looks up correspondance with DF$ID
and convert back to a matrix.
This works:
> structure(
sapply(unlist(MA.ID),
function(id){DF[which(DF$ID==id),'1990']}),
dim=dim(MA.ID), names=NULL)
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 19 22 25
[3,] 28 31 34
(here the call to structure(..., dim=dim(MA.ID), names=NULL) converts back the vector to a matrix)

Resources