Identifying columns with high correlation in large dataset - r

I have two large dataframes (50+ columns and many are long character vars) and I need to identify the "link" variable that I should use to merge them together. The problem is the name of the variables don't match up. That is I need to identify variables in the two datasets where the values have a high correlation.
As an example :
dta1 = data.frame(A = c(1 , 2,3, 4), B = c( 23, 45, 6, 8), C = c("001", "028", "076", "039"))
dta2 = data.frame(first = c(5, 6, 7, 8), second = c( 58, 32, 33, 45), third = c("008", "028", "076", "039"))
I would like the code to tell me that columns C and third have a very high correlation (they are not complete duplicates though!).
I have tried adding the two dataframes and running a cor() function, but this doesn't work with character variables.
Also tried union_all(x, y, ...) from dplyr but that requires the same column names.
At this point I am out of ideas.
Thanks very much.

To identify the columns most similar, try the following. It systematically compares the values from each column in dta1 with the columns in dta2. It returns a matrix.
sapply(dta1, function(x) sapply(dta2, function(y) sum(x == y)))
A B C
first 0 1 0
second 0 0 0
third 0 0 3
From here we can see that third and C have the most matches. Now you can join your two data.frames. To keep all rows and columns, you will want a full_join from the dplyr package.
library(dplyr)
full_join(dta1, dta2, by = c("C" = "third"))
A B C first second
1 1 23 001 NA NA
2 2 45 028 6 32
3 3 6 076 7 33
4 4 8 039 8 45
5 NA NA 008 5 58

Related

List within a data.frame column

So far I have the following data.frame, with an initial column filled with set values:
df <- data.frame(N=seq(10, 100, by=10))
Now, I want to have a second column here, which would be a list (or c()) of integers, such that the output of calling df would be as follows:
N I
1 10 2, 8, 1
2 20 4, 0, 99
.. .. ..
I tried doing the following, where df <- data.frame(N=seq(10, 100, by=10), I=logical(10)), which puts a FALSE in each of the columns. But trying to test what I wanted to do using df$I[df$N == 10] <- list(2, 8, 1) throws the error:
number of items to replace is not a multiple of replacement length
Edit: I also tried using I(list(...)) to keep the list interpreted as is, but the same error was thrown.
We can create the list by wrapping with I in data.frame and then assign by extracting the list element that corresponds to the index provided by the logical vector
df <- data.frame(N=seq(10, 100, by=10), I= I(vector('list', 10)))
df$I[df$N == 10][[1]] <- list(2, 8, 1)
df
# N I
#1 10 2, 8, 1
#2 20
#3 30
#4 40
#5 50
#6 60
#7 70
#8 80
#9 90
#10 100

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

Replace row values if sum is equal to zero in R

I want to replace the values of columns by NA if the sum of their rows is equal to 0. Imagine the following columns:
a b
0 0
1 5
2 8
3 7
0 0
5 8
I would like to replace these by:
a b
NA NA
1 5
2 8
3 7
NA NA
5 8
I've been looking for answers on many pages but have not found any solution.
Here is what I have tried so far:
df[ , 31:36][df[,31:36] == 0 ] <- NA #With df being my dataframe and 31:36 the columns I want to apply the replacement too.
This replaces all the values equal to 0 by NA
I've also tried other alternatives using rowSums() but have not found a solution.
Any help would be greatly appreciated.
Thanks
How about this?
a <- df[31:36,1]
b <- df[31:36,2]
c <- a
a[a+b==0] <- NA
b[c+b==0] <- NA
df[31:36,1] <- a
df[31:36,2] <- b
We have to create a temporary variable called c, otherwise when you are checking the second column, you will be adding NA+0 which equals NA not 0.
An idiomatic way of doing this using dplyr would be:
library(dplyr)
tb <- tibble(
a = c(0, 1:3, 0, 5),
b = c(0, 5, 8, 7, 0, 8)
)
tb <- tb %>%
# creates a "rowsum" column storing the sum of columns 1:2
mutate(rowsum = rowSums(.[1:2])) %>%
# applies, to columns 1:2, a function that puts NA when the sum of the rows is 0
mutate_at(1:2, funs(ifelse(rowsum == 0, NA, .))) %>%
# removes rowsum
select(-rowsum)
Of course you could replace 1:2 with 31:36 when applying the code to your actual table.

subset using R language

I have 2 tables
In the first table, I have two columns. In the first colum , the values run from 1 to 2 million (call them x). In the second column, I have random numbers (call them y) .
In the second table , I have two columns. In the first colum , I have the same x values, but they do not run from 1 to 2 million instead they are in random increasing order like 222 , 249 , 562 .. and so on. In the second column, I have random numbers (call them z) .
Now, I am trying to add a third column to my second table with the y values from first table.I decided to use apply . But, you can use join or merge -- whichever is more efficient. Here x value connects the y and the z.
To start with a minimal data, you can use this code:
t1 <- cbind(1:20, sample(100:999, 20, TRUE))
t2 <- rbind(c(2, 4), c(6, 12), c(17, 18))
apply(t2, 1, function(...) )
Could you help me to fill the ... blanks.
The output should be of the form:
2 4 --
6 12 --
17 18 --
You can use merge for this:
merge(as.data.frame(t2), as.data.frame(t1), by='V1')
V1 V2.x V2.y
1 2 4 751
2 6 12 298
3 17 18 218
Does this meet your requirements?
require(plyr)
t1 <- as.data.frame(cbind(1:20, sample(100:999, 20, TRUE)))
t2 <- as.data.frame(rbind(c(2, 4), c(6, 12), c(17, 18)))
t3 <- join(t2, t1, type = "left", by = "V1")
t3
> t3
V1 V2 V2
1 2 4 779
2 6 12 898
3 17 18 903

Altering a data frame in R

I have a data frame that has the first column go from 1 to 365 like this
c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2...
and the second column has times that repeat over and over again like this
c(0,30,130,200,230,300,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
so for every 1 value in the first column I have a corresponding time in the second column then when I get to the 2's the times start over and each 2 has a corresponding time,
occasionally I will come across
c(3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4...
c(0,30,130,200,230,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
Here one of the 3's is missing and the corresponding time of 300 is missing with it.
How can I go through my entire data frame and add these missing values? I need a way for R to go through and identify any missing values then insert a row and put the appropriate value, 1 to 365, in column one and the appropriate time with it. So for the given example R would add a row in between 230 and 330 and then place a 3 in the first column and 300 in the second. There are parts of the column that are missing several consecutive values. It is not just one here and there
EDIT: Solution with all 10 times clearly specified in advance and code tidy up/commenting
You need to create another data.frame containing every possible row and then merge it with your data.frame. The key aspect is the all.x = TRUE in the final merge which forces the gaps in your data to be highlighted. I simulated the gaps by sampling only 15 of the first 20 possible day/time combinations in your.dat
# create vectors for the days and times
the.days = 1:365
the.times = c(0,30,100,130,200,230,330,400,430,500) # the 10 times to repeat
# create a master data.frame with all the times repeated for each day, taking only the first 20 observations
dat.all = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365))[1:20,]
# mimic your data.frame with some gaps in it (only 15 of 20 observations are present)
your.sample = sample(1:20, 15)
your.dat = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365), x3 = rnorm(365*10))[your.sample,]
# left outer join merge to include ALL of the master set and all of your matching subset, filling blanks with NA
merge(dat.all, your.dat, all.x = TRUE)
Here is the output from the merge, showing all 20 possible records with the gaps clearly visible as NA:
x1 x2 x3
1 1 0 NA
2 1 30 1.23128294
3 1 100 0.95806838
4 1 130 2.27075361
5 1 200 0.45347199
6 1 230 -1.61945983
7 1 330 NA
8 1 400 -0.98702883
9 1 430 NA
10 1 500 0.09342522
11 2 0 0.44340164
12 2 30 0.61114408
13 2 100 0.94592127
14 2 130 0.48916825
15 2 200 0.48850478
16 2 230 NA
17 2 330 0.52789171
18 2 400 -0.16939587
19 2 430 0.20961745
20 2 500 NA
Here are a few NA handling functions that could help you getting started.
For the inserting task, you should provide your own data using dput or a reproducible example.
df <- data.frame(x = sample(c(1, 2, 3, 4), 100, replace = T),
y = sample(c(0,30,130,200,230,300,330,400,430,500), 100, replace = T))
nas <- sample(NA, 20, replace = T)
df[1:20, 1] <- nas
df$y <- ifelse(df$y == 0, NA, df$y)
# Columns x and y have NA's in diferent places.
# Logical test for NA
is.na(df)
# Keep not NA cases of one colum
df[!is.na(df$x),]
df[!is.na(df$y),]
# Returns complete cases on both rows
df[complete.cases(df),]
# Gives the cases that are incomplete.
df[!complete.cases(df),]
# Returns the cases without NAs
na.omit(df)

Resources