Linking rows in a data frame - r

I have a data frame with 4 columns.
m <-c(1,2,3,4)
e <-c('01/01/1970', '02/01/1981','03/05/1986','01/01/1970')
z <-c(111,123, 151, 111)
l <-c('XAR', 'XAR', 'XUI','XUI' )
q <-c(673, 673, 304, 455)
df <- data.frame(m,e,z,l,q)
I need to create a new df that describes the relationships between rows.
There is a relationship if rows match other rows in any 2 out of the 4 fields
For instance :
The resulting df in this case would be :
In my production data there are 700,000 rows. I've tried to solve this using SQL but the recursive nature of the function makes it too slow for production purposes
I wondered if R/R packages had any graphing capability to make this practical.

It's not entirely clear what output you expect.
In any case, data.table makes it easy and fast to identify rows with common values:
library(data.table)
# convert your data frame into data table
setDT(df)
# create common id for rows with same values in 'e' AND 'z'
df[, id_ez :=.GRP, by=.(e,z)]
# create common id for rows with same values in 'l' AND 'q'
df[, id_lq :=.GRP, by=.(l,q)]
> head(df)
> m e z l q id_ez id_lq
> 1: 1 01/01/1970 111 XAR 673 1 1
> 2: 2 02/01/1981 123 XAR 673 2 1
> 3: 3 03/05/1986 151 XUI 304 3 2
> 4: 4 01/01/1970 111 XUI 455 1 3
Now you can get a two-column output that tells you which 'm' is liked to each id
df[, .(m_linked=paste(m)), by=id_ez]
> id_ez m_linked
> 1: 1 1
> 2: 1 4
> 3: 2 2
> 4: 3 3
If you want to turn this table into a list of vectors:
mysplit = split(a$V1, a$id_ez)
myresult = lapply(mysplit, as.character)
> myresult
$`1`
[1] "1" "4"
$`2`
[1] "2"
$`3`
[1] "3"

Related

if values of a column is in between two columns in R, populate a new column

I have two data frames of different lengths, like :
df1
locusnum CHR MinBP MaxBP
1: 1 1 13982248 14126651
2: 2 1 21538708 21560253
3: 3 1 28892760 28992798
4: 4 1 43760070 43927877
5: 5 1 149999059 150971195
6: 6 1 200299701 200441048
df2
position chr
27751 13982716 1
27750 13982728 1
10256 13984208 1
27729 13985591 1
27730 13988076 1
27731 13988403 1
both dfs has other columns. df2 has 60000 rows and df1 has 64 rows.
I want to populate a new column in df2 with locusnum from df1. The condition would be df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP
My expected output would be
position chr locusnum
27751 13982716 1 1
27750 13982728 1 1
10256 13984208 1 1
27729 13985591 1 1
27730 13988076 1 1
27731 13988403 1 1
So far I have tried with ifelse statement and for loop as below:
if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum
and
for(i in 1:length(df2$position)){ #runs the following code for each line
if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){ #if logical TRUE then it runs the next line
df2$locusnum[i] <- df1$locusnum #gives value of another column to a new column
but got error:
the condition has length > 1
longer object length is not a multiple of shorter object length
Any help? Did I explain the issue clearly?
}
}
Using foverlaps(...) from the data.table package.
Your example is uninteresting because all the rows correspond to locusnum = 1, so I changed df2 a little bit to demonstrate how this works.
##
# df1 is as you provided it
# in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id position chr
27751 13982716 1
27750 21538718 1
10256 43760080 1
27729 13985591 1
27730 13988076 1
27731 200299711 1", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
## id position chr locusnum
## 1: 27751 13982716 1 1
## 2: 27750 21538718 1 2
## 3: 10256 43760080 1 4
## 4: 27729 13985591 1 1
## 5: 27730 13988076 1 1
## 6: 27731 200299711 1 6
foverlaps(...) works best if both data.tables are keyed, but this changes the order of the rows in df2, so I added an index column to recover the original ordering, then removed it at the end.
This should be extremely fast but 60,000 rows is a tiny data-set tbh so you might not notice a difference.

Split alternating values in a numeric vector into columns of a data-frame in R

I have a dataframe called data, which contains a column of numeric vectors called "time." Each row is its own vector.
>data[1, ]$time
3000 3000 2991 2961 2958 2947 2925 2836 2890
>str(data$time)
$ time :List of 20
..$ : num 3000 3000 2991 2961 2958 ...
I want to convert each row value the time column into a dataframe such that every odd index value is in a column of the dataframe called "odd" and every even index value is in a column called "even."
Here is a dummy example of how I may do that:
test <- c(3000,3000,2991,2961,2958,2947)
test <- data.frame(test)
odd <- test[ c(TRUE,FALSE), ]
even <- test[ !c(TRUE,FALSE), ]
finalData <- data.frame(odd = odd, even = even )
The resulting output is now:
> finalData
odd even
1 3000 3000
2 2991 2961
3 2958 2947
I just can't figure out how to do this process row-by-row with the real dataframe above. I want to replace the time column in the above dataframe with a dataframe containing two odd and even columns. Is this clear?
I'm guessing your data looks something like this?
time <- list(1:4, 1:6, 1:3)
data <- data.frame(cbind(time), dd=1:3)
data
# time dd
# 1 1, 2, 3, 4 1
# 2 1, 2, 3, 4, 5, 6 2
# 3 1, 2, 3 3
In which case something like this should do
lapply(data$time, function(x) {
odd <- seq_along(x) %% 2 == 1
o <- x[odd]
e <- x[!odd]
length(e) <- length(o)
data.frame(o, e)
})
# [[1]]
# o e
# 1 1 2
# 2 3 4
# [[2]]
# o e
# 1 1 2
# 2 3 4
# 3 5 6
# [[3]]
# o e
# 1 1 2
# 2 3 NA
Do notice that this requires there to be an equal number of odd and even values in each vector. If you can't guarantee that you'd need to either pad with NA or store in lists.

R, construct a data.frame column by using data from another list

Given a list x:
$a
[1] 1 2 3 4 5 6
$b
[1] 10 20 30 40 50
$c
[1] 100 200 300 400 500
I want to construct a data frame that contains one column containing the following values:
1 10 100
Namely the elements of the column come from the first element in x$a, x$b and x$c.
I wonder what is the most efficient way to construct this column?
We can use [ to extract the 1st element
d1 <- data.frame(Col1 = unname(sapply(x, `[`, 1)))
d1
# Col1
#1 1
#2 10
#3 100
We can also do
data.frame(Col1 = do.call(cbind, x)[1,])
You can try this too:
data.frame(Col1=do.call(rbind, x)[,1])
Col1
a 1
b 10
c 100

R: change data frame structure using values from one variable as new variable

df1 <- data.frame(
name = c("a", "b", "b", "c"),
score = c(1, 1, 2, 1)
)
How can I get a new data frame with variables/columns from df$name and with each 'corresponding' df$score. I figure that its actually a two-step problem:
First I would need to make a list of (in this example) unequal length vectors like this:
$a
[1] 1
$b
[1] 1 2
$c
[1] 1
Second, NAs need to be padded so one get vectors of equal length before making the desired data frame
that would be like:
a b c
1 1 1 1
2 NA 2 NA
I cannot find any simple means to do this - Im sure there must be!
If the solution can be delivered using dplyr it would be fantastic! Thanks!
To split the data:
(s <- split(df1$score, df1$name))
# $a
# [1] 1
#
# $b
# [1] 1 2
#
# $c
# [1] 1
To create the new data frame:
as.data.frame(sapply(s, `length<-`, max(vapply(s, length, 1L))))
# a b c
# 1 1 1 1
# 2 NA 2 NA
Slightly more efficient would be to use vapply in place of sapply
len <- max(vapply(s, length, 1L))
as.data.frame(vapply(s, `length<-`, double(len), len))
# a b c
# 1 1 1 1
# 2 NA 2 NA

how to avoid change string to number automaticlly in r

I was trying to save some string into a matrix, but it automatically changed to numbers (levels). How can i avoid it??
Here is the table:
trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b
I wanna to save to a matrix like:
J0001 a ab ab b
But, what i get is:
J0001 1 2 2 3
How can i avoid this?
Your M column is defined as a factor. You can save it as-is by wrapping it with as.character
> dat <- read.table(header = TRUE, text = "trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b")
> as.numeric(dat$M)
# [1] 1 2 2 3
> as.character(dat$M)
# [1] "a" "ab" "ab" "b"
You can avoid this in the first place by using stringsAsFactors = FALSE when you read the data into R, or take advantage of the colClasses argument in some of the read-in functions.

Resources