populate values from another data frame based on predefined set of columns - r

I have two data frames. The first one look like that:
df1 <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS"),
A183=c(-0.19,NA,2.01,0.4,1.23),
A185=c(0.11,2.45,NA,NA,1.67),
A186=c(1.19,NA,2.41,0.78,1.93),
A187=c(2.78,NA,NA,0.7,2.23),
A188=c(NA,NA,NA,2.4,1.23))
head(df1)
Hugo_Symbol A183 A185 A186 A187 A188
1 CDKN2A -0.19 0.11 1.19 2.78 NA
2 JUN NA 2.45 NA NA NA
3 IRS2 2.01 NA 2.41 NA NA
4 MTOR 0.40 NA 0.78 0.70 2.40
5 NRAS 1.23 1.67 1.93 2.23 1.23
The second data frame is smaller and have an empty values:
df2 <- data.frame(Hugo_Symbol=c("CDKN2A", "IRS2", "NRAS"),
A183=c(0, 0, 0),
A187=c(0, 0, 0),
A188=c(0, 0, 0))
head(df2)
Hugo_Symbol A183 A187 A188
1 CDKN2A 0 0 0
2 IRS2 0 0 0
3 NRAS 0 0 0
I would like to populate the second data frame with values from the first data frame. The final result will look like that:
Hugo_Symbol A183 A187 A188
1 CDKN2A -0.19 2.78 NA
2 IRS2 2.01 NA NA
3 NRAS 1.23 2.23 1.23
I tried cbind() and merge() functions, but they do not work on data with different number of raws and columns.
I would appreciate any help!
Thank you!
Olha

I don't get the logic of your output, I hope you wrote it wrong, but I think you want the following:
matchedRowInds <- match(df2$Hugo_Symbol,df1$Hugo_Symbol)
matchedColInds <- match(colnames(df2),colnames(df1))
newdf <- df1[matchedRowInds,matchedColInds]
# > newdf
# Hugo_Symbol A183 A187 A188
# 1 CDKN2A -0.19 2.78 NA
# 3 IRS2 2.01 NA NA
# 5 NRAS 1.23 2.23 1.23
Idea: Get the matching rows in the bigged dataframe which are present in the smaller. Same with columns.

You can use semi_join from dplyr:
your final table has unexpected values.
my version:
library(dplyr)
df3 <- df1 %>% semi_join(df2, by="Hugo_Symbol") %>%
select(Hugo_Symbol, A183, A187, A188)

here is a data.table approach... are you sure your desires output in the question is correct? seems to me like IRS2 - A188 should be NA and not 2.23 ?
library( data.table )
#make them both data.tables
setDT(df1); setDT(df2)
#find the common columns
comcols <- intersect( names(df1[,-1]), names(df2[,-1]) )
#create a data.table syntax for an update join on the common columns
expr <- paste0( "df2[ df1, `:=` (",
paste0( comcols, " = i.", comcols, collapse = " ," ),
" ), on = .(Hugo_Symbol) ]" )
eval(parse(text=expr))
df2
# Hugo_Symbol A183 A187 A188
# 1: CDKN2A -0.19 2.78 NA
# 2: IRS2 2.01 NA NA
# 3: NRAS 1.23 2.23 1.23

Related

Update dt columns based on named list

Let's say, I have the following my_dt datatable:
neutrons
spectrum
geography
2.30
-1.2
KIEL
2.54
-1.6
KIEL
2.56
-0.9
JUNG
2.31
-0.3
ANT
Also I have the following named list (my_list):
> my_list
$particles
[1] "neutrons"
$station
[1] NA
$energy
[1] "spectrum"
$area
[1] "geography"
$gamma
[1] NA
The values of this list correspond to the columns names from my dataset (if they exist, if they are absent - NA).
Based on my dataset and this list, I need to check which columns exist in my_dt and rename them (based on my_list names), and for NA values - I need to create columns filled with NAs.
So, I want to obtain the following dataset:
>final_dt
particles
station
energy
area
gamma
2.30
NA
-1.2
KIEL
NA
2.54
NA
-1.6
KIEL
NA
2.56
NA
-0.9
JUNG
NA
2.31
NA
-0.3
ANT
NA
I try to implement this using apply family functions, but at the moment I can't obtain exactly what I want.
So, I would be grateful for any help!
data.table using lapply
library(data.table)
setDT(my_dt)
setDT(my_list)
final_dt <- setnames( my_list[, lapply( .SD, function(x){
if( x %in% colnames(my_dt)){ my_dt[,x,with=F] }else{ NA } } ) ],
names(my_list) )
final_dt
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
base R using sapply
setDF(my_dt)
setDF(my_list)
data.frame( sapply( my_list, function(x) if(!is.na(x)){ my_dt[,x] }else{ NA } ) )
particles station energy area gamma
1 2.30 NA -1.2 KIEL NA
2 2.54 NA -1.6 KIEL NA
3 2.56 NA -0.9 JUNG NA
4 2.31 NA -0.3 ANT NA
Data
my_dt <- structure(list(neutrons = c(2.3, 2.54, 2.56, 2.31), spectrum = c(-1.2,
-1.6, -0.9, -0.3), geography = c("KIEL", "KIEL", "JUNG", "ANT"
)), class = "data.frame", row.names = c(NA, -4L))
my_list <- list(particles = "neutrons", station = NA, energy = "spectrum",
area = "geography", gamma = NA)
This may not meet your needs, but since I had come up with this separately thought I would share just in case. You can use setnames to rename the columns based on my_list. After that, add in the missing column names with values of NA. Finally, you can use setcolorder to reorder based on your list if desired.
library(data.table)
my_vec <- unlist(my_list)
setnames(my_dt, names(my_vec[match(names(my_dt), my_vec)]))
my_dt[, (setdiff(names(my_vec), names(my_dt))) := NA]
setcolorder(my_dt, names(my_vec))
my_dt
Output
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
I wrote a simple code that should do the job for you:
l = list(c = 'cc', a = 'aa', b = NA) # replace this with your my_list
dt = data.frame(aa = 1:3, cc = 2:4) # replace this with my_dt
dtl = data.frame(l)
names(dt) = names(l)[na.omit(match(l, names(dt)))]
m = merge(dt, dtl[!is.element(names(dtl), names(dt))])

Arranging Columns in R

I have data of the form:
Department LengthAfter
1 A 8.42
2 B 10.93
3 D 9.98
4 A 10.13
5 B 10.54
6 C 7.82
7 A 9.55
8 D 12.53
9 C 7.87
I would like to make a new table or dataframe in which the column header is each department (A, B, C, D) and the Lengths under each column are the values on LengthAfter corresponding to each department. e.g.
A B C D
8.42 10.93 7.82 9.98
Can anyone help with this? Thank you
Using tidyverse, you can use pivot_wider to pivot your data into the desired form. Before that, you will need to sort (arrange) by Department first, if you want to include the values from LengthAfter in the order of appearance, and have the columns in order as above.
library(tidyverse)
df %>%
arrange(Department) %>%
group_by(Department) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = "Department", values_from = "LengthAfter") %>%
select(-rn)
Output
A B C D
<dbl> <dbl> <dbl> <dbl>
1 8.42 10.9 7.82 9.98
2 10.1 10.5 7.87 12.5
3 9.55 NA NA NA
You can use package reshape2 for this
Library (reshape2)
Df_new <- dcast(df_old, Department~LengthAfter)
Base-R
dept_length <- read.csv("/Users/usr/SO_Department_LengthAfter.tsv", sep="\t");
dl_list <- with(dept_length, tapply(LengthAfter, Department, `c`));
n.obs <- sapply(dl_list, length);
seq.max <- seq_len(max(n.obs));
sapply(dl_list, `[`, i = seq.max);
Returns:
A B C D
[1,] 8.42 10.93 7.82 9.98
[2,] 10.13 10.54 7.87 12.53
[3,] 9.55 NA NA NA
References:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html
How to convert a list consisting of vector of different lengths to a usable data frame in R?
https://eeob-biodata.github.io/R-Data-Skills/05-split-apply-combine/

Group values with identical ID into columns without summerizing them in R

I have a dataframe that looks like this, but with a lot more Proteins
Protein z
Irak4 -2.46
Irak4 -0.13
Itk -0.49
Itk 4.22
Itk -0.51
Ras 1.53
For further operations I need the data to be grouped by Proteinname into columns like this.
Irak4 Itk Ras
-2.46 -0.49 1.53
-0.13 4.22 NA
NA -0.51 NA
I tried different packages like dplyr or reshape, but did not manage to transform the data into the desired format.
Is there any way to achieve this? I think the missing datapoints for some Proteins are the main problem here.
I am quite new to R, so my apologies if I am missing an obvious solution.
Here is an option with tidyverse
library(tidyverse)
DF %>%
group_by(Protein) %>%
mutate(idx = row_number()) %>%
spread(Protein, z) %>%
select(-idx)
# A tibble: 3 x 3
# Irak4 Itk Ras
# <dbl> <dbl> <dbl>
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
Before we spread the data, we need to create unique identifiers.
In base R you could use unstack first which will give you a named list of vectors that contain the values in the z column.
Use lapply to iterate over that list and append the vectors with NAs using the `length<-` function in order to have a list of vectors with equal lengths. Then we can call data.frame.
lst <- unstack(DF, z ~ Protein)
data.frame(lapply(lst, `length<-`, max(lengths(lst))))
# Irak4 Itk Ras
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
data
DF <- structure(list(Protein = c("Irak4", "Irak4", "Itk", "Itk", "Itk",
"Ras"), z = c(-2.46, -0.13, -0.49, 4.22, -0.51, 1.53)), .Names = c("Protein",
"z"), class = "data.frame", row.names = c(NA, -6L))
library(data.table)
dcast(setDT(df),rowid(Protein)~Protein,value.var='z')
Protein Irak4 Itk Ras
1: 1 -2.46 -0.49 1.53
2: 2 -0.13 4.22 NA
3: 3 NA -0.51 NA
in base R you can do:
data.frame(sapply(a<-unstack(df,z~Protein),`length<-`,max(lengths(a))))
Irak4 Itk Ras
1 -2.46 -0.49 1.53
2 -0.13 4.22 NA
3 NA -0.51 NA
Or using reshape:
reshape(transform(df,gr=ave(z,Protein,FUN=seq_along)),v.names = 'z',timevar = 'Protein',idvar = 'gr',dir='wide')
gr z.Irak4 z.Itk z.Ras
1 1 -2.46 -0.49 1.53
2 2 -0.13 4.22 NA
5 3 NA -0.51 NA

tm package: Output of findAssocs() in a matrix instead of a list in R

Consider the following list:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:
The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Here's a solution using reshape2 to help reshape the data
library(reshape2)
aa<-do.call(rbind, Map(function(d, n)
cbind.data.frame(
xterm=if (length(d)>0) names(d) else NA,
cor=if(length(d)>0) d else NA,
term=n),
a, names(a))
)
dcast(aa, term~xterm, value.var="cor")
Or you could use dplyr and tidyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
cor=x, stringsAsFactors=FALSE)), term)
a1 %>%
spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
# term 15.8 ability above agreement analysts buyers clearly emergency fixed
#1 oil 0.87 NA 0.76 0.71 0.79 0.70 0.8 0.75 0.73
#2 opec 0.85 0.8 0.82 0.76 0.85 0.83 NA 0.87 NA
# late market meeting prices prices. said that they trying who winter
#1 0.8 0.75 0.77 0.72 NA 0.78 0.73 NA 0.8 0.8 0.8
#2 NA NA 0.88 NA 0.79 0.82 NA 0.8 NA NA NA
Update
aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x),
cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
res <- aNew3 %>%
spread(xterm, cor)
dim(res)
#[1] 1021 160
res[1:3,1:5]
# term ... 100,000 10.8 1.1
#1 ... NA NA NA NA
#2 100,000 NA NA NA 1
#3 10.8 NA NA NA NA

Removing NA columns in xts

I have an xts in the following format
a b c d e f ......
2011-01-03 11.40 NA 23.12 0.23 123.11 NA ......
2011-01-04 11.49 NA 23.15 1.11 111.11 NA ......
2011-01-05 NA NA 23.11 1.23 142.32 NA ......
2011-01-06 11.64 NA 39.01 NA 124.21 NA ......
2011-01-07 13.84 NA 12.12 1.53 152.12 NA ......
Is there a function I can apply to generate a new xts or data.frame missing the columns containing only NA?
The position of the columns with the NAs isn't static so just removing those columns by name or position isn't possible
Supose DF is your data.frame
DF [, -which(sapply(DF, function(x) sum(is.na(x)))==nrow(DF))]
a c d e
2011-01-03 11.40 23.12 0.23 123.11
2011-01-04 11.49 23.15 1.11 111.11
2011-01-05 NA 23.11 1.23 142.32
2011-01-06 11.64 39.01 NA 124.21
2011-01-07 13.84 12.12 1.53 152.12
#Jiber's solution works, but might give you unexpected results if there are no columns with all NA. For example:
# sample data
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
# Jiber's solution, when no columns have all missing values
DF <- as.data.frame(x)
DF[, -which(sapply(DF, function(x) sum(is.na(x)))==nrow(DF))]
# data frame with 0 columns and 180 rows
Here's a solution that works whether or not there are columns that have all missing values:
y <- x[,apply(!is.na(x), 2, all)]
x$High <- NA
x$Close <- NA
z <- x[,apply(!is.na(x), 2, all)]
Try this:
dataframe[,-which(apply(is.na(dataframe), 2, all))]
This seems simpler:
DF[, colSums(is.na(DF)) < nrow(DF)]

Resources