I am trying to add multiple dataframes together but not in a bind fashion.
Is there an easy way to overlay & add dataframes on top of each other? As shown in this picture:
The number of columns will always be same; the row count will differ.
I want to sum the cells by row position. So Result[1,1] = Table1[1,1] + Table2[1,1] and so on, such that the resulting frame adds whatever cells have data and resulting table is the size of biggest table's size.
The table are generated dynamically so I'd like to refrain from any hardcoding.
Consider the following two data frames:
table1 <- replicate(4,round(runif(10,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table2 <- replicate(4,round(runif(6,0,1),2)) %>% as.data.frame %>% setNames(LETTERS[1:4])
table1
A B C D
1 0.81 0.08 0.85 0.89
2 0.88 0.82 0.62 0.77
3 0.12 0.13 0.99 0.02
4 0.17 0.54 0.37 0.62
5 0.77 0.10 0.81 0.34
6 0.58 0.15 0.00 0.56
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
table2
A B C D
1 0.95 0.81 0.99 0.92
2 0.18 0.99 0.35 0.09
3 0.73 0.10 0.02 0.68
4 0.37 0.53 0.78 0.02
5 0.48 0.54 0.79 0.83
6 0.75 0.32 0.41 0.04
We might create a new variable called ID from their row numbers and use that to sum the values after binding the rows:
library(dplyr)
library(tibble)
bind_rows(table1 %>% rowid_to_column("ID"),table2 %>% rowid_to_column("ID")) %>%
group_by(ID) %>%
summarise(across(everything(),sum))
# A tibble: 10 x 5
ID A B C D
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1.76 0.89 1.84 1.81
2 2 1.06 1.81 0.97 0.86
3 3 0.85 0.23 1.01 0.7
4 4 0.54 1.07 1.15 0.64
5 5 1.25 0.64 1.6 1.17
6 6 1.33 0.47 0.41 0.6
7 7 0.61 0.15 0.59 0.15
8 8 0.52 0.36 0.12 0.99
9 9 0.83 0.93 0.290 0.3
10 10 0.52 0.02 0.48 0.46
A potentially more dangerous base R approach is to subset table1 to the dimensions of table2, and add them together:
table1[seq(1,nrow(table2)),seq(1,ncol(table2))] <- table1[seq(1,nrow(table2)),seq(1,ncol(table2))] + table2
table1
A B C D
1 1.76 0.89 1.84 1.81
2 1.06 1.81 0.97 0.86
3 0.85 0.23 1.01 0.70
4 0.54 1.07 1.15 0.64
5 1.25 0.64 1.60 1.17
6 1.33 0.47 0.41 0.60
7 0.61 0.15 0.59 0.15
8 0.52 0.36 0.12 0.99
9 0.83 0.93 0.29 0.30
10 0.52 0.02 0.48 0.46
# Create your data frames
df1<-data.frame(a=c(1,2,3),b=c(2,3,4),c=c(3,4,5))
df2<-data.frame(a=c(1,2),b=c(2,3),c=c(3,4))
# Create a new data frame from the bigger of the two
if (nrow(df1)>nrow(df2)){
df3 <-df1
} else {
df3<-df2
}
# For each line in the smaller data frame add it to the larger
for (number in 1:min(nrow(df1),nrow(df2))){
df3[number,] <- df1[number,]+df2[number,]
}
Related
I would like to remove all rows if any value of the row is less than 0.05. Any suggestions? I need dplyr and base R simple subset solutions.
library(magrittr)
text = '
INNO RISK PRO AMB MKT IP
1 0.00 0.01 0.00 0.00 0.19 0.24
2 1.00 0.83 0.04 0.48 0.60 0.03
3 0.01 0.07 0.79 0.05 0.19 0.00
4 0.99 0.99 0.92 0.86 0.01 0.10
5 0.72 0.93 0.28 0.48 1.00 0.90
6 0.96 1.00 1.00 0.86 1.00 0.75
7 0.02 0.07 0.01 0.86 0.60 0.00
8 0.02 0.01 0.01 0.12 0.60 0.24
9 0.02 0.93 0.92 0.02 0.19 0.90
10 0.99 0.97 0.92 0.86 0.99 0.90'
d10 = textConnection(text) %>% read.table(header = T)
Created on 2020-11-28 by the reprex package (v0.3.0)
We can use rowSums
d10[!rowSums(d10 < 0.05),]
# INNO RISK PRO AMB MKT IP
#5 0.72 0.93 0.28 0.48 1.00 0.90
#6 0.96 1.00 1.00 0.86 1.00 0.75
#10 0.99 0.97 0.92 0.86 0.99 0.90
Or with dplyr
library(dplyr)
d10 %>%
filter(across(everything(), ~ . >= 0.05))
# INNO RISK PRO AMB MKT IP
#5 0.72 0.93 0.28 0.48 1.00 0.90
#6 0.96 1.00 1.00 0.86 1.00 0.75
#10 0.99 0.97 0.92 0.86 0.99 0.90
I have a data frame df1 with four variables. One refers to sunlight, the second one refers to the moon-phase light (light due to the moon's phase), the third one to the moon-position light (light from the moon depending on if it is in the sky or not) and the fourth refers to the clarity of the sky (opposite to cloudiness).
I call them SL, MPhL, MPL and SC respectively. I want to create a new column referred to "global light" that during the day depends only on SL and during the night depends on the other three columns ("MPhL", "MPL" and "SC"). What I want is that at night (when SL == 0), the light in a specific area is equal to the product of the columns "MPhL", "MPL" and "SC". If any of them is 0, then, the light at night would be 0 also.
Since I work with a matrix of hundreds of thousands of rows, what would be the best way to do it? As an example of what I have:
SL<- c(0.82,0.00,0.24,0.00,0.98,0.24,0.00,0.00)
MPhL<- c(0.95,0.85,0.65,0.35,0.15,0.00,0.87,0.74)
MPL<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
SC<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
df<-data.frame(SL,MPhL,MPL,SC)
df
SL MPhL MPL SC
1 0.82 0.95 0.00 0.00
2 0.00 0.85 0.50 0.50
3 0.24 0.65 0.10 0.10
4 0.00 0.35 0.89 0.89
5 0.98 0.15 0.33 0.33
6 0.24 0.00 0.58 0.58
7 0.00 0.87 0.00 0.00
8 0.00 0.74 0.46 0.46
What I would like to get is this:
df
SL MPhL MPL SC GL
1 0.82 0.95 0.00 0.00 0.82 # When "SL">0, GL= SL
2 0.00 0.85 0.50 0.50 0.21 # When "SL" is 0, GL = MPhL*MPL*SC
3 0.24 0.65 0.10 0.10 0.24
4 0.00 0.35 0.89 0.89 0.28
5 0.98 0.15 0.33 0.33 0.98
6 0.24 0.00 0.58 0.58 0.24
7 0.00 0.87 0.00 0.00 0.00
8 0.00 0.74 0.46 0.46 0.16
the most simple way would be to use the ifelse function:
GL <- ifelse(SL == 0, MPhL * MPL * SC, SL)
If you want to work in a more structured environment, I can recommend the dplyr package:
library(dplyr)
tibble(SL = SL, MPhL = MPhL, MPL = MPL, SC = SC) %>%
mutate(GL = if_else(SL == 0, MPhL * MPL * SC, SL))
# A tibble: 8 x 5
SL MPhL MPL SC GL
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.82 0.95 0.00 0.00 0.820000
2 0.00 0.85 0.50 0.50 0.212500
3 0.24 0.65 0.10 0.10 0.240000
4 0.00 0.35 0.89 0.89 0.277235
5 0.98 0.15 0.33 0.33 0.980000
6 0.24 0.00 0.58 0.58 0.240000
7 0.00 0.87 0.00 0.00 0.000000
8 0.00 0.74 0.46 0.46 0.156584
Dataframe (df1) has a column with variable of interest (V1), some of these values in that column correspond to column names in other data frame (df2).
I would need to find an overlap between values (rows) of that column in df1 and all the columns in df2.
head(df1)
V1 CHR MAPINFO Pval
a 2 38067017 0.27
c 2 38070880 0.29
d 2 38073394 0.00
e 2 38073443 0.00
f 2 38073564 0.01
head(df2)
a b c d f
-0.09 -0.08 -0.50 0.50 0.35
0.00 0.00 0.40 -0.40 -0.85
0.32 0.30 0.20 0.74 0.42
-0.41 -0.52 -0.72 -0.90 -0.96
1.30 1.30 1.10 1.10 1.20
-1.12 -1.78 -1.40 1.40 1.20
For example, in the df2, there is no "e" and in df1 there is no "b". How could I only keep the ones that are present both in df1$V1 and all columns of df2?
In the end I would need intersect between both dataframes (values present only in both).
head(df1)
V1 CHR MAPINFO Pval
a 2 38067017 0.27
c 2 38070880 0.29
d 2 38073394 0.00
f 2 38073564 0.01
head(df2)
a c d f
-0.09 -0.50 0.50 0.35
0.00 0.40 -0.40 -0.85
0.32 0.20 0.74 0.42
-0.41 -0.72 -0.90 -0.96
1.30 1.10 1.10 1.20
-1.12 -1.40 1.40 1.20
Since the real number of these columns is > ~1200, I can not filter one by one.
Is there another elegant way other than transpose?
Base R solution:
df1 <- subset(df1, df1$V1 %in% names(df2))
df2 <- df2[,df1$V1]
fg = read.table("fungus.txt", header=TRUE, row.names=1);fg
names(dimnames(fg)) = c("Temperature", "Area");names(dimnames(fg))#doesn't work
dimnames(fg) = list("Temperature"=row.names(fg), "Area"=colnames(fg));dimnames(fg)
#doesn't work
You can look at the picture of data I used below:
Using dimnames() to assign dim names to the data.frame doesn't work.
The two R command both do not work. The dimnames of fg didn't change, and the names of dimnames of fg is still NULL.
Why does this happen? How to change the dimnames of this data.frame?
Finally I found change the data frame to matrix works well.
fg = as.matrix(read.table("fungus.txt", header=TRUE, row.names=1))
dimnames(fg) = list("Temp"=row.names(fg), "Isolate"=1:8);fg
And got the output:
Isolate
Temp 1 2 3 4 5 6 7 8
55 0.66 0.67 0.43 0.41 0.69 0.63 0.46 0.52
60 0.82 0.81 0.80 0.79 0.85 0.91 0.53 0.66
65 0.91 1.09 0.81 0.86 0.95 0.93 0.64 1.10
70 1.02 1.22 1.03 1.08 1.10 1.13 0.80 1.17
75 1.06 1.17 0.89 1.02 1.06 1.29 0.94 1.01
80 0.80 0.81 0.73 0.77 0.80 0.79 0.59 0.95
85 0.26 0.40 0.36 0.53 0.67 0.53 0.57 0.18
Reply to the comments: if you do not know anything about the code, then do not ask me why I post such a question.
I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)