I have a data.frame with vegetation in a presence-abscence matrix and ELLENBERG-values about moisture (values 1-9 and indicator plants (! and =)). Now I want to count the plants in every column (observation point) and for each ELLENBERG-value.
T1 -T4 are my observation points and when the plant is present, the value is 1, if absent 0. In F_nr are my ELLENBERG Values from 1 to 9. In F_sym the indicators with ! and =. In my output I count the values, i. e. in T1 I have one plants with 4, two with 7, one with ! and one with =.
Here some small example data:
df <- df2 <- data.frame(name=c("Acer campestre", "Acer negundo", "Achillea millefolium agg.", "Agrostis stolonifera", "Alnus glutinosa", "Alnus incana"),
T1=rbinom(6, 1, .5), T2=rbinom(6, 1, .5), T3=rbinom(6, 1, .5), T4=rbinom(6, 1, .5),
F_Nr=c(5,6,4,7,9,7), F_sym=c(NA, NA, NA, "!","=", "="))
I excpect a matrix like this, to create plots about the distribution of the values.
df_count <- data.frame(F_sum=c(1,2,3,4,5,6,7,8,9,"=", "!"),
Thanks for your help
We can use a combination of aggregate() and merge().
df2 <- read.table(text="
name T1 T2 T3 T4 F_Nr F_sym
'Acer campestre' 0 1 1 0 5 <NA>
'Acer negundo' 0 1 0 1 6 <NA>
'Achillea millefolium agg.' 1 1 1 1 4 <NA>
'Agrostis stolonifera' 1 0 0 0 7 !
'Alnus glutinosa' 0 0 1 1 9 =
'Alnus incana' 1 0 1 0 7 =",
header=TRUE, stringsAsFactors=FALSE)
fnr <- aggregate(df2[,2:5], list(df2$F_Nr), sum)
fsm <- aggregate(df2[,2:5], list(df2$F_sym), sum)
counts0 <- rbind(fnr, fsm)
dtf <- data.frame(F_sum=c(1:9, "=", "!"), stringsAsFactors=FALSE)
counts <- merge(dtf, counts0, by.x="F_sum", by.y="Group.1", all.x=TRUE)
counts[is.na(counts)] <- 0
counts[match(dtf$F_sum, counts$F_sum), ]
# F_sum T1 T2 T3 T4
# 3 1 0 0 0 0
# 4 2 0 0 0 0
# 5 3 0 0 0 0
# 6 4 1 1 1 1
# 7 5 0 1 1 0
# 8 6 0 1 0 1
# 9 7 2 0 1 0
# 10 8 0 0 0 0
# 11 9 0 0 1 1
# 2 = 1 0 2 1
# 1 ! 1 0 0 0
I have 3 dataframes
As you can see each one contains a pair "Drug"-"Target". Every dataframe contains only unique pairs. But you can find exactly the same pair in the other dataframes. What I want to achieve is to create a new dataframe which will extract all the unique pairs in the first column and then in the other 3 columns will have the fileA, fileB and fileC which will be filled with 1 if the pair exists and 0 if the pair does not exist. Something like:
Pairs fileA fileB fileC
1: abqwewr 1 1 1
2: bcsaff 1 1 1
3: cdcxzcc 1 1 1
4: efsadda 1 1 1
5: ghsadd 1 1 0
6: cdDcxzccf 0 0 0
7: efcsaddav 0 0 0
8: ghgsadd 0 0 0
9: hjbn 0 0 0
10: abBqwewrm 0 0 0
11: bcvsaff 0 0 0
12: cdDcxzccfh 0 0 0
But here the dataframe is not correct since in the first column there is only the drug name and also each row should have had at least one 1.
My method:
# Create composite dataset by combining all files
compositeDataD <- rbind(fileA,fileB,fileC)
# Get unique (drug, target) pairs
# Connect Drug Names and Target Gene Symbols into one vector of pairs
compositeDataD <- na.omit(compositeDataD)
DrugTargetPairsD <- paste(compositeDataD$Drug,compositeDataD$Target,sep="")
PairsA <- DrugTargetPairsD[1:nrow(na.omit(fileA))]
PairsB <- DrugTargetPairsD[1:nrow(na.omit(fileB))]
PairsC <- DrugTargetPairsD[1:nrow(na.omit(fileC))]
# Create binary matrix for unique (drug, target) pairs
binaryA <- as.numeric(uniquePairsD %in% PairsA) # This function returns a binary value for each unique (Drug, Target) Pair compared with the content of file1
binaryB <- as.numeric(uniquePairsD %in% PairsB)
binaryC <- as.numeric(uniquePairsD %in% PairsC)
table33 <- data.table(Pairs=uniquePairsD,
Form list L from the three objects and use lapply to paste their columns together and then stack to create a 2 column data frame with the pasted values and an indicator of which object it came from. Finally use table to provide the counts.
L <- mget(ls(pattern = "file"))
s <- stack(lapply(L, function(x) paste0(x[[1]], x[[2]])))
values fileA fileB fileC
abBqwewrm 0 0 1
abqwewr 1 1 0
bcsaff 1 1 0
bcvsaff 0 0 1
cdcxzcc 1 0 0
cdDcxzccf 0 1 0
cdDcxzccfh 0 0 1
efcsaddav 0 1 1
efsadda 1 0 0
ghgsadd 0 1 0
ghsadd 1 0 0
hjbn 0 1 0
A variation of this is to express it as this pipeline:
mget(ls(pattern = "file")) %>%
lapply(function(x) paste0(x[[1]], x[[2]])) %>%
stack %>%
You can first create the Pairs and then merge on them, while carrying a column where the data came from:
Create the indicator column in each file:
fileA$fileA <- 1
fileB$fileB <- 1
fileC$fileC <- 1
Create the pairs in each file:
fileA$DrugTargetPair <- paste0(fileA$Drug, fileA$Target)
fileB$DrugTargetPair <- paste0(fileB$Drug, fileB$Target)
fileC$DrugTargetPair <- paste0(fileC$Drug, fileC$Target)
Select only the indicator column and the Pairs colum :
fileA <- fileA[, c("DrugTargetPair", "fileA")]
fileB <- fileB[, c("DrugTargetPair", "fileB")]
fileC <- fileC[, c("DrugTargetPair", "fileC")]
Merge on the Pairs column, kepp all Pairs with all = T:
file_new <- merge(fileA, fileB, by = "DrugTargetPair", all = T)
file_new <- merge(file_new, fileC, by = "DrugTargetPair", all = T)
file_new[is.na(file_new)] <- 0
DrugTargetPair fileA fileB fileC
1 abBqwewrm 0 0 1
2 abqwewr 1 1 0
3 bcsaff 1 1 0
4 bcvsaff 0 0 1
5 cdcxzcc 1 0 0
6 cdDcxzccf 0 1 0
7 cdDcxzccfh 0 0 1
8 efcsaddav 0 1 1
9 efsadda 1 0 0
10 ghgsadd 0 1 0
11 ghsadd 1 0 0
12 hjbn 0 1 0
all_list <- list(fileA, fileB, fileC)
all1 <- rbind(fileA,fileB,fileC)
all1 <- as.data.frame(unique(all1))
ans <- t(apply(all1, 1, function(drgT){ sapply(all_list, function(x) {(list(drgT) %in% unlist(apply(x,1,list), recursive = F))*1} ) }))
ans[rowSums(ans) == 1,] <- 0
cbind(all1, ans)
# Drug Target 1 2 3
#1 ab qwewr 1 1 0
#2 bc saff 1 1 0
#3 cd cxzcc 0 0 0
#4 ef sadda 0 0 0
#5 gh sadd 0 0 0
#8 cdD cxzccf 0 0 0
#9 efc saddav 0 1 1
#10 ghg sadd 0 0 0
#11 hj bn 0 0 0
#12 abB qwewrm 0 0 0
#13 bcv saff 0 0 0
#14 cdD cxzccfh 0 0 0
please note:
please revise your example data/ desired outcome.
please E D U C A T E yourself on stringsAsFactors.
As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1
I have a df that looks like this.
Date Winner
4/12 Tom
4/13 Abe
4/14 George
4/15 Tom
I would like to add new columns that assign a 1 if if the name appears in the winner column and 0 if the name did not appear and vice versa. Ideally the df would look like this as a result
Date Winner Tom_Win Tom_Lose Abe_Win Abe_Lose George_Win George Lose
4/12 Tom 1 0 0 1 0 1
4/13 Abe 0 1 1 0 0 1
4/14 George 0 1 0 1 1 0
4/15 Tom 1 0 0 1 0 1
Is there an easy way to accomplish this?
This is extremely simple to do if you use the model.matrix functions, it will create N dummy columns with 0 when the name does not appear and one when it does (exactly as you requested), the code below:
(assuming your data is called db)
> winners <- model.matrix(~Winner - 1, data=db)
> winners
WinnerAbe WinnerGeorge WinnerTom
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
This bit is to compute the columns with the losing values
winners <- as.data.frame(winners)
winners$loserAbe <- as.numeric(!winners$WinnerAbe) #naturally you have to
#do this for every column you need
WinnerAbe WinnerGeorge WinnerTom loserAbe
1 0 0 1 1
2 1 0 0 0
3 0 1 0 1
4 0 0 1 1
winners$Date <- db$Date #this last bit so you don't lose the date.
Using mtabulate from qdapTools package we can do the following three steps,
d1 <- mtabulate(d3$Winner)
d2 <- setNames(data.frame(sapply(d1, function(i) ifelse(i == 1, 0, 1))),
paste0(names(d1), '_Lose'))
cbind(d3$Date, d1, d2)
# d3$Date Abe George Tom Abe_Lose George_Lose Tom_Lose
#1 4/12 0 0 1 1 1 0
#2 4/13 1 0 0 0 1 1
#3 4/14 0 1 0 1 0 1
#4 4/15 0 0 1 1 1 0
'data.frame': 4 obs. of 2 variables:
$ Date : Factor w/ 4 levels "4/12","4/13",..: 1 2 3 4
$ Winner: Factor w/ 3 levels "Abe","George",..: 3 1 2 3
I'm sure there is a better way than this but this works in base R and it's fairly simple:
If your data looks like this:
df <- data.frame(Date = c("4/12","4/13","4/14","4/15"),Winner = c("Tom","Abe","George","Tom"))
Append the extra columns like so:
xcols <- c(paste0(unique(df$Winner), '_Win'), paste0(unique(df$Winner), '_Lose'))
df[ , xcols] <- 0
Now make a character vector with instructions to give the points for every player.
evl <- unlist(lapply(unique(df$Winner), function(x){paste0('df[', which(df$Winner == x), ',', which(names(df) == paste0(x, '_Win')), '] <- 1')}))
And execute the code:
eval(parse(text = evl))
df <- data.frame(
Date = c("4/12", "4/13","4/14", "4/15"),
Winner = c("Tom", "Abe", "George", "Tom")
df2 <- do.call(cbind,
lapply(seq_along(levels(df$Winner)), function(x) {
win <- ifelse(df$Winner == levels(df$Winner)[x], 1, 0)
lose <- ifelse(df$Winner == levels(df$Winner)[x], 0, 1)
dat <- cbind(win, lose)
colnames(dat) <- c(paste(levels(df$Winner)[x], "win", sep = "_"), paste(levels(df$Winner)[x], "lose", sep = "_"))
cbind(df, df2)
> cbind(df, df2)
Date Winner Abe_win Abe_lose George_win George_lose Tom_win Tom_lose
1 4/12 Tom 0 1 0 1 1 0
2 4/13 Abe 1 0 0 1 0 1
3 4/14 George 0 1 1 0 0 1
4 4/15 Tom 0 1 0 1 1 0
I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
I am new to R.
I would like to transform a binary matrix like this:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.