I have the following type of dataframe
A B C D
1 0 1 10
0 2 1 15
1 1 0 11
I would like the following output
A B C D
1 0 1 10
1 1 0 11
0 2 1 15
I have tried this code
require(permute)
z <- apply(permute::allPerms(1:nrow(DF)), 1, function(x){
mat <- as.matrix(DF,2:ncol(DF)])
if(all(diag(mat[x,]) == rep(1,nrow(DF)))){
return(df[x,])} })
I am unable to get the desired output.
(Link for the above code- Arrange data frame in a specific way)
I request someone to guide me. The dataframe is a small sample but I have a huge one with a similar structure.
The following will work so long as there is at least one 1 in every suitable column. It's deterministic so will always just find the first 1 and swap that with the number in the diagonal position. But no combinatorial explosion. Perhaps someone can find a more elegant (or vectorised) solution???
fn<- function(colm){
i1<-match(1, colm)
colm[i1]<- colm[i]
colm[i]<-1
return(colm)
}
for(i in 1:nrow(DF))
{
DF[,i]=fn(DF[,i])
}
EDIT
Although this answer was accepted (so I cannot delete) when rereading it I don't think it does quite what you asked...
The folowing code should fix this answer..
DF<-read.table(text="A B C D
13 0 0 1
1 0 1 10
0 2 1 15
1 1 0 11", header=T)
rem<-1:nrow(DF)
for(i in 1:nrow(DF))
{
temp<-DF[i,]
any1<-intersect(rem, which(DF[,i]==1))
best1<-which.min(rowSums(DF[any1,]==1))
firsti<-any1[best1]
DF[i,]<-DF[firsti,]
DF[firsti,]<-temp
rem<-setdiff(rem, i)
}
DF
A B C D
1 1 0 1 10
2 1 1 0 11
3 0 2 1 15
4 13 0 0 1
My apologies for confusion.
Related
I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1
As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1
I've got a column in my dataset that contains a collection of 0,1 and 2. The 2's are a weird leftover from some previous transformation, and I need to convert them to 1. I've written a simple loop to do this
for (i in my.cl.accept$enroll){
if (i==2){
i=1
}
}
however, this doesn't change the actual contents of the dataframe. ifelse() doesn't work, because I don't need to change the other digits at all; just the number 2.
I've been using R a little more after coming from python, what simple thing am I misunderstanding here?
Lets generate a sample set:
set.seed(10)
DF <- data.frame(
a=1:10,
b=sample(0:2,10,rep=T))
DF
Now, replace every entry corresponding to 2 with 1:
DF$b[DF$b==2] <- 1
DF
Note: This is a vectorized method, and will always work faster than loop iterations.
Dunno whether this is what you want?
> A<- 1:10
> B<- c(rep(0,5), rep(1,3), rep(2,2))
> data <- data.frame(A,B)
> data
A B
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 2
10 10 2
> data[data$B==2,]$B <- 1
> data
A B
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
Are you sure you're using ifelse correctly? It actually does allow you to only change one value to another. Here's an example:
> x <- sample(c(0, 1, 2), 10, TRUE)
> x
## [1] 2 1 1 0 2 2 0 0 2 1
> ifelse(x == 2, 1, x)
## [1] 1 1 1 0 1 1 0 0 1 1
For future reference, your good old-fashioned for loop should go something like this...
for (i in 1:length(my.cl.accept$enroll)){
if (my.cl.accept$enroll[i] == 2){
my.cl.accept$enroll[i] <- 1
} else {
my.cl.accept$enroll[i]
}
}
I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.
How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))
model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0
I am trying to create a new column (variable) according to the values that appear in an existing column such that if there is an NA in the existing column then the corresponding value in the new column should be 0 (zero), if not NA then it should be 1 (one). An example data is given below:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
data=data.frame(aid,age)
My new data frame should look like this:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
surv=c(1,1,0,1,0,1,1,1,0,1)
data<-data.frame(aid,age,surv)
data
I hope that my question is clear enough.
The R community's help is highly appreciated!
Baz
surv = 1 - is.na(age)
> data
aid age surv
1 1 2 1
2 2 14 1
3 3 NA 0
4 4 0 1
5 5 NA 0
6 6 1 1
7 7 6 1
8 8 9 1
9 9 NA 0
10 10 15 1
>
If I'm understanding correctly:
data$surv <- 1
data$surv[is.na(data$age)] <- 0
or
data$surv <- ifelse(is.na(data$age), 0, 1)
An alternative to #mod's 1-is.na(foo) solution, is to just invert the TRUE/FALSE with !, and call as.numeric(). This involves more typing, but the intention and explicit coercion to numeric is apparent.
> as.numeric(!is.na(c(2,14,NA,0,NA,1,6,9,NA,15)))
[1] 1 1 0 1 0 1 1 1 0 1