Lets, say I have data:
df <- data.frame (RR_Code = c( "848140", "848180", "848190", "848310", "848360", "848410", "848490", "850131", "850132", "850133"),
Model = c("X1", "FG", "FD", "XR", "RT", "FG", "CV", "GH", "ER", "RF"))
RR_Code Model
1 848140 X1
2 848180 FG
3 848190 FD
4 848310 XR
5 848360 RT
6 848410 FG
7 848490 CV
8 850131 GH
9 850132 ER
10 850133 RF
Now I want to create a new column based on filer.df and if RR_Code is included in filter.df, its 1, otherwise 0.
filter.df <- c("848410", "848490", "850131", "850132")
Expected outcome:
RR_Code Model filter
1 848140 X1 0
2 848180 FG 0
3 848190 FD 0
4 848310 XR 0
5 848360 RT 0
6 848410 FG 1
7 848490 CV 1
8 850131 GH 1
9 850132 ER 1
10 850133 RF 0
We can do
df$filter <- +(df$RR_Code %in% filter.df)
df
# RR_Code Model filter
#1 848140 X1 0
#2 848180 FG 0
#3 848190 FD 0
#4 848310 XR 0
#5 848360 RT 0
#6 848410 FG 1
#7 848490 CV 1
#8 850131 GH 1
#9 850132 ER 1
#10 850133 RF 0
There are several ways of doing it, my first intuition would be using case_when, from dplyr.
df$new_column <- case_when(df$RR_Code %in% filter.df ~ 1,
TRUE ~ 0)
Which would return:
RR_Code Model new_column
1 848140 X1 0
2 848180 FG 0
3 848190 FD 0
4 848310 XR 0
5 848360 RT 0
6 848410 FG 1
7 848490 CV 1
8 850131 GH 1
9 850132 ER 1
10 850133 RF 0
Related
I have for example a datset like this:
data <- data.frame(matrix(c(1,2,2,3,4,5,5,"a","a","b","a","a","a","b"), nrow = 7, ncol = 2, byrow = F))
X1 X2
1 a
2 a
2 b
3 a
4 a
5 a
5 b
then I add another variable "tag", initially set to 0.
data$tag <- 0
X1 X2 tag
1 a 0
2 a 0
2 b 0
3 a 0
4 a 0
5 a 0
5 b 0
I'd like to have "tag" equal to 1 for each row that is repeated, like:
X1 X2 tag
1 a 0
2 a 1
2 b 1
3 a 0
4 a 0
5 a 1
5 b 1
I used the followed code:
for (i in data$X1) {
for (j in 1:length(data$X1)) {
if (j==2) {data$tag[j] <- 1}
}
}
but it doesn't work like I would like to. I'd like the second loop (j) to work inside the previous one in order to obtain what I want, where j starts from 1 every time X1 changes.
How can I manage it?
Thanks a lot
Maybe you can try ave
within(
data,
tag <- +(ave(X1, X1, FUN = length) > 1)
)
which gives
X1 X2 tag
1 1 a 0
2 2 a 1
3 2 b 1
4 3 a 0
5 4 a 0
6 5 a 1
7 5 b 1
You can use duplicated from both the ends in base R :
data$tag <- as.integer(duplicated(data$X1) |
duplicated(data$X1, fromLast = TRUE))
data
# X1 X2 tag
#1 1 a 0
#2 2 a 1
#3 2 b 1
#4 3 a 0
#5 4 a 0
#6 5 a 1
#7 5 b 1
An option with add_count
library(dplyr)
data %>%
add_count(X1) %>%
mutate(n = +(n > 1))
I have a data.frame with vegetation in a presence-abscence matrix and ELLENBERG-values about moisture (values 1-9 and indicator plants (! and =)). Now I want to count the plants in every column (observation point) and for each ELLENBERG-value.
T1 -T4 are my observation points and when the plant is present, the value is 1, if absent 0. In F_nr are my ELLENBERG Values from 1 to 9. In F_sym the indicators with ! and =. In my output I count the values, i. e. in T1 I have one plants with 4, two with 7, one with ! and one with =.
Here some small example data:
set.seed(1)
df <- df2 <- data.frame(name=c("Acer campestre", "Acer negundo", "Achillea millefolium agg.", "Agrostis stolonifera", "Alnus glutinosa", "Alnus incana"),
T1=rbinom(6, 1, .5), T2=rbinom(6, 1, .5), T3=rbinom(6, 1, .5), T4=rbinom(6, 1, .5),
F_Nr=c(5,6,4,7,9,7), F_sym=c(NA, NA, NA, "!","=", "="))
I excpect a matrix like this, to create plots about the distribution of the values.
df_count <- data.frame(F_sum=c(1,2,3,4,5,6,7,8,9,"=", "!"),
T1=c(0,0,0,1,0,0,2,0,0,1,0),
T2=c(0,0,0,1,1,1,0,0,0,0,0),
T3=c(0,0,0,1,1,0,1,0,1,1,1),
T4=c(0,0,0,1,0,1,0,0,1,1,0))
Thanks for your help
We can use a combination of aggregate() and merge().
df2 <- read.table(text="
name T1 T2 T3 T4 F_Nr F_sym
'Acer campestre' 0 1 1 0 5 <NA>
'Acer negundo' 0 1 0 1 6 <NA>
'Achillea millefolium agg.' 1 1 1 1 4 <NA>
'Agrostis stolonifera' 1 0 0 0 7 !
'Alnus glutinosa' 0 0 1 1 9 =
'Alnus incana' 1 0 1 0 7 =",
header=TRUE, stringsAsFactors=FALSE)
fnr <- aggregate(df2[,2:5], list(df2$F_Nr), sum)
fsm <- aggregate(df2[,2:5], list(df2$F_sym), sum)
counts0 <- rbind(fnr, fsm)
dtf <- data.frame(F_sum=c(1:9, "=", "!"), stringsAsFactors=FALSE)
counts <- merge(dtf, counts0, by.x="F_sum", by.y="Group.1", all.x=TRUE)
counts[is.na(counts)] <- 0
counts[match(dtf$F_sum, counts$F_sum), ]
# F_sum T1 T2 T3 T4
# 3 1 0 0 0 0
# 4 2 0 0 0 0
# 5 3 0 0 0 0
# 6 4 1 1 1 1
# 7 5 0 1 1 0
# 8 6 0 1 0 1
# 9 7 2 0 1 0
# 10 8 0 0 0 0
# 11 9 0 0 1 1
# 2 = 1 0 2 1
# 1 ! 1 0 0 0
I would like to fit model at factor level and use those fitted model name on fly for predicting new data at such matching factor level. I am failing in prediction in this logic, can someone guide on this considering below case?
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit<-sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Bb<-Aa[-(which(names(Aa) %in% "delay"))]
sapply(unique(Aa[["card"]]), function(x,y) predict(seq_along(x=ModelFit), newdata=DataOPEN[DataOPEN$SubsidiaryName],type="response"))
I have made this into a loop for simplicity. The prediction throws a warning, but seems to work. Your DataOPEN dataset was not provided, so I just calculated the prediction using the original Aa (new column pred). A final rounded version of the prediction is shown in column pred.round.
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit <- sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Aa$pred <- NaN # create a new variable for prediction
for(i in levels(Aa$card)){
newdat <- subset(Aa, subset=card==i)
newdat$pred <- predict(ModelFit[[i]], newdata=newdat,type="response")
Aa$pred[match(rownames(newdat), rownames(Aa))] <- newdat$pred
}
Aa$pred.round <- round(Aa$pred) # a rounded prediction
Aa
The output:
> Aa
amount cat1 cat2 card delay pred pred.round
1 1 u p a 0 1.170226e-09 0
2 2 x o b 1 1.000000e+00 1
3 1 x o c 0 2.143345e-11 0
4 2 w m a 0 1.170226e-09 0
5 1 v n c 0 2.143345e-11 0
6 1 x p b 0 5.826215e-11 0
7 2 u o a 1 5.000000e-01 0
8 2 x o c 0 2.143345e-11 0
9 1 w m b 0 5.826215e-11 0
10 1 w n a 0 1.170226e-09 0
11 1 w p b 0 5.826215e-11 0
12 2 w o c 1 1.000000e+00 1
13 2 u o a 0 5.000000e-01 0
14 2 u m c 0 2.143345e-11 0
15 1 w n a 0 1.170226e-09 0
I am currently working on a Multistate Analysis dataset in "long" form (one row for each individual's observation; each individual is repeatedly measured up to 5 times).
The idea is that each individual can recurrently transition across the levels of the time-varying state variable s = 1, 2, 3, 4. All the other variables that I have (here cohort) are fixed within any given id.
After some analyses, I need to reshape the dataset in "wide" form, according to the specific sequence of visited states. Here is an example of the initial long data:
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
The final "wide" dataset should take into account the specific individual sequence of visited states, recorded into the newly created variables s1, s2, s3, s4, s5, where s1 is the first state visited by the individual and so on.
According to the above example, the wide dataset looks like:
id cohort s1 s2 s3 s4 s5
1 1 2 2 1 4 0
2 3 1 1 3 0 0
3 2 1 2 3 3 4
I tried to use reshape(), and also to focus on transposing s, but without the intended result. Actually, my knowledge of the R functions is quite limited.. Can you give any suggestion? Thanks.
EDIT: obtaining a different kind of wide dataset
Thank you all for your help, I have a related question if I can. Especially when each individual is observed for a long time and there are few transitions across states, it is very useful to reshape the initial sample dat in this alternative way:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4 dur5
1 1 2 1 4 0 0 2 1 1 0 0
2 3 1 3 0 0 0 2 1 0 0 0
3 2 1 2 3 4 0 1 1 2 1 0
In practice now s1-s5 are the distinct visited states, and dur1-dur5 the time spent in each respective distinct visited state.
Can you please give a hand for reaching this data structure? I believe it is necessary to create all the dur- and s- variables in an intermediate sample before using reshape(). Otherwise maybe it is possible to directly adopt -reshape2-?
dat <- read.table(text = "
id cohort s
1 1 2
1 1 2
1 1 1
1 1 4
2 3 1
2 3 1
2 3 3
3 2 1
3 2 2
3 2 3
3 2 3
3 2 4",
header=TRUE)
df <- data.frame(
dat,
period = sequence(rle(dat$id)$lengths)
)
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide")
wide[is.na(wide)] = 0
wide
Gives:
id cohort s.1 s.2 s.3 s.4 s.5
1 1 1 2 2 1 4 0
5 2 3 1 1 3 0 0
8 3 2 1 2 3 3 4
then using the following line gives your names:
names(wide) <- c('id','cohort', paste('s', seq_along(1:5), sep=''))
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 5 2 3 1 1 3 0 0
# 8 3 2 1 2 3 3 4
If you use sep='' in the wide statement you do not have to rename the variables:
wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
timevar = "period", direction = "wide", sep='')
I suspect there are ways to avoid creating the period variable and avoid replacing NA directly in the wide statement, but I have not figured those out yet.
ok...
library(plyr)
library(reshape2)
dat2 <- ddply(dat,.(id,cohort), function(x)
data.frame(s=x$s,name=paste0("s",seq_along(x$s))))
dat2 <- ddply(dat2,.(id,cohort), function(x)
dcast(x, id + cohort ~ name, value.var= "s" ,fill= 0)
)
dat2[is.na(dat2)] <- 0
dat2
# id cohort s1 s2 s3 s4 s5
# 1 1 1 2 2 1 4 0
# 2 2 3 1 1 3 0 0
# 3 3 2 1 2 3 3 4
This seem right? I admit the first ddply is hardly elegant.
Try this:
library(reshape2)
dat$seq <- ave(dat$id, dat$id, FUN = function(x) paste0("s", seq_along(x)))
dat.s <- dcast(dat, id + cohort ~ seq, value.var = "s", fill = 0)
which gives this:
> dat.s
id cohort s1 s2 s3 s4 s5
1 1 1 2 2 1 4 0
2 2 3 1 1 3 0 0
3 3 2 1 2 3 3 4
If you did not mind using just 1, 2, ..., 5 as column names then you could shorten the ave line to just:
dat$seq <- ave(dat$id, dat$id, FUN = seq_along)
Regarding the second question that was added later try this:
library(plyr)
dur.fn <- function(x) {
r <- rle(x$s)$length
data.frame(id = x$id[1], dur.value = r, dur.seq = paste0("dur", seq_along(r)))
}
dat.dur.long <- ddply(dat, .(id), dur.fn)
dat.dur <- dcast(dat.dur.long, id ~ dur.seq, c, value.var = "dur.value", fill = 0)
cbind(dat.s, dat.dur[-1])
which gives:
id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4
1 1 1 2 2 1 4 0 2 1 1 0
2 2 3 1 1 3 0 0 2 1 0 0
3 3 2 1 2 3 3 4 1 1 2 1
Here is my data:
sub <- paste ("s", 1:6, sep = "")
mark1a <- c("A", "A", "B", "d1", "A", 2)
mark1b <- c("A", "B", "d1", 2, "d1", "A")
myd <- data.frame (sub, mark1a, mark1b)
myd
sub mark1a mark1b
1 s1 A A
2 s2 A B
3 s3 B d1
4 s4 d1 2
5 s5 A d1
6 s6 2 A
I want create a design matrix of the pair of variables (columns) - mark1a and mark1b. A design matrix will consists of length (unique (c(mark1a, mark1b))) for each unique (c(mark1a, mark1b). then 1 or 2 based on if the particular number is present once or twice in the columns and else 0. The following is expected output (not a figure):
I could understand how this can be done:
You could try something like this:
cbind(myd, t(apply(myd, 1, function(x) sapply(unique(unlist(myd[, 2:3])), function(y) sum(x==y)))))
1 s1 A A 2 0 0 0
2 s2 A B 1 1 0 0
3 s3 B d1 0 1 1 0
4 s4 d1 2 0 0 1 1
5 s5 A d1 1 0 1 0
6 s6 2 A 1 0 0 1
First, make sure that the mark1a and mark1b columns share the same levels:
all.levels <- levels(myd["mark1a", "mark1b"])
levels(myd$mark1a) <- all.levels
levels(myd$mark1b) <- all.levels
Then you can compute the sum of two frequency tables and bind it to myd:
library(plyr)
cbind(myd, ddply(myd, "sub", function(x)table(x$mark1a) + table(x$mark1b))[,-1])
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0
I would say the solution from #jmsigner is the way go to for a one-liner, but I usually get confused by those nested apply (and its relatives) solutions.
Here's a similar solution:
# Identify all the levels in `mark1a` and `mark1b`
mydLevels = unique(c(levels(myd$mark1a), levels(myd$mark1b)))
# Use these levels and an anonymous function with `lapply`
temp = data.frame(lapply(mydLevels,
function(x) rowSums(myd[-1] == x)+0))
colnames(temp) = mydLevels
# This gives you the correct output, but not in the order
# that you have in your original question.
cbind(myd, temp)
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0