I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
however if I save it as a data frame 'df'
df<-data.frame(table(df))
the format of 'df' would be like
id period Freq
1 1 calib 2
2 2 calib 1
3 3 calib 1
4 4 calib 0
5 1 first 1
6 2 first 2
7 3 first 0
8 4 first 0
9 1 valid 0
10 2 valid 0
11 3 valid 2
12 4 valid 3
how can I avoid this and how can I save the first output as it is into a data frame?
more importantly is there any way to get the same result using 'dcast'?
Would this help?
> data.frame(unclass(table(df)))
calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
To elaborate just a little bit. I've changed the ids in the example data.frame such that your ids are not 1:4, in order to prove that the ids are carried along into the table and are not a sequence of row counts.
id <- c(10,20,30,40,10,10,20,30,40,40,20,20)
period <- c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df <- data.frame(id,period)
Create the new data.frame one of two ways. rengis answer is fine for 2-column data frames that have the id column first. It won't work so well if your data frame has more than 2 columns, or if the columns are in a different order.
Alternative would be to specify the columns and column order for your table:
df3 <- data.frame(unclass(table(df$id, df$period)))
the id column is included in the new data.frame as row.names(df3). To add it as a new column:
df3$id <- row.names(df3)
df3
calib first valid id
10 1 2 0 10
20 2 0 2 20
30 0 0 2 30
40 1 1 1 40
Related
I have a question regarding longitudinal study analysis and work with R.
I have the following data format:
ID Visit Behaviour Distance_to_first_visit_in_month
1 0 1 0
1 1 1 6
1 2 1 12
1 3 1 50
2 0 3 0
2 1 3 8
2 2 3 16
2 3 3 25
2 4 3 40
2 5 3 60
3 0 1 0
3 1 1 6
3 2 1 12
3 3 3 24
3 4 3 30
3 5 3 55
I need the data in the following format:
ID Visit Behaviour Distance_to_first_visit_in_month Status
1 0 1 0 0
2 0 3 0 1
3 3 3 24 1
If a person has 1 every time until the end he should be only censored because the study is finished. If a person has 3 for the first time I need the Distance_to_to_first_visit_in_month because there he has the status 1 in the Kapplan-Meyer curve.
I tried to filter the maximal Distance_to_first_visit_in_month and get the Behaviour. When I bring the data to the wide format it is easy to get those. But I can't get the Distance_to_first_visit_in_month when the person 3 as Behaviour at the beginning or when otherwise.
I have 300IDs with sometimes 11 visits so I can't prepare the data manuell.
Do you have an idea?
Thanks you in advance.
Best Christina
As you don't explain how to aggregate your data to the second dataset, I can only show you how to get the ID's that match your conditions and how to implement the status variable. See this example:
library(dplyr)
# get id's with only 1
id_list1 <- lapply(df %>% split(.$ID),function(x){
if(unique(x$ID)==1){
return(unique(x$ID))
}
}) %>%
unlist()
# get id's with 3 as first value
id_list3 <- lapply(df %>% split(.$ID),function(x){
if(x[x$Visit==0,"Behaviour"]==3){
return(unique(x$ID))
}
}) %>%
unlist()
df %>%
mutate(Status = ifelse(ID %in% id_list3,1,0)) %>%
mutate(new_dist = ifelse(!ID %in% id_list3,Distance_to_first_visit_in_month,NA))
Please note that you'll get named vectors in id_list1 and id_list3. There are no duplicates, just the name of the element matching the element.
And do you mean Visit number 0 with "at the beginning"? Otherwise you'll have to adjust x$Visit==0.
I have 2 dataframes where im trying to compare the value in one with another
If the value matches in both table 1 and 2, then a third value from table 2 is inserted into Table one.
Example Table My DF
words number
1 it 1
2 was 2
3 the 3
4 LTD QTY 4
5 end 5
6 of 6
7 winter 7
Table x.sub
lev_dist Var1 Var2
31 1 LTD QTY LTD QTY
What i want to say is, if Var1 in x.sub is equal to words in MyDF then insert x.sub.lev_dist in a third column next to the word in mydf
My attempt is below but keeps producing 3 in the results instead of the lev_value
mydf$lev_dist <- ifelse(test = (mydf$words == x.sub$Var1),x.sub$Var1,0)
Results:
words number lev_dist
1 it 1 0
2 was 2 0
3 the 3 0
4 LTD QTY 4 3
5 end 5 0
6 of 6 0
7 winter 7 0
Can anyone help
The x.sub$Var1 is a factor column. So, when we do the ifelse, we get the numeric levels of the factor. Replace x.sub$Var1 with as.character(x.sub$Var1) in the ifelse
mydf$lev_dist <- ifelse(mydf$words == as.character(x.sub$Var1)),
x.sub$lev_dist,0)
This could have avoided if the columns were of character class. Using stringsAsFactors=FALSE in the read.csv/read.table or data.frame would ensure that all the character columns are of character class.
You can also use merge:
x.sub = setNames(x.sub,c('lev_dist','words','Var2'))
df_ = merge(df, x.sub[,1:2], by='words', all=T)
df_[is.na(df_)]=0
# >df_
# words number lev_dist
#1 end 5 0
#2 it 1 0
#3 LTD QTY 4 1
#4 of 6 0
#5 the 3 0
#6 was 2 0
#7 winter 7 0
Let's say I have the following data set, which acts like the key
x y value
1 2 10
1 1 20
2 1 30
1 1 20
2 3 200
I have another data with many many columns, 2 of them being x and y. I want to create a column value that matches to the key, e.g.
x y value and other columns
1 1 20
2 1 30
2 3 300
I can only use the match to make it work when matching one column. How do I extend to multiple column matching?
You can use merge, as #MrFlick suggested:
df.key <- data.frame(
x=c(1,1,2,1,2),
y=c(2,1,1,1,3),
value=c(10,20,30,20,200))
##
df.add <- data.frame(
x=c(1,2,2),
y=c(1,1,3),
value=c(20,30,300),
a=rnorm(3),
b=rpois(3,0))
##
> merge(
x=df.key,
y=df.add)
x y value a b
1 1 1 20 0.9246104 0
2 1 1 20 0.9246104 0
3 2 1 30 0.2685016 0
##
> merge(
x=df.key,
y=df.add,
by=c("x","y"))
x y value.x value.y a b
1 1 1 20 20 0.9246104 0
2 1 1 20 20 0.9246104 0
3 2 1 30 30 0.2685016 0
4 2 3 200 300 -0.4174230 0
By default, this will join on the intersection of column names, like in the first example (x,y,value). Additionally, you can specify which columns to use from both data.frames using by=, as in the second example. Or, you can get more specific by using by.x= and/or by.y=. See ?merge.
Edit:
The problem is that df.key contains two rows where x=1, y=1 is TRUE, so the row in df.add with x=1,y=1 has to be duplicated in the join in order to preserve the data in df.key. I'm not sure how to make this adjustment elegantly (e.g. by specifying certain arguments to merge), but here's one approach:
R> merge(
x=df.key[!duplicated(df.key[,c(1:2)]),],
y=df.add)
x y value a b
1 1 1 20 -1.0185211 0
2 2 1 30 2.7507656 0
3 2 3 200 0.3986168 0
I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
Is there any way to get the same result using 'dcast' and save it as a new data frame?
Yes, there is a way:
library(reshape2)
dcast(df, id ~ period, length)
Using period as value column: use value.var to override.
id calib first valid
1 1 1 2 0
2 2 2 0 2
3 3 0 0 2
4 4 1 1 1
You can also type just dcast(df, id ~ period) and length will be chosen by default too. As I can see, you tried to find this out in your another question. Extended solution without dcast would look like this:
df <- data.frame(unclass(table(df)))
df$ID <- rownames(df)
df
calib first valid ID
1 1 2 0 1
2 2 0 2 2
3 0 0 2 3
4 1 1 1 4