So I have a "by" class object (which is essentially a list).
It is indexed by 2 factors [id1,id2], with a list associated with each unique pair.
e.g.
id1:1
id2:1
1,2,3
------
id1:1
id2:2
4,4,NA
------
id1:2
id2:1
NA
I would like to convert this to a data frame which has 3 columns {id1,id2,value} and would take the above and return
id1, id2, value
1 1 1
1 1 2
1 1 3
1 2 4
1 2 4
1 2 NA
2 1 NA
This can be done with a for loop but is obviously slow. I am looking to try and merge the value column back to a data frame which has indices 1 and 2.
Answer: Use the data.table package. It is ridiculously quick for these sorts of problems.
Related
I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1
I'm working with R Studio Version 1.0.143.
I need to transpose one column to become the name of the variables without losing information about the value (freq_var and freq_mut in this example).
Also, I need to know for which table the data came from (sample 1, 2 etc)
The main problem is how I add everything together even if one value in Gene is not present in Sample 1 but IS present in Sample 2 (NA's in the example)
I could do it manually, but my table contains thousands of values for each variable!
Sample 1
Freq_var Freq_mut Gene
2 2 A
3 3 B
2 5 C
Sample 2
Freq_var Freq_mut Gene
1 2 A
1 1 B
1 1 D
To:
A(Freq_var) B(Freq_var) C(Freq_var) D(Freq_var) A(Freq_mut).....
Sample 1 2 3 2 NA 2
Sample 2 1 1 NA 1 2
I'm trying to use the diff function to calculate the increase in a variable ("damage") in this dataset (df). I want to fill the column "damage_new" with this new variable. The values that you see now are the values I would like to have.
df = data.frame(id=c(1,1,1,2,2), trial=c(1,3,4,1,2), damage=(1,NA,3,1,5))
df
ID TRIAL DAMAGE DAMAGE_NEW
1 1 1 0
1 3 NA NA
1 4 3 NA
2 1 1 0
2 2 5 4
If I run
diff(df$damage) it will calculate the difference in the whole dataset.
two things that I haven't managed are:
-how to nest the difference within the values of another column? Specifically, I want to calculate the damage increase (for the whole dataset), but within a single individual (ID), of which I have repeated measurements.
-I also would like to have the damage_new column to be the same length as the rest of the dataset (to attach it), and for each individual, have the first value of damage_new set to 0, since obviously the first measurement has no reference.
-To further describe the dataset, I have NAs in the 'damage" column, which I suspect will lead to more NAs in the damage_new column, but I would like to keep them (and I wonder how the function deals with them?). I also don't have the same number of measurements per individual (they will have a different number of trials, with some missing in between).
thanks a lot for the always fast and efficient answers!
The dplyr package is great for this kind of things:
library(dplyr)
df %>% group_by(id) %>% mutate(damage_new=c(0,diff(damage)))
Source: local data frame [5 x 4]
Groups: id
id trial damage damage_new
1 1 1 1 0
2 1 3 NA NA
3 1 4 3 NA
4 2 1 1 0
5 2 2 5 4
You can read more about dplyr usage here
Update
If you'd like to go with the base R, you could do:
df$damage_new <- ave(df$damage,df$id,FUN=function(v) c(0,diff(v)))
which will produce the same df.
Library data.table is your friend there:
> library(data.table)
> setDT(df)
> setkey(df, id, trial)
> df[,new_damage:=c(0,diff(damage)),by=id]
> df
id trial damage new_damage
1: 1 1 1 0
2: 1 3 NA NA
3: 1 4 3 NA
4: 2 1 1 0
5: 2 2 5 4
On the diff working with NA, anything you withdraw from NA gives NA:
> diff(c(1,3,4,NA,5,7))
[1] 2 1 NA NA 2
Okay, I'm stupid, but:
How can I create a list with one column, 10 rows and a column name, and the same numeric value in all fields? I know how to append it, e.g.
mylist["column_name"] <- rep(1, nrow(mylist))
but not how to create it on its own.
It should look like this:
> mylist
column_name
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
Are you sure you want a list and not a data frame (as that is what your example looks like)? You can get it like this:
data.frame(column_name=rep(1,10))
I have returned stats on my data using the table command as such:
subject<-c(4,4,2,2,3,3)
correct<-c(0,1,1,1,0,0)
test<-data.frame(subject,correct)
freq_test<-head(table(test$subject,test$correct))
This returns a table which looks like this
0 1
2 0 2
3 2 0
4 1 1
That's great, but the problem is that I would like, the first column to be a vector rather than row.names (so that I can code it properly as "subject").
Is there a way to get this column to act in this way?
Just make a new data frame with the row names of freq_test as the first column:
> df<-data.frame(as.numeric(rownames(freq_test)),freq_test)
> colnames(df)[1]="subject"
> df
subject X0 X1
2 2 0 2
3 3 2 0
4 4 1 1
>
Of course, you can rename X0 and X1 to whatever you want by editing colnames(df) as above.
If you want the data in "long" format (useful for some models and plotting, and especially when your tables are more complicated), the table method for the generic function as.data.frame will take care of this for you:
> as.data.frame(table(test))
subject correct Freq
1 2 0 0
2 3 0 2
3 4 0 1
4 2 1 2
5 3 1 0
6 4 1 1
I think you should have used the standard method of construction of a data.frame, which is with name=values pairs:
test <- data.frame( subject=subject, correct=correct)
The first subject will be interpreted as a name to be quoted and the second subject will be interpreted .... i.e, the enclosing environments will be searched for an object named subject and its value will be assigned to the "subject" column of "test".