I have a large set of data and I'm trying to group different rows together. I will know how to group the rows by using an ID. In the dataset, these IDs are sequential.
For example,
So what I want to do is iterate through this set of data and then place the data contained in these rows into a vector of vectors for processing later. The data contained in these rows of identical ID are going to be compared with one another to categorize the groupings.
I would like my data structure to look like something like this.
1 -> 1 -> 1
|
V
2 -> 2
So row 1 would contain only data from 1 type of ID, then the next row in the vector would be a vector of another type of ID. How would I go about doing this in R? In C++ it would just be a vector of vectors but I haven't been able to figure out how to do the same in R.
Is this even the right way to be approaching this problem? Is there a better way to do what I'm trying to do?
You would want to work with Data Frames rather than simple matrices. Have a look as the Documentation R-tutor Data.Frames.
It is doable. Best!
Related
I have 5 dataframes with different subsets of variables. For example, the subset of the 5 A-Variables appear in dataframe 1 and 5. The subset of the 7 B-Variables appear in dataframe 1 and 4 and so on. A different number of persons did one of the 5 test-versions (thats why I have 5 dataframes)
Now, I want to merge the dataframes together. The colums shall have all variables of all dataframes. When a variable appeared in two dataframes, the values should be merged and appear in one column at the end. For all persons who did not see a variable because it was in another test, a "NA" should be in there at the end..
Do you guys have an idea?
Thank you very much in advance!
You'll probably need to do some combination of inner_join(), left_join(), right_join() etc.
Check this out, should have what you need... It's difficult to know exactly what you need without seeing the data.
Say I have a data.frame of arbitrary dimensions (n by p). I want to extract a vector of length n from that data.frame, one element in the vector per row in the data.frame. However, the column in which each element lies may vary by row. Is there a way to do this without loops?
For example, if I have the following (3x3) data frame, called say DATA
X Y Z
1 17 43
3 4 2
6 9 0
I want to extract one scalar value from DATA per row. I have a vector, call it column.list, c(1,3,1) (arbitrarily selected in this case) which gives the column index for the elements I want, where the kth element of column.list is the column index for row k in DATA. How do I do this without loops? I want to avoid loops because I am using this repeatedly in a simulation study that will take a lot of running time even without loops, and the row number might be 100,000 or so. Much appreciated!
You can do this by indexing your data.frame with a matrix. The first column indicates row, the second indicates column. So if you do
column.list <- c(1,3,1)
DATA[cbind(1:nrow(DATA), column.list)]
You will get
[1] 1 2 6
as desired. If you mix across columns of different classes, all the variable will be coerced to the most accommodating data type.
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)
Here is my question,
I have a list of data.frames. It's produced by same piece of codes with different data.
All of the data.frames looks like
US 100 (not guarantee to exist in another data.frame because data is different)
CA 50
...
Is there any fast/neat way to sum over all the data.frames?
I am not sure whether I have understood your problem correctly, but here a possible solution:
Try to put all your dataframes in a list, e.g., your_list=list(df1,df2,...)
Then use total_df=do.call(rbind,your_list) to combine all dataframes (row-wise).
After that you can use ddply(total_df,"country",function (x) sum(x$value)) to aggregate the data. Here, I have assumed that US and CA stand for entries in a country column and 100 and 50 for entries in a value column.
I hope you won't find my question too silly, i did a lot of research but it seems that i can't figure how to solve this really annoying issue.
Well, i have datas for 6 participants (P) in an experiment, with 50 trials (T) per participants and 10 condition (C). So i'd like to create a dataframe in r allowing me to put these datas.
This data.frame should have 3 factors (P, T and C) and so a number of total row of (P*T*C). The difficulty for me is to create this one, since i have the datas for the 6 participant in 6 data.frame of 100 obs(T) by 10 varibles(C).
I'd like first to create the empty dataset with these factors, and then copy the values of the 6 data.set according to the factors P, T and C.
Any help would be greatly appreciated, i'm novice in r.
Thank you.
OK; First we create one big dataframe for all participants:
result<-rbind(dfrforparticipant1, dfrforparticipant2,...dfrforparticipant6) #you'll have to fill out the proper names of the original data.frames
Next, we add a column for the participant ID:
numTrials<-50 #although 100 is also mentioned in your question
result$P<-as.factor(rep(1:6, each=numTrials))
Finally, we need to go from 'wide' format to 'long' format (I'm assuming your column names holding the results for each condition are called C1, C2 etc. ; I'm also assuming your original data.frames already held a column named T to denote the trial), like this (untested, since you did not provide example data):
orgcolnames<-paste("C", 1:10, sep="")
result2<-reshape(result, varying=list(orgcolnames), v.names="val", idvar=c("T","P"), timevar="C", times=seq_along(orgcolnames), direction="long")
What you want is now in result2.