I realize this is a topic that's covered somewhat well but I couldn't find anything that approaches this specific concern:
I have a df with 800 columns, 10 iterations of 80 columns (each column represents an item) - Each column is named something like: 1_BL_PRE.1 1_FU_PRE.1 1_BL_PRE.1 1_BL_POST.1
Where the first '1' indicates the item number and the second '1' indicates the iteration number.
What I'm trying to figure out is how to get the sums of specific groups of items from all 10 iterations.
As a short example let's say I want to take the 1st and 3rd item of BL_PRE and get the sum of all 10 iterations for those 2 items - how would I do this?
subject 1_BL_PRE.1 2_BL_PRE.1 3_BL_PRE.1 1_BL_PRE.2 2_BL_PRE.2
1 40002 3 4 3 1 2
2 40004 1 2 3 4 4
3 40006 4 3 3 3 1
4 40008 2 3 1 2 3
5 40009 3 4 1 2 3
Expected output (where A represents the sum of 1_BL_PRE.1, 3_BL_PRE.1, 1_BL_PRE.2 and so on):
subject BL_PRE_A
1 40002 12
2 40004 14
3 40006 15
4 40008 20
5 40009 12
My hunch is the solution is related to a for-loop or lappy (and I'm not familiar at all with either). I'm trying to work with apply(finaldata,1,function(x) {sum(x ...)}) but I haven't been able to figure out the conditional statement for the function of sum.
If there's an implementation with plyr I'd be really curious to see what that looks like. (and if there's a thread that answers this, apologies and just re-direct!)
**Edited to include small example + code I'm trying to get to work
Thanks!
I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.
I have a longitudinal data-set that looks like this:
id date group
1 jan-13 1
2 jan-13 1
3 jan-13 2
1 fev-13 3
2 fev-13 4
2 fev-13 3
3 fev-13 4
1 mar-13 5
2 mar-13 6
3 mar-13 5
It represents a network, each individual is connected to other individuals in period t if they were in the same group in any period before t (including t). Therefore in feb-13 indidual 1 is only conected to individual 2.
I want to calculate the degrees for every individual at every period. In this case the final dataset that I want to create would look like this:
id date degree
1 jan-13 1
2 jan-13 1
3 jan-13 0
1 fev-13 1
2 fev-13 2
3 fev-13 1
1 mar-13 2
2 mar-13 2
3 mar-13 2
I have tried some things using for and aggregate but it is not very efficient (it is taking more than a day and hasn't finished). The data-set is very large, so usual packages that work with networks are not working here.
Edit:
Ok, sorry, it seems I misunterstood your question. Did you check if any of the network data packages for R does what you want? If you create a relational data set it should be easy to get what you want, maybe this tutorial helps:
https://statnet.org/trac/raw-attachment/wiki/Resources/introToSNAinR_sunbelt_2012_tutorial.pdf
moral verw ho dog
4 1049 1 2
4 2799 1 3
2 8412 4 4
4 2122 1 3
4 2171 1 3
4 2241 1 2
4 3398 1 4
I was normalizing a dataset using
noid = data.Normalization(newx,type="n4") but I want to ignore JUST the "moral" column but normalize everything else.
Any help will be greatly appreciated.
As David Arenburg suggested in his comment, if you don't pass the moral column to the normalisation routine it will obviously be ignored. If you want it to be included in the normalized dataset you can do something like this:
noid <- data.Normalization(newx[-1], type="n4")
noid <- cbind(moral=newx[1], noid)
I have binned data that looks like this:
(8.048,18.05] (-21.95,-11.95] (-31.95,-21.95] (18.05,28.05] (-41.95,-31.95]
81 76 18 18 12
(-132,-122] (-122,-112] (-112,-102] (-162,-152] (-102,-91.95]
6 6 6 5 5
(-91.95,-81.95] (-192,-182] (28.05,38.05] (38.05,48.05] (58.05,68.05]
5 4 4 4 4
(78.05,88.05] (98.05,108] (-562,-552] (-512,-502] (-482,-472]
4 4 3 3 3
(-452,-442] (-412,-402] (-282,-272] (-152,-142] (48.05,58.05]
3 3 3 3 3
(68.05,78.05] (118,128] (128,138] (-582,-572] (-552,-542]
3 3 3 2 2
(-532,-522] (-422,-412] (-392,-382] (-362,-352] (-262,-252]
2 2 2 2 2
(-252,-242] (-142,-132] (-81.95,-71.95] (148,158] (-1402,-1392]
2 2 2 2 1
(-1372,-1362] (-1342,-1332] (-942,-932] (-862,-852] (-822,-812]
1 1 1 1 1
(-712,-702] (-682,-672] (-672,-662] (-632,-622] (-542,-532]
1 1 1 1 1
(-502,-492] (-492,-482] (-472,-462] (-462,-452] (-442,-432]
1 1 1 1 1
(-432,-422] (-352,-342] (-332,-322] (-312,-302] (-302,-292]
1 1 1 1 1
(-202,-192] (-182,-172] (-172,-162] (-51.95,-41.95] (88.05,98.05]
1 1 1 1 1
(108,118] (158,168] (168,178] (178,188] (298,308]
1 1 1 1 1
(318,328] (328,338] (338,348] (368,378] (458,468]
1 1 1 1 1
How can I plot this data so that the bin is sorted from most negative on the left to most positive on the right? Currently my graph looks like this. Notice that it is not sorted at all. In particular the second bar (value = 76) is placed to the right of the first:
(8.048,18.05] (-21.95,-11.95]
81 76
This is the command I use to plot:
barplot(x,ylab="Number of Unique Tags", xlab="Expected - Observed")
I really want to help answer your question, but I gotta tell you, I can't make heads or tails of your data. I see a lot of opening parenthesis but no closing ones. The data looks sorted descending by whatever the values are on the bottom of each row. I have no idea what to make out of a value like "(8.048,18.05]"
Am I missing something obvious? Can you make a more simple example where your data structure is not a factor?
I would generally expect a data frame or a matrix with two columns, one for the X and one for the Y.
See if this example of sorting helps (I'm sort of shooting in the dark here)
tN <- table(Ni <- rpois(100, lambda=5))
r <- barplot(tN)
#stop here and examine the plot
#the next bit converts the matrix to a data frame,
# sorts it, and plots it again
df<-data.frame(tN)
df2<-df[order(df$Freq),]
barplot(df2$Freq)