making a table with multiple columns in r - r

I´m obviously a novice in writing R-code.
I have tried multiple solutions to my problem from stackoverflow but I'm still stuck.
My dataset is carcinoid, patients with a small bowel cancer, with multiple variables.
i would like to know how different variables are distributed
carcinoid$met_any - with metastatic disease 1=yes, 2=no(computed variable)
carcinoid$liver_mets_y_n - liver metastases 1=yes, 2=no
carcinoid$regional_lymph_nodes_y_n - regional lymph nodes 1=yes, 2=no
peritoneal_carcinosis_y_n - peritoneal carcinosis 1=yes, 2=no
i have tried this solution which is close to my wanted result
ddply(carcinoid, .(carcinoid$met_any), summarize,
livermetastases=sum(carcinoid$liver_mets_y_n=="1"),
regionalmets=sum(carcinoid$regional_lymph_nodes_y_n=="1"),
pc=sum(carcinoid$peritoneal_carcinosis_y_n=="1"))
with the result being:
carcinoid$met_any livermetastases regionalmets pc
1 1 21 46 7
2 2 21 46 7
Now, i expected the row with 2(=no metastases), to be empty. i would also like the rows in the column carcinoid$met_any to give the number of patients.
If someone could help me it would be very much appreciated!
John
Edit
My dataset, although the column numbers are: 1, 43,28,31,33
1=yes2=no
case_nr met_any liver_mets_y_n regional_lymph_nodes_y_n pc
1 1 1 1 2
2 1 2 1 2
3 2 2 2 2
4 1 2 1 1
5 1 2 1 1
desired output - I want to count the numbers of 1:s and 2:s, if it works, all 1:s should end up in the met_any=1 row
nr liver_mets regional_lymph_nodes pc
met_any=1 4 1 4 2
met_any=2 1 4 1 3
EDIT
Although i probably was very unclear in my question, with your help i could make the table i needed!
setDT(carcinoid)[,lapply(.SD,table),.SDcols=c(43,28,31,33,17)]
gives
met_any lymph_nod liver_met paraortal extrahep
1: 50 46 21 6 15
2: 111 115 140 151 146
i am very grateful! #mtoto provided the solution
John

Based on your example data, this data.table approach works:
library(data.table)
setDT(df)[,lapply(.SD,table),.SDcols=c(2:5)]
# met_any liver_mets_y_n regional_lymph_nodes_y_n pc
# 1: 4 1 4 2
# 2: 1 4 1 3

Related

Sum variables conditionally with loop in r

I realize this is a topic that's covered somewhat well but I couldn't find anything that approaches this specific concern:
I have a df with 800 columns, 10 iterations of 80 columns (each column represents an item) - Each column is named something like: 1_BL_PRE.1 1_FU_PRE.1 1_BL_PRE.1 1_BL_POST.1
Where the first '1' indicates the item number and the second '1' indicates the iteration number.
What I'm trying to figure out is how to get the sums of specific groups of items from all 10 iterations.
As a short example let's say I want to take the 1st and 3rd item of BL_PRE and get the sum of all 10 iterations for those 2 items - how would I do this?
subject 1_BL_PRE.1 2_BL_PRE.1 3_BL_PRE.1 1_BL_PRE.2 2_BL_PRE.2
1 40002 3 4 3 1 2
2 40004 1 2 3 4 4
3 40006 4 3 3 3 1
4 40008 2 3 1 2 3
5 40009 3 4 1 2 3
Expected output (where A represents the sum of 1_BL_PRE.1, 3_BL_PRE.1, 1_BL_PRE.2 and so on):
subject BL_PRE_A
1 40002 12
2 40004 14
3 40006 15
4 40008 20
5 40009 12
My hunch is the solution is related to a for-loop or lappy (and I'm not familiar at all with either). I'm trying to work with apply(finaldata,1,function(x) {sum(x ...)}) but I haven't been able to figure out the conditional statement for the function of sum.
If there's an implementation with plyr I'd be really curious to see what that looks like. (and if there's a thread that answers this, apologies and just re-direct!)
**Edited to include small example + code I'm trying to get to work
Thanks!

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Calculate degrees from group variables in panel data in R

I have a longitudinal data-set that looks like this:
id date group
1 jan-13 1
2 jan-13 1
3 jan-13 2
1 fev-13 3
2 fev-13 4
2 fev-13 3
3 fev-13 4
1 mar-13 5
2 mar-13 6
3 mar-13 5
It represents a network, each individual is connected to other individuals in period t if they were in the same group in any period before t (including t). Therefore in feb-13 indidual 1 is only conected to individual 2.
I want to calculate the degrees for every individual at every period. In this case the final dataset that I want to create would look like this:
id date degree
1 jan-13 1
2 jan-13 1
3 jan-13 0
1 fev-13 1
2 fev-13 2
3 fev-13 1
1 mar-13 2
2 mar-13 2
3 mar-13 2
I have tried some things using for and aggregate but it is not very efficient (it is taking more than a day and hasn't finished). The data-set is very large, so usual packages that work with networks are not working here.
Edit:
Ok, sorry, it seems I misunterstood your question. Did you check if any of the network data packages for R does what you want? If you create a relational data set it should be easy to get what you want, maybe this tutorial helps:
https://statnet.org/trac/raw-attachment/wiki/Resources/introToSNAinR_sunbelt_2012_tutorial.pdf

How does one ignore the "moral" column in R

moral verw ho dog
4 1049 1 2
4 2799 1 3
2 8412 4 4
4 2122 1 3
4 2171 1 3
4 2241 1 2
4 3398 1 4
I was normalizing a dataset using
noid = data.Normalization(newx,type="n4") but I want to ignore JUST the "moral" column but normalize everything else.
Any help will be greatly appreciated.
As David Arenburg suggested in his comment, if you don't pass the moral column to the normalisation routine it will obviously be ignored. If you want it to be included in the normalized dataset you can do something like this:
noid <- data.Normalization(newx[-1], type="n4")
noid <- cbind(moral=newx[1], noid)

How can I sort the X axis in a Barplot in R?

I have binned data that looks like this:
(8.048,18.05] (-21.95,-11.95] (-31.95,-21.95] (18.05,28.05] (-41.95,-31.95]
81 76 18 18 12
(-132,-122] (-122,-112] (-112,-102] (-162,-152] (-102,-91.95]
6 6 6 5 5
(-91.95,-81.95] (-192,-182] (28.05,38.05] (38.05,48.05] (58.05,68.05]
5 4 4 4 4
(78.05,88.05] (98.05,108] (-562,-552] (-512,-502] (-482,-472]
4 4 3 3 3
(-452,-442] (-412,-402] (-282,-272] (-152,-142] (48.05,58.05]
3 3 3 3 3
(68.05,78.05] (118,128] (128,138] (-582,-572] (-552,-542]
3 3 3 2 2
(-532,-522] (-422,-412] (-392,-382] (-362,-352] (-262,-252]
2 2 2 2 2
(-252,-242] (-142,-132] (-81.95,-71.95] (148,158] (-1402,-1392]
2 2 2 2 1
(-1372,-1362] (-1342,-1332] (-942,-932] (-862,-852] (-822,-812]
1 1 1 1 1
(-712,-702] (-682,-672] (-672,-662] (-632,-622] (-542,-532]
1 1 1 1 1
(-502,-492] (-492,-482] (-472,-462] (-462,-452] (-442,-432]
1 1 1 1 1
(-432,-422] (-352,-342] (-332,-322] (-312,-302] (-302,-292]
1 1 1 1 1
(-202,-192] (-182,-172] (-172,-162] (-51.95,-41.95] (88.05,98.05]
1 1 1 1 1
(108,118] (158,168] (168,178] (178,188] (298,308]
1 1 1 1 1
(318,328] (328,338] (338,348] (368,378] (458,468]
1 1 1 1 1
How can I plot this data so that the bin is sorted from most negative on the left to most positive on the right? Currently my graph looks like this. Notice that it is not sorted at all. In particular the second bar (value = 76) is placed to the right of the first:
(8.048,18.05] (-21.95,-11.95]
81 76
This is the command I use to plot:
barplot(x,ylab="Number of Unique Tags", xlab="Expected - Observed")
I really want to help answer your question, but I gotta tell you, I can't make heads or tails of your data. I see a lot of opening parenthesis but no closing ones. The data looks sorted descending by whatever the values are on the bottom of each row. I have no idea what to make out of a value like "(8.048,18.05]"
Am I missing something obvious? Can you make a more simple example where your data structure is not a factor?
I would generally expect a data frame or a matrix with two columns, one for the X and one for the Y.
See if this example of sorting helps (I'm sort of shooting in the dark here)
tN <- table(Ni <- rpois(100, lambda=5))
r <- barplot(tN)
#stop here and examine the plot
#the next bit converts the matrix to a data frame,
# sorts it, and plots it again
df<-data.frame(tN)
df2<-df[order(df$Freq),]
barplot(df2$Freq)

Resources