Grouping and building intervals of data in R and useful visualization - r

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.

If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Related

Delete rows when certain factor is present more than 200 times

I have a dataset with over 400,000 cows. These cows are (unevenly) spreak over 2355 herds. Some herds are only present once in the data, while one herd is even present 2033 times in the data, meaning that 2033 cows belong to this herd. I want to delete herds from my data that occur less than 200 times.
With use of plyr and subset, I can obtain a list of which herds occur less than 200 times, I however can not find out how to apply this selection to the full dataset.
For example, my current data looks a little like:
cow herd
1 1
2 1
3 1
4 2
5 3
6 4
7 4
8 4
With function count() I can obtain the following:
x freq
1 3
2 1
3 1
4 3
Say I want to delete the data belonging to herds that occur less than 3 times, I want my data to look like this eventually:
cow herd
1 1
2 1
3 1
6 4
7 4
8 4
I do know how to tell R to delete data herd by herd, however since, in my real datatset, over 1000 herds occur less then 200 times, it would mean that I would have to type every herd number in my script one by one. I am sure there is an easier and quicker way of asking R to delete data above or below a certain occurence.
I hope my explanation is clear and someone can help me, thanks in advance!
Use n + group_by:
library(dplyr)
your_data %>%
group_by(herd) %>%
filter(n() >= 3)

making a table with multiple columns in r

I´m obviously a novice in writing R-code.
I have tried multiple solutions to my problem from stackoverflow but I'm still stuck.
My dataset is carcinoid, patients with a small bowel cancer, with multiple variables.
i would like to know how different variables are distributed
carcinoid$met_any - with metastatic disease 1=yes, 2=no(computed variable)
carcinoid$liver_mets_y_n - liver metastases 1=yes, 2=no
carcinoid$regional_lymph_nodes_y_n - regional lymph nodes 1=yes, 2=no
peritoneal_carcinosis_y_n - peritoneal carcinosis 1=yes, 2=no
i have tried this solution which is close to my wanted result
ddply(carcinoid, .(carcinoid$met_any), summarize,
livermetastases=sum(carcinoid$liver_mets_y_n=="1"),
regionalmets=sum(carcinoid$regional_lymph_nodes_y_n=="1"),
pc=sum(carcinoid$peritoneal_carcinosis_y_n=="1"))
with the result being:
carcinoid$met_any livermetastases regionalmets pc
1 1 21 46 7
2 2 21 46 7
Now, i expected the row with 2(=no metastases), to be empty. i would also like the rows in the column carcinoid$met_any to give the number of patients.
If someone could help me it would be very much appreciated!
John
Edit
My dataset, although the column numbers are: 1, 43,28,31,33
1=yes2=no
case_nr met_any liver_mets_y_n regional_lymph_nodes_y_n pc
1 1 1 1 2
2 1 2 1 2
3 2 2 2 2
4 1 2 1 1
5 1 2 1 1
desired output - I want to count the numbers of 1:s and 2:s, if it works, all 1:s should end up in the met_any=1 row
nr liver_mets regional_lymph_nodes pc
met_any=1 4 1 4 2
met_any=2 1 4 1 3
EDIT
Although i probably was very unclear in my question, with your help i could make the table i needed!
setDT(carcinoid)[,lapply(.SD,table),.SDcols=c(43,28,31,33,17)]
gives
met_any lymph_nod liver_met paraortal extrahep
1: 50 46 21 6 15
2: 111 115 140 151 146
i am very grateful! #mtoto provided the solution
John
Based on your example data, this data.table approach works:
library(data.table)
setDT(df)[,lapply(.SD,table),.SDcols=c(2:5)]
# met_any liver_mets_y_n regional_lymph_nodes_y_n pc
# 1: 4 1 4 2
# 2: 1 4 1 3

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Retrieving the name of a particular column of a table in R

I have the next data frame (matrix):
> Table
Clima Crecimiento
1 1 350
2 1 375
3 1 360
4 1 400
5 1 380
6 2 500
7 2 530
8 2 520
9 2 550
10 2 545
I would like to know the command to retrieve the name of the first column exclusively (Clima), as it is a categorical column and I need to introduce it in a code I am preparing to ease discriminant function analysis.
I tryed using names(), colnames[.1], with no succeed. I suppose I am near to the solution, however, some missing code is needed in order to solve what I request.
Thanks

ggplot2 plotting single data frame with 10 different levels

I am using R studio
I have a single data frame with three columns titled
colnames(result)
[1] "v" "v2" "Lambda"
I wish to use ggplot2 to create an overlay plot assigning 10 different colors to each of the 10 different values in the Lambda column
summary(result$Lambda)
1 2 3 4 5 6 7 8 9 10
101 100 100 100 100 100 100 100 100 100
Now I created a factor for the Lambda values as follows:
result$Lambda<-factor(result$Lambda, levels=c(1,2,3,4,5,6,7,8,9,10), labels=c("1","2","3","4","5","6","7","8","9","10"))
qplot(x=v, y=v2, data=result, geom="line", fill=Lambda main="Newton Revolved Plot", xlab="x=((L/2)*((1/t)+2*t+t^(3)))", ylab="y=(L/2)*(log(1/t)+t^(2)+(3/4)*t^(4))-(7*L/8)")
but this command does not work.
I am just inquiring about how to plot 10 different plots, one plot for each Lambda value. Or perhaps resources for the novice..
Thank you in advance
2.000000e+00 0.000000e+00 1
3 6.250000e+00 6.778426e+00 1
4 1.666667e+01 3.345069e+01 1
5 3.612500e+01 1.024319e+02 1
6 6.760000e+01 2.451953e+02 1
7 1.140833e+02 5.022291e+02 1
8 1.785714e+02 9.230270e+02 1
9 2.640625e+02 1.566085e+03 1
10 3.735556e+02 2.498901e+03 1
105 7.225000e+01 2.048637e+02 2
106 1.352000e+02 4.903906e+02 2
107 2.281667e+02 1.004458e+03 2
108 3.571429e+02 1.846054e+03 2
109 5.281250e+02 3.132171e+03 2
110 7.471111e+02 4.997803e+03 2
111 1.020100e+03 7.595947e+03 2
112 1.353091e+03 1.109760e+04 2
250 1.766205e+05 6.488994e+06 3
251 1.876500e+05 7.034992e+06 3
252 1.991295e+05 7.614744e+06 3
253 2.110680e+05 8.229615e+06 3
254 2.234745e+05 8.880996e+06 3
255 2.363580e+05 9.570303e+06 3
Here is some sample date, the first column is the "v", the second column is "v2" and the third column is the "Lambda" column. (not include the 0th column as the index)
Just to rephrase my R problems, I have a single data frame with 10 levels each with 100 entries (roughly.. see above for exact count). and wish to use ggplot2 to plot each level a different color.
there are two ways I can think of
1) Find the correct ggplot2 option and distinguish each level
2) split this single data frame titled "result" into 10 subsets.
thank you very much in advance.
for a line use
color=
instead of
fill=
But without any data to reproduce it is hard to know if that is what you want.

Resources