How to determine which columns have no variability - r

I have a dataset with 50 columns and over 100,000 observations. I want to determine which columns have no variability (i.e., all rows contain the same value), print the names of the columns, and then remove those columns.
I tried using this code:
names(Filter(function(x) length(unique(x)) != 0, df))
but I am not sure if it does what I want it to do. I think It lists the unique columns meaning no identical values?
Here is a sample from the data I am using:
135 24437208 1 2 1 Cardiology ? <30 8 8 None No No Ch No No Yes No No No No No Down No Steady None Steady No No No No No No No No No No 77 33 6 0 0 0 [50-60) Female Caucasian ? 401.00 997.00 560.00
2 135 26264286 7 1 1 Surgery-Cardiovascular/Thoracic ? >30 3 5 None No No Ch No No Yes No No No No No Steady No No None Steady No No No No No No No No No No 31 14 1 0 1 0 [50-60) Female Caucasian ? 998.00 41.00 250.00
3 378 29758806 1 3 1 Surgery-Neuro ? NO 2 3 None No No No No No No No No No No No No No No None No No No No No No No No No No No 49 11 1 0 0 0 [50-60) Female Caucasian ? 722.00 305.00 250.00
4 729 189899286 7 1 3 InternalMedicine MC NO 4 9 >7 No No No No No Yes No No No No No No No No None Steady No No No No No No No No No No 68 23 2 0 0 0 [80-90) Female Caucasian ? 820.00 493.00 880.00
5 774 64331490 7 1 1 InternalMedicine ? NO 3 9 >8 No No Ch No No Yes No No No No No Steady No No None Steady No No No No No No No No No No 46 20 0 0 0 0 [80-90) Female Caucasian ? 274.00 427.00 416.00
6 927 14824206 7 1 1 InternalMedicine ? NO 5 3 None No No No No No Yes No Steady No No No No No No None No No No No No No No No No No No 49 5 0 0 0 0 [30-40) Female AfricanAmerican ? 590.00 220.00 250.00
7 1152 8380170 7 1 1 Hematology/Oncology ? >30 6 2 None No No No No No Yes No No No No No No No Steady None No No No No No No No No No No No 43 13 2 0 1 0 [50-60) Female AfricanAmerican ? 282.00 250.01 NA
8 1152 30180318 7 1 1 Hematology/Oncology ? >30 6 6 None No No Ch No No Yes No No No No No No No Down None No No No No No No No No No No No 45 15 4 0 2 0 [50-60) Female AfricanAmerican ? 282.00 794.00 250.00
9 1152 55533660 7

Related

Mantel-Haenszel test for trend R

I need to perform a meta-analysis of two studies with two variables of multiples levels of exposure
data1:
CA CO
0 405 457
1 101 108
2 22 16
3 6 5
4 0 1
data2:
CA CO
0 154 141
1 42 65
2 13 9
3 2 2
4 0 0
But I don't find the way to do it with metafor in R; anyone can help?
Thank you very much

How to remove rows based on distance from an average of column and max of another column

Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)
subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1
Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)

How to calculate this variable in R

I have the following data:
mydf[77:84,]
id game_week points code web_name first_name second_name position team_name date fixture team1 team2 home_away team_scored team_conceded minutes goals assists cleansheet goals_conceded own_goals
77 3 1 -2 51507 Koscielny Laurent Koscielny Defender Arsenal 17/08/13 ARS-AVL ARS AVL H 1 3 67 0 0 0 3 0
78 3 2 0 51507 Koscielny Laurent Koscielny Defender Arsenal 24/08/13 FUL-ARS ARS FUL A 3 1 0 0 0 0 0 0
79 3 3 6 51507 Koscielny Laurent Koscielny Defender Arsenal 01/09/13 ARS-TOT ARS TOT H 1 0 90 0 0 1 0 0
80 3 4 2 51507 Koscielny Laurent Koscielny Defender Arsenal 14/09/13 SUN-ARS ARS SUN A 3 1 90 0 0 0 1 0
81 3 5 2 51507 Koscielny Laurent Koscielny Defender Arsenal 22/09/13 ARS-STK ARS STK H 3 1 90 0 0 0 1 0
82 3 6 2 51507 Koscielny Laurent Koscielny Defender Arsenal 28/09/13 SWA-ARS ARS SWA A 2 1 90 0 0 0 1 0
83 3 7 3 51507 Koscielny Laurent Koscielny Defender Arsenal 06/10/13 WBA-ARS ARS WBA A 1 1 90 0 0 0 1 0
84 3 8 2 51507 Koscielny Laurent Koscielny Defender Arsenal 19/10/13 ARS-NOR ARS NOR H 4 1 90 0 0 0 1 0
As a part of modeling exercise, I want to create a new variable, "mov_avg_min", which for a given "id", is the average of "minutes" played in the last 3 "game_week". Example, for web_name "Koscielny" his distinct "id" is 3 in this data_frame. So for id= 3 and game_week=4, a function should calculate mov_avg_min of game_weeks 1:3 (3 game_week before current game_week for the same id value). Hence in row 80, mov_avg_min = 1/3(67+0+90)=52.333
I think the rollapply (of the zoo package) with width = 3 will include the value of the row you consider. So, for game 4 it will give you the average of minutes in games 2,3 and 4. I think you have to lag the minutes column first in order to get the average based on games 1,2 and 3. See a simple example below:
library(dplyr)
library(zoo)
dt = data.frame(id = c(1,1,1,1,1,2,2,2,2,2),
games = c(1,2,3,4,5,1,2,3,4,5),
minutes = c(61,72,73,82,82,81,71,51,90,73))
dt
# id games minutes
# 1 1 1 61
# 2 1 2 72
# 3 1 3 73
# 4 1 4 82
# 5 1 5 82
# 6 2 1 81
# 7 2 2 71
# 8 2 3 51
# 9 2 4 90
# 10 2 5 73
dt %>% group_by(id) %>%
mutate(lag_minutes = lag(minutes, default=NA)) %>%
mutate(RA = rollapply(lag_minutes,width=3,mean, align= "right", fill=NA))
# Source: local data frame [10 x 5]
# Groups: id
#
# id games minutes lag_minutes RA
# 1 1 1 61 NA NA
# 2 1 2 72 61 NA
# 3 1 3 73 72 NA
# 4 1 4 82 73 68.66667
# 5 1 5 82 82 75.66667
# 6 2 1 81 NA NA
# 7 2 2 71 81 NA
# 8 2 3 51 71 NA
# 9 2 4 90 51 67.66667
# 10 2 5 73 90 70.66667

compute a Means variable for a specific value in another variable

I would like to compute the mean age for every value from 1-7 in another variable called period.
This is how my data looks like:
work1 <- read.table(header=T, text="ID dead age gender inclusion_year diagnosis surv agrp period
87 0 25 2 2006 1 2174 1 5
396 0 19 2 2003 1 3077 1 3
446 0 23 2 2003 1 3144 1 3
497 0 19 2 2011 1 268 1 7
522 1 57 2 1999 1 3407 2 1
714 0 58 2 2003 1 3041 2 3
741 0 27 2 2004 1 2587 1 4
767 0 18 1 2008 1 1104 1 6
786 0 36 1 2005 1 2887 3 4
810 0 25 1 1998 1 3783 4 2")
This is a subset of a data with more then 1500 observations
This is what I'm trying to achieve:
sim <- read.table(header=T, text="Period diagnosis dead surv age
1 1 50 50000 35.5
2 1 80 70000 40.3
3 1 100 80000 32.8
4 1 120 100000 39.8
5 1 140 1200000 28.7
6 1 150 1400000 36.2
7 1 160 1600000 37.1")
In this data set I would like to group by period and diagnosis while all deaths(dead) and surv(survival time in days) is summarised in period time. I would also like for a mean value of the age in every period.
Have tried everything, still can't create the data set I'm striving for.
All help is appreciated!
You could try data.table
library(data.table)
as.data.table(work1)[, .(dead_sum=sum(dead),
surv_sum=sum(surv),
age_mean=mean(age)), keyby=.(period, diagnosis)]
Or dplyr
library(dplyr)
work1 %>% group_by(period, diagnosis) %>%
summarise(dead_sum=sum(dead), surv_sum=sum(surv), age_mean=mean(age))
# result
period diagnosis dead_sum surv_sum age_mean
1: 1 1 1 3407 57.00000
2: 2 1 0 3783 25.00000
3: 3 1 0 9262 33.33333
4: 4 1 0 5474 31.50000
5: 5 1 0 2174 25.00000
6: 6 1 0 1104 18.00000
7: 7 1 0 268 19.00000

R: calculate and superimpose on a ggplot graph minute-based totals for an event series

Consider a data frame df with an extract from a web server access log, with two fields (sample below, duration is in msec and to simplify the example, let's ignore the date).
time,duration
18:17:26.552,8
18:17:26.632,10
18:17:26.681,12
18:17:26.733,4
18:17:26.778,5
18:17:26.832,5
18:17:26.889,4
18:17:26.931,3
18:17:26.991,3
18:17:27.040,5
18:17:27.157,4
18:17:27.209,14
18:17:27.249,4
18:17:27.303,4
18:17:27.356,13
18:17:27.408,13
18:17:27.450,3
18:17:27.506,13
18:17:27.546,3
18:17:27.616,4
18:17:27.664,4
18:17:27.718,3
18:17:27.796,10
18:17:27.856,3
18:17:27.909,3
18:17:27.974,3
18:17:28.029,3
qplot(time, duration, data=df); gives me a graph of the duration. I'd like to add, superimposed a line showing the number of requests for each minute. Ideally, this line would have a single data point per minute, at the :30sec point. If that's too complicated, an acceptable alternative is to have a step line, with the same value (the count of request) during a minute.
One way is to trunc(df$time, units=c("mins")), then calculate the count of request per minute into a new column then graph it.
I'm asking if there is, perhaps, a more direct way to accomplish the above. Thanks.
Following may be helpful. Create a data frame with steps and plot:
time duration sec sec2 diffsec2 step30s steps
1 18:17:26.552 8 26.552 552 0 0 0
2 18:17:26.632 10 26.632 632 80 1 1
3 18:17:26.681 12 26.681 681 49 0 0
4 18:17:26.733 4 26.733 733 52 1 1
5 18:17:26.778 5 26.778 778 45 0 0
6 18:17:26.832 5 26.832 832 54 1 1
7 18:17:26.889 4 26.889 889 57 1 2
8 18:17:26.931 3 26.931 931 42 0 0
9 18:17:26.991 3 26.991 991 60 1 1
10 18:17:27.040 5 27.040 040 -951 0 0
11 18:17:27.157 4 27.157 157 117 1 1
12 18:17:27.209 14 27.209 209 52 1 2
13 18:17:27.249 4 27.249 249 40 0 0
14 18:17:27.303 4 27.303 303 54 1 1
15 18:17:27.356 13 27.356 356 53 1 2
16 18:17:27.408 13 27.408 408 52 1 3
17 18:17:27.450 3 27.450 450 42 0 0
18 18:17:27.506 13 27.506 506 56 1 1
19 18:17:27.546 3 27.546 546 40 0 0
20 18:17:27.616 4 27.616 616 70 1 1
21 18:17:27.664 4 27.664 664 48 0 0
22 18:17:27.718 3 27.718 718 54 1 1
23 18:17:27.796 10 27.796 796 78 1 2
24 18:17:27.856 3 27.856 856 60 1 3
25 18:17:27.909 3 27.909 909 53 1 4
26 18:17:27.974 3 27.974 974 65 1 5
27 18:17:28.029 3 28.029 029 -945 0 0
>
> ggplot(ddf)+geom_point(aes(x=time, y=duration))+geom_line(aes(x=time, y=steps, group=1),color='red')

Resources