Trying to translate a SAS GLM to R sasLM::GLM

Trying to translate a SAS GLM to R sasLM::GLM - r

I have a data set like this:
X1 Record Plot Row Column Cp Csp Entry Year Location Genotype Trait Value Whole_plot
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 3256 717 566 6 7 0 2 717 2019 Preston Novelty STD 5 6 + 7
2 3263 716 567 6 7 0 1 716 2019 Preston Flanders STD 4 6 + 7
3 3893 716 657 7 8 0 1 716 2019 Preston Flanders STD 2 7 + 8
4 3900 717 658 7 8 0 2 717 2019 Preston Novelty STD 2 7 + 8
5 4698 716 772 9 3 0 1 716 2019 Preston Flanders STD 3 9 + 3
6 4712 717 774 9 3 0 2 717 2019 Preston Novelty STD 3 9 + 3
7 3257 717 566 6 7 0 2 717 2019 Preston Novelty V1 5 6 + 7
8 3264 716 567 6 7 0 1 716 2019 Preston Flanders V1 4 6 + 7
9 3894 716 657 7 8 0 1 716 2019 Preston Flanders V1 3 7 + 8
10 3901 717 658 7 8 0 2 717 2019 Preston Novelty V1 3 7 + 8
the SAS code is
PROC GLM DATA=MAD.MDA_plot_subplot_controls outstat=MAD.subplot_control_anova_stat;
BY Year Location Trait;
CLASS Whole_plot Genotype;
MODEL Value = Whole_plot Genotype/SS3;
RANDOM Whole_plot;
LSMEANS Whole_plot;
RUN;
QUIT;
what I tried was this
require(sasLM)
glm(Value ~ factor(Row) + factor(Column),MDA_plot_controls)
but I do get an error
Error in glm(Value ~ factor(Row) + factor(Column), MDA_plot_controls) :
'family' not recognized
In addition: Warning message:
Unknown or uninitialised column: `family`.
i don't understand the 'family' column.

Related

group_by one column but keeping multiples based off another column

I have a data frame with thousands of rows looking like this:
time Unique_ID Unix_Time Event Version
<dbl> <dbl> <dbl> <lgl> <dbl>
1 1404 4961657804 1565546745 FALSE 6
2 2534 4453645779 1550934792 FALSE 5
3 2114 3602935494 1512593418 TRUE 3
4 2605 5343699852 1586419012 TRUE 6
5 1246 5095942046 1572689498 FALSE 6
6 2519 3206995213 1495881898 TRUE 3
7 1419 4958551504 1565434177 TRUE 6
8 2262 5441937631 1590754817 TRUE 6
9 1650 3024892331 1488210079 TRUE 2
10 1880 3163703804 1494173662 FALSE 2
I manipulate the data frame using the following command:
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60))) %>%
summarise(Number= n(),
Won = sum(Event))
Now my data frame looks like this:
minute Number Won
<int> <int> <int>
1 55 264 128
2 71 34 17
3 31 1427 728
4 80 9 5
5 24 1197 673
6 141 1 1
7 53 326 163
8 30 1572 802
9 77 14 9
10 97 1 1
I would want something like this though:
minute Number Won Version
<int> <int> <int> <int>
1 55 264 128 1
2 55 34 17 2
3 55 1427 728 3
4 80 9 5 1
5 24 1197 673 1
6 141 1 1 2
7 53 326 163 3
8 53 1572 802 4
9 77 14 9 2
10 97 1 1 6
Is it possible to keep the rows with different Versions seperated while grouping time?

I think you can group by 2 columns: minute and Version
df <- df %>%
group_by(minute = findInterval(time, seq(min(0), max(9000), 60)), Version)

Shorten tibble/df by remove duplicant entries inside tidyverse

i have a very big dataframe from which i need the lossyear per Point:
# A tibble: 74,856 x 13
Date index Mean Sdev Median pixel_used doy Month Year_n Year lossyear Point Scene
<date> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <int> <int> <int> <chr>
1 2013-06-11 NBR 0.481 0.0832 0.496 92647 162 6 2013 2013 2017 1 LC08_125016
2 2013-06-11 NDMI 0.175 0.0737 0.189 92647 162 6 2013 2013 2017 1 LC08_125016
3 2013-06-11 NDVI 0.734 0.0517 0.741 92647 162 6 2013 2013 2017 1 LC08_125016
4 2013-06-11 TCB 0.237 0.0159 0.235 92647 162 6 2013 2013 2017 1 LC08_125016
5 2013-06-11 TCG 0.158 0.0174 0.158 92647 162 6 2013 2013 2017 1 LC08_125016
6 2013-06-11 TCW -0.0958 0.0195 -0.0903 92647 162 6 2013 2013 2017 1 LC08_125016
7 2013-06-27 NBR 0.524 0.0503 0.525 39323 178 6 2013 2013 2017 1 LC08_125016
8 2013-06-27 NDMI 0.234 0.0464 0.236 39323 178 6 2013 2013 2017 1 LC08_125016
9 2013-06-27 NDVI 0.721 0.0351 0.725 39323 178 6 2013 2013 2017 1 LC08_125016
10 2013-06-27 TCB 0.249 0.0299 0.251 39323 178 6 2013 2013 2017 1 LC08_125016
# ... with 74,846 more rows
I was able to create a subset by row df[,c("lossyear", "Point")]:
# A tibble: 74,856 x 2
Point lossyear
<fct> <fct>
1 1 2017
2 1 2017
3 1 2017
4 1 2017
5 1 2017
6 1 2017
7 1 2017
8 1 2017
9 1 2017
10 1 2017
# ... with 74,846 more rows
But how do i "shorten" it, so that i have only 1 Row per unique Point which the corresponding lossyear (2000:2017)? Something like this:
# A tibble: 42 x 2
Point lossyear
<fct> <fct>
1 1 2017
2 2 2017
3 3 2017
4 4 2016
5 5 2016
6 6 2016
7 7 2015
8 8 2014
9 9 2014
10 10 2014
# ... with 32 more rows

We can use distinct to get the unique elements of the selected columns
library(dplyr)
df %>%
distinct(lossyear, Point)

You could group by Pointand get the first value via slice:
library(dplyr)
df %>% select(lossyear, Point)
%>% group_by(Point)
%>% slice(1) %>% ungroupt

How to export tibble to .csv

I did a rfm analysis using package "rfm". The results are in tibble and I can't seem to figure out how to export it to .csv. I tried argument below but it exported a blank file.
> dim(bmdata4RFM)
[1] 1182580 3
> str(bmdata4RFM)
'data.frame': 1182580 obs. of 3 variables:
$ customer_ID: num 0 0 0 0 0 0 0 0 0 0 ...
$ sales_date : Factor w/ 366 levels "1/1/2018 0:00:00",..: 267 275 286 297 300 301 302 303 304 305 ...
$ sales : num 101541 110543 60932 75472 43588 ...
> head(bmdata4RFM,5)
customer_ID sales_date sales
1 0 6/30/2017 0:00:00 101540.70
2 0 7/1/2017 0:00:00 110543.35
3 0 7/2/2017 0:00:00 60932.20
4 0 7/3/2017 0:00:00 75471.93
5 0 7/4/2017 0:00:00 43587.70
> library(rfm)
> # convert date from factor to date format
> bmdata4RFM[,2] <- as.Date(as.character(bmdata4RFM[,2]), format = "%m/%d/%Y")
> rfm_result_v2
# A tibble: 535,868 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<dbl> <date> <dbl> <dbl> <dbl> <int> <int> <int> <dbl>
1 0 2018-06-30 12 366 42462470. 5 5 5 555
2 1 2018-06-30 12 20 2264. 5 5 5 555
3 2 2018-01-12 181 24 1689 3 5 5 355
4 3 2018-05-04 69 27 1984. 4 5 5 455
5 6 2017-12-07 217 12 922. 2 5 5 255
6 7 2018-01-15 178 19 1680. 3 5 5 355
7 9 2018-01-05 188 19 2106 2 5 5 255
8 20 2018-04-11 92 4 414. 4 5 5 455
9 26 2018-02-10 152 1 72 3 1 2 312
10 48 2017-12-20 204 1 90 2 1 3 213
11 68 2017-09-30 285 1 37 1 1 1 111
12 70 2017-12-17 207 1 18 2 1 1 211
13 104 2017-08-11 335 1 90 1 1 3 113
14 120 2017-07-27 350 1 19 1 1 1 111
15 134 2018-01-13 180 1 275 3 1 4 314
16 153 2018-06-24 18 10 1677 5 5 5 555
17 155 2018-05-28 45 1 315 5 1 4 514
18 171 2018-06-11 31 6 3485. 5 5 5 555
19 172 2018-05-24 49 1 93 5 1 3 513
20 174 2018-06-06 36 3 347. 5 4 5 545
# ... with 535,858 more rows
> write.csv(rfm_result_v2,"bmdataRFMFunction_output071218v2.csv")

The problem seems to be that the result of the rfm_table_order is not only a tibble: looking at this question already solved, and using its data, you can know this:
> class(rfm_result)
[1] "rfm_table_order" "tibble" "data.frame"
So if for example choose this:
> rfm_result$rfm
# A tibble: 325 x 9
customer_id date_most_recent recency_days transaction_count amount recency_score frequency_score monetary_score rfm_score
<int> <date> <dbl> <dbl> <int> <int> <int> <int> <dbl>
1 1 2017-08-06 353 1 145 4 1 2 412
2 2 2016-10-15 648 1 268 2 1 3 213
3 5 2016-12-14 588 1 119 3 1 1 311
4 7 2017-04-27 454 1 290 3 1 3 313
5 8 2016-12-07 595 3 835 2 5 5 255
6 10 2017-07-31 359 1 192 4 1 2 412
7 11 2017-08-16 343 1 278 4 1 3 413
8 12 2017-10-14 284 2 294 5 4 3 543
9 15 2016-07-12 743 1 206 2 1 2 212
10 17 2017-05-22 429 2 405 4 4 4 444
# ... with 315 more rows
You can export it with this command:
write.table(rfm_result$rfm , file = "your_path\\df.csv")

OP asks for a CSV output.
Being very picky, write.table(rfm_result$rfm , file = "your_path\\df.csv") creates a TSV.
If you want a CSV add the sep="," parameter and also you'll likely want to not write out the row names so also use row.names=FALSE.
write.table(rfm_result$rfm , file = "your_path\\df.csv", sep=",", row.names=FALSE)

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2

movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!

Hadley provided the answer to that here:
select_(df, .dots = top.var)

Get row number data frame R

I have a dataset like this
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12
What I would like to do is to set the Year and Month and get the correspondent row number, like
MONTH <- 12
YEAR <- 1850
ROWNUMBER = 1
Many thanks

A simple which call would be enough, e.g.:
df <- read.table(textConnection("
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12"), header=TRUE)
which(df$year == 1850 & df$month == 12)
# [1] 1
which(df$year == 1852 & df$month == 12)
# [1] 10

Sorry I found the answer
TIMEC <- which(df$year==YEAR & df$month==MONTH)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trying to translate a SAS GLM to R sasLM::GLM - r

Related

group_by one column but keeping multiples based off another column

Shorten tibble/df by remove duplicant entries inside tidyverse

How to export tibble to .csv

How to dynamically select columns

Get row number data frame R

Categories

Resources