Fomat number series in R - r

I have a number series like below. I need the negative numbers (numbers below 0) to be zero and other numbers to be rounded to two digits.
Can somebody help me to do this in R?
Output:
21.31 22.0 0 8.71 -25.27 1.63 0 144.23 0 0 21.9558290 57.2186577 214.2688719 57.9806240 0 0 21.7744036 50.7217715 0 131.4853834
Thanks in advance.
pred_cty1
1 2 3 4 5 6 7
21.3147237 22.0741859 -1.5040034 8.7155408 -25.2777258 1.6331518 -1.5303588
8 9 10 11 12 13 14
144.2318083 -13.1278888 -19.6253222 21.9558290 57.2186577 214.2688719 57.9806240
15 16 17 18 19 20
-7.7710546 -35.6169525 21.7744036 50.7217715 -0.4616455 131.4853834
> str(pred_cty1)
Named num [1:20] 21.31 22.07 -1.5 8.72 -25.28 ...
- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...

These are very basic r functions and methodologies, so I'd recommend researching the concept of subsetting and checking out ?round. FYI, pred_cty1 is a vector type object. 'Series' doesn't really help answer your question because there a bunch of data types that can store them.
After reading up on subsetting and round check out this simple solution:
pred_cty1 <- round(pred_cty1, digits = 2)
pred_cty1[pred_cty1 < 0] <- 0

Related

Output of igraph clustering functions

I constructed a graph from a data-frame using the igraph graph_from_data_frame function. My two first column represent the edge list, and i have another column named "weight". There is several other attributes columns.
I then tried to find a community structure within my graph using cluster_fast_greedy.
data <- data %>% rename(weight = TH_LIEN_2)
graph <- graph_from_data_frame(data,directed=FALSE)
is_weighted(graph)
cluster_1 <- cluster_fast_greedy(graph, weights = NULL)
The output is a list of three (merges, modularity, membership), each containing some of my vertices.
However, the following returns "NULL":
cluster_1[["merges"]]
cluster_1[["modularity"]]
cluster_1[["membership"]]
(I believe cluster_1[["membership"]] is supposed to be a list of integer indicating the cluster the vertices belong to?)
I have tried different method of clustering (cluster_fast_greedy, cluster_label_prop, cluster_leading_eigen, cluster_spinglass, cluster_walktrap) and with a weighted and non weighted graph and the output looks the same every time. (The number of element on the list varying from 1 to 4)
Does anyone have an idea of why it does that?
Thank you and have a nice day!
Cassandra
You should use the dollar sign $ to access the cluster object. For example
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g <- add_edges(g, c(1, 6, 1, 11, 6, 11))
fc <- cluster_fast_greedy(g)
and you will see
> str(fc)
Class 'communities' hidden list of 5
$ merges : num [1:14, 1:2] 3 4 5 1 12 13 15 11 7 8 ...
$ modularity: num [1:15] -6.89e-02 -4.59e-02 6.94e-18 6.89e-02 1.46e-01 ...
$ membership: num [1:15] 3 3 3 3 3 1 1 1 1 1 ...
$ algorithm : chr "fast greedy"
$ vcount : int 15
> fc$merges
[,1] [,2]
[1,] 3 2
[2,] 4 16
[3,] 5 17
[4,] 1 18
[5,] 12 14
[6,] 13 20
[7,] 15 21
[8,] 11 22
[9,] 7 9
[10,] 8 24
[11,] 10 25
[12,] 6 26
[13,] 27 19
[14,] 23 28
> fc$modularity
[1] -6.887052e-02 -4.591368e-02 6.938894e-18 6.887052e-02 1.460055e-01
[6] 1.689624e-01 2.148760e-01 2.837466e-01 3.608815e-01 3.838384e-01
[11] 4.297521e-01 4.986226e-01 5.757576e-01 3.838384e-01 -1.110223e-16
> fc$membership
[1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
> fc$algorithm
[1] "fast greedy"
> fc$vcount
[1] 15

How to create independent different data.frame in a loop R

Good evening everybody,
I'm stuck about the construction of the for loop, I don't have any problem, buit I'd like to understand how I can create dataframe "independents" (duplicite with some differences).
I wrote the code step by step (it works), but I think that, maybe, there is a way to compact the code with the for.
x is my original data.frame
str(x)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
My first goal is to delete per every column the eventualy NA and "" elements. I do this by these codes of rows.
x_b<- x[!(!is.na(x$b) & x$b==""), ]
x_c<- x[!(!is.na(x$c) & x$c==""), ]
x_d<- x[!(!is.na(x$d) & x$d==""), ]
x_e<- x[!(!is.na(x$e) & x$e==""), ]
x_f<- x[!(!is.na(x$f) & x$f==""), ]
After this the second goal is to create per each new data.frame a id code that I create using the function paste0(x_b$a, x_b$f).
x_b$ID_1<-paste0(x_b$a, x_b$b)
x_c$ID_2<-paste0(x_c$a, x_c$c)
x_d$ID_3<-paste0(x_c$a, x_c$d)
x_e$ID_4<-paste0(x_c$a, x_c$e)
x_f$ID_5<-paste0(x_c$a, x_c$f)
I created this for loop to try to minimize the rows that I use, and to create a good code visualization.
z<-data.frame("a", "b","c","d","e","f")
zy<-data.frame("x_b", "x_c", "x_d", "x_e", "x_f")
for(i in z) {
for (j in zy ) {
target <- paste("_",i)
x[[i]]<-(!is.na(x[[i]]) & x[[i]]=="") #with this I able to create a column on the x data.frame,
#but if I put a new dataframe the for doesn't work
#the name, but I don't want this. I'd like to create a
#data.base per each transformation.
#at this point of the script, I should have a new
#different dataframe, as x_b, x_c, x_d, x_e, x_f but I
#don't know
#How to create them?
#If I have these data frame I will do this anther function
#in the for loop:
zy[[ID]]<-paste0(x_b$a, "_23X")
}
}
I'd like to have as output this:
str(x_b)
Classes ‘data.table’ and 'data.frame': 13500 obs. of 6 variables:
$ a: int 1 56 1058 567 987 574 1001...
$ b: int 10 5 10 5 5 10 10 5 10 10 ...
$ c: int NA NA NA NA NA NA NA NA NA NA ...
$ d: int 0 0 0 0 0 0 0 0 0 0 ...
$ e: int 0 0 0 0 0 0 0 0 0 0 ...
$ f: int 22 22 22 22 22 22 22 22 22 22 ...
$ ID: int 1_23X 56_23X 1058_23X 567_23X 987_23X 574_23X 1001_23X...
and so on.
I think that there is some important concept about the dataframe that I miss.
Where I wrong?
Thank you so much in advance for the support.
There is simple way to do this with the tidyverse package(s):
First goal:
drop.na(df)
You can also use na_if if you want convert "" to NA.
Second goal: use mutate to create a new variable:
df <- df %>%
mutate(id = paste0(x_b$a, "_23X"))

How do I change the structure of a r data table

I've merged a handful of data sets all downloaded from either spss, cvs, or excel files into one large data table. For the most part I can use all the variables I want to run tests but every once in a while the structure of them needs to be changed. As an example here's my data set:
> str(gadd.us)
'data.frame': 467 obs. of 381 variables:
$ nidaid : Nmnl. item chr "45-D11150341" "45-D11180321" "45-D11220022" "45-D11240432" ...
$ id : Nmnl. item chr "D11150341" "D11180321" "D11220022" "D11240432" ...
$ agew1 : Itvl. item num 17 17 15 18 17 15 15 18 20 18 ...
$ nagew1 : Itvl. item num 17.3 17.2 15.7 18.2 17.2 ...
$ nsex : Nmnl. item w/ 2 labels for 0,1 num 1 1 0 0 0 0 1 1 1 1 ...
and when I focus on just one variable I get something like this
> str(gadd.us$wasiblckw2)
Itvl. item + ms.v. num [1:467] 70 48 40 60 37 46 67 55 45 61 ...
> str(gadd.us$nsex)
Nmnl. item w/ 2 labels for 0,1 num [1:467] 1 1 0 0 0 0 1 1 1 1 ...
So when I try to create a histogram I get an error...
> hist(gadd.us$wasiblckw2)
Error in hist.default(gadd.us$wasiblckw2) :
some 'x' not counted; maybe 'breaks' do not span range of 'x'
If I change this variable using as.numeric() it works just fine. Any idea what's going on here?
If you import your data from SPSS, SAS, or Stata using haven: library(haven), haven stores variable formats in an attribute: format.spss, format.sas, or format.stata. format.spss, or format.sas. This can sometimes cause problems for your code. haven has several functions to remove such formats and labels:
gadd.us <- haven::zap_formats(gadd.us)
gadd.us <- haven::zap_labels(gadd.us)
You may also want to try some other zap_ functions.

Using lag function gives an atomic vector with all zeroes

I have trying to use "lag" function in base R to calculate rainfall accumulations for a 6-hr period. I have hourly rainfall, then I calculate cumulative rainfall using cumsum function and then I am using the lag function to calculate 6-hr accumulations as below.
Event_Data<-dbGetQuery(con, "select feature_id, TO_CHAR(datetime, 'MM/DD/YYYY HH24:MI') as DATE_TIME, value_ms as RAINFALL_IN from Rain_HOURLY")
Event_Data$cume<-cumsum(Event_Data$RAINFALL_IN)
Event_Data$six_hr<-Event_Data$cume-lag(Event_Data$cume, 6)
But the lag function gives me all zeroes and the structure of the data frame looks like this-
'data.frame': 169 obs. of 5 variables:
$ feature_id : num 80 80 80 80 80 ...
$ DATE_TIME : chr "09/10/2017 00:00" "09/10/2017 01:00" "09/10/2017 02:00" "09/10/2017 03:00" ...
$ RAINFALL_IN: num 0.251 0.09 0.017 0.071 0.016 0.01 0.136 0.651 0.185 0.072 ...
$ cume : num 0.251 0.341 0.358 0.429 0.445 ...
$ six_hr : atomic 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "tsp")= num -23 145 1
This code has worked fine with several of my other projects but I have no clue why I am getting zeroes. Any help is greatly appreciated.
Thanks.
There might be a conflict with the lag function from other packages, that would explain why this code worked on other scripts but not on this one.
try stats::lag instead of just lag to enforce which package you want to use. (or dplyr::lag which seems to work better for me at east) ?
I think you have a misconception about what lag() from the stats package does. It's returning zeros, because you're taking the full data for cumulative rainfall and then subtract it again. Check this small example for an illustration:
x <- 1:20
y <- lag(x,3) ;y
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#attr(,"tsp")
#[1] -2 17 1
x-y #x is a vector
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#attr(,"tsp")
#[1] -2 17 1
As you can see, lag() simply keeps the vector values and just adds a time series attribute with the values "starting time, ending time, frequency". Because you put in a vector, it used the default values "1, length(Event_Data$cume), 1" and subtracted the lag from the starting and ending time, which is 3 in the example and seemingly 24 in your code output (which doesn't fit the code input above it, btw).
The problem is that your vector doesn't have any time attribute assigned to it, so R doesn't know which the corresponding values of your data and lagged data are. Thus, it simply subtracts the vector values and adds the time attribute of the lagged variable. To fix this, you just need to assign times to Event_Data$cume, by converting it to a time-series object, i.e. try Event_Data$six_hr<-as.numeric(ts(Event_Data$cume) - lag(ts(Event_Data$cume), 6))
It works fine for the small example above:
x <- ts(1:20)
y <- lag(x,3)
x-y #x is a ts
#Time Series:
#Start = 1
#End = 17
#Frequency = 1
# [1] -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -3

Reducing crosstab size by frequency of responses

Excuse my neophyte question - I'm new to R and pretty unversed in statistics.
I have a simple contingency table representing the number of queries per user for a group of web pages gathered over a period of time. There are about 15,000 total observations. This works out to a table of around 100 users viewing 50 groups of pages.
Since a 50x100 matrix is unwieldy to visualize, I would like to present a subset of this table sorted by the largest aggregates - either column (page groups), row (users), or perhaps even the largest row-by-column counts. For example I might choose the top 20 users and the top 10 groups, or the top 99% row-by-column counts.
Ideally, I end up with a table that still represents the major interactions between the most represented users and the page groups.
Is this a reasonable approach? Will I lose some large amount of statistical significance; and, is there a way to compare the before and after significance.
I must admit that I still don't know how to sort and subset a table based on two factors without resorting to row-by-column manipulation.
S <- trunc(10*runif(1000) )
R <- trunc(10*runif(1000))
RStab <- table(R, S)
str(RStab)
# 'table' int [1:10, 1:10] 6 12 10 13 10 7 8 6 9 10 ...
# - attr(*, "dimnames")=List of 2
# ..$ R: chr [1:10] "0" "1" "2" "3" ...
# ..$ S: chr [1:10] "0" "1" "2" "3" ...
rowSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
# 8 0 1 3 2 5 9 4 6 7
# 90 94 96 99 100 101 101 103 107 109
colSums( RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ])
6 0 3 5 7 2 4 8 9 1
80 91 94 96 98 100 106 109 112 114
The 5 highest marginals for row and columns:
RStab[ order(rowSums(RStab)) , order(colSums(RStab) ) ][ 6:10, 6:10]
#-------------
S
R 2 4 8 9 1
5 14 10 12 10 12
9 6 8 9 10 13
4 10 10 8 8 18
6 9 12 12 17 8
7 14 10 14 12 9
It does sound as though you might be a little shakey on the statistical issues. Can you explain more fully what you mean by "losing a large amount of significance"? What sort of statistical test were you thinking of?

Resources