prop.test for multiply compare in R - r

kk=structure(list(items = structure(c(2L, 4L, 5L, 11L, 1L, 3L, 6L,
7L, 8L, 9L, 10L, 12L), .Label = c("ak47", "aks47", "colt", "dubstepgun",
"moneygun", "paintballgun", "portalgun", "s", "scar20", "spas12",
"tank", "watergun"), class = "factor"), N = c(3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("items", "N"), class = "data.frame", row.names = c(NA,
-12L))
to perform prop.test for each item i use simple way.
count of item(12) and total number of obs.(N)=20,
prop.test(1,20)
so colt item met 1 time in 20 !
How to do single prop.test for all items at once, no manualy.
prop.test(3,20)
prop.test(2,20)
and so on, but with name of item
#tank
prop.test(3,20)
#spas12
prop.test(1,20)

An option is to get the unique elements of 'N' column, loop it with lapply and apply the prop.test
lst1 <- lapply(unique(kk$N), function(i) prop.test(i, 20))
names(lst1) <- unname(tapply(as.character(kk$items),
factor(kk$N, levels = unique(kk$N)), FUN = tail, 1))

Related

Can not use is.na() function in mutate_if funciton in r

I tried to use is.na() in mutate_if() but I get an error:
Error in is_logical(.p) : object 'n_day' not found
n_day indeed in my dataframe and I thought because of the argument set of is.na() that I can not use it in mutate_if() but I don't know how to solve it.
Here's the idea if the value in n_day is NA, replace it with the value in n_cum at the same day.
Any help will be highly appreciated!
My code like this:
library(tidyverse)
t <- structure(list(city = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
time = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L), .Label = c("2012/1/1", "2012/1/2",
"2012/1/3", "2012/1/4", "2012/2/1", "2012/2/2", "2012/2/3",
"2012/2/4"), class = "factor"), n_cum = c(1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L)), class = "data.frame", row.names = c(NA,
-16L))
t
t2 <- t %>% group_by(city) %>%
mutate(n_day = n_cum - lag(n_cum))
t2 %>% mutate_if(is.na(n_day), n_day = n_cum)
mutate_if is used to do operations on multiple columns at once(See documentation), this is not what you are looking for here as you only want to change one column.
The question can be solved using mutate and if_else :
t2 %>% mutate(n_day = if_else(is.na(n_day),n_cum,n_day))
Use mutate_at + if condition instead,
t2 %>% mutate_at(vars(n_day), ~ ifelse(is.na(.), n_cum, .))
In the case of multiple variables selection, just add them respectively into vars helper.

seasonplot function in R is not a function, character or symbol

transport<- structure(list(date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L), .Label = c("01.01.2001", "01.02.2001", "01.03.2001",
"01.04.2001", "01.05.2001", "01.06.2001", "01.07.2001", "01.08.2001",
"01.09.2001", "01.10.2001", "01.11.2001", "01.12.2001"), class = "factor"),
Market_82 = c(7000L, 7272L, 7668L, 7869L, 8057L, 8428L, 8587L,
8823L, 8922L, 9178L, 9306L, 9439L, 3725L, 4883L, 8186L, 7525L,
6335L, 4252L, 5642L, 1326L, 8605L, 3501L, 1944L, 7332L),
transport = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("plane", "train"), class = "factor")), .Names = c("date",
"Market_82", "transport"), class = "data.frame", row.names = c(NA,
-24L))
let's create seasonalplot for each group(plane and train) separately
library(forecast)
par(mfrow = c(2, 1))
lapply(split(transport['Market_82'], transport$transport), seasonplot(ts(transport,frequency=12)))
then i get error
Error in match.fun(FUN) :
'seasonplot(ts(transport, frequency = 12))' is not a function, character or symbol
How to get seasonlap plot for two groups?
lapply wants a function, without the arguments in brackets. If you want to pass additional arguments to your function, list them after the function, e.g. lapply(func, arg1, arg2).
Also, seasonplot(ts(transport,frequency=12)) would plot both, plane and train data into one plot.
Since in your example you also want to build a time series object using ts, it is best to code it in a function you define within lapply:
Try:
lapply(split(transport['Market_82'], transport$transport), function(x)seasonplot(ts(x, frequency=12)))
Edit
To distinguish which group is for which plot, you could iterate over the names:
data = split(transport['Market_82'], transport$transport)
par(mfrow = c(2, 1))
lapply(names(data), function(x)seasonplot(ts(data[[x]], frequency=12), main=x))

perform acf plot for each type of group in R

Say, here the mydata (little part)
transport<- structure(list(date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L), .Label = c("01.01.2001", "01.02.2001", "01.03.2001",
"01.04.2001", "01.05.2001", "01.06.2001", "01.07.2001", "01.08.2001",
"01.09.2001", "01.10.2001", "01.11.2001", "01.12.2001"), class = "factor"),
Market_82 = c(7000L, 7272L, 7668L, 7869L, 8057L, 8428L, 8587L,
8823L, 8922L, 9178L, 9306L, 9439L, 3725L, 4883L, 8186L, 7525L,
6335L, 4252L, 5642L, 1326L, 8605L, 3501L, 1944L, 7332L),
transport = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("plane", "train"), class = "factor")), .Names = c("date",
"Market_82", "transport"), class = "data.frame", row.names = c(NA,
-24L))
group variable - Transport.
For each type of transport i must get acf plot of time series.
something like this
How perform acf plot for each transport?
I have a lot of groups. How to do that plots were in folder
C:/Users/admin/Documents/myplot
akrun's answer is spot on. Since you tagged the question with ggplot2 you could also use ggAcf from the forcast package.
The first step is to split your data.
transport_split <- split(transport, transport$transport)
If you want to include the respective element of column transport in the title, subtitle etc. try with Map
out <- Map(
f = function(x, y)
forecast::ggAcf(x$Market_82) + labs(title = y),
x = transport_split,
y = names(transport_split)
)
out$train
We can do this with Acf from forecast
library(forecast)
par(mfrow = c(2, 1))
lapply(split(transport['Market_82'], transport$transport), Acf)
If we also want the title, then
lst <- lapply(split(transport['Market_82'], transport$transport), acf, plot = FALSE)
par(mfrow = c(2, 1))
lapply(names(lst), function(x) plot(lst[[x]], main = x))

How to change all 0's in a list of dataframes

I would like to change all 0's to, say 0.0001, in a list of dataframes to avoid -Inf when take log. So followed the instruction from Replace all 0 values to NA, I wrote my function as
set_zero_as_value <- function(x, value=0.0001){
x[x == 0] <- value
}
However when I use sapply to my data sapply(a,set_zero_as_value), the result returned as
s1 s2
1e-04 1e-04
And further check the list a, the 0 in a does not change at all. Is there a solution for this?
PS: list a can be created as
> a = NULL
> a$s1 = rbind(cbind(0,1,2),cbind(3,4,5))
> a$s2 = rbind(cbind(0,1,2),cbind(3,4,5))
Use pmax inside of lapply call, no need to define set_zero_as_value since pmax does what you need. Let's suppose your list is:
list.DF <-list(structure(list(a = c(1L, 2L, 3L, 5L, 1L, 5L, 5L, 3L, 3L,
0L), b = c(1L, 1L, 4L, 2L, 4L, 2L, 4L, 5L, 2L, 4L), c = c(5L,
1L, 3L, 0L, 1L, 2L, 0L, 2L, 5L, 2L)), .Names = c("a", "b", "c"
), row.names = c(NA, -10L), class = "data.frame"), structure(list(
d = c(2L, 3L, 2L, 1L, 4L, 4L, 4L, 0L, 4L, 2L), e = c(4L,
3L, 4L, 3L, 3L, 4L, 0L, 2L, 4L, 4L), f = c(2L, 5L, 2L, 1L,
0L, 0L, 1L, 3L, 3L, 2L)), .Names = c("d", "e", "f"), row.names = c(NA,
-10L), class = "data.frame"))
Now applying your desired transformation:
> lapply(list.DF, function(x) sapply(x, pmax, 0.0001))
If you want to use your set_zero_as_value function, then add return(x) at the end of it.
set_zero_as_value <- function(x, value=0.0001){
x[x == 0] <- value
return(x)
}
lapply(list.DF, function(x) sapply(x, set_zero_as_value))
This will produce the same result as before.

Bin data by (x,y) and summarize

These are the first 10 lines of a huge files I have: (Note that there is only one user in these 10 lines but I've got thousands of users)
dput(testd)
structure(list(user = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), otime = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L
), .Label = c("2010-10-12T19:56:49Z", "2010-10-13T03:57:23Z",
"2010-10-13T16:41:35Z", "2010-10-13T20:05:43Z", "2010-10-13T23:31:51Z",
"2010-10-14T00:21:47Z", "2010-10-14T18:25:51Z", "2010-10-16T03:48:54Z",
"2010-10-16T06:02:04Z", "2010-10-17T01:48:53Z"), class = "factor"),
lat = c(39.747652, 39.891383, 39.891077, 39.750469, 39.752713,
39.752508, 39.7513, 39.758974, 39.827022, 39.749934),
long = c(-104.99251, -105.070814, -105.068532, -104.999073,
-104.996337, -104.996637, -105.000121, -105.010853,
-105.143191, -105.000017),
locid = structure(c(5L, 4L, 9L, 6L, 1L, 2L, 8L, 3L, 10L, 7L),
.Label = c("2ef143e12038c870038df53e0478cefc",
"424eb3dd143292f9e013efa00486c907", "6f5b96170b7744af3c7577fa35ed0b8f",
"7a0f88982aa015062b95e3b4843f9ca2", "88c46bf20db295831bd2d1718ad7e6f5",
"9848afcc62e500a01cf6fbf24b797732f8963683", "b3d356765cc8a4aa7ac5cd18caafd393",
"d268093afe06bd7d37d91c4d436e0c40d217b20a", "dd7cd3d264c2d063832db506fba8bf79",
"f6f52a75fd80e27e3770cd3a87054f27"), class = "factor"),
dnt = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L),
.Label = c("2010-10-12 19:56:49",
"2010-10-13 03:57:23", "2010-10-13 16:41:35", "2010-10-13 20:05:43",
"2010-10-13 23:31:51", "2010-10-14 00:21:47", "2010-10-14 18:25:51",
"2010-10-16 03:48:54", "2010-10-16 06:02:04", "2010-10-17 01:48:53"
), class = "factor"),
x = c(-11674.6344476781, -11683.3414552141,
-11683.0877083915, -11675.3642199817, -11675.0599906624,
-11675.0933491404, -11675.4807522648, -11676.6740962175,
-11691.3894104198, -11675.4691879924),
y = c(4419.73724843345, 4435.719406435, 4435.68538078744,
4420.05048454181, 4420.3000059572, 4420.27721099723,
4420.14288752585, 4420.99619739292, 4428.56278976123,
4419.99099525605),
cellx = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
.Label = c("[-11682,-11672)", "[-11692,-11682)"
), class = "factor"),
celly = structure(c(1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("[4419,4429)", "[4429,4439)"
), class = "factor"),
cellxy = structure(c(1L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L, 1L), .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"
), class = "factor")), .Names = c("user", "otime", "lat",
"long", "locid", "dnt", "x", "y", "cellx", "celly", "cellxy"), class = "data.frame", row.names = c(NA,
-10L))
A bit of explanation on what the data is to simplify understanding. The x and y are transformation of the lat and long coordinates. I have discretised the x,y locations into bins using cut. I want to get the most visited bin per user so I use ddply. As follows:
cells = ddply(testd, .(user, cellxy), summarise, length(cellxy))
Obtaining:
dput(cells)
structure(list(user = c(0, 0, 0), cellxy = structure(1:3, .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"), class = "factor"),
count = c(7L, 1L, 2L)), .Names = c("user", "cellxy", "count"
), row.names = c(NA, -3L), class = "data.frame")
Now what I want to do is calculate the average x,y from the first dataset for the most visited bin per user as obtained from the previous calculation. I have no idea how to do this efficiently and given that my dataset is really big I would appreciate some guidance. Thanks!
Here is two stage approach. First, modified your original code of cells - for each combination of cellxy and user calculate mean x and y value.
cells = ddply(testd, .(user, cellxy), summarise,
cellcount=length(cellxy),meanx=mean(x),meany=mean(y))
cells
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.40 4420.214
2 0 [-11692,-11682)[4419,4429) 1 -11691.39 4428.563
3 0 [-11692,-11682)[4429,4439) 2 -11683.21 4435.702
Then use other call to ddply() to subset for each user cellxy with highest cellcount.
cells2 = ddply(cells,.(user),subset,cellcount==max(cellcount))
cells2
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.4 4420.214
since your data set is large, you might want to consider data.table, which not only will be blazing fast, it will also make the data mungling a bit easier.
Converting to a data table is straight forward:
library (data.table)
DT <- data.table(testd, by="user")
Then determining the most visited, by user, is just one line
# Determining which is the most visited, by user
DT[, "MostVisited" := {counts <- table(cellxy); names(counts)[which(counts==max(counts))]}, by=user]
I'm not sure how specifically you want to calculate the average x, y relative to the MostVisited, but I'm sure that as well could be relatively straight forward with data.table.
## But perhaps something like this
DT[, c("AvgX", "AvgY") := list(mean(x), mean(y)), by=list(user, MostVisited)]

Resources