Using summarise function to make sumIF with the dplyr package [duplicate] - r

This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 2 years ago.
I am using the dplyr to make a sumIF function on my data frame. However, it does not give me the desired output:
> dput(sys)
structure(list(NUMERIC = c(244L, 24L, 1L, 2L, 4L, 111L, 23L,
2L, 3L, 4L, 24L), VAL = c("FALSE", "FALSE", "TES", "TEST", "TRUE",
"TRUE", "TRUE", "asdfs", "asdfs", "safd", "sd"), IDENTIFIER = c(99L,
99L, 98L, 98L, 99L, 99L, 99L, 13L, 13L, 99L, 12L)), .Names = c("NUMERIC",
"VAL", "IDENTIFIER"), row.names = c(NA, 11L), class = c("grouped_dt",
"tbl_dt", "tbl", "grouped_dt", "tbl_dt", "tbl", "data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000100788>, sorted = c("VAL",
"IDENTIFIER"), vars = list(VAL, IDENTIFIER))
>
>
> sys <- group_by(sys, VAL, IDENTIFIER)
> df.summary <- summarise(sys,
+ numeric = sum(NUMERIC)
+ )
>
> (df.summary)
numeric
1 442
My desired result should look like that:
Any recommendation as to what I am doing wrong?

This could occur when you have plyr loaded along with dplyr. You can either do this on a new R session or use
dplyr::summarise(sys,
numeric = sum(NUMERIC)
)

Related

Renaming All Columns from Vector [duplicate]

This question already has answers here:
Rename multiple columns by names
(20 answers)
Closed 2 years ago.
I have a data set for which I'd like to rename all the columns using a list of characters of the same length.
Here is my code at the moment, which tries to use the rename command.
eurrefcolnames = c('euRefVoteW1','euRefVoteW2', 'euRefVoteW3', 'euRefVoteW4', 'euRefVoteW6','euRefVoteW7', 'euRefVoteW8', 'euRefVoteW9')
times <- in_all_waves %>%
filter(Paper == 'Times') %>%
ungroup() %>%
select(-Paper) %>%
map_df(table) %>%
t %>%
as_tibble() %>%
rename(eurrefcolnames = names(times))
But this doesn't seem to work. The function replace_with requires a function and it seems silly to have to write a function that just calls a list. Is there a simpler way I'm not seeing ?
Here is the table I'm using for context (taken after the as_tibble) command:
structure(list(V1 = c(316L, 157L, 2L, 56L), V2 = c(290L, 123L,
3L, 51L), V3 = c(313L, 159L, 3L, 55L), V4 = c(324L, 154L, 3L,
50L), V5 = c(338L, 134L, 2L, 57L), V6 = c(320L, 189L, 2L, 20L
), V7 = c(325L, 187L, 2L, 17L), V8 = c(335L, 181L, 0L, 0L)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
You can use set_names from the purrr package:
mydf %>%
purrr::set_names(eurrefcolnames)

Assign trip number based on condition

I have a time series data. I would like to group and number rows when column "soak" > 3600. The first row when soak > 3600 is numbered as 1, and the consecutive rows are numbered as 1 too until another row met the condition of soak > 3600. Then that row and consequent rows are numbered as 2 until the third occurrence of soak > 3600.
A small sample of my data and the code I tried is also provided.
My code did the count, but seems using the ave() gave me some decimal numbers... Is there a way to output integer?
starts <- structure(list(datetime = structure(c(1440578907, 1440579205,
1440579832, 1440579885, 1440579926, 1440579977, 1440580044, 1440580106,
1440580195, 1440580256, 1440580366, 1440580410, 1440580476, 1440580529,
1440580931, 1440580966, 1440587753, 1440587913, 1440587933, 1440587954
), class = c("POSIXct", "POSIXt"), tzone = ""), soak = c(NA,
70L, 578L, 21L, 2L, 41L, 14L, 16L, 32L, 9L, 45L, 20L, 51L, 25L,
364L, 4L, 6764L, 20L, 4L, 5L)), row.names = c(NA, -20L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000000000a4d1ef0>)
starts$trip <- with(starts, ave(tdiff, cumsum(replace(soka, NA, 10000) > 3600)))
Using dplyr
library(dplyr)
starts %>% mutate(trip = cumsum(replace(soak, is.na(soak), 1) > 3600))
And with base R
starts$trip = with(starts, ave(soak, FUN=function(x) cumsum(replace(x, is.na(x), 1) > 3600)))

Calculate moving geometric mean by water sampling station

I need to calculate the moving geometric mean on fecal coliform over time(at each value I want the geomean of that value and the previous 29 values), by individual sampling stations. When I download the data from our database the column headers are:
Station SampleDate FecalColiform
Depending on the growing area there are a few to over a dozen stations.
I tried to adapt some code that I found at HERE:
#File: Fecal
Fecal <- group_by(Fecal, Station) %>%
arrange(SampleDate) %>%
mutate(logres = log10(ResultValue)) %>%
mutate(mgm = stats::filter(logres, rep(1/24, 24), sides =1))
This worked, but the problem is that I don't want the resulting log values. I want just the regular geomean so that I can plot it and everyone can easily understand the values. I tried to somehow sneak the geometric.mean function from the psych package in there I could not make that work.
There are resources for calculating a moving average, and code for calculating geometric mean and I have tried to combine several of them. I can't find an example for moving geometric mean.
Eventually I would like to graph all of geomeans by station similar to the example in the link above.
> dput(ByStationRGMData[1:10,])
structure(list(Station = c(114L, 114L, 114L, 114L, 114L, 114L,
114L, 114L, 114L, 114L), Classification = structure(c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c(" Approved ", " Conditionally Approved ",
" Prohibited "), class = "factor"), SampleDate = c(19890103L,
19890103L, 19890209L, 19890316L, 19890413L, 19890511L, 19890615L,
19890713L, 19890817L, 19890914L), SWTemp = c(NA, NA, 5L, 8L,
NA, 13L, 15L, 18L, NA, 18L), Salinity = c(NA, NA, 22L, 18L, NA,
26L, 22L, 24L, NA, 32L), FecalColiform = c(180, 49, 2, 17, 7.9,
1.8, 4.5, 11, 33, 1.8), RGM = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), .Names = c("Station", "Classification", "SampleDate", "SWTemp",
"Salinity", "FecalColiform", "RGM"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), vars = list(
Station), drop = TRUE, indices = list(0:9), group_sizes = 10L, biggest_group_size = 10L, labels = structure(list(
Station = 114L), class = "data.frame", row.names = c(NA,
-1L), vars = list(Station), drop = TRUE, .Names = "Station"))
I would also like to add a moving 90th percentile to the dataframe and the graphs. I tried the following:
ByStationRGMData <- RawData %>%
group_by(Station) %>%
arrange(SampleDate) %>%
mutate(RGM = as.numeric(rollapply(FecalColiform, 30, geometric.mean, fill=NA, align="right"))) +
mutate(F90 = as.numeric(rollapply(FecalColiform, 30, quantile, p=0.90, fill=NA, align="right")))
This gives me the error:
Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)) : argument ".data" is missing, with no default
I can't seem to figure out what I'm missing.
You can use rollapply from the zoo package (illustrated here using the built-in mtcars data frame). I've used a window of 3 values, but you can set that to 30 in your actual data. align="left" uses the current value and n-1 previous values, where n is the window width:
library(psych)
library(dplyr)
library(zoo)
mtcars %>%
mutate(mpgGM = rollapply(mpg, 3, geometric.mean, fill=NA, align="left"))
Include a grouping variable to get rolling geometric means separately for each group.

R Program Vector, record Column Percent

This is my vector
head(sep)
I must find percent of all SEP 11 in each row.
For instance, in first row, percent of SEP 11 is
100 * ((63 + 124)/ (63 + 124 + 0 + 0))
And would like this stored in newly created 8th column
Thanks
dput
> dput(head(sep))
structure(list(Site = structure(1:6, .Label = c("31R001", "31R002",
"31R003", "31R004", "31R005", "31R006", "31R007", "31R008", "31R011",
"31R013", "31R014", "31R016", "31R018", "31R019", "31R020", "31R021",
"31R022", "31R023", "31R024", "31R025", "31R026", "31R027", "31R029",
"31R030", "31R031", "31R032", "31R034", "31R035", "31R036", "31R038",
"31R039", "31R040", "31R041", "31R042", "31R043", "31R044", "31R045",
"31R046", "31R048", "31R049", "31R050", "31R051", "31R052", "31R053",
"31R054", "31R055", "31R056", "31R057", "31R058", "31R059", "31R060",
"31R061", "31R069", "31R071", "31R072", "31R075", "31R435", "31R440",
"31R445", "31R450", "31R455", "31R460", "31R470", "31R600", "31R722",
"31R801", "31R825", "31R826", "31R829", "31R840", "31R843", "31R861",
"31R880"), class = "factor"), Latitude = c(33.808874, 33.877256,
33.820825, 33.852373, 33.829697, 33.810274), Longitude = c(-117.844048,
-117.700135, -117.811845, -117.795516, -117.787532, -117.830429
), Windows.SEP.11 = c(63L, 174L, 11L, 85L, 163L, 71L), Mac.SEP.11 = c(0L,
1L, 4L, 0L, 0L, 50L), Windows.SEP.12 = c(124L, 185L, 9L, 75L,
23L, 5L), Mac.SEP.12 = c(0L, 1L, 32L, 1L, 0L, 50L)), .Names = c("Site",
"Latitude", "Longitude", "Windows.SEP.11", "Mac.SEP.11", "Windows.SEP.12",
"Mac.SEP.12"), row.names = c(NA, 6L), class = "data.frame")
Assuming that you want to get the rowSums of columns that have 'Windows' as column names, we subset the dataset ("sep1") using grep. Then get the rowSums(Sub1), divide by the rowSums of all the numeric columns (sep1[4:7]), multiply by 100, and assign the results to a new column ("newCol")
Sub1 <- sep1[grep("Windows", names(sep1))]
sep1$newCol <- 100*rowSums(Sub1)/rowSums(sep1[4:7])

Using frequency of column value in dataframe to calculate new column value

So I have an example dataframe that hold the columns id, count and username with id and count being numbers and username being a string.
For every row of the dataframe I want to set a value of a new column called 'ratio', with ratio being defined as
count / number of rows where username == the username in this row
Example from the provided data:
In every row where the username is 'Tom' the ratio would be count/4 , because the user Tom is found four times in the data.
This is just a simplified version of my problem, a for-loop is not an option because my original dataframe has about 3.4 million rows and my previous approach where I used for-loops to iterate the unique values of e.g. 'username' to solve this problem takes forever.
dput of my dataframe:
structure(list(id = 1:20, count = c(140L, 89L, 17L, 114L, 129L,
86L, 21L, 50L, 197L, 160L, 8L, 14L, 78L, 208L, 155L, 55L, 63L,
20L, 189L, 79L), usernames = structure(c(4L, 3L, 5L, 5L, 2L,
3L, 1L, 1L, 3L, 1L, 3L, 2L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 3L), .Label = c("Jerry",
"Mark", "Phil", "Tina", "Tom"), class = "factor")), .Names = c("id",
"count", "usernames"), row.names = c(NA, 20L), class = "data.frame")
I hope I provided everything for you to understand and reproduce the problem, if something's missing don't hesitate to mention it in the comments.
There are several options. Here are three, one in base R, one with data.table, and one with "plyr". Both assume we're starting with a data.frame named "mydf":
Base R
within(mydf, {
temp <- as.numeric(ave(as.character(usernames), usernames, FUN = length))
ratio <- count/temp
rm(temp)
})
data.table
library(data.table)
DT <- data.table(mydf)
DT[, ratio := count/.N, by = "usernames"]
DT
plyr
library(plyr)
ddply(mydf, .(usernames), transform,
ratio = count/length(usernames))
You can use ave for this:
transform(d, x=count/as.numeric(ave(d$usernames, d$usernames, FUN=length)))

Resources