R program visualization + How to plot between categorical and numerical data

R program visualization + How to plot between categorical and numerical data - r

I want to develop a visualization graph with date in x axis and A,B,C values in Y axis. A value categorical data and B,c are numerical data. In the x axis, date should be represented in month and day like (01/07, 02/07). The problem here is A value is categorical data and B value is numerical data. The visualization charts should not be of bar type.It is like scatter plots.
I didn't know how to do this one. Your help would be highly appreciable. Thanks.
dput(df)
structure(list(Date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = 1:31, mon = c(6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(5L,
6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L,
1L, 2L, 3L, 4L, 5L, 6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L), yday = 182:212,
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"
), class = c("POSIXlt", "POSIXt"), tzone = "Asia/Tokyo"), A =
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("N",
"YES"), class = "factor"), B = c(0L, 0L, 0L, 0L, 1L, 3L, 5L,
10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), C = c(0L, 0L, 0L, 0L, 1L,
3L, 5L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 5L, 10L, 9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 10L, 9L, 8L)), .Names = c("Date",
"A", "B", "C"), row.names = c(NA, -31L), class = "data.frame")
sample data:
Date A B C
1 2016-07-01 N 0 0
2 2016-07-02 N 0 0
3 2016-07-03 N 0 0
4 2016-07-04 N 0 0
5 2016-07-05 N 1 1
6 2016-07-06 N 3 3
7 2016-07-07 N 5 5
8 2016-07-08 N 10 10
9 2016-07-09 N 9 9
10 2016-07-10 N 8 8
11 2016-07-11 N 7 7
12 2016-07-12 N 6 6
13 2016-07-13 N 5 5
14 2016-07-14 N 4 4
15 2016-07-15 N 3 3
16 2016-07-16 N 2 2
17 2016-07-17 N 1 1
18 2016-07-18 N 0 5
19 2016-07-19 N 0 10
20 2016-07-20 N 0 9
21 2016-07-21 N 0 8
22 2016-07-22 N 0 7
23 2016-07-23 N 0 6
24 2016-07-24 YES 0 5
25 2016-07-25 N 0 4
26 2016-07-26 N 0 3
27 2016-07-27 N 0 2
28 2016-07-28 N 0 1
29 2016-07-29 N 0 10
30 2016-07-30 N 0 9
31 2016-07-31 N 0 8
I tried this code. It works good for one value like B.. when I try to plot C value along with B, the graph seems different .
library(scales)
library(ggplot2)
ggplot(df, aes(x = Date, y = B)) +
geom_line(aes(y=B,group=1),colour="#000099") +
geom_point(size=2, colour="#CC0000") +
scale_y_continuous(breaks = seq(0, 50, by = 1)) +
scale_x_datetime(date_breaks = "2 day")

Here's one possible way to do it. I am not sure if this is ideal but hopefully helps in your case:
Using Markers
#library(devtools)
#install_github("ropensci/plotly")
library(plotly)
library(zoo)
df$Date <- as.yearmon(df$Date)
df$Date <- as.Date(df$Date)
plot_ly(df, x = ~Date) %>%
add_markers(y = ~B, marker = list(size = 15, symbol = "cross"),
name = "B") %>%
add_markers(y = ~C, marker = list(size = 10, symbol = "circle"),
name = "C") %>%
add_markers(y = ~A, marker = list(size = 20, symbol = "diamond-open"),
name = "A", yaxis = "y2") %>%
layout(xaxis = list(domain = c(0, 0.9)),
yaxis2 = list(side = "right", anchor = "xaxis", overlaying = "y", title = "A"),
yaxis = list(title = "B/C"))
Using lines
#library(devtools)
#install.github("ropensci/plotly")
library(plotly)
df$x <- 1:nrow(df)
df$Date <- as.character(as.Date(df$Date))
plot_ly(df, x = ~x) %>%
add_lines(y = ~B, line = list(width = 5), opacity = "0.5",
name = "B") %>%
add_lines(y = ~C, line = list(width = 5, dash = "5px"),
name = "C") %>%
add_markers(y = ~A, marker = list(size = 15, symbol = "cross"),
name = "A", yaxis = "y2") %>%
layout(xaxis = list(domain = c(0, 0.9), title = "",
tickmode = "array",
tickvals = ~x,
ticktext = ~Date,
tickfont = list(size = 10)),
yaxis2 = list(side = "right", anchor = "xaxis", overlaying = "y", title = "A"),
yaxis = list(title = "B/C"))

Related

How to read many lidar files (.las) at once and combine them into one dataframe in R

I have folder where many lidar(.las) files. It looks like
library(rgdal)
library(raster)
library(tmaptools)
library(tmap)
library(lidR)
library(RStoolbox)
las=readLAS("C:/1/078-638.las")
las1=readLAS("C:/1/082-628.las")
las2=....
so if more than 100 files It's hard to write in every line. Is there a way to read all these files at once, but in format data.frame?
.las file has such structure
las=payload(las)
las=structure(list(X = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), Y = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
Z = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), Synthetic_flag = c(FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Keypoint_flag = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
), Withheld_flag = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE), ScanAngleRank = c(-12L, -12L,
-12L, -12L, -12L, -12L, -12L, -13L, -13L, -13L), UserData = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), PointSourceID = c(16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), Pulse.width = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-10L))
structure of las1
las1=structure(list(X = c(628800.68, 628800.75, 628801.43, 628801.47,
628802.13, 628802.19, 628800.19, 628800.24, 628799.57, 628799.58
), Y = c(6082001.07, 6082001.08, 6082001.19, 6082001.2, 6082001.3,
6082001.31, 6082001.21, 6082001.22, 6082001.12, 6082001.12),
Z = c(163.16, 162.96, 163.09, 162.97, 163.12, 162.98, 163.29,
163.16, 163.02, 162.99), gpstime = c(319799021.884921, 319799021.884921,
319799021.884929, 319799021.884929, 319799021.884938, 319799021.884938,
319799021.889375, 319799021.889375, 319799021.889384, 319799021.889384
), Intensity = c(12L, 99L, 14L, 112L, 14L, 121L, 17L, 167L,
20L, 189L), ReturnNumber = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), NumberOfReturns = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), ScanDirectionFlag = c(1L, 1L, 1L, 1L, 1L,
1L, 0L, 0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L), Synthetic_flag = c(FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
Keypoint_flag = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE), Withheld_flag = c(FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
ScanAngleRank = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L), UserData = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), PointSourceID = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L), Pulse.width = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -10L))
then after we read all .las files , we combine it in one dataset, with indicating that these rows belong to the first las file, and these to the second, something like this
X Y Z gpstime Intensity ReturnNumber number las
638238.76 6078001.09 186.64 319805734.664265 13 2 1
638238.76 6078001.09 186.59 319805734.664265 99 1 1
638239.29 6078001.15 199.28 319805734.67875 5 1 1
638235.39 6078001.15 189.37 319805734.678768 2 2 1
638233.86 6078001.07 186.67 319805734.678777 20 1 1
638233.86 6078001.07 186.67 319805734.678777 189 1 1
638235.55 6078001.02 198.04 319805734.687338 2 2 1
638231.97 6078001.08 200.03 319805734.701928 11 1 1
638231.91 6078001.09 199.73 319805734.701928 90 1 1
638228.41 6078001.01 192.14 319805734.701945 1 3 1
628800.68 6082001.07 163.16 319799021.884921 12 1 2
628800.75 6082001.08 162.96 319799021.884921 99 1 2
628801.43 6082001.19 163.09 319799021.884929 14 1 2
628801.47 6082001.2 162.97 319799021.884929 112 1 2
628802.13 6082001.3 163.12 319799021.884938 14 1 2
628802.19 6082001.31 162.98 319799021.884938 121 1 2
628800.19 6082001.21 163.29 319799021.889375 17 1 2
628800.24 6082001.22 163.16 319799021.889375 167 1 2
628799.57 6082001.12 163.02 319799021.889384 20 1 2
628799.58 6082001.12 162.99 319799021.889384 189 1 2
So How can i read all .las files from folder C:/1, then for all of them get structure format like i provided above, and them combine it into 1 dataset with las file number.
Thanks for any of your valuable help.
*Edit
now next error list_df <- filenames %>%
purrr::map(., ~readLAS(.x) %>% mutate(filenumber = match(.x, filenames)))
Error in UseMethod("mutate") :
no suitable method for 'mutate' applied to object of las
I think it becauase las format has such structure
las
class : LAS (v1.2 format 1)
memory : 308.7 Mb
extent : 637999, 638240.5, 6077999, 6079999 (xmin, xmax, ymin, ymax)
coord. ref. : NA
area : 409328 units²
points : 3.68 million points
density : 8.99 points/units²
but dataframe from las must be using payload function which i indicated.
but list_df <- filenames %>%
purrr::map(., ~readLAS(.x) %>% payload(filenumber = match(.x, filenames)))
provide error.

Maybe somthing like this
filenames <- list.files(path <- "C:/1/", pattern="*.las", full.names=TRUE)
list_df <- filenames %>%
purrr::map(., ~read.LAS(.x) %>% payload() %>% mutate(filenumber = match(.x, filenames)))
# If all data has the same structure, you can easily bind them together, i.e.
list_df %>% bind_rows()

No need to loop. rlas already natively support reading multiples files
rlas::read.las(filenames)

Adding together factor levels across multiple columns

I've got a dataset with the same four factors repeated over a number of columns. I'm trying to count the number of factors in each column (in effect adding the rows together) but without any success using the summarise( n = n()) command. Instead of getting a no. of columns x 4 size dataframe, I get just the entire thing counted.
This is the code I've tried:
percentages_20_notconstant <- allchangingreaders_20 %>%
group_by(resp) %>%
summarise(resp = n(colnames(allchangingreaders_20)))
structure(list(resp = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L,
1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("Don't Know",
"Leave", "Remain", "Will Not Vote"), class = "factor"), euRefVoteW1 = c(0L,
0L, 0L, 0L, 3L, 5L, 1L, 0L, 12L, 0L, 0L, 1L, 17L, 10L, 0L, 5L,
13L, 9L, 0L, 3L), euRefVoteW2 = c(0L, 0L, 0L, 0L, 4L, 5L, 0L,
0L, 13L, 0L, 0L, 0L, 16L, 12L, 0L, 4L, 10L, 10L, 0L, 5L), euRefVoteW3 = c(0L,
0L, 0L, 0L, 3L, 4L, 0L, 2L, 11L, 1L, 0L, 1L, 17L, 8L, 1L, 6L,
13L, 8L, 0L, 4L), euRefVoteW4 = c(0L, 0L, 0L, 0L, 3L, 4L, 0L,
2L, 12L, 0L, 0L, 1L, 19L, 10L, 0L, 3L, 12L, 8L, 0L, 5L), euRefVoteW6 = c(0L,
0L, 0L, 0L, 4L, 4L, 0L, 1L, 13L, 0L, 0L, 0L, 20L, 8L, 0L, 4L,
13L, 7L, 0L, 5L), euRefVoteW7 = c(0L, 0L, 0L, 0L, 2L, 6L, 0L,
1L, 13L, 0L, 0L, 0L, 18L, 14L, 0L, 0L, 11L, 12L, 0L, 2L), euRefVoteW8 = c(0L,
0L, 0L, 0L, 2L, 7L, 0L, 0L, 12L, 1L, 0L, 0L, 19L, 12L, 0L, 1L,
12L, 12L, 0L, 1L), euRefVoteW9 = c(0L, 0L, 0L, 0L, 4L, 5L, 0L,
0L, 12L, 1L, 0L, 0L, 21L, 11L, 0L, 0L, 11L, 14L, 0L, 0L)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
I have managed to do what I was after by changing a seperate function, but think this task is something which makes sense to keep up.
So the thing I'm trying to do is to go from the first dput to this dput:
structure(list(resp = structure(c(3L, 2L, 4L, 1L), .Label = c("Don't Know",
"Leave", "Remain", "Will Not Vote"), class = "factor"), euRefVoteW1 = c(45L,
24L, 1L, 9L), euRefVoteW2 = c(43L, 27L, 0L, 9L), euRefVoteW3 = c(44L,
21L, 1L, 13L), euRefVoteW4 = c(46L, 22L, 0L, 11L), euRefVoteW6 = c(50L,
19L, 0L, 10L), euRefVoteW7 = c(44L, 32L, 0L, 3L), euRefVoteW8 = c(45L,
32L, 0L, 2L), euRefVoteW9 = c(48L, 31L, 0L, 0L), Paper = structure(c(1L,
1L, 1L, 1L), .Label = "Former Readers", class = "factor")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Can this be done with summarise ?

After grouping by 'resp', get the rowSums of the cur_data() (which doesn't include the grouping column) and then wrap with sum
library(dplyr)
allchangingreaders_20 %>%
group_by(resp) %>%
summarise(n = sum(rowSums(cur_data())), .groups = 'drop')
-output
# A tibble: 4 x 2
# resp n
#* <fct> <dbl>
#1 Don't Know 57
#2 Leave 208
#3 Remain 365
#4 Will Not Vote 2
Or if it is the count of elements greater than 0
allchangingreaders_20 %>%
group_by(resp) %>%
summarise(n = sum(rowSums(cur_data() > 0)))
# A tibble: 4 x 2
# resp n
#* <fct> <dbl>
#1 Don't Know 20
#2 Leave 27
#3 Remain 32
#4 Will Not Vote 2
Update
Based on the updated expected output, we can also do
allchangingreaders_20 %>%
group_by(resp) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')

Are you looking for this
allchangingreaders_20 %>% group_by(resp) %>%
summarise(across(everything(), ~sum(.)))
# A tibble: 4 x 9
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 euRefVoteW4 euRefVoteW6 euRefVoteW7 euRefVoteW8 euRefVoteW9
<fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 Don't~ 9 9 13 11 10 3 2 0
2 Leave 24 27 21 22 19 32 32 31
3 Remain 45 43 44 46 50 44 45 48
4 Will ~ 1 0 1 0 0 0 0 0

Formatting x_scale in ggplot with weekly data

In my ggplot, I've managed to create a x_scale based on time (I have weekly data) but am not sure how to create year-month labels using scale_x_date
The following is my code - I have tried using (...breaks = "1 month") and (...minor_breaks = "1 month") but this does not produce the desired result. I am aiming for the labels to simply be Dec-16, Jan-17, Feb-17 and so on. What is the proper formatting to make the x_scale to show Month-Year in an abbreviated way?
ggplot(data=test, aes(x=as.Date(test$weekly), y=test$dist, group=1)) +
geom_path(col = "blue") +
scale_x_date(labels = date_format("%b-%Y"))
Here is a sample of the data
> dput(test)
structure(list(weekly = structure(list(sec = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(18L, 25L, 1L,
8L, 15L, 22L, 29L, 5L, 12L, 19L, 26L, 3L, 10L, 17L, 24L, 31L,
7L, 14L, 21L, 28L, 5L, 12L, 19L, 26L, 2L, 9L, 16L, 23L, 30L,
6L, 13L, 20L, 27L, 6L, 13L, 20L, 27L, 3L, 10L, 17L, 24L, 1L,
8L, 15L, 22L, 29L, 5L, 12L, 19L, 26L, 3L, 10L, 17L, 24L, 31L),
mon = c(6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L,
9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 0L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L),
year = c(116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 117L, 117L, 117L, 117L,
117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L,
117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L, 117L,
117L, 117L, 117L, 117L, 117L, 117L, 117L), wday = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), yday = c(199L, 206L, 213L,
220L, 227L, 234L, 241L, 248L, 255L, 262L, 269L, 276L, 283L,
290L, 297L, 304L, 311L, 318L, 325L, 332L, 339L, 346L, 353L,
360L, 1L, 8L, 15L, 22L, 29L, 36L, 43L, 50L, 57L, 64L, 71L,
78L, 85L, 92L, 99L, 106L, 113L, 120L, 127L, 134L, 141L, 148L,
155L, 162L, 169L, 176L, 183L, 190L, 197L, 204L, 211L), isdst = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", "CEST",
"CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST",
"CEST", "CEST", "CEST", "CEST", "CEST", "CET", "CET", "CET",
"CET", "CET", "CET", "CET", "CET", "CET", "CET", "CET", "CET",
"CET", "CET", "CET", "CET", "CET", "CET", "CET", "CET", "CET",
"CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST",
"CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST", "CEST",
"CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), dist = c(23621.19,
30400.01, 27458.66, 24511.07, 24908.37, 24413.81, 21096.52, 24557.51,
14833.71, 15513.72, 13516.23, 8102.02, 5881.44, 8370.01, 7339.34,
7703.79, 9297.52, 4542, 3555.56, 4438.33, 1968.65, 2259.06, 2729.89,
2876.66, 1767.86, 784.17, 2004.55, 4446.98, 2203.16, 3956.35,
4095.28, 3999.88, 3288.59, 4593.19, 6164.63, 6111.46, 8462.84,
7404.8, 9725.91, 9652.72, 9357.52, 15535.51, 11810.82, 17890.89,
23518.06, 18754.44, 16377.46, 15023.27, 23354.14, 23328.12, 27024.1,
23414.38, 28273.08, 24213.3, 19068.03)), .Names = c("weekly",
"dist"), row.names = c(NA, -55L), class = "data.frame")

group by and then count unique observations [duplicate]

I have a data frame that looks like this:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
I need to find the number of unique "id"s for each "date" in that data frame.
I tried this:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
and I got the following error:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
Any ideas?? Thank you!
Hopefully this helps!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")

You can use the uniqueN function from data.table:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
or (as per the comment of #Richard Scriven):
aggregate(id ~ date, df, function(x) length(unique(x)))

Or we could use n_distinct from library(dplyr)
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))

This answer is in response to this post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here: How to find number of unique ids corresponding to each date in a data drame which asks about finding unique ID's. I'm not sure the second post actually answers the OP's question which is,
"I want to create a table with the number of unique id for each
combination of group1 and group2."
The keyword here is 'combination'. The interpretation is each id has a particular value for group1 and a particular value for group2 so that the set of data of interest is the particular set of values c(id, group1, group2).
Here is the data.frame the OP provided:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
Using data.table inspired by this post -- https://stackoverflow.com/a/13017723/5220858:
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N is the count for the id that has a particular group1 value and a particular group2 value. Expanding to include the id also returns a table of 104 unique id, group1, group2 combinations.
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77

How to find number of unique ids corresponding to each date in a data drame

I have a data frame that looks like this:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
I need to find the number of unique "id"s for each "date" in that data frame.
I tried this:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
and I got the following error:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
Any ideas?? Thank you!
Hopefully this helps!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")

You can use the uniqueN function from data.table:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
or (as per the comment of #Richard Scriven):
aggregate(id ~ date, df, function(x) length(unique(x)))

Or we could use n_distinct from library(dplyr)
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))

This answer is in response to this post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here: How to find number of unique ids corresponding to each date in a data drame which asks about finding unique ID's. I'm not sure the second post actually answers the OP's question which is,
"I want to create a table with the number of unique id for each
combination of group1 and group2."
The keyword here is 'combination'. The interpretation is each id has a particular value for group1 and a particular value for group2 so that the set of data of interest is the particular set of values c(id, group1, group2).
Here is the data.frame the OP provided:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
Using data.table inspired by this post -- https://stackoverflow.com/a/13017723/5220858:
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N is the count for the id that has a particular group1 value and a particular group2 value. Expanding to include the id also returns a table of 104 unique id, group1, group2 combinations.
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R program visualization + How to plot between categorical and numerical data - r

Related

How to read many lidar files (.las) at once and combine them into one dataframe in R

Adding together factor levels across multiple columns

Formatting x_scale in ggplot with weekly data

group by and then count unique observations [duplicate]

How to find number of unique ids corresponding to each date in a data drame

Categories

Resources