Say , I have datasets
df1=
structure(list(date = c("17.02.2021", "04.11.2020", "14.11.2020",
"24.11.2020", "29.11.2020", "04.12.2020", "09.12.2020"), x1 = c(0L,
0L, 7L, 0L, 0L, 0L, 0L), x2 = c(674L, 632L, 1036L, 656L, 736L,
762L, 698L), x3 = c(698L, 712L, 1140L, 704L, 784L, 786L, 722L
), x4 = c(522L, 472L, 988L, 464L, 608L, 578L, 514L), x5 = c(2408L,
3256L, 2840L, 2840L, 2888L, 2632L, 2648L), x6 = c(1952L, 2336L,
2480L, 2208L, 2208L, 2144L, 2016L), x7 = c(1056L, 1120L, 1504L,
1056L, 1184L, 1184L, 1120L), x8 = c(1984L, 2464L, 2400L, 2144L,
2208L, 2144L, 2080L), x9 = c(2336L, 2976L, 2784L, 2464L, 2784L,
2528L, 2400L), x10 = c(2528L, 3232L, 3104L, 2848L, 2912L, 2592L,
2656L), x11 = c(1248L, 1312L, 1504L, 1312L, 1312L, 1312L, 1248L
)), class = "data.frame", row.names = c(NA, -7L))
each row it is date. for the first day data profile here
The second day has data profiles
and so on.
Here reference dataset
df2=structure(list(date = c("06.11.2019", "01.12.2019", "25.01.2020",
"04.02.2020", "09.02.2020", "14.02.2020"), x1 = c(12L, 0L, 1L,
6L, 23L, 1L), x2 = c(1272L, 1046L, 688L, 572L, 592L, 328L), x3 = c(1032L,
974L, 736L, 780L, 800L, 568L), x4 = c(792L, 862L, 496L, 476L,
592L, 296L), x5 = c(2232L, 1496L, 1784L, 2792L, 3064L, 3544L),
x6 = c(2976L, 1904L, 1632L, 1760L, 1376L, 1440L), x7 = c(1568L,
1248L, 1008L, 1120L, 992L, 800L), x8 = c(1888L, 1376L, 1632L,
2400L, 2464L, 2720L), x9 = c(2080L, 1504L, 1760L, 2848L,
2912L, 3296L), x10 = c(2400L, 1552L, 1824L, 2848L, 2928L,
3360L), x11 = c(2400L, 1504L, 1120L, 1040L, 784L, 736L)), class = "data.frame", row.names = c(NA,
-6L))
Is there a way or method that would compare the profile of each row of data in df1 with the reference dataset df2, if the profile is similar, then 1 otherwise 0
The date in both dataset can be different, the main problem is detect is profiles are similar or not.
My desired output. The Peter's code is good, but is it possible calculate The difference between profiles by variables for example
This code allows you to visually compare the reference and df1 profiles. As you can see that none of the profiles match exactly. Some profiles are similar, but without a definition of "similar" as pointed out by #user2974951 it's difficult to move this closer to an answer.
library(dplyr)
library(tidyr)
library(ggplot2)
# restructure the data to allow comparison between the datasets
df <-
expand.grid("date_ref" = df2$date, "date_df1" = df1$date) %>%
left_join(df2, by = c("date_ref" = "date")) %>%
left_join(df1, by = c("date_df1" = "date")) %>%
pivot_longer(starts_with("x"), names_to = c("var", "df"), names_sep = "\\.") %>%
mutate(df = if_else(df == "x", "ref", "df1"),
var = factor(var, paste0("x", 1:11)))
# now you can plot the data to compare profiles; had to add some formatting to make the graph readable.
ggplot(df, aes(var, value, group = df, colour = df))+
geom_line()+
facet_grid(date_ref~date_df1)+
labs(colour = "Dataset")+
theme_classic()+
theme(legend.position = "bottom",
axis.text.x = element_text(size = 6, angle = 90),
axis.text.y = element_text(size = 6),
strip.text = element_text(size = 6))
Created on 2021-04-07 by the reprex package (v1.0.0)
What you need to define first is what criteria of similarity you want to use and what your threshold level of similarity is (how similar the datasets need to be to be considered equivalent). Also the important factor is the nature of your data. For example whether you consider your x1..x11 to be independent or just different samples of the same set.
Depending on the answers it can be anything from comparing each df1[i,2:12] to df2[i,2:12] exactly (if they are just duplicates or not) to comparing both of them to NA and checking if they are both NA or both a known value. Something in between would be checking if the differences of each parameter for each line of the datasets are not greater then 0.05 of the minimal value for example and marking the line equivalent if all the parameters are OK or using something like Pearson's correlation coefficient (cor(x,y) function has it enabled by default) for each line and comparing its value to 0.5 for example (both 0.05 and 0.5 are just arbitrary numbers of course and they probably need to be adjusted somewhat). Or maybe the amount of matching points (compared exactly as integers or just similar to some degree) is a better indication for you. There are also known standard tests for sample group dissimilarities, time series dissimilarities, or other statistical hypothesis. Many of them are available in R from bundled packages and if you fancy something else then it is most likely already available in one of the extra packages you can easily download and install.
Related
I have a data frame loaded in R and I need to sum one row. The problem is that I've tried to use rowSums() function, but 2 columns are not numeric ones (one is character "Nazwa" and one is boolean "X" at the end of data frame). Is there any option to sum this row without those two columns? So I'd like to start from row 1, column 3 and don't include last column.
My data:
structure(list(Kod = c(0L, 200000L, 400000L, 600000L, 800000L,
1000000L), Nazwa = c("POLSKA", "DOLNOŚLĄSKIE", "KUJAWSKO-POMORSKIE",
"LUBELSKIE", "LUBUSKIE", "ŁÓDZKIE"), gospodarstwa.ogółem.gospodarstwa.2006.... = c(9187L,
481L, 173L, 1072L, 256L, 218L), gospodarstwa.ogółem.gospodarstwa.2007.... = c(11870L,
652L, 217L, 1402L, 361L, 261L), gospodarstwa.ogółem.gospodarstwa.2008.... = c(14896L,
879L, 258L, 1566L, 480L, 314L), gospodarstwa.ogółem.gospodarstwa.2009.... = c(17091L,
1021L, 279L, 1710L, 579L, 366L), gospodarstwa.ogółem.gospodarstwa.2010.... = c(20582L,
1227L, 327L, 1962L, 833L, 420L), gospodarstwa.ogółem.gospodarstwa.2011.... = c(23449L,
1322L, 371L, 2065L, 1081L, 478L), gospodarstwa.ogółem.gospodarstwa.2012.... = c(25944L,
1312L, 390L, 2174L, 1356L, 518L), gospodarstwa.ogółem.gospodarstwa.2013.... = c(26598L,
1189L, 415L, 2129L, 1422L, 528L), gospodarstwa.ogółem.gospodarstwa.2014.... = c(24829L,
1046L, 401L, 1975L, 1370L, 508L), gospodarstwa.ogółem.gospodarstwa.2015.... = c(22277L,
849L, 363L, 1825L, 1202L, 478L), gospodarstwa.ogółem.gospodarstwa.2016.... = c(22435L,
813L, 470L, 1980L, 1148L, 497L), gospodarstwa.ogółem.gospodarstwa.2017.... = c(20257L,
741L, 419L, 1904L, 948L, 477L), gospodarstwa.ogółem.gospodarstwa.2018.... = c(19207L,
713L, 395L, 1948L, 877L, 491L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2006..ha. = c(228038L,
19332L, 4846L, 19957L, 12094L, 3378L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2007..ha. = c(287529L,
21988L, 5884L, 23934L, 18201L, 3561L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2008..ha. = c(314848L,
28467L, 5943L, 26892L, 18207L, 4829L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2009..ha. = c(367062L,
26427L, 6826L, 30113L, 22929L, 5270L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2010..ha. = c(519069L,
39703L, 7688L, 34855L, 35797L, 7671L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2011..ha. = c(605520L,
45547L, 8376L, 34837L, 44259L, 8746L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2012..ha. = c(661688L,
44304L, 8813L, 37466L, 52581L, 9908L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2013..ha. = c(669970L,
37455L, 11152L, 40819L, 54692L, 10342L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2014..ha. = c(657902L,
37005L, 11573L, 38467L, 53300L, 11229L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2015..ha. = c(580730L,
31261L, 10645L, 34052L, 46343L, 10158L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2016..ha. = c(536579L,
29200L, 9263L, 31343L, 43235L, 9986L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2017..ha. = c(494978L,
27542L, 8331L, 29001L, 37923L, 9260L), gospodarstwa.ogółem.powierzchnia.użytków.rolnych.2018..ha. = c(484677L,
27357L, 7655L, 28428L, 37174L, 8905L), X = c(NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 6L), class = "data.frame")
My attempt:
rowSums(dane_csv[, 3:length(dane_csv$Nazwa=='POLSKA')])
Using base R
rowSums(dane_csv[sapply(dane_csv, is.numeric)])
-output
1 2 3 4 5 6
6667212 627833 511473 1033876 1288648 1108797
Or with dplyr
library(dplyr)
dane_csv %>%
transmute(out = rowSums(across(where(is.numeric))))
in base R use Filter function, to select the numeric columns then do a rowSums on them
rowSums(Filter(is.numeric, df))
1 2 3 4 5 6
6667212 627833 511473 1033876 1288648 1108797
You can select only the numeric columns:
library(dplyr)
df %>%
select(where(is.numeric)) %>%
rowSums() %>%
first()
Result:
1
6667212
I have a database with the following structure:
I want to create two bar plots, with two facets (Sin and T), in the X-axis the time and in the Y-axis the different A, B, C, D, and E columns (columns can be stacked or not).
How can I do that?
Thanks in advance.
Something like this?
library(tidyverse)
df %>%
pivot_longer(
-c(COND, Time)
) %>%
ggplot(aes(x=factor(Time), y = value, fill=name)) +
geom_col(position = position_dodge())+
facet_wrap(.~COND)+
xlab("Time")
data:
df <- structure(list(COND = c("Sin", "Sin", "Sin", "Sin", "T", "T",
"T", "T"), Time = c(0L, 1L, 6L, 8L, 0L, 1L, 6L, 8L), A = c(54L,
202L, 155L, 202L, 244L, 321L, 149L, 155L), B = c(1536L, 732L,
2577L, 1321L, 1744L, 1952L, 3857L, 1780L), C = c(34018L, 80476L,
4173L, 119L, 33851L, 56320L, 2494L, 696L), D = c(10458L, 33655L,
357L, 452L, 10869L, 30667L, 1839L, 3315L), E = c(3500L, 1904L,
0L, 0L, 3035L, 2839L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-8L))
I'm analysing real-estate sales for some N. American cities and am using k-means clustering on the data. I have seven clusters and for each observation in the cluster I have the latitude, longitude, zipcode, and cluster_id. I'd like to plot this on a map to better visualize the clusters - I'm not sure what such a plot is called - Choropleth? Polygon?
Most of the examples are using geoJSON files but I only have a data.frame object from my k-means clustering.
Actual data:
https://www.kaggle.com/threnjen/portland-housing-prices-sales-jul-2020-jul-2021
Sample data:
> dput(dt[runif(n = 10,min = 1,max = 25000)])
structure(list(id = c(23126L, 15434L, 5035L, 19573L, NA, 24486L,
NA, 14507L, 3533L, 20192L), zipcode = c(97224L, 97211L, 97221L,
97027L, NA, 97078L, NA, 97215L, 97124L, 97045L), latitude = c(45.40525436,
45.55965805, 45.4983139, 45.39398956, NA, 45.47454071, NA, 45.50736618,
45.52812958, 45.34381485), longitude = c(-122.7599182, -122.6500015,
-122.7288742, -122.591217, NA, -122.8898392, NA, -122.6084061,
-122.91745, -122.5948334), lastSoldPrice = c(469900L, 599000L,
2280000L, 555000L, NA, 370000L, NA, 605000L, 474900L, 300000L
), lotSize = c(5227L, 4791L, 64904L, 9147L, NA, 2178L, NA, 4356L,
2613L, 6969L), livingArea = c(1832L, 2935L, 5785L, 2812L, NA,
1667L, NA, 2862L, 1844L, 742L), cluster_id = c(7, 7, 2, 7, NA,
4, NA, 7, 7, 4)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7faa8000fee0>)
I've followed the example on https://gist.github.com/josecarlosgonz/8565908 to try and create a geoJSON file to be able to plot this data but without success.
I'm not using markers because I have ~25,000 observations - it would be difficult to plot them all and the file would take forever to load.
EDIT:
observations by zipcode:
> dput(dat[, .N, by = .(`address/zipcode`)][(order(`address/zipcode`))])
structure(list(`address/zipcode` = c(7123L, 97003L, 97004L, 97005L,
97006L, 97007L, 97008L, 97009L, 97015L, 97019L, 97023L, 97024L,
97027L, 97030L, 97034L, 97035L, 97038L, 97045L, 97056L, 97060L,
97062L, 97068L, 97070L, 97078L, 97080L, 97086L, 97089L, 97113L,
97123L, 97124L, 97132L, 97140L, 97201L, 97202L, 97203L, 97204L,
97205L, 97206L, 97209L, 97210L, 97211L, 97212L, 97213L, 97214L,
97215L, 97216L, 97217L, 97218L, 97219L, 97220L, 97221L, 97222L,
97223L, 97224L, 97225L, 97227L, 97229L, 97230L, 97231L, 97232L,
97233L, 97236L, 97239L, 97266L, 97267L), N = c(1L, 352L, 9L,
252L, 421L, 1077L, 357L, 1L, 31L, 2L, 4L, 159L, 239L, 525L, 640L,
548L, 1L, 1064L, 5L, 353L, 471L, 736L, 6L, 403L, 866L, 913L,
8L, 5L, 1113L, 776L, 3L, 543L, 219L, 684L, 463L, 1L, 57L, 809L,
189L, 216L, 688L, 510L, 504L, 330L, 318L, 177L, 734L, 195L, 832L,
305L, 276L, 589L, 688L, 716L, 286L, 83L, 1307L, 475L, 77L, 150L,
382L, 444L, 290L, 423L, 430L)), row.names = c(NA, -65L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7f904781a6e0>)
I used the kaggle data on a simple laptop (i3 8th gen) to generate a ggplot2 object, with cluster IDs randomly sampled and transform this via the ggplotly() function ... the resulting plotly object seems OK to work with for analysis but I do not know your performance requirements:
library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth) # here we get the basic map data from
# read in data from zip, select minimal number of columns and sample cluster_id
df <- readr::read_csv(unzip("path_to_zip/portland_housing.csv.zip"))%>%
dplyr::select(az = `address/zipcode`, latitude, longitude) %>%
dplyr::mutate(cluster_id = sample(1:7, n(), replace = TRUE))
# get the map data
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
# build the ggplot2 object (note that I use rings as shapes and alpha parameter to reduce the over plotting
plt <- ggplot2::ggplot(data = world) +
ggplot2::geom_sf() +
ggplot2::geom_point(data = df, aes(x = longitude, y = latitude, color = factor(cluster_id)), size = 1, shape = 21, alpha = .7) +
ggplot2::coord_sf(xlim = c(-124.5, -122), ylim = c(45, 46), expand = FALSE)
# plot it:
plt
# plotly auto transform from ggplot2 object
plotly::ggplotly(plt)
EDIT
To include a map you can use for example the ggmap package instead of the map data from rnaturalearth... I will only display the plotly result:
library(ggmap)
# https://stackoverflow.com/questions/23130604/plot-coordinates-on-map
sbbox <- ggmap::make_bbox(lon = c(-124.5, -122), lat = c(45, 46), f = .1)
myarea <- ggmap::get_map(location=sbbox, zoom=10, maptype="terrain")
myarea <- ggmap::ggmap(myarea)
plt2 <- myarea +
ggplot2::geom_point(data = df, mapping = aes(x = longitude, y = latitude, color = factor(cluster_id)), shape = 21, alpha = .7)
plotly::ggplotly(plt2)
There are many other approaches concerning the map data, like using the mapbox-api
I have two files named counties.rds and houses1990.rds. The first one (counties.rds) includes conties of Caliornia state in the USA, and the second one (houses1990.rds) gives us some information about houses. I used the following codes to creat the variables Cali_s and Houses-Cali as follow:
Cali_s <- readRDS("counties.rds")
Cali_s <- raster::aggregate(California_shp, by = "NAME")
Houses-Cali <- readRDS("houses1990.rds")
To give you some information about Cali_s and Houses-Cali, the output of dput(head(Cali_s)) and dput(head(Houses_Cali)) are as follow:
dput(head(Cali_s))
structure(list(NAME = c("Alameda", "Alpine", "Amador", "Butte",
"Calaveras", "Colusa")), row.names = c(NA, 6L), class = "data.frame")
dput(head(Houses_Cali))
structure(list(houseValue = c(452600L, 358500L, 352100L, 341300L,
342200L, 269700L), income = c(8.3252, 8.3014, 7.2574, 5.6431,
3.8462, 4.0368), houseAge = c(41L, 21L, 52L, 52L, 52L, 52L),
rooms = c(880L, 7099L, 1467L, 1274L, 1627L, 919L), bedrooms = c(129L,
1106L, 190L, 235L, 280L, 213L), population = c(322L, 2401L,
496L, 558L, 565L, 413L), households = c(126L, 1138L, 177L,
219L, 259L, 193L), latitude = c(37.88, 37.86, 37.85, 37.85,
37.85, 37.85), longitude = c(-122.23, -122.22, -122.24, -122.25,
-122.25, -122.25)), row.names = c(NA, 6L), class = "data.frame")
I used the following codes to plot the below interactive map displaying the boundaries:
tmap_mode("view")
tm_shape(Cali_s) +tm_borders(alpha = 0.9,col ="BLUE")+ tm_text("NAME", size = 0.7)
Considering the houseValue which is inside the Houses-Cali as the target (dependent) variable, I want to plot the following map (named map 1):
Furthermore, using the point-in-polygon operation and the mean function, I want to merge the Houses-Cali object to Cali_sin order to plot the following thematic map (named map 2) of houseValue variable:
Could you please help me with how I can plot the two above maps (map 1 and map 2)?
Thank you in advance for your help
I have asked a question here how to make a binary heat map several times on a data frame without getting any feedback which I believe it is not that easy. I tried to break down the problem in order to get some answer
so I consider having only one data in my data structure as follows
df1<- structure(list(time = c(1L, 1L, 1L, 1L, 1L, 1L), level = structure(1:6, .Label = c("B",
"C", "D", "E", "F", "G"), class = "factor"), X2 = c(266L, 480L,
396L, 501L, 481L, 542L), X3 = c(356L, 587L, 491L, 320L, 883L,
422L), X4 = c(452L, 406L, 521L, 300L, 582L, 506L), X5 = c(549L,
368L, 548L, 528L, 673L, 701L), X6 = c(414L, 398L, 526L, 593L,
639L, 484L)), .Names = c("time", "level", "X2", "X3", "X4", "X5",
"X6"), class = "data.frame", row.names = c(NA, -6L))
at first I get the range of it as follows:
min(df1[3:length(df1)])
max(df1[3:length(df1)])
it means that the data is distributed between 266 and 883.
now I want to set the one closer to lower to be red and the one closer to higher value to be yellow
I tried to plot it with reshape2as follows but I could not figure out how to set the values and other stuff
library(reshape2)
melted_cormat <- melt(df1)
ggplot(data = melted_cormat, aes(x=level, y=variable, fill=value))
Using something similar to your code (note inclusion of "id" variable)...
melted_cormat <- melt(df1, id = c("time", "level"))
ggplot(data = melted_cormat, aes(x=level, y=variable, fill=value)) +
geom_tile() + scale_fill_gradient(low = "red", high = "yellow")
made this plot...
Note that you can substitute hex codes for "red" and "yellow" in scale_fill_gradient, if you want a particular hue or two.