Clustering longitudinal data with multiple variables in R - r

I have a dataset that contains the observations of 30 people and each of them had done 20 experiments. Suppose my data looks like this:
ID trial reaction response prop_1 prop_2
"s1" 1 2.12 0 0.52 0.48
"s1" 2 1.32 1 0.12 0.88
"s1" 3 NA 1 NA NA
"s2" 1 2.33 1 0.65 0.35
"s2" 2 2.56 0 0.43 0.57
"s2" 3 NA 1 NA NA
I want to cluster the participants using these variables. I studied traj, latrend and kml packages but all of them use just one variable to cluster the data. How can I use multiple variables to cluster a longitudinal data like this?
Any simple help or guidance would be appreciated.

Here is one way to do it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\business.csv')
df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
X = df[['longitude', 'latitude']].copy()
kmeans = KMeans(n_clusters = 5, init ='k-means++')
kmeans.fit(X[X.columns[1:2]]) # Compute k-means clustering
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers
labels = kmeans.predict(X[X.columns[1:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
Here's another idea.
# import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from collections import Counter
df = pd.read_csv('C:\\properties_2017.csv')
# df.head(10)
df = df.head(10000)
list(df)
df.shape
df.shape
df = df.sample(frac=0.2, replace=True, random_state=1)
df.shape
df = df.fillna(0)
df.isna().sum()
df['regionidzip']=df['regionidzip'].fillna(97000)
df.dropna(axis=0,how='any',subset=['latitude','longitude'],inplace=True)
X=df.loc[:,['latitude','longitude']]
zp=df.regionidzip
id_n=8
kmeans = KMeans(n_clusters=id_n, random_state=0).fit(X)
id_label=kmeans.labels_
#plot result
ptsymb = np.array(['b.','r.','m.','g.','c.','k.','b*','r*','m*','r^']);
plt.figure(figsize=(12,12))
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
for i in range(id_n):
cluster=np.where(id_label==i)[0]
plt.plot(X.latitude[cluster].values,X.longitude[cluster].values,ptsymb[i])
plt.show()
#revise the clustering based on zipcode
uniq_zp=np.unique(zp)
for i in uniq_zp:
a=np.where(zp==i)[0]
c = Counter(id_label[a])
c.most_common(1)[0][0]
id_label[a]=c.most_common(1)[0][0]
#plot result (revised)
plt.figure(figsize=(12,12))
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
for i in range(id_n):
cluster=np.where(id_label==i)[0]
plt.plot(X.latitude[cluster].values,X.longitude[cluster].values,ptsymb[i])
plt.show()
https://www.kaggle.com/xxing9703/kmean-clustering-of-latitude-and-longitude?select=zillow_data_dictionary.xlsx
https://www.kaggle.com/c/zillow-prize-1/data
Also, check this out.
https://towardsdatascience.com/clustering-geospatial-data-f0584f0b04ec
https://raw.githubusercontent.com/mdipietro09/DataScience_ArtificialIntelligence_Utils/master/machine_learning/data_stores.csv

Related

Can't re-project precipitation data from Stereographic to PlateCarree() using Cartopy

I am trying to plot precipitation data from the National Weather Service. However, the data is by default set to a stereographic projection. I'd like to plot in a PlateCarree projection but I am having some difficulties. When I try and use the PlateCarree projection in Cartopy, it plots the maps but will not overlay the precipitation data. I'm assuming this means that I am not properly re-projecting the data from stereographic to PlateCarree. Is there anything specific I need to do in order to re-project the data correctly?
Here is the code that works with the stereographic projection:
'''
=====================
NWS Precipitation Map
=====================
Plot a 1-day precipitation map using a netCDF file from the National Weather Service.
This opens the data directly in memory using the support in the netCDF library to open
from an existing memory buffer. In addition to CartoPy and Matplotlib, this uses
a custom colortable as well as MetPy's unit support.
"""
###############################
# Imports
from datetime import datetime, timedelta
from urllib.request import urlopen
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
from metpy.plots import USCOUNTIES
from metpy.units import masked_array, units
from netCDF4 import Dataset
import pandas as pd
###############################
# Download the data from the National Weather Service.
dt = datetime.utcnow() - timedelta(days=1) # This should always be available
url = 'http://water.weather.gov/precip/downloads/{dt:%Y/%m/%d}/nws_precip_1day_'\
'{dt:%Y%m%d}_conus.nc'.format(dt=dt)
data = urlopen(url).read()
nc = Dataset('data', memory=data)
###############################
# Pull the needed information out of the netCDF file
prcpvar = nc.variables['observation']
data = masked_array(prcpvar[:], units(prcpvar.units.lower())).to('in')
#data = data * 0.0393
x = nc.variables['x'][:]
y = nc.variables['y'][:]
proj_var = nc.variables[prcpvar.grid_mapping]
#%%
###############################
# Set up the projection information within CartoPy
globe = ccrs.Globe(semimajor_axis=proj_var.earth_radius)
proj = ccrs.Stereographic(central_latitude=90.0,
central_longitude=proj_var.straight_vertical_longitude_from_pole,
true_scale_latitude=proj_var.standard_parallel, globe=globe)
###############################
# Create the figure and plot the data
# create figure and axes instances
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111, projection=proj)
#ax.set_extent([-75,-85,35,39])
#draw coastlines, state and country boundaries, edge of map.
ax.coastlines(resolution='10m')
ax.add_feature(cfeature.BORDERS.with_scale('10m'), linewidth=1.5)
ax.add_feature(cfeature.STATES.with_scale('10m'), linewidth=2.0)
ax.add_feature(USCOUNTIES.with_scale('500k'), edgecolor='black')
# draw filled contours.
clevs = [0.01, 0.1, 0.25, 0.50, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0,
6.0, 8.0, 10., 20.0]
# In future MetPy
# norm, cmap = ctables.registry.get_with_boundaries('precipitation', clevs)
cmap_data = [
"#04e9e7", # 0.01 - 0.10 inches
"#019ff4", # 0.10 - 0.25 inches
"#0300f4", # 0.25 - 0.50 inches
"#02fd02", # 0.50 - 0.75 inches
"#01c501", # 0.75 - 1.00 inches
"#008e00", # 1.00 - 1.50 inches
"#fdf802", # 1.50 - 2.00 inches
"#e5bc00", # 2.00 - 2.50 inches
"#fd9500", # 2.50 - 3.00 inches
"#fd0000", # 3.00 - 4.00 inches
"#d40000", # 4.00 - 5.00 inches
"#bc0000", # 5.00 - 6.00 inches
"#f800fd", # 6.00 - 8.00 inches
"#9854c6", # 8.00 - 10.00 inches
"#fdfdfd" # 10.00+
]
cmap = mcolors.ListedColormap(cmap_data, 'precipitation')
norm = mcolors.BoundaryNorm(clevs, cmap.N)
cs = ax.contourf(x, y, data, clevs, alpha = 0.5, cmap=cmap, norm=norm)
# add colorbar.
cbar = plt.colorbar(cs, orientation='vertical')
cbar.set_label(data.units)
time = nc.creation_time[4:6]+'/'+nc.creation_time[6:8]+'/'+nc.creation_time[0:4]+' '\
+nc.creation_time[9:11] +':'+ nc.creation_time[11:13] + " UTC"
print(time)
ax.set_title('24 hr Precipitation (in)' + '\n for period ending ' + time, fontsize = 16, fontweight = 'bold' )
'''
However, when I change the projection lines to PlateCarree I run into the issue I described above. Does anyone have any advice on how to re-prroject this data?
Thanks
Just add the transform parameter to the ax.contourf method.
# Imports
from datetime import datetime, timedelta
from urllib.request import urlopen
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
from metpy.plots import USCOUNTIES
from metpy.units import masked_array, units
from netCDF4 import Dataset
import pandas as pd
###############################
# Download the data from the National Weather Service.
dt = datetime.utcnow() - timedelta(days=1) # This should always be available
url = 'http://water.weather.gov/precip/downloads/{dt:%Y/%m/%d}/nws_precip_1day_'\
'{dt:%Y%m%d}_conus.nc'.format(dt=dt)
data = urlopen(url).read()
nc = Dataset('data', memory=data)
###############################
# Pull the needed information out of the netCDF file
prcpvar = nc.variables['observation']
data = masked_array(prcpvar[:], units(prcpvar.units.lower())).to('in')
#data = data * 0.0393
x = nc.variables['x'][:]
y = nc.variables['y'][:]
proj_var = nc.variables[prcpvar.grid_mapping]
#%%
###############################
# Set up the projection information within CartoPy
globe = ccrs.Globe(semimajor_axis=proj_var.earth_radius)
proj = ccrs.Stereographic(central_latitude=90.0,
central_longitude=proj_var.straight_vertical_longitude_from_pole,
true_scale_latitude=proj_var.standard_parallel, globe=globe)
###############################
# Create the figure and plot the data
# create figure and axes instances
fig = plt.figure(figsize=(15, 15))
pc_proj = ccrs.PlateCarree()
#ax = fig.add_subplot(111, projection=proj)
ax = fig.add_subplot(111, projection=pc_proj)
ax.set_extent([-128,-65,25,52])
#draw coastlines, state and country boundaries, edge of map.
ax.coastlines(resolution='10m')
#ax.add_feature(cfeature.BORDERS.with_scale('10m'), linewidth=1.5)
#ax.add_feature(cfeature.STATES.with_scale('10m'), linewidth=2.0)
#ax.add_feature(USCOUNTIES.with_scale('500k'), edgecolor='black')
# draw filled contours.
clevs = [0.01, 0.1, 0.25, 0.50, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0,
6.0, 8.0, 10., 20.0]
# In future MetPy
# norm, cmap = ctables.registry.get_with_boundaries('precipitation', clevs)
cmap_data = [
"#04e9e7", # 0.01 - 0.10 inches
"#019ff4", # 0.10 - 0.25 inches
"#0300f4", # 0.25 - 0.50 inches
"#02fd02", # 0.50 - 0.75 inches
"#01c501", # 0.75 - 1.00 inches
"#008e00", # 1.00 - 1.50 inches
"#fdf802", # 1.50 - 2.00 inches
"#e5bc00", # 2.00 - 2.50 inches
"#fd9500", # 2.50 - 3.00 inches
"#fd0000", # 3.00 - 4.00 inches
"#d40000", # 4.00 - 5.00 inches
"#bc0000", # 5.00 - 6.00 inches
"#f800fd", # 6.00 - 8.00 inches
"#9854c6", # 8.00 - 10.00 inches
"#fdfdfd" # 10.00+
]
cmap = mcolors.ListedColormap(cmap_data, 'precipitation')
norm = mcolors.BoundaryNorm(clevs, cmap.N)
#cs = ax.contourf(x, y, data, clevs, alpha = 0.5, cmap=cmap, norm=norm)
# add transform args
cs = ax.contourf(x, y, data, clevs, alpha = 0.5, cmap=cmap, norm=norm,transform=proj)
# add colorbar.
cbar = plt.colorbar(cs, orientation='vertical')
cbar.set_label(data.units)
time = nc.creation_time[4:6]+'/'+nc.creation_time[6:8]+'/'+nc.creation_time[0:4]+' '\
+nc.creation_time[9:11] +':'+ nc.creation_time[11:13] + " UTC"
print(time)
ax.set_title('24 hr Precipitation (in)' + '\n for period ending ' + time, fontsize = 16, fontweight = 'bold' )
plt.savefig("prec_usa_pc.png")
#plt.show()
Below is the output fig.

Method for calculating volume under a surface in R

I'm trying to calculate the volume under a 3d surface in R.
My data, dat, looks like:
0.003 0.019 0.083 0.25 0.5 1
0 1.0000000 0.8884265 0.8603268 0.7719994 0.7443621 0.6571405
0.111 0.6909722 0.6775000 0.6443750 0.6243750 0.5914730 0.5698242
0.25 0.5847205 0.6022367 0.5572917 0.5432991 0.5170673 0.4835819
0.429 0.5210938 0.5139063 0.4995312 0.4864062 0.4648636 0.4163698
0.667 0.4363103 0.4526562 0.4321859 0.4027519 0.4046011 0.3661616
1 0.3958333 0.4167468 0.3964428 0.3810459 0.3486328 0.3487930
where x = rownames(dat), y = colnames(dat) and z = dat.
I've looked here, here, and here, but can't seem to figure out how to apply those answers to my use case.
Here's a reproducible version of my data:
dat = structure(c(1,0.690972222222222,0.584720477386935,0.52109375,0.436310279187817,0.395833333333333,0.888426507537688,0.6775,0.602236675126904,0.51390625,0.45265625,0.416746794871795,0.860326776649746, 0.644375, 0.557291666666667,0.49953125,0.432185913705584,0.396442819148936,0.771999378109453,0.624375,0.543299129353234,0.48640625,0.402751865671642,0.381045854271357,0.744362113402062,0.591472989949749,0.517067307692308,0.464863578680203,0.404601130653266,0.3486328125,0.657140544041451,0.56982421875,0.483581852791878,0.41636981865285,0.366161616161616,0.348792989417989),.Dim = c(6L, 6L), .Dimnames = list(c("0","0.111","0.25","0.429","0.667","1"),c("0.003","0.019","0.083","0.25","0.5","1")))
You could use the getVolume() function provided in the answer you linked here provided that your matrix were in the requisite dataframe format.
Here is some code to make that dataframe:
df <- expand.grid(x = as.numeric(rownames(dat)), y = as.numeric(colnames(dat)))
df$z = as.vector(dat)
Then define the function and apply:
library(geometry)
getVolume=function(df) {
#find triangular tesselation of (x,y) grid
res=delaunayn(as.matrix(df[,-3]),full=TRUE,options="Qz")
#calulates sum of truncated prism volumes
sum(mapply(function(triPoints,A) A/3*sum(df[triPoints,"z"]),
split.data.frame(res$tri,seq_along(res$areas)),
res$areas))
}
getVolume(df)
[1] 0.4714882

how to get total variance explained by each principal component

I have created a PCA plot using:
library(SNPRelate)
library(gdsfmt)
vcf.fn <- "input.vcf"
snpgdsVCF2GDS(vcf.fn, "test.gds", method="biallelic.only")
snpgdsSummary("test.gds")
genofile <- snpgdsOpen("test.gds")
pop_code <- read.gdsn(index.gdsn(genofile, "genotype"))
snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.2, maf= 0.01, missing.rate=0.5)
snpset.id <- unlist(snpset)
pca <- snpgdsPCA(genofile, autosome.only=FALSE, snp.id=snpset.id, num.thread=2)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
tab <- data.frame(sample.id = pca$sample.id,
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")
PCA plot looks like:
[1]: https://i.stack.imgur.com/WBeKT.png
I have created a matrix representing sample names (in rows) and first five PC's (in columns):
sample.id EV1 EV2 EV3 EV4 EV5
1 T11 -0.007433146 -0.038371106 0.079585181 0.069839389 0.12178713
2 T3 -0.014198086 0.069641911 0.006414285 -0.004750456 0.046201258
3 T10 -0.086656303 0.026455731 -0.028758639 -0.015004286 -0.007497732
4 T162 -0.00520634 0.053996842 0.021754194 -0.004660844 0.006939661
5 T163 -0.020055447 0.027697494 -0.006933852 -0.058596466 0.028236645
I want to check how much variation is explained by each PC component. Thank you for your help!

How can I calculate Cosine similarity between two strings vectors

I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
Can anyone explain what I should do?
using the lsa package and the manual for this package
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
EDIT: show how is the mymatrix object
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere. If you name the new vectors a2 and b2, you can calculate the cosine similarly with cor(a2, b2), but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5
CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
could be the better one for the larger data set.
Advanced form of embedding might help you to get better output. Please check the following code.
It is a Universal sentence encode model that generates the sentence embedding using transformer-based architecture.
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model([input])
paragraph = [
"Universal Sentence Encoder embeddings also support short paragraphs. ",
"Universal Sentence Encoder support paragraphs"]
messages = [paragraph]
print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))

Appropriate data structure for paired data and extension of its functionality

The question hast 2 parts.
Which is the data structure in R that allows to store the paired data:
0:0
0.5:10
1:20
(Python dictionary {[0]:0, [0.5]:10, [1]:20})
and how to initiate it with one liner? i.e. to couple seq(0,1,by=0.5)
with seq(0,10,by=5) in this data structure
Assume I added 0.25 to the list, then I want the weighted average of the neighbor nodes to appear (automatically) in the data set, i.e. the element 0.25:5 and the paired set would be
0:0
0.25:5
0.5:10
1:20
If I add the element 0.3, then it must be paired with 5+(10-5)*(0.3-0.25)/(0.5-0.25)=6 and element 0.3:6 to be added.
How I can create the class with S4 or Reference Class class model where I could put this functionality?
Not really sure what you are getting at but maybe the package hash may have what you want
library(hash)
h<-hash(keys=seq(0,1,by=0.5),values=seq(0,10,by=5))
h[['0.25']]<-2.5
Probably deals with the first part of your question. http://cran.r-project.org/web/packages/hash/hash.pdf may allude to help on the second.
a similar construct with lists
lst<-list()
lst<-seq(0,10,5)
names(lst)<-seq(0,1,0.5)
> lst['0.5']
0.5
5
lst['0.25']<-2.5
for your second part you could construct a simple function to update you hash/list with a new value.
A two-column data.frame seems appropriate:
xy <- data.frame(x = seq(0, 1, by = 0.5), y = seq(0, 20, by = 10))
xy
# x y
# 1 0.0 0
# 2 0.5 10
# 3 1.0 20
Then, what you are trying to do is a linear-interpolation, which you can achieve using the approx function. For example:
approx(xy$x, xy$y, xout = 0.3)
# $x
# [1] 0.3
#
# $y
# [1] 6
If you want to add that result to the data.frame, you can do something like:
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, 0.3))))
xy
# x y
# 1 0.0 0
# 2 0.3 6
# 3 0.5 10
# 4 1.0 20
which is a bit expensive, especially if you plan to add points one at a time. You could instead add all your points at once since the result is independent of the order in which you add them:
add.points <- c(0.25, 0.3)
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, add.points))))
xy
# x y
# 1 0.00 0
# 2 0.25 5
# 3 0.30 6
# 4 0.50 10
# 5 1.00 20

Resources