Categorizing Data frame with R

Categorizing Data frame with R - r

I have a following sample code to make one data frame containing information for more than 1 ID.
I want to sort them by defined categories.
In which I want to see the percentage change at specific (given time for e.h here t=10) with respect to
its baseline value and return the value of that found category in output.
I have explained detailed step of my calculation below.
a=c(100,105,126,130,150,100,90,76,51,40)
t=c(0,5,10,20,30)
t=rep(t,2)
ID=c(1,1,1,1,1,2,2,2,2,2)
data=data.frame(ID,t,a)
My desired Calculation
1)for all ID at t=0 "a" value is baseline
2) Computation
e.g At Given t=10 (Have to provide) take corresponding a value
%Change(answer) = (taken a value - baseline/baseline)
3) Compare the answer in the following define CATEGORIES..
#category
1-If answer>0.25
2-If -0.30<answer<0.25
3-If -1.0<answer< -0.30
4-If answer== -1.0
4)Return the value of category
Desired Output
ID My_Answer
1 1
2 3
Can anybody help me in this.I do understand the flow of my computation but not awre of efficient way of doing it as i have so many ID in that data frame.
Thank you

It's easier to do math with columns than with rows. So the first step is to move baseline numbers into their own columns, then use cut to define these groups:
library(dplyr)
library(tidyr)
foo <- data %>%
filter(t == 0) %>%
left_join(data %>%
filter(t != 0),
by = "ID") %>%
mutate(percentchange = (a.y - a.x) / a.x,
My_Answer = cut(percentchange, breaks = c(-1, -0.3, 0.25, Inf),
right = FALSE, include.lowest = TRUE, labels = c("g3","g2","g1")),
My_Answer = as.character(My_Answer),
My_Answer = ifelse(percentchange == -1, "g4", My_Answer)) %>%
select(ID, t = t.y, My_Answer)
foo
ID t.x a.x t.y a.y percentchange My_Answer
1 1 0 100 5 105 0.05 g2
2 1 0 100 10 126 0.26 g1
3 1 0 100 20 130 0.30 g1
4 1 0 100 30 150 0.50 g1
5 2 0 100 5 90 -0.10 g2
6 2 0 100 10 76 -0.24 g2
7 2 0 100 20 51 -0.49 g3
8 2 0 100 30 40 -0.60 g3
You can see that this lets us calculate My_Answer for all values at once. if you want to find out the values for t == 10, you can just pull out those rows:
foo %>%
filter(t == 10)
ID t My_Answer
1 1 10 g1
2 2 10 g2

Related

Loop through matrix in R and calculate measurement difference between all users

I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.

You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1

Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)

Find what values were changed to after normalization

How can I see the original values post normalization? Or change them in the final output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
I have a dataset that has 10 columns and 5,000 rows. After cleaning up the data and selecting which columns and rows I want, I run a normalization code.
Then I run a kmeans and get my output. How can I see what the values were changed to after normalization? Like, if I have Region 1, 2, 3, 4, and 5. And post normalization it changes to 0.00, 0.25, 0.5, 0.75, and 1. is there a way to change them back to the original in the kmeans output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
normalize = function(X) {
return(abs((X-min(X)))/(max(X)-min(X)))
}
df_age_norm = as.data.frame(lapply(df_age,normalize))
clusters = kmeans(df_age_norm, 9)[['cluster']]
df_age_norm$clusters = clusters
df_age_norm =
aggregate(df_age_norm[,1:4],list(df_age_norm$clusters),FUN
= mean)
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
Head of dataset before normalization
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
Head of dataset after normalization
Age HHIncome Region MaritalStatus group
0 0.001879699 1.00 0 0
0 0.001879699 1.00 1 0
0 0.002819549 0.25 0 0
0 0.002819549 0.75 0 0
0 0.003759398 0.00 0 0

This solution is inspired in base R function scale, that centers and scales the vector by subtracting the mean value and dividing by the standard deviation of the vector x. These two values, mean(x) and sd(x) are returned as attributes.
x <- -4:5
y <- scale(x)
attributes(y)
#$dim
#[1] 10 1
#
#$`scaled:center`
#[1] 0.5
#
#$`scaled:scale`
#[1] 3.02765
I have, therefore, rewritten function normalize to also set and return min(x) and max(x) as attributes. They will be used to later denormalize.
normalize <- function(X, na.rm = FALSE) {
if(na.rm) X <- X[!is.na(X)]
Min <- min(X)
Max <- max(X)
Y <- X - Min
if(Min != Max) Y <- Y/(Max - Min)
attr(Y, "scaled:min") <- Min
attr(Y, "scaled:max") <- Max
Y
}
denormalize <- function(X){
Min <- attr(X, "scaled:min")
Max <- attr(X, "scaled:max")
attr(X, "scaled:min") <- NULL
attr(X, "scaled:max") <- NULL
Y <- if(Min != Max) X*(Max - Min) else X
Y <- Y + Min
Y
}
df_age_norm <- as.data.frame(lapply(df_age, normalize))
df_age_2 <- as.data.frame(lapply(df_age_norm, denormalize))
df_age_2
# Age HHIncome Region MaritalStatus group
#1 18 11000 5 0 1
#2 18 11000 5 1 1
#3 18 12000 2 0 1
#4 18 12000 4 0 1
#5 18 13000 1 0 1
Data.
df_age <- read.table(text = "
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
", header = TRUE)

Counting values within levels

I have a set of levels in R that I generate with cut, e.g. say fractional values between 0 and 1, broken down into 0.1 bins:
> frac <- cut(c(0, 1), breaks=10)
> levels(frac)
[1] "(-0.001,0.1]" "(0.1,0.2]" "(0.2,0.3]" "(0.3,0.4]" "(0.4,0.5]"
[6] "(0.5,0.6]" "(0.6,0.7]" "(0.7,0.8]" "(0.8,0.9]" "(0.9,1]"
Given a vector v containing continuous values between [0.0, 1.0], how do I count the frequency of elements in v that fall within each level in levels(frac)?
I could customize the number of breaks and/or the interval from which I am making levels, so I'm looking for a way to do this with standard R commands, so that I can build a two-column data frame: one column for the levels as factors, and the second column for a fractional or percentage value of total elements in v over the level.
Note: The following does not work:
> table(frac)
frac
(-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
1 0 0 0 0 0
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
0 0 0 1
If I use cut on v directly, then I do not get the same levels when I run cut on different vectors, because the range of values — their minimum and maximum — is going to be different between arbitrary vectors, and so while I may have the same number of breaks, the level intervals will not be the same.
My goal is to take different vectors and bin them to the same set of levels. Hopefully this helps clarify my question. Thanks for any assistance.

Amend frac to actually represent your desired intervals, and then use the table function:
x = runif(100) # For example.
frac = cut(x, breaks = seq(0, 1, 0.1))
table(frac)
Result:
frac
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
14 9 8 10 8 12 7 7
(0.8,0.9] (0.9,1]
16 9

Introduce extremes c(0, 1) to v then use the same cut:
library(dplyr)
#dummy data
set.seed(1)
v <- round(runif(7), 2)
#result
data.frame(v,
vFrac = cut(c(0, 1, v), breaks = 10)[-c(1, 2)]) %>%
group_by(vFrac) %>%
mutate(vFreq = n())
# Source: local data frame [10 x 3]
# Groups: vFrac [8]
#
# v vFrac vFreq
# <dbl> <fctr> <int>
# 1 0.27 (0.2,0.3] 1
# 2 0.37 (0.3,0.4] 1
# 3 0.57 (0.5,0.6] 1
# 4 0.91 (0.9,1] 2
# 5 0.20 (0.1,0.2] 1
# 6 0.90 (0.8,0.9] 1
# 7 0.94 (0.9,1] 2

frac = seq(0,1,by=0.1)
ranges = paste(head(frac,-1), frac[-1], sep=" - ")
freq = hist(v, breaks=frac, include.lowest=TRUE, plot=FALSE)
data.frame(range = ranges, frequency = freq$counts)

Use findInterval instead of cut:
v<-data.frame(v=runif(100,0,1))
library(plyr)
v$x<-findInterval(v$v,seq(0,1,by=0.1))*0.1
ddply(v, .(x), summarize, n=length(x))

frac = seq(0, 1, 0.1)
set.seed(42); v = rnorm(10, 0.5, 0.2)
sapply(1:(length(frac)-1), function(i) sum(frac[i]<v & frac[i+1]>=v))
#[1] 0 0 0 1 3 2 1 1 1 1

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.

Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA

This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

R Adding one new row for each subject

I would like to add one new row for each of the subjects in my dataframe, which looks something like this:
Subject = c("1","5","10")
time = c("2", "2.25", "2.5")
value = c("3", "17", "9")
DF <- data.frame(Subject, time, value)
Subject time value
1 1 2 3
2 5 2.25 17
3 10 2.5 9
I want to add a new row for each subject with a time = 0 and value = 0, giving this:
Subject = c("1","1","5","5","10","10")
time = c("0","2","0", "2.25","0", "2.5")
value = c("0","3","0", "17","0", "9")
DF2 <- data.frame(Subject, time, value)
Subject time value
1 1 0 0
2 1 2 3
3 5 0 0
4 5 2.25 17
5 10 0 0
6 10 2.5 9
I have a lot of subjects with a lot of gaps in their subject numbers, and want do this for all of them in a reasonable way. Any suggestions?
Thank you in advance.
Sincerily,
ykl

I would just rbind in the new values (not sure why you specified all your values as character values, here I changed them to numeric)
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
DF2 <- rbind(
DF,
data.frame(Subject = unique(DF$Subject), time="0", value="0")
)
this puts them at the bottom, but you could re-sort of you like
DF2[order(DF2$subject, DF2$time), ]

You can also use interleave from the "gdata" package:
library(gdata)
interleave(DF, data.frame(Subject = 0, time = 0, value = 0))
# Subject time value
# 1 1 2.00 3
# 11 0 0.00 0
# 2 5 2.25 17
# 1.1 0 0.00 0
# 3 10 2.50 9
# 1.2 0 0.00 0
This uses #MrFlick's sample data.
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Categorizing Data frame with R - r

Related

Loop through matrix in R and calculate measurement difference between all users

Find what values were changed to after normalization

Counting values within levels

R: Find the Variance of all Non-Zero Elements in Each Row

R Adding one new row for each subject

Categories

Resources