Find what values were changed to after normalization - r

How can I see the original values post normalization? Or change them in the final output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
I have a dataset that has 10 columns and 5,000 rows. After cleaning up the data and selecting which columns and rows I want, I run a normalization code.
Then I run a kmeans and get my output. How can I see what the values were changed to after normalization? Like, if I have Region 1, 2, 3, 4, and 5. And post normalization it changes to 0.00, 0.25, 0.5, 0.75, and 1. is there a way to change them back to the original in the kmeans output?
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
normalize = function(X) {
return(abs((X-min(X)))/(max(X)-min(X)))
}
df_age_norm = as.data.frame(lapply(df_age,normalize))
clusters = kmeans(df_age_norm, 9)[['cluster']]
df_age_norm$clusters = clusters
df_age_norm =
aggregate(df_age_norm[,1:4],list(df_age_norm$clusters),FUN
= mean)
I want to change my final output back to there original values. Or at least close to it considering I aggregate and take the mean.
Head of dataset before normalization
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
Head of dataset after normalization
Age HHIncome Region MaritalStatus group
0 0.001879699 1.00 0 0
0 0.001879699 1.00 1 0
0 0.002819549 0.25 0 0
0 0.002819549 0.75 0 0
0 0.003759398 0.00 0 0

This solution is inspired in base R function scale, that centers and scales the vector by subtracting the mean value and dividing by the standard deviation of the vector x. These two values, mean(x) and sd(x) are returned as attributes.
x <- -4:5
y <- scale(x)
attributes(y)
#$dim
#[1] 10 1
#
#$`scaled:center`
#[1] 0.5
#
#$`scaled:scale`
#[1] 3.02765
I have, therefore, rewritten function normalize to also set and return min(x) and max(x) as attributes. They will be used to later denormalize.
normalize <- function(X, na.rm = FALSE) {
if(na.rm) X <- X[!is.na(X)]
Min <- min(X)
Max <- max(X)
Y <- X - Min
if(Min != Max) Y <- Y/(Max - Min)
attr(Y, "scaled:min") <- Min
attr(Y, "scaled:max") <- Max
Y
}
denormalize <- function(X){
Min <- attr(X, "scaled:min")
Max <- attr(X, "scaled:max")
attr(X, "scaled:min") <- NULL
attr(X, "scaled:max") <- NULL
Y <- if(Min != Max) X*(Max - Min) else X
Y <- Y + Min
Y
}
df_age_norm <- as.data.frame(lapply(df_age, normalize))
df_age_2 <- as.data.frame(lapply(df_age_norm, denormalize))
df_age_2
# Age HHIncome Region MaritalStatus group
#1 18 11000 5 0 1
#2 18 11000 5 1 1
#3 18 12000 2 0 1
#4 18 12000 4 0 1
#5 18 13000 1 0 1
Data.
df_age <- read.table(text = "
Age HHIncome Region MaritalStatus group
18 11000 5 0 1
18 11000 5 1 1
18 12000 2 0 1
18 12000 4 0 1
18 13000 1 0 1
", header = TRUE)

Related

How to create a scorecard for a dataframe in R?

I'm trying to create a scorecard for the values relative to the scorecard (both below).
values <- data.frame(A= c(-200,-78,-100,0,-30),
B= c(100,0,-101,-199,-300),
C= c(-400,400,500,-500,250),
D= c(NA,NA,-1000,-1000,-1000),
E= c(1000,1000,1,-1000,-2000))
scorecard <- data.frame(Names = c("A","B","C","D","E"),
"Score5" = c(-100,-200,-300,-400,-500),
"Score3" = c(-50,-100,-150,-200,-250),
"Score1" = c(-25,-50,-75,-100,-125))
values
A B C D E
1 -200 100 -400 NA 1000
2 -78 0 400 NA 1000
3 -100 -101 500 -1000 1
4 0 -199 -500 -1000 -1000
5 -30 -300 250 -1000 -2000
scorecard
Names Score5 Score3 Score1
1 A -100 -50 -25
2 B -200 -100 -50
3 C -300 -150 -75
4 D -400 -200 -100
5 E -500 -250 -125
For my scorecard, if the value:
is < its respective Score5, it gets awarded 5
is > its respective Score5 AND < Score3, but closer to Score5 than it is to Score3, it gets awarded 5
is > its respective Score5 AND < Score3, but closer to Score3 than it is to Score5, it gets awarded 3
is > its respective Score3 AND < Score1, but closer to Score3 than it is to Score1, it gets awarded 3
is > its respective Score3 AND < Score1, but closer to Score1 than it is to Score3, it gets awarded 1
all other values get 0
The desired result is:
desired result
I've tried the following - which required the packaged xts: install.packages("xts") but I didn't quite get there.
pointsfunction <- function(value) {
points <- c()
for(i in names) {
index = which(colnames(value)==i)
data_start <- which(!is.na(value))[1]
points[1:(data_start -1)] <- NA
for(a in (data_start):(length(value))) {
if(value[a] < scorecard[index, 2]) {
points[a] <- -5
} else {
points[a] <- 0
}
}
}
points <- reclass(points, value)
return(points)
}
scorecardpoints <- as.data.frame(lapply(values, pointsfunction))
I got the following error:
Error in if (value[a] < scorecard[index, 2]) { : argument is of length
zero Called from: FUN(X[[i]], ...)
Any ideas?
Here's a dplyr solution. We pivot to long format, join to the scorecard, do comparisons, and pivot the result back to wide. I added an ID column, but you can drop it at the end, if you like.
library(dplyr)
library(tidyr)
values %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "Names") %>%
left_join(scorecard) %>%
mutate(
result = case_when(
value < (Score5 + Score3) / 2 ~ 5,
value < (Score3 + Score1) / 2 ~ 3,
value < Score1 ~ 1,
is.na(value) ~ NA_real_,
TRUE ~ 0
)
) %>%
pivot_wider(id_cols = id, names_from = Names, values_from = result)
# # A tibble: 5 x 6
# id A B C D E
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 5 NA 0
# 2 2 5 0 0 NA 0
# 3 3 5 3 0 5 0
# 4 4 0 5 5 5 5
# 5 5 0 5 0 5 5
The values in your example values object is not the same as the values in the data.frame you assign to values. E.g. look at the 5th value of A.
You could use a base R approach like this:
# Look up the scorecard values for a name from the scorecard data.frame
get_scorecard_values <- function(name, card) {
as.numeric(card[card$Names == name, c(2,3,4)])
}
# translate scorecard values into breakpoints for scoring intervals
get_breaks <- function(x){
c((x[1]+x[2])/2, (x[2]+x[3])/2, x[3])
}
# the value to assign to each scoring interval
my_scores <- c(5,3,1,0)
# given a vector of values, assign a score value to each based on
# the interval that it falls into
get_scores <- function(x, intervals, scores) {
scores[(findInterval(x, get_breaks(intervals)) + 1L)]
}
# go across the list of names of variables of the values object.
# for each name, get the values and corresponding scorecard values
# and calculate the score values.
sapply(
names(values),
function(val, values, card, scores) {
get_scores(
x = values[[val]],
intervals = get_scorecard_values(name = val, card = card),
scores = scores
)
},
values = values,
card = scorecard,
scores = my_scores
)
A B C D E
[1,] 5 0 5 NA 0
[2,] 5 0 0 NA 0
[3,] 5 3 0 5 0
[4,] 0 5 5 5 5
[5,] 0 5 0 5 5
I used the dataframe with A5 = -30. Here is a base R solution
scoremat <- as.matrix(scorecard[, -1L])
dimnames(scoremat) <- list(scorecard$Names, names(scorecard)[-1L])
vscore <- function(x, nm, scoremat) {
scores <- c("Score5" = 5, "Score3" = 3, "Score1" = 1)[dimnames(score_mat)[[2L]]]
conds <- scoremat[rep(nm, length(x)), ]
i <- as.integer(apply(abs(x - conds), 1L, which.min))
unname(ifelse(x > conds[, "Score1"] , 0, scores[i]))
}
dscore <- function(df, scoremat) {
as.data.frame(vapply(
names(df),
function(nm, mat) vscore(df[[nm]], nm, mat),
numeric(nrow(df)),
scoremat
))
}
Output
> dscore(values, scoremat)
A B C D E
1 5 0 5 NA 0
2 5 0 0 NA 0
3 5 3 0 5 0
4 0 5 5 5 5
5 1 5 0 5 5
We first create a score matrix as follows
> scoremat
Score5 Score3 Score1
A -100 -50 -25
B -200 -100 -50
C -300 -150 -75
D -400 -200 -100
E -500 -250 -125
Note that your logic simplifies to
for any x in, for example, column A
if x > -25 (i.e. scoremat["A", "Score1"]) then
return 0
else
calculate distance = abs(x - values in row A of scoremat)
return the score where the minimum distance is
That's basically how vscore works. First match the scores
scores <- c("Score5" = 5, "Score3" = 3, "Score1" = 1)[dimnames(score_mat)[[2L]]]
Then, match and repeat the row so that the conds matrix has the same number of rows as the length of vector x.
conds <- scoremat[rep(nm, length(x)), ]
Next, calculate abs(x - conds) and get where the minimum is for each row. For example,
let x = values$A
abs ( x - conds ) = distance which.min = i
-200 -100 -50 -25 100 150 175 1
-150 -100 -50 -25 50 100 125 1
-100 -100 -50 -25 0 50 75 1
0 -100 -50 -25 100 50 25 3
-30 -100 -50 -25 70 20 5 3
Score5 Score3 Score1 Score5 Score3 Score1
Use as.integer to convert no match (this happens when there are NA values in x) into NA values.
i <- as.integer(apply(abs(x - conds), 1L, which.min))
Finally, return the results based on the logic shown above
unname(ifelse(x > conds[, "Score1"] , 0, scores[i]))

Create a boolean column based on data in another column in R

I have a data set and would like to do two things:
Set certain row values in Col A to 0 based on values in Col B
Create a new column with values of either 0 or 1 based on the edited values in Col A
My current approach is shown below - the issue is I occasionally get an error:
Error in `[<-.data.frame`(`*tmp*`, "OCS_dose", value = 0) :
replacement has 1 row, data has 0
As the numbers that I am generating are randomly selected and on certain trials there are no rows to update in Col A based on the numbers in Col B.
Here is an example of my code that causes the error:
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
OCS_status is either a 0 or 1 that is generated using:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE,
size = pbo_n_IFNlow, prob=c(1-.863, 0.863))
Therefore on occasion, I have no 0's... In my mind R should then just not try to update anything.
Is there a better way to do what I am trying to do?
Here is a more complete segment of my code:
pbo_OCS_status_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-.863, 0.863)) #on OCS = 1
#OCS dose
pbo_OCS_dose_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=12.8, sd=8.1)
#IFN boolean flag
pbo_IFN_low <- rep(0, pbo_n_IFNlow)
#SLEDAI score
pbo_SLEDAI_low <- rtruncnorm(pbo_n_IFNlow, a=0, b=Inf, mean=11.1, sd=4.4)
#Response criteria met for SRI score reduction
pbo_SRI_low <- sample(c(0,1), replace = TRUE, size = pbo_n_IFNlow, prob=c(1-0.423, 0.423))
pbo_IFNlow_data <- cbind(IFN_status=pbo_IFN_low,
OCS_status=pbo_OCS_status_low,
OCS_dose=pbo_OCS_dose_low,
SLEDAI=pbo_SLEDAI_low,
SRI_response=pbo_SRI_low)
pbo_IFNlow_data <- data.frame(pbo_IFNlow_data)
#set those off OCS to 0
pbo_IFNlow_data[pbo_IFNlow_data$OCS_status == 0,]['OCS_dose'] <- 0
#stratifcation factor for OCS dosage
pbo_IFNlow_data$OCS_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose < 10, ]['OCS_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$OCS_dose >= 10, ]['OCS_lessthan10'] <- 0
#stratification factor for SLE score
pbo_IFNlow_data$SLE_lessthan10 <- "temp"
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI < 10, ]['SLE_lessthan10'] <- 1
pbo_IFNlow_data[pbo_IFNlow_data$SLEDAI >= 10, ]['SLE_lessthan10'] <- 0
It would be easier if we can have a minimal reproducible example. If I understand your question correctly, you may want to try ifelse statement in R?
df <- data.frame(colA = seq(1, 10), colB = seq(11, 20))
# Set certain row values in Col A to 0 based on values in Col B
df$colA <- ifelse(df$colB > 15, 0, df$colB)
# Create a new column with values of either 0
# or 1 based on the edited values in Col A
df$colC <- ifelse(df$colA == 0, 1, 0)
print(df)
## colA colB colC
## 1 11 11 0
## 2 12 12 0
## 3 13 13 0
## 4 14 14 0
## 5 15 15 0
## 6 0 16 1
## 7 0 17 1
## 8 0 18 1
## 9 0 19 1
## 10 0 20 1

Counting values within levels

I have a set of levels in R that I generate with cut, e.g. say fractional values between 0 and 1, broken down into 0.1 bins:
> frac <- cut(c(0, 1), breaks=10)
> levels(frac)
[1] "(-0.001,0.1]" "(0.1,0.2]" "(0.2,0.3]" "(0.3,0.4]" "(0.4,0.5]"
[6] "(0.5,0.6]" "(0.6,0.7]" "(0.7,0.8]" "(0.8,0.9]" "(0.9,1]"
Given a vector v containing continuous values between [0.0, 1.0], how do I count the frequency of elements in v that fall within each level in levels(frac)?
I could customize the number of breaks and/or the interval from which I am making levels, so I'm looking for a way to do this with standard R commands, so that I can build a two-column data frame: one column for the levels as factors, and the second column for a fractional or percentage value of total elements in v over the level.
Note: The following does not work:
> table(frac)
frac
(-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
1 0 0 0 0 0
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
0 0 0 1
If I use cut on v directly, then I do not get the same levels when I run cut on different vectors, because the range of values — their minimum and maximum — is going to be different between arbitrary vectors, and so while I may have the same number of breaks, the level intervals will not be the same.
My goal is to take different vectors and bin them to the same set of levels. Hopefully this helps clarify my question. Thanks for any assistance.
Amend frac to actually represent your desired intervals, and then use the table function:
x = runif(100) # For example.
frac = cut(x, breaks = seq(0, 1, 0.1))
table(frac)
Result:
frac
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
14 9 8 10 8 12 7 7
(0.8,0.9] (0.9,1]
16 9
Introduce extremes c(0, 1) to v then use the same cut:
library(dplyr)
#dummy data
set.seed(1)
v <- round(runif(7), 2)
#result
data.frame(v,
vFrac = cut(c(0, 1, v), breaks = 10)[-c(1, 2)]) %>%
group_by(vFrac) %>%
mutate(vFreq = n())
# Source: local data frame [10 x 3]
# Groups: vFrac [8]
#
# v vFrac vFreq
# <dbl> <fctr> <int>
# 1 0.27 (0.2,0.3] 1
# 2 0.37 (0.3,0.4] 1
# 3 0.57 (0.5,0.6] 1
# 4 0.91 (0.9,1] 2
# 5 0.20 (0.1,0.2] 1
# 6 0.90 (0.8,0.9] 1
# 7 0.94 (0.9,1] 2
frac = seq(0,1,by=0.1)
ranges = paste(head(frac,-1), frac[-1], sep=" - ")
freq = hist(v, breaks=frac, include.lowest=TRUE, plot=FALSE)
data.frame(range = ranges, frequency = freq$counts)
Use findInterval instead of cut:
v<-data.frame(v=runif(100,0,1))
library(plyr)
v$x<-findInterval(v$v,seq(0,1,by=0.1))*0.1
ddply(v, .(x), summarize, n=length(x))
frac = seq(0, 1, 0.1)
set.seed(42); v = rnorm(10, 0.5, 0.2)
sapply(1:(length(frac)-1), function(i) sum(frac[i]<v & frac[i+1]>=v))
#[1] 0 0 0 1 3 2 1 1 1 1

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Preparing trimmed means of multiple data files for repeated mesaures anova

I have multiple data files (in tab separated txt files) formatted like so:
Made some example files
https://docs.google.com/file/d/0B20HmmYd0lsFVGhTQ0EzRFFmYXc/edit?usp=sharing
https://docs.google.com/file/d/0B20HmmYd0lsFbWxmQzV6X0o2Y1E/edit?usp=sharing
Condition Block Session Stimuli Score Reqrespons Act RT extra
X 3 3 asdfa 1 a a 500 0
Y 1 2 qewrq 0 b a 1100 0
I want to exclude outlying RTs and perform ANOVA on the means of RT and score of the files(with factor condition). So far, I've done this in an extremely ugly fashion and have rows by subject (I'd prefer to format it with rows as subjectxcondition).
My current attempt uses a for loop :
all_data<-data.frame(rbind(1:27)) #make empty data.frame
all_data
for(i in 1:2)
{
n= paste(i,".txt", sep="")
a<- sprintf("table%d", i, i)
data <- read.table(toString(n), header = TRUE, sep = "\t")
I fill out cols 1:9 with scores 1-9
Score<-as.vector(tapply(data$Score,list(data$Condition,data$Reqresponse),mean))
for(o in 1:9)
{
all_data [i, o] <- Score[o]
}
Then trim my RT values in the way i want to and put in cols 10 on of all_data
data <- data[which(data$RT>200),]
data <- do.call(rbind,by(data,data$Condition,function(x) x[!abs(scale(x$RT)) > 3,] ))
RT<-as.vector(tapply(data$RT,list(data$Condition,data$Reqresponse, data$Score),mean))
for(j in 1:18)
{
all_data [i, j+9] <- RT[j]
}
}
Also this code must be aesthetically offensive to anyone decent in R, please tell me how to fix that up if you like
I would do this using ddply from plyr package. For example:
require(plyr)
res <- lapply(list.files(pattern='^[1-2].txt'),function(ff){
## you read the file
data <- read.table(ff, header=T, quote="\"")
## remove the outlier
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
## compute the mean
ddply(data,.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
[[1]]
Condition Reqresponse Score RT
1 X a 0 500
2 X a 1 750
3 X b 0 500
4 X b 1 500
5 Y a 0 400
6 Y a 1 640
7 Y b 1 1000
8 Z a 0 1000
9 Z a 1 1675
10 Z b 0 400
[[2]]
Condition Reqresponse Score RT
1 X a 0 500
2 X a 1 750
3 X b 0 500
4 X b 1 500
5 Y a 0 400
6 Y a 1 640
7 Y b 1 1000
8 Z a 0 1000
9 Z a 1 1675
10 Z b 0 400

Resources