fast, efficient way to loop over millions of rows and match columns

fast, efficient way to loop over millions of rows and match columns - r

I'm working with eye tracking data right now, so have a HUGE dataset (think millions of rows) and so would like a fast way to do this task. Here's a simplified version of it.
The data tells you where the eye is looking at each time point, and for each file we are looking at. X1,Y1 to the coordinates of the point we're looking at. There are multiple time points for each file (representing the eye looking at different location in the file through time).
Filename Time X1 Y1
1 1 10 10
1 2 12 10
I also have a file of where items are located for each filename. Each file contains (in this simplified case) two objects. X1,Y1 are the lower left coordinates and X2, Y2 are the upper right. You can imagine this as giving the bounding box where the item is located in each file. E.g.
Filename Item X1 Y1 X2 Y2
1 Dog 11 10 20 20
What I'd like to do is add another column to the first data frame that tells me what object the person is looking at during each time for each file. If there are not looking at any of the objects, I'd like the column to say "none". Things on the border count at as being looked at. E.g.
Filename Time X1 Y1 LookingAt
1 1 10 10 none
1 2 12 11 Dog
I know how to do this the for loop way, but it takes forever (and crashed my RStudio). I'm wondering if there might be a faster, more efficient way I'm missing.
Here's the dput for the first dataframe (These contain more rows that the example I showed above):
structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5",
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L,
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L,
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename",
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")
And here's the dput for the second:
structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1",
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat",
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L,
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11",
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L,
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11",
"13", "35"), class = "factor")), .Names = c("Filename", "Item",
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Using data.table and the sample data you provided, I would approach it as follows:
# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
][, `:=` (Filename = as.character(Filename),
X1 = pmin(X1,X2), X2 = pmax(X1,X2), # make sure that 'X1' is always the lowest value
Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))] # make sure that 'Y1' is always the lowest value
# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
between(X, lu$X1, lu$X2) &
between(Y, lu$Y1, lu$Y2)],
by = .(Filename,Time)]
which gives:
> dat
Filename Time X Y looked_at
1: 1 1 10 10 Cat
2: 1 2 15 20 NA
3: 1 3 12 25 NA
4: 2 1 11 15 NA
5: 2 2 10 10 NA
6: 3 1 15 11 NA
7: 3 2 25 12 NA
8: 3 5 20 15 House
9: 3 6 10 10 Mouse
Used data:
dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"),
X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"),
Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")),
.Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"),
Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"),
X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")),
.Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Related

Update row values based on condition in R

I am trying to update the values of Column C2 and C3 based on conditions:
• The variable C2 is equal to 1 if the type of cue = 2,
and 0 otherwise.
• The variable C3 is equal to 1 if the type of cue = 3,
and 0 otherwise.
Data frame Image: https://drive.google.com/file/d/1Enik09cXQ21d3cQQv0_YQDZGBb3Btm5n/view?usp=sharing
dput(Cognitive[1:6,]) =
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L), Time = c(191L,
206L, 219L, 176L, 182L, 196L), W = c(0L, 0L, 0L, 1L, 1L, 1L),
Cue = c(1L, 2L, 3L, 1L, 2L, 3L), D = c(0L, 0L, 0L, 0L, 0L,
0L), Subject.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22",
"23", "24"), class = "factor"), Cue.f = structure(c(1L, 2L,
3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor"),
D.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), C2 = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("1", "2", "3"), class = "factor"), C3 = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Cognitive <- read.csv(file = 'Cognitive.csv')
View(Cognitive)
# Factor the variables Subject, Cue and D and add these variable to the Cognitive data frame.
Cognitive <- mutate(Cognitive, Subject.f = factor(Subject), Cue.f = factor(Cue), D.f = factor(D))
Cognitive <- mutate(Cognitive, C2 = Cue.f, C3 = Cue.f)
Thanks.

df %>%
mutate(C2 = case_when(cue == 2 ~ 1
TRUE ~ 0),
C3 = case_when(cue ==3 ~ 1,
TRUE ~ 0))

A super easy base solution
df <- data.frame(cue=sample(c(1:3),10,replace = T),c2=sample(c(0,1),10,replace = T),c3=sample(c(0,1),10,replace = T))
df$c2 <- ifelse(df$cue==2,1,0)
df$c3 <- ifelse(df$cue==3,1,0)
EDIT
to add another dplyr solution
df <- dplyr::mutate(df,c2= ifelse(cue==2,1,0),c3= ifelse(cue==3,1,0))

We can use sapply in base R
df[-1] <- +(sapply(c(2, 3), `==`, df$cue))

How to use table and aggregate in R

I have a dataframe called tt. I would like to know how many different types of variants I have for different STATUS using aggregate and table functions. I tried aggregate(tt$STATUS, by = list(tt$variant), table) , but it gives me weird column names that I couldn't understand. How do I properly do this?
tt <- structure(list(ID = structure(1:11, .Names = c("9", "10", "11",
"12", "13", "2280", "2415", "3096", "4095", "6437", "7642"), .Label = c("003-0029-0258443",
"003-0039-0349951", "003-0041-0357849", "003-0042-0388658", "010-0001-0040921",
"4_596_8", "5_26202_105", "64368", "A-ADC-AD002860", "MAP_64368",
"S0085"), class = "factor"), variant = structure(c(`9` = 1L,
`10` = 1L, `11` = 1L, `12` = 1L, `13` = 1L, `2280` = 2L, `2415` = 2L,
`3096` = 3L, `4095` = 2L, `6437` = 3L, `7642` = 3L), .Label = c("0/0",
"0/1", "1/1"), class = "factor"), STATUS = structure(c(`9` = 2L,
`10` = 2L, `11` = 2L, `12` = 2L, `13` = 2L, `2280` = 1L, `2415` = 1L,
`3096` = 1L, `4095` = 1L, `6437` = 1L, `7642` = 2L), .Label = c(" 1",
"-9"), class = "factor")), class = "data.frame", row.names = c("9",
"10", "11", "12", "13", "2280", "2415", "3096", "4095", "6437",
"7642"))

If we need to apply table, can use directly after subsetting the columns instead of applying it within aggregate
table(tt[c("STATUS", "variant")])
# variant
#STATUS 0/0 0/1 1/1
# 1 0 3 2
# -9 5 0 1

I would like to create a boxplot of numerical data, but excluding cases which are marked as '0' on another column?

I have made a boxplot for a single factor as follows:
ggplot(data = dataframe2, aes(x=factor(0), y = RPSdata$Survival.One.Year)) + geom_boxplot(...)
The dataframe is simply:
dataframe2 <- data.frame(RPSdata$Survival.One.Year)
I would like to make the same boxplot, but only including cases which are coded as '1' in column RPSdata$Survival.Complete.Sense
Thank you so much! New to R so appreciate any help
Data Sample:
> dput(head(RPSdata, 5))
structure(list(ID.Rank = 1:5, ID.Participant = c("8571762481",
"7351340719", "7396795819", "3790978753", "6450996320"), Population.Risk = structure(c(1L,
2L, 3L, 2L, 2L), .Label = c("1", "2", "3", "4", "5", "6"), class = "factor"),
Personal.Risk = c(50, 60, 30, 40, 10), Comparative.Risk.Age = structure(c(2L,
NA, 3L, 4L, 3L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Current = structure(c(NA, 3L, 3L, NA, NA
), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Ex = structure(c(2L, 3L, NA, NA, 3L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), Score.Exposure = structure(c(1L,
1L, 1L, 2L, 1L), .Label = c("1", "2", "4", "5"), class = "factor"),
RF.Age = structure(c(1L, NA, 1L, 1L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Pollution = structure(c(1L,
NA, 3L, 2L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Asbestos = structure(c(1L, NA, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), RF.Asthma = structure(c(2L, NA,
3L, 2L, 1L), .Label = c("0", "1", "2"), class = "factor"),
RF.BMI = structure(c(2L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Gene = structure(c(2L, NA,
3L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.COPD = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.History = structure(c(2L,
NA, 1L, 1L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Diet = structure(c(3L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Radon = structure(c(2L,
NA, 1L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.Smoking = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Second.Smoke = structure(c(3L,
NA, 1L, 3L, 2L), .Label = c("0", "1", "2"), class = "factor"),
Survival.One.Year = c(80, 20, NA, NA, 90), Survival.Five.Year = c(60,
50, NA, 30, 50), Survival.Ten.Year = c(40, 20, NA, NA, 2),
Worry.Frequency = structure(c(1L, 3L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), Worry.Intensity = structure(c(1L,
2L, 2L, 2L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
Mental.Health.One = structure(c(1L, 3L, 2L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Two = structure(c(1L,
2L, 2L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
Mental.Health.Three = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Four = structure(c(2L,
2L, 1L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
PHQ.4 = structure(c(2L, 5L, 3L, 1L, 1L), .Label = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12"), class = "factor"), PHQ4.Anx = structure(c(1L, 4L,
3L, 1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"
), class = "factor"), PHQ4.Dep = structure(c(2L, 2L, 1L,
1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"), class = "factor"),
PHQ4.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Dep.Bin = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
Anx.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), Survival.Compelete.Sense = structure(c(2L,
1L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
Survival.Semi.Sense = c(1L, 0L, 0L, 1L, 1L)), row.names = c(NA,
5L), class = "data.frame")
>

Given the problem description, there is no need for a second data.frame, RPSdata alone is all that is needed. The problem is solved by subsetting conditional on a column that must be equal to 1.
library(ggplot2)
ggplot(data = subset(RPSdata, Survival.Complete.Sense == 1),
mapping = aes(x = Survival.Complete.Sense, y = Survival.One.Year)) +
geom_boxplot()
Another option, with package dplyr, is to filter first and pipe the result to ggplot. I also coerce the x axis column to factor.
library(dplyr)
library(ggplot2)
RPSdata %>%
filter(Survival.Complete.Sense == 1) %>%
mutate(Survival.Complete.Sense = factor(Survival.Complete.Sense)) %>%
ggplot(aes(Survival.Complete.Sense, Survival.One.Year)) +
geom_boxplot()

r - Calculated mean and sum values group by the first row

I have a dataframe, I would like to calculate all the mean values of x and all the sum of y group by the first row of the dateframe.
The dateframe to be calculate
The following link is the result I want.
The result expected
Here are the data.
dt=structure(list(year = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1980",
"1981", "1982", "1985", "group"), class = "factor"), x1 = structure(c(4L,
1L, 3L, 2L, 1L), .Label = c("1", "2", "4", "A"), class = "factor"),
y1 = structure(c(4L, 1L, 3L, 2L, 2L), .Label = c("1", "3",
"5", "A"), class = "factor"), x2 = structure(c(5L, 1L, 4L,
3L, 2L), .Label = c("2", "4", "5", "6", "A"), class = "factor"),
y2 = structure(c(4L, 1L, 3L, 3L, 2L), .Label = c("3", "5",
"7", "A"), class = "factor"), x3 = structure(c(4L, 1L, 3L,
2L, 1L), .Label = c("4", "6", "8", "B"), class = "factor"),
y3 = structure(c(4L, 1L, 3L, 2L, 1L), .Label = c("3", "5",
"6", "B"), class = "factor"), x4 = structure(c(4L, 1L, 3L,
2L, 3L), .Label = c("2", "4", "5", "C"), class = "factor"),
y4 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("3", "4",
"5", "6", "C"), class = "factor"), x5 = structure(c(5L, 2L,
1L, 3L, 4L), .Label = c("3", "4", "6", "7", "C"), class = "factor"),
y5 = structure(c(4L, 2L, 1L, 3L, 2L), .Label = c("2", "5",
"8", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
And result expected,
result_expected <- structure(list(year = c(1980L, 1981L, 1982L, 1985L), A_x_mean = c(1.5,
5, 3.5, 2.5), A_y_sum = c(4L, 12L, 10L, 8L), B_x_mean = c(4L,
8L, 6L, 4L), B_y_sum = c(3L, 6L, 5L, 3L), C_x_mean = 3:6, C_y_sum = c(8L,
6L, 13L, 11L)), class = "data.frame", row.names = c(NA, -4L))
I have search key words in goole and stackoverflow, but no proper answers. My current thinking is to calculate unique group A,B,C in first row.
require(tidyverse)
group_variables <- dt%>%gather(key,value)%>%distinct(value)%>%arrange(value)
then get the row in group_variables by the for
for i in group_variables{......}
or can I change the structure of the dataframe by gathe and spread in tidyr,and by dplyr method, something just like following code,
dt_new%>% group_by (group)%>%
summarise(mean=mean(x,na.rm=TRUE),
sum=sum(x,na.rm=TURE))

First we need to take out the first row having the group, make the data frame long, simplify x1,x2,x3 to x etc and put the groups back:
group_var = sapply(dt[1,-1],as.character)
mat <-
dt[-1,] %>% pivot_longer(-year) %>%
mutate(value=as.numeric(as.character(value))) %>%
mutate(group=as.character(group_var[as.character(name)])) %>%
mutate(name=substr(name,1,1))
mat
# A tibble: 40 x 4
year name value group
<fct> <chr> <dbl> <chr>
1 1980 x 1 A
2 1980 y 1 A
3 1980 x 2 A
4 1980 y 3 A
5 1980 x 4 B
6 1980 y 3 B
7 1980 x 2 C
8 1980 y 3 C
9 1980 x 4 C
10 1980 y 5 C
Now what's left is to group them according to year, name and group and do the respective function, so we define a function:
func = function(DF,func){
DF %>%
group_by(group,name,year) %>%
summarise_all(func) %>%
mutate(label=paste(group,name,func,sep="_")) %>%
ungroup %>%
select(year,value,label) %>%
pivot_wider(values_from=value,names_from=label)
}
And we apply it over two parts of the data:
cbind(func(mat %>% filter(name=="x"),"mean"),func(mat %>% filter(name=="y"),"sum"))
year A_x_mean B_x_mean C_x_mean year A_y_sum B_y_sum C_y_sum
1 1980 1.5 4 3 1980 4 3 8
2 1981 5.0 8 4 1981 12 6 6
3 1982 3.5 6 5 1982 10 5 13
4 1985 2.5 4 6 1985 8 3 11

One way would be to make your factors into characters, then make your first row your column names(and remove the first row). Then I did some data manipulation using dplyr and tidyr to make the data long by year and letters and then transposed the data into wide format after taking the sum and the mean.
dt=structure(list(year = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1980",
"1981", "1982", "1985", "group"), class = "factor"), x1 = structure(c(4L,
1L, 3L, 2L, 1L), .Label = c("1", "2", "4", "A"), class = "factor"),
y1 = structure(c(4L, 1L, 3L, 2L, 2L), .Label = c("1", "3",
"5", "A"), class = "factor"), x2 = structure(c(5L, 1L, 4L,
3L, 2L), .Label = c("2", "4", "5", "6", "A"), class = "factor"),
y2 = structure(c(4L, 1L, 3L, 3L, 2L), .Label = c("3", "5",
"7", "A"), class = "factor"), x3 = structure(c(4L, 1L, 3L,
2L, 1L), .Label = c("4", "6", "8", "B"), class = "factor"),
y3 = structure(c(4L, 1L, 3L, 2L, 1L), .Label = c("3", "5",
"6", "B"), class = "factor"), x4 = structure(c(4L, 1L, 3L,
2L, 3L), .Label = c("2", "4", "5", "C"), class = "factor"),
y4 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("3", "4",
"5", "6", "C"), class = "factor"), x5 = structure(c(5L, 2L,
1L, 3L, 4L), .Label = c("3", "4", "6", "7", "C"), class = "factor"),
y5 = structure(c(4L, 2L, 1L, 3L, 2L), .Label = c("2", "5",
"8", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
dt[sapply(dt, is.factor)] <- sapply(dt, as.character)
colnames(dt) <- dt[1,]
dt2 <- dt[-1,]
library(tidyverse)
dt3 <- pivot_longer(dt2, cols = c("A","B","C"),
names_to = "letters") %>%
ungroup %>%
select(-.copy) %>%
ungroup %>%
mutate(value = as.numeric(value)) %>%
group_by(letters,group) %>%
summarize(meanval = mean(value),
sumval = sum(value)) %>%
ungroup %>%
pivot_wider(names_from = letters,
values_from = c(meanval,sumval))

Trying to normalize data, but got undefined columns selected error in R

I have a dataset contains a variable nr.employed. Its numeric.
I am normalizing it in using code
markting_train_dim_deleted =
"","custAge","profession","marital","schooling","default","contact","month","campaign","previous","poutcome","cons.price.idx","cons.conf.idx","euribor3m","nr.employed","pmonths","pastEmail","responded"
"1",0.486842105263158,"1","3","7","2","1","8",0,0,"2",0.389321901792677,0.368200836820084,0.806393108138744,5195.8,999,0,"1"
"2",0.342105263157895,"2","2","1","1","1","4",0,0,"2",0.669134840218243,0.338912133891213,0.980729993198821,5228.1,999,0,"1"
"3",0.315789473684211,"10","2","4","1","2","7",0,0,"2",0.698752922837102,0.602510460251046,0.95737927907504,5191,999,0,"1"
"4",0.486842105263158,"5","1","1","2","1","4",0.0256410256410256,0,"2",0.669134840218243,0.338912133891213,0.981183405123555,5228.1,999,0,"1"
"5",0.215870043275927,"1","1","7","1","1","7",0.102564102564103,0.166666666666667,"1",0.26968043647701,0.192468619246862,0.148945817274994,5099.1,999,1,"1"
"6",0.381578947368421,"2","2","1","1","2","7",0,0,"2",0.698752922837102,0.602510460251046,0.95737927907504,5191,999,0,"1"
cnames=c("custAge","campaign","previous","cons.price.idx","cons.conf.idx",
"euribor3m"," nr.employed","pmonths","pastEmail")
for(i in cnames){
print(i)
print(markting_train_dim_deleted[,i])
markting_train_dim_deleted[,i]=
(markting_train_dim_deleted[,i]-min(markting_train_dim_deleted[,i]))/
(max(markting_train_dim_deleted[,i]-min(markting_train_dim_deleted[,i])))
}
After processing euribor3m it is printing nr.employed, it throws exception
Error in `[.data.frame`(markting_train_dim_deleted, , i) :
undefined columns selected
I have looked at the structure. Its a numeric datatype with no missing values.
output
dput(head(markting_train_dim_deleted))
structure(list(custAge = c(0.486842105263158, 0.342105263157895,
0.315789473684211, 0.486842105263158, 0.215870043275927, 0.381578947368421
), profession = structure(c(1L, 2L, 10L, 5L, 1L, 2L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "factor"),
marital = structure(c(3L, 2L, 2L, 1L, 1L, 2L), .Label = c("1",
"2", "3", "4"), class = "factor"), schooling = structure(c(7L,
1L, 4L, 1L, 7L, 1L), .Label = c("1", "2", "3", "4", "5",
"6", "7", "8"), class = "factor"), default = structure(c(2L,
1L, 1L, 2L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"),
contact = structure(c(1L, 1L, 2L, 1L, 1L, 2L), .Label = c("1",
"2"), class = "factor"), month = structure(c(8L, 4L, 7L,
4L, 7L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10"), class = "factor"), campaign = c(0, 0, 0,
0.0256410256410256, 0.102564102564103, 0), previous = c(0,
0, 0, 0, 0.166666666666667, 0), poutcome = structure(c(2L,
2L, 2L, 2L, 1L, 2L), .Label = c("1", "2", "3"), class = "factor"),
cons.price.idx = c(0.389321901792677, 0.669134840218243,
0.698752922837102, 0.669134840218243, 0.26968043647701, 0.698752922837102
), cons.conf.idx = c(0.368200836820084, 0.338912133891213,
0.602510460251046, 0.338912133891213, 0.192468619246862,
0.602510460251046), euribor3m = c(0.806393108138744, 0.980729993198821,
0.95737927907504, 0.981183405123555, 0.148945817274994, 0.95737927907504
), nr.employed = c(5195.8, 5228.1, 5191, 5228.1, 5099.1,
5191), pmonths = c(999, 999, 999, 999, 999, 999), pastEmail = c(0L,
0L, 0L, 0L, 1L, 0L), responded = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("1", "2"), class = "factor")), .Names = c("custAge",
"profession", "marital", "schooling", "default", "contact", "month",
"campaign", "previous", "poutcome", "cons.price.idx", "cons.conf.idx",
"euribor3m", "nr.employed", "pmonths", "pastEmail", "responded"
), row.names = c(NA, 6L), class = "data.frame")

The mistake is simply having " nr.employed" (with a space) rather than "nr.employed" in cnames.
Also, something like
markting_train_dim_deleted[, cnames] <- sapply(markting_train_dim_deleted[, cnames],
function(x) (x - min(x)) / (max(x) - min(x)))
would make the normalization easier to read.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

fast, efficient way to loop over millions of rows and match columns - r

Related

Update row values based on condition in R

How to use table and aggregate in R

I would like to create a boxplot of numerical data, but excluding cases which are marked as '0' on another column?

r - Calculated mean and sum values group by the first row

Trying to normalize data, but got undefined columns selected error in R

Categories

Resources