r - Calculated mean and sum values group by the first row

r - Calculated mean and sum values group by the first row - r

I have a dataframe, I would like to calculate all the mean values of x and all the sum of y group by the first row of the dateframe.
The dateframe to be calculate
The following link is the result I want.
The result expected
Here are the data.
dt=structure(list(year = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1980",
"1981", "1982", "1985", "group"), class = "factor"), x1 = structure(c(4L,
1L, 3L, 2L, 1L), .Label = c("1", "2", "4", "A"), class = "factor"),
y1 = structure(c(4L, 1L, 3L, 2L, 2L), .Label = c("1", "3",
"5", "A"), class = "factor"), x2 = structure(c(5L, 1L, 4L,
3L, 2L), .Label = c("2", "4", "5", "6", "A"), class = "factor"),
y2 = structure(c(4L, 1L, 3L, 3L, 2L), .Label = c("3", "5",
"7", "A"), class = "factor"), x3 = structure(c(4L, 1L, 3L,
2L, 1L), .Label = c("4", "6", "8", "B"), class = "factor"),
y3 = structure(c(4L, 1L, 3L, 2L, 1L), .Label = c("3", "5",
"6", "B"), class = "factor"), x4 = structure(c(4L, 1L, 3L,
2L, 3L), .Label = c("2", "4", "5", "C"), class = "factor"),
y4 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("3", "4",
"5", "6", "C"), class = "factor"), x5 = structure(c(5L, 2L,
1L, 3L, 4L), .Label = c("3", "4", "6", "7", "C"), class = "factor"),
y5 = structure(c(4L, 2L, 1L, 3L, 2L), .Label = c("2", "5",
"8", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
And result expected,
result_expected <- structure(list(year = c(1980L, 1981L, 1982L, 1985L), A_x_mean = c(1.5,
5, 3.5, 2.5), A_y_sum = c(4L, 12L, 10L, 8L), B_x_mean = c(4L,
8L, 6L, 4L), B_y_sum = c(3L, 6L, 5L, 3L), C_x_mean = 3:6, C_y_sum = c(8L,
6L, 13L, 11L)), class = "data.frame", row.names = c(NA, -4L))
I have search key words in goole and stackoverflow, but no proper answers. My current thinking is to calculate unique group A,B,C in first row.
require(tidyverse)
group_variables <- dt%>%gather(key,value)%>%distinct(value)%>%arrange(value)
then get the row in group_variables by the for
for i in group_variables{......}
or can I change the structure of the dataframe by gathe and spread in tidyr,and by dplyr method, something just like following code,
dt_new%>% group_by (group)%>%
summarise(mean=mean(x,na.rm=TRUE),
sum=sum(x,na.rm=TURE))

First we need to take out the first row having the group, make the data frame long, simplify x1,x2,x3 to x etc and put the groups back:
group_var = sapply(dt[1,-1],as.character)
mat <-
dt[-1,] %>% pivot_longer(-year) %>%
mutate(value=as.numeric(as.character(value))) %>%
mutate(group=as.character(group_var[as.character(name)])) %>%
mutate(name=substr(name,1,1))
mat
# A tibble: 40 x 4
year name value group
<fct> <chr> <dbl> <chr>
1 1980 x 1 A
2 1980 y 1 A
3 1980 x 2 A
4 1980 y 3 A
5 1980 x 4 B
6 1980 y 3 B
7 1980 x 2 C
8 1980 y 3 C
9 1980 x 4 C
10 1980 y 5 C
Now what's left is to group them according to year, name and group and do the respective function, so we define a function:
func = function(DF,func){
DF %>%
group_by(group,name,year) %>%
summarise_all(func) %>%
mutate(label=paste(group,name,func,sep="_")) %>%
ungroup %>%
select(year,value,label) %>%
pivot_wider(values_from=value,names_from=label)
}
And we apply it over two parts of the data:
cbind(func(mat %>% filter(name=="x"),"mean"),func(mat %>% filter(name=="y"),"sum"))
year A_x_mean B_x_mean C_x_mean year A_y_sum B_y_sum C_y_sum
1 1980 1.5 4 3 1980 4 3 8
2 1981 5.0 8 4 1981 12 6 6
3 1982 3.5 6 5 1982 10 5 13
4 1985 2.5 4 6 1985 8 3 11

One way would be to make your factors into characters, then make your first row your column names(and remove the first row). Then I did some data manipulation using dplyr and tidyr to make the data long by year and letters and then transposed the data into wide format after taking the sum and the mean.
dt=structure(list(year = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1980",
"1981", "1982", "1985", "group"), class = "factor"), x1 = structure(c(4L,
1L, 3L, 2L, 1L), .Label = c("1", "2", "4", "A"), class = "factor"),
y1 = structure(c(4L, 1L, 3L, 2L, 2L), .Label = c("1", "3",
"5", "A"), class = "factor"), x2 = structure(c(5L, 1L, 4L,
3L, 2L), .Label = c("2", "4", "5", "6", "A"), class = "factor"),
y2 = structure(c(4L, 1L, 3L, 3L, 2L), .Label = c("3", "5",
"7", "A"), class = "factor"), x3 = structure(c(4L, 1L, 3L,
2L, 1L), .Label = c("4", "6", "8", "B"), class = "factor"),
y3 = structure(c(4L, 1L, 3L, 2L, 1L), .Label = c("3", "5",
"6", "B"), class = "factor"), x4 = structure(c(4L, 1L, 3L,
2L, 3L), .Label = c("2", "4", "5", "C"), class = "factor"),
y4 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("3", "4",
"5", "6", "C"), class = "factor"), x5 = structure(c(5L, 2L,
1L, 3L, 4L), .Label = c("3", "4", "6", "7", "C"), class = "factor"),
y5 = structure(c(4L, 2L, 1L, 3L, 2L), .Label = c("2", "5",
"8", "C"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
dt[sapply(dt, is.factor)] <- sapply(dt, as.character)
colnames(dt) <- dt[1,]
dt2 <- dt[-1,]
library(tidyverse)
dt3 <- pivot_longer(dt2, cols = c("A","B","C"),
names_to = "letters") %>%
ungroup %>%
select(-.copy) %>%
ungroup %>%
mutate(value = as.numeric(value)) %>%
group_by(letters,group) %>%
summarize(meanval = mean(value),
sumval = sum(value)) %>%
ungroup %>%
pivot_wider(names_from = letters,
values_from = c(meanval,sumval))

Related

how to combine information from 3 columns into one without duplicating them

I need to have one column for age. Currently I have three which are under30, age30to64 and age65plus. I need to combine all these into a single column. I would also like to do the same for the active and active1 columns. I would like to achieve this without duplicating or leaving out essential data.
structure(list(fruits = c(0, 0, 0, 0, 1), veggies = c(0, 1, 1,
1, 1), age = structure(c(7L, 8L, 9L, 10L, 6L), levels = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"
), class = "factor"), under30 = structure(c(1L, 1L, 1L, 1L, 1L
), levels = c("30 plus", "under 30"), class = "factor"), age30to64 = structure(c(2L,
2L, 2L, 1L, 2L), levels = c("under 30 or 65 plus", "age 30 to 64"
), class = "factor"), age65plus = structure(c(1L, 1L, 1L, 2L,
1L), levels = c("under 65", "65 plus"), class = "factor"), arthritis = structure(c(1L,
2L, 1L, 1L, 1L), levels = c("No arthritis", "Arthritis"), class = "factor"),
gender = structure(c(2L, 2L, 2L, 1L, 2L), levels = c("male",
"female"), class = "factor"), genhealth = structure(c(3L,
3L, 2L, 3L, 2L), levels = c("Excellent", "Very good", "Good",
"Fair", "Poor"), class = "factor"), education = structure(c(5L,
6L, 4L, 6L, 6L), levels = c("1", "2", "3", "4", "5", "6"), class = "factor"),
income = structure(c(8L, 8L, 7L, 6L, 8L), levels = c("1",
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), active = structure(c(2L,
1L, 2L, 1L, 2L), levels = c("Not active", "Active"), class = "factor"),
active1 = structure(c(2L, 1L, 2L, 1L, 3L), levels = c("Low",
"Moderate", "Vigorous"), class = "factor"), bmi = c(18.2199993133545,
27.4599990844727, 21.9699993133545, 35.939998626709, 39.8600006103516
), bmicat = structure(c(1L, 3L, 2L, 4L, 4L), levels = c("Underweight",
"Normal", "Overweight", "Obese"), class = "factor"), activetimes = c(20,
0, 5, 0, 8), ageCat = structure(c(2L, 2L, 2L, 3L, 2L), levels = c("under30",
"age30to64", "over64"), class = "factor")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

Using dplyr::case_when():
library(dplyr)
dat <- dat %>%
mutate(
ageCat = case_when(
under30 == "under 30" ~ "under 30",
age30to64 == "age 30 to 64" ~ "30 to 64",
age65plus == "65 plus" ~ "65 plus"
),
ageCat = factor(ageCat, c("under 30", "30 to 64", "65 plus"))
)
Result:
#> dat %>%
#+ select(under30, age30to64, age65plus, ageCat)
# A tibble: 5 × 4
under30 age30to64 age65plus ageCat
<fct> <fct> <fct> <fct>
1 30 plus age 30 to 64 under 65 30 to 64
2 30 plus age 30 to 64 under 65 30 to 64
3 30 plus age 30 to 64 under 65 30 to 64
4 30 plus under 30 or 65 plus 65 plus 65 plus
5 30 plus age 30 to 64 under 65 30 to 64

Update row values based on condition in R

I am trying to update the values of Column C2 and C3 based on conditions:
• The variable C2 is equal to 1 if the type of cue = 2,
and 0 otherwise.
• The variable C3 is equal to 1 if the type of cue = 3,
and 0 otherwise.
Data frame Image: https://drive.google.com/file/d/1Enik09cXQ21d3cQQv0_YQDZGBb3Btm5n/view?usp=sharing
dput(Cognitive[1:6,]) =
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L), Time = c(191L,
206L, 219L, 176L, 182L, 196L), W = c(0L, 0L, 0L, 1L, 1L, 1L),
Cue = c(1L, 2L, 3L, 1L, 2L, 3L), D = c(0L, 0L, 0L, 0L, 0L,
0L), Subject.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22",
"23", "24"), class = "factor"), Cue.f = structure(c(1L, 2L,
3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor"),
D.f = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), C2 = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("1", "2", "3"), class = "factor"), C3 = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Cognitive <- read.csv(file = 'Cognitive.csv')
View(Cognitive)
# Factor the variables Subject, Cue and D and add these variable to the Cognitive data frame.
Cognitive <- mutate(Cognitive, Subject.f = factor(Subject), Cue.f = factor(Cue), D.f = factor(D))
Cognitive <- mutate(Cognitive, C2 = Cue.f, C3 = Cue.f)
Thanks.

df %>%
mutate(C2 = case_when(cue == 2 ~ 1
TRUE ~ 0),
C3 = case_when(cue ==3 ~ 1,
TRUE ~ 0))

A super easy base solution
df <- data.frame(cue=sample(c(1:3),10,replace = T),c2=sample(c(0,1),10,replace = T),c3=sample(c(0,1),10,replace = T))
df$c2 <- ifelse(df$cue==2,1,0)
df$c3 <- ifelse(df$cue==3,1,0)
EDIT
to add another dplyr solution
df <- dplyr::mutate(df,c2= ifelse(cue==2,1,0),c3= ifelse(cue==3,1,0))

We can use sapply in base R
df[-1] <- +(sapply(c(2, 3), `==`, df$cue))

I would like to create a boxplot of numerical data, but excluding cases which are marked as '0' on another column?

I have made a boxplot for a single factor as follows:
ggplot(data = dataframe2, aes(x=factor(0), y = RPSdata$Survival.One.Year)) + geom_boxplot(...)
The dataframe is simply:
dataframe2 <- data.frame(RPSdata$Survival.One.Year)
I would like to make the same boxplot, but only including cases which are coded as '1' in column RPSdata$Survival.Complete.Sense
Thank you so much! New to R so appreciate any help
Data Sample:
> dput(head(RPSdata, 5))
structure(list(ID.Rank = 1:5, ID.Participant = c("8571762481",
"7351340719", "7396795819", "3790978753", "6450996320"), Population.Risk = structure(c(1L,
2L, 3L, 2L, 2L), .Label = c("1", "2", "3", "4", "5", "6"), class = "factor"),
Personal.Risk = c(50, 60, 30, 40, 10), Comparative.Risk.Age = structure(c(2L,
NA, 3L, 4L, 3L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Current = structure(c(NA, 3L, 3L, NA, NA
), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
Comparative.Risk.Ex = structure(c(2L, 3L, NA, NA, 3L), .Label = c("1",
"2", "3", "4", "5"), class = "factor"), Score.Exposure = structure(c(1L,
1L, 1L, 2L, 1L), .Label = c("1", "2", "4", "5"), class = "factor"),
RF.Age = structure(c(1L, NA, 1L, 1L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Pollution = structure(c(1L,
NA, 3L, 2L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Asbestos = structure(c(1L, NA, 1L, 1L, 1L), .Label = c("1",
"2"), class = "factor"), RF.Asthma = structure(c(2L, NA,
3L, 2L, 1L), .Label = c("0", "1", "2"), class = "factor"),
RF.BMI = structure(c(2L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Gene = structure(c(2L, NA,
3L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.COPD = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.History = structure(c(2L,
NA, 1L, 1L, 2L), .Label = c("0", "1", "2"), class = "factor"),
RF.Diet = structure(c(3L, NA, 1L, 2L, 3L), .Label = c("0",
"1", "2"), class = "factor"), RF.Radon = structure(c(2L,
NA, 1L, 3L, 3L), .Label = c("0", "1", "2"), class = "factor"),
RF.Smoking = structure(c(2L, NA, 2L, 2L, 2L), .Label = c("0",
"1", "2"), class = "factor"), RF.Second.Smoke = structure(c(3L,
NA, 1L, 3L, 2L), .Label = c("0", "1", "2"), class = "factor"),
Survival.One.Year = c(80, 20, NA, NA, 90), Survival.Five.Year = c(60,
50, NA, 30, 50), Survival.Ten.Year = c(40, 20, NA, NA, 2),
Worry.Frequency = structure(c(1L, 3L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), Worry.Intensity = structure(c(1L,
2L, 2L, 2L, 1L), .Label = c("1", "2", "3", "4"), class = "factor"),
Mental.Health.One = structure(c(1L, 3L, 2L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Two = structure(c(1L,
2L, 2L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
Mental.Health.Three = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Mental.Health.Four = structure(c(2L,
2L, 1L, 1L, 1L), .Label = c("0", "1", "2", "3"), class = "factor"),
PHQ.4 = structure(c(2L, 5L, 3L, 1L, 1L), .Label = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",
"12"), class = "factor"), PHQ4.Anx = structure(c(1L, 4L,
3L, 1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"
), class = "factor"), PHQ4.Dep = structure(c(2L, 2L, 1L,
1L, 1L), .Label = c("0", "1", "2", "3", "4", "5", "6"), class = "factor"),
PHQ4.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1", "2", "3"), class = "factor"), Dep.Bin = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
Anx.Bin = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("0",
"1"), class = "factor"), Survival.Compelete.Sense = structure(c(2L,
1L, 1L, 1L, 2L), .Label = c("0", "1"), class = "factor"),
Survival.Semi.Sense = c(1L, 0L, 0L, 1L, 1L)), row.names = c(NA,
5L), class = "data.frame")
>

Given the problem description, there is no need for a second data.frame, RPSdata alone is all that is needed. The problem is solved by subsetting conditional on a column that must be equal to 1.
library(ggplot2)
ggplot(data = subset(RPSdata, Survival.Complete.Sense == 1),
mapping = aes(x = Survival.Complete.Sense, y = Survival.One.Year)) +
geom_boxplot()
Another option, with package dplyr, is to filter first and pipe the result to ggplot. I also coerce the x axis column to factor.
library(dplyr)
library(ggplot2)
RPSdata %>%
filter(Survival.Complete.Sense == 1) %>%
mutate(Survival.Complete.Sense = factor(Survival.Complete.Sense)) %>%
ggplot(aes(Survival.Complete.Sense, Survival.One.Year)) +
geom_boxplot()

Trying to normalize data, but got undefined columns selected error in R

I have a dataset contains a variable nr.employed. Its numeric.
I am normalizing it in using code
markting_train_dim_deleted =
"","custAge","profession","marital","schooling","default","contact","month","campaign","previous","poutcome","cons.price.idx","cons.conf.idx","euribor3m","nr.employed","pmonths","pastEmail","responded"
"1",0.486842105263158,"1","3","7","2","1","8",0,0,"2",0.389321901792677,0.368200836820084,0.806393108138744,5195.8,999,0,"1"
"2",0.342105263157895,"2","2","1","1","1","4",0,0,"2",0.669134840218243,0.338912133891213,0.980729993198821,5228.1,999,0,"1"
"3",0.315789473684211,"10","2","4","1","2","7",0,0,"2",0.698752922837102,0.602510460251046,0.95737927907504,5191,999,0,"1"
"4",0.486842105263158,"5","1","1","2","1","4",0.0256410256410256,0,"2",0.669134840218243,0.338912133891213,0.981183405123555,5228.1,999,0,"1"
"5",0.215870043275927,"1","1","7","1","1","7",0.102564102564103,0.166666666666667,"1",0.26968043647701,0.192468619246862,0.148945817274994,5099.1,999,1,"1"
"6",0.381578947368421,"2","2","1","1","2","7",0,0,"2",0.698752922837102,0.602510460251046,0.95737927907504,5191,999,0,"1"
cnames=c("custAge","campaign","previous","cons.price.idx","cons.conf.idx",
"euribor3m"," nr.employed","pmonths","pastEmail")
for(i in cnames){
print(i)
print(markting_train_dim_deleted[,i])
markting_train_dim_deleted[,i]=
(markting_train_dim_deleted[,i]-min(markting_train_dim_deleted[,i]))/
(max(markting_train_dim_deleted[,i]-min(markting_train_dim_deleted[,i])))
}
After processing euribor3m it is printing nr.employed, it throws exception
Error in `[.data.frame`(markting_train_dim_deleted, , i) :
undefined columns selected
I have looked at the structure. Its a numeric datatype with no missing values.
output
dput(head(markting_train_dim_deleted))
structure(list(custAge = c(0.486842105263158, 0.342105263157895,
0.315789473684211, 0.486842105263158, 0.215870043275927, 0.381578947368421
), profession = structure(c(1L, 2L, 10L, 5L, 1L, 2L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "factor"),
marital = structure(c(3L, 2L, 2L, 1L, 1L, 2L), .Label = c("1",
"2", "3", "4"), class = "factor"), schooling = structure(c(7L,
1L, 4L, 1L, 7L, 1L), .Label = c("1", "2", "3", "4", "5",
"6", "7", "8"), class = "factor"), default = structure(c(2L,
1L, 1L, 2L, 1L, 1L), .Label = c("1", "2", "3"), class = "factor"),
contact = structure(c(1L, 1L, 2L, 1L, 1L, 2L), .Label = c("1",
"2"), class = "factor"), month = structure(c(8L, 4L, 7L,
4L, 7L, 7L), .Label = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10"), class = "factor"), campaign = c(0, 0, 0,
0.0256410256410256, 0.102564102564103, 0), previous = c(0,
0, 0, 0, 0.166666666666667, 0), poutcome = structure(c(2L,
2L, 2L, 2L, 1L, 2L), .Label = c("1", "2", "3"), class = "factor"),
cons.price.idx = c(0.389321901792677, 0.669134840218243,
0.698752922837102, 0.669134840218243, 0.26968043647701, 0.698752922837102
), cons.conf.idx = c(0.368200836820084, 0.338912133891213,
0.602510460251046, 0.338912133891213, 0.192468619246862,
0.602510460251046), euribor3m = c(0.806393108138744, 0.980729993198821,
0.95737927907504, 0.981183405123555, 0.148945817274994, 0.95737927907504
), nr.employed = c(5195.8, 5228.1, 5191, 5228.1, 5099.1,
5191), pmonths = c(999, 999, 999, 999, 999, 999), pastEmail = c(0L,
0L, 0L, 0L, 1L, 0L), responded = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = c("1", "2"), class = "factor")), .Names = c("custAge",
"profession", "marital", "schooling", "default", "contact", "month",
"campaign", "previous", "poutcome", "cons.price.idx", "cons.conf.idx",
"euribor3m", "nr.employed", "pmonths", "pastEmail", "responded"
), row.names = c(NA, 6L), class = "data.frame")

The mistake is simply having " nr.employed" (with a space) rather than "nr.employed" in cnames.
Also, something like
markting_train_dim_deleted[, cnames] <- sapply(markting_train_dim_deleted[, cnames],
function(x) (x - min(x)) / (max(x) - min(x)))
would make the normalization easier to read.

fast, efficient way to loop over millions of rows and match columns

I'm working with eye tracking data right now, so have a HUGE dataset (think millions of rows) and so would like a fast way to do this task. Here's a simplified version of it.
The data tells you where the eye is looking at each time point, and for each file we are looking at. X1,Y1 to the coordinates of the point we're looking at. There are multiple time points for each file (representing the eye looking at different location in the file through time).
Filename Time X1 Y1
1 1 10 10
1 2 12 10
I also have a file of where items are located for each filename. Each file contains (in this simplified case) two objects. X1,Y1 are the lower left coordinates and X2, Y2 are the upper right. You can imagine this as giving the bounding box where the item is located in each file. E.g.
Filename Item X1 Y1 X2 Y2
1 Dog 11 10 20 20
What I'd like to do is add another column to the first data frame that tells me what object the person is looking at during each time for each file. If there are not looking at any of the objects, I'd like the column to say "none". Things on the border count at as being looked at. E.g.
Filename Time X1 Y1 LookingAt
1 1 10 10 none
1 2 12 11 Dog
I know how to do this the for loop way, but it takes forever (and crashed my RStudio). I'm wondering if there might be a faster, more efficient way I'm missing.
Here's the dput for the first dataframe (These contain more rows that the example I showed above):
structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L,
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5",
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L,
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L,
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename",
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")
And here's the dput for the second:
structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1",
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat",
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L,
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11",
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L,
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11",
"13", "35"), class = "factor")), .Names = c("Filename", "Item",
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Using data.table and the sample data you provided, I would approach it as follows:
# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
][, `:=` (Filename = as.character(Filename),
X1 = pmin(X1,X2), X2 = pmax(X1,X2), # make sure that 'X1' is always the lowest value
Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))] # make sure that 'Y1' is always the lowest value
# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
between(X, lu$X1, lu$X2) &
between(Y, lu$Y1, lu$Y2)],
by = .(Filename,Time)]
which gives:
> dat
Filename Time X Y looked_at
1: 1 1 10 10 Cat
2: 1 2 15 20 NA
3: 1 3 12 25 NA
4: 2 1 11 15 NA
5: 2 2 10 10 NA
6: 3 1 15 11 NA
7: 3 2 25 12 NA
8: 3 5 20 15 House
9: 3 6 10 10 Mouse
Used data:
dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"),
X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"),
Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")),
.Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"),
Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"),
X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"),
X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"),
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"),
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")),
.Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r - Calculated mean and sum values group by the first row - r

Related

how to combine information from 3 columns into one without duplicating them

Update row values based on condition in R

I would like to create a boxplot of numerical data, but excluding cases which are marked as '0' on another column?

Trying to normalize data, but got undefined columns selected error in R

fast, efficient way to loop over millions of rows and match columns

Categories

Resources