Assigning elements of one vector to elements of another with R - r

I would like to assign elements of one vector to elements of another for every single user.
For example:
Within a data frame with the variables "user", "activities" and "minutes" (see below), I would like to assign, for example, the duration (4 minutes) of the first activity (4 minutes to activity "READ") of user 1 to new variable READ_duration. Then duration (5 minutes) of second activity ("EDIT") to the new variable EDIT_duration. And the duration (2 minutes) of third activity (again "READ") to the new variable READ_duration.
user <- 1,2,3
activities <- c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT")
minutes <- c(4,5,2), c(3.5, 1, 2), c(4.5,3)
Output should be like: in a data frame with the assigned minutes to the activities:
user READ_duration EDIT_duration WRITE_duration
1 6 5 0
2 3.5 1 2
3 0 3 4.5
The tricky thing here is the algorithm needs to consider that the activities are not in the same order for every user. For example, user 3 starts with writing and therefore the duration 4.5 needs to be assigned to column 4 WRITE_duration.
Also, a loop-function would be needed due to a massive amount of users.
Thank you so much for your help!!

This needs a simple reshape to wide format with sum as an aggregation function.
Prepare a long-format data.frame:
user <- c(1,2,3)
activities <- list(c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT"))
minutes <- list(c(4,5,2), c(3.5, 1, 2), c(4.5,3))
DF <- Map(data.frame, user = user, activities = activities, minutes = minutes)
DF <- do.call(rbind, DF)
# user activities minutes
#1 1 READ 4.0
#2 1 EDIT 5.0
#3 1 READ 2.0
#4 2 READ 3.5
#5 2 EDIT 1.0
#6 2 WRITE 2.0
#7 3 WRITE 4.5
#8 3 EDIT 3.0
Reshape:
library(reshape2)
dcast(DF, user ~ activities, value.var = "minutes", fun.aggregate = sum)
# user EDIT READ WRITE
#1 1 5 6.0 0.0
#2 2 1 3.5 2.0
#3 3 3 0.0 4.5

in base R you could do:
xtabs(min~ind+values, cbind(stack(setNames(activities, user)), min = unlist(minutes)))
values
ind EDIT READ WRITE
1 5.0 6.0 0.0
2 1.0 3.5 2.0
3 3.0 0.0 4.5

Related

Pulling columns based on values in a row

I am looking for a way to use the values in the first row to help filter values I want. Say if I want to keep certain columns in R based on the values in the first row. So in the first row, we have -0.5, 0.7, 1.1, and -1.2.
I want to only keep values that are equal to or greater than 1, or less than or equal to -1.2. Everything else will just be dropped.
So say my original data I have is DF1
ID
Location
XPL
SNA
AAS
APA
First
Park
-0.5
0.7
1.1
-1.2
Second
School
2
5
2
3
Second
Home
4
5
6
4
Third
Car
1
8
8
5
Third
Lake
7
5
4
6
Fourth
Prison
4
5
1
7
With the filter, I would now have a new DF:
ID
Location
AAS
APA
First
Park
1.1
-1.2
Second
School
2
3
Second
Home
6
4
Third
Car
8
5
Third
Lake
4
6
Fourth
Prison
1
7
What would be the best way for this. I feel there is a way to sort columns based on values from a row, but I am unable to think of the way we can with certain commands.
ID <- c("First", "Second", "Second", "Third", "Third", "Fourth")
Location <- c("Park", "School", "Home", "Car", "Lake", "Prison")
XPL <- c(-0.5,2,4,1,7,4)
SNA <- c(0.7,5,5,8,5,5)
AAS <- c(1.1,2,6,8,4,1)
APA <- c(-1.2,3,4,5,6,7)
DF1 <- data.frame(ID, Location, XPL, SNA, AAS,APA)
In dplyr, you can select numeric columns whose first absolute value is above 1:
library(dplyr)
DF1 %>%
select(!where(~ is.numeric(.x) && abs(first(.x)) <= 1))
# ID Location AAS APA GOP
# 1 First Park 1.1 -1.2 1.4
# 2 Second School 2.0 3.0 1.0
# 3 Second Home 6.0 4.0 2.0
# 4 Third Car 8.0 5.0 2.0
# 5 Third Lake 4.0 6.0 3.0
# 6 Fourth Prison 1.0 7.0 3.0
Or with between:
DF1 %>%
select(!where(~ is.numeric(.x) && between(first(.x), -1.19, 0.99)))
If you are using the first row as the basis, you can convert it to a normal integer vector and use the which function to know the indexes that will be kept.
test.row <- as.numeric(DF1[1,3:6])
the 3 and 6 corresponds to the range of index from XPL to APA.
DF1 <- DF1[,c(1:2, 2 + which(test.row >= 1 | test.row <= -1.2))]
we keep the columns ID and Location as 1:2 and we offset 2 to the which function.

How to rank data from multiple rows and columns?

Example data:
>data.frame("A" = c(20,40,53), "B" = c(40,11,60))
What's the easiest way in R to get from this
A B
1 20 40
2 40 11
3 53 60
to this?
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0
I couldn't find a way to make rank() or frank() work on multiple rows/columns and googling things like "r rank dataframe" "r rank multiple rows" yielded only questions on how to rank multiple rows/columns individually, which is weird, as I suspect the question must have been answered before.
Try rank like below
df[] <- rank(df)
or
df <- list2DF(relist(rank(df),skeleton = unclass(df)))
and you will get
> df
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Is there a package that I can use in order to get rules for a target outcome in R

For example In this given data set I would like to get the best values of each variable that will yield a pre-set value of "percentage" : for example I need that the value of "percentage" will be >=0.7 so in this case the outcome should be something like:
birds >=5,1<wolfs<=3 , 2<=snakes <=4
Example data set:
dat <- read.table(text = "birds wolfs snakes percentage
3 8 7 0.50
1 2 3 0.33
5 1 1 0.66
6 3 2 0.80
5 2 4 0.74",header = TRUE
I can't use decision trees as I have a large data frame and I can't see all tree correctly. I tried the *arules* package as but it requires that all variables will be factors and I have mixed dataset of factor,logical and continuous variables and I would like to keep the variables and the Independent variable continues .Also I need "percentage" variable to be the only one that I would like to optimize.
The code that I wrote with *arules* package is this:
library(arules)
dat$birds<-as.factor(dat$birds)
dat$wolfs<-as.factor(dat$wolfs)
dat$snakes<-as.factor(dat$snakes)
dat$percentage<-as.factor(dat$percentage)
rules<-apriori(dat, parameter = list(minlen=2, supp=0.005, conf=0.8))
Thank you
I may have misunderstood the question but to get the maximum value of each variable with the restriction of percentage >= 0.7 you could do this:
lapply(dat[dat$percentage >= 0.7, 1:3], max)
$birds
[1] 6
$wolfs
[1] 3
$snakes
[1] 4
Edit after comment:
So perhaps this is more what you are looking for:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y))))
birds wolfs snakes
1 5 2 2
2 6 3 4
It will give the min and max values representing the ranges of variables if percentage >=0.7
If this is completely missing what you are trying to achieve, I may not be the right person to help you.
Edit #2:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y), length(y), length(y)/nrow(dat))))
birds wolfs snakes
1 5.0 2.0 2.0
2 6.0 3.0 4.0
3 2.0 2.0 2.0
4 0.4 0.4 0.4
Row 1: min
Row 2: max
Row 3: number of observations meeting the condition
Row 4: percentage of observations meeting the condition (relative to total observations)

R: repeat series of numbers within groups a number of times that differs among groups

I have a data frame that looks something like the one below, which I'll call data frame 1. There is no regular pattern to the number of rows associated with each number in the “tank” column (or the other columns for that matter).
#code for making data frame 1
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
df1<-data.frame(cbind(tank,size,mass))
I need to repeat the sequence of values found in the "size" and "mass" columns within each tank. However, the number of repeats for each tank's sequence will differ (again in no particular pattern). I have another data frame (data frame 2) that contains the number of repeats for each tank's sequence, and it looks something like this:
#code for making data frame 2
tank<-c(1,2,3,4)
rpeat<-c(3,1,2,2)
df2<-data.frame(cbind(tank,rpeat))
Ultimately, my goal is to have a data frame like this (see below). Each series of values within a tank is repeated a number of times equal to that specified in data frame 2.
#code for making data frame 3
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size<-c(2.1,3.5,2.1,3.5,2.1,3.5,2.3,4.0,3.3,2.2,4.0,3.3,2.2,1.9,3.0,1.9,3.0)
mass<-c(6.5,5.5,6.5,5.5,6.5,5.5,5.9,7.2,4.9,8.0,7.2,4.9,8.0,9.1,6.3,9.1,6.3)
df3<-data.frame(cbind(tank,size,mass))
I have figured out a somewhat crude way to do this when each number in the size and mass columns is just repeated a specified number of times (see below) but not how to create the repeating series that I need.
#code to make data frame 4
tank<-c(1,1,1,1,1,1,2,3,3,3,3,3,3,4,4,4,4)
size2<-c(2.1,2.1,2.1,3.5,3.5,3.5,2.3,4.0,4.0,3.3,3.3,2.2,2.2,1.9,1.9,3.0,3.0)
mass2<-c(6.5,6.5,6.5,5.5,5.5,5.5,5.9,7.2,7.2,4.9,4.9,8.0,8.0,9.1,9.1,6.3,6.3)
df4<-data.frame(cbind(tank,size,mass))
To make the above data frame, I took the data frame below, which combines data frames 1 and 2, and applied the code below.
#code to produce data frame 5
tank<-c(1,1,2,3,3,3,4,4)
size<-c(2.1,3.5,2.3,4.0,3.3,2.2,1.9,3.0)
mass<-c(6.5,5.5,5.9,7.2,4.9,8.0,9.1,6.3)
rpeat<-c(3,3,1,2,2,2,2,2)
df5<-data.frame(cbind(tank,size,mass,rpeat))
#code to produce data frame 4 from data frame 5
tank_col <- rep(df5$tank, times = df5$rpeat)
size_col <- rep(df5$size, times = df5$rpeat)
mass_col <- rep(df5$mass, times = df5$rpeat)
goal <-data.frame(cbind(tank_col,size_col,mass_col))
Sorry this is so long, but I have a hard time explaining what I need to do without providing examples. Thanks in advance for any help you can provide.
You can use data.table, and
library(data.table)
# create df1 and df2 as data.tables keyed by tank
DT1 <- data.table(df1, key = 'tank')
DT2 <- data.table(df2, key = 'tank')
# you can now join on tank, and repeat all columns in
# .SD (the subset of the data.table)
DT1[DT2, lapply(.SD, rep, times = rpeat)]
# 1: 1 2.1 6.5
# 2: 1 3.5 5.5
# 3: 1 2.1 6.5
# 4: 1 3.5 5.5
# 5: 1 2.1 6.5
# 6: 1 3.5 5.5
# 7: 2 2.3 5.9
# 8: 3 4.0 7.2
# 9: 3 3.3 4.9
# 10: 3 2.2 8.0
# 11: 3 4.0 7.2
# 12: 3 3.3 4.9
# 13: 3 2.2 8.0
# 14: 4 1.9 9.1
# 15: 4 3.0 6.3
# 16: 4 1.9 9.1
# 17: 4 3.0 6.3
Read the vignettes associated with data.table to get a full understanding of what is going on.
What we are doing is called by-without-by within the vignettes.

Resources