special case of merge in R - r

I'm trying to solve this:
For example:
Warehouse
id amount size
1 cymbals 5 24
2 snares 3 10
3 tom1 2 19
Incoming
id amount size
1 snares 2 15
Resulting
id amount size
1 cymbals 5 24
2 snares 5 15
3 tom1 2 19
I am newby in R, so I was looking for the most elegant/legible way of getting the 'resulting' (I am not concerned with performance). Resulting would be: taking every incoming item, loop in the warehouse if it exists: add the amounts, and replace the size with the new size; if it doesn't exist, add it.

With dplyr, we can bind the two dataframes together , group them by id and calculate the sum of amount and take the last value of size so if the value is present in incoming it will take it from there or else it will take the size value from warehouse dataframe.
library(dplyr)
bind_rows(warehouse, incoming) %>%
group_by(id) %>%
summarise(amount = sum(amount),
size = last(size))
# id amount size
# <chr> <int> <int>
#1 cymbals 5 24
#2 snares 5 15
#3 tom1 2 19

Related

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Subdividing and counting how many values in particular columns under certain conditions in r

I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Resources