Summing up responses on several factor levels - r

I got 2 identical variables due to allowing multiple responses.
Let's say, variables deal about hobbies: 1 = football, 2 = ice hockey, 3 = I have no hobbies
Thus, one can have two hobbies: football PLUS ice hockey.
hobby1<-c(1,2,3)
hobby1<-factor(hobby1,labels("football", "ice hockey", "I have no hobbies")
hobby2<-c(1,2,3)
hobby2<-factor(hobby2,labels("football", "ice hockey", "I have no hobbies")
Now I try to extract amout of hobbies, reaching from 0 to 2.
I already tried:
sum(hobby1<2, hobby2<2)
How can this be done, sum-function is not working for factors?
Plus, my solution would not take into account 3th category: no hobbies.
Should I possibly change my data arrangement, e.g. dummy coding (football yes/no, ...).

Dummy coding could be an easier approach since once you transform the data into a factor you can't use sum or the < operations easily. This approach works in base R:
df <- data.frame(football = c(0, 1, 1, 0),
ice_hockey = c( 1, 1, 0, 0))
df$num_hobbies <- rowSums(df[, 1:2])
df
# football ice_hockey num_hobbies
# 0 1 1
# 1 1 2
# 1 0 1
# 0 0 0
Or using dplyr to take advantage of column names a little more easily:
library(dplyr)
df <- data.frame(football = c(0, 1, 1, 0),
ice_hockey = c( 1, 1, 0, 0)) %>%
mutate(num_hobbies = football + ice_hockey)
df
# football ice_hockey num_hobbies
# 0 1 1
# 1 1 2
# 1 0 1
# 0 0 0

Related

Calculation between vectors in nested for loops

I am struggling with an issue concerned with nested for loops and calculation with conditions.
Let's say I have a data frame like this:
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
With this df, I want to compare each element between vectors with a condition: if the absolute difference between two elements is lower than 2 (so difference of 0 and 1), I'd like to accord 1 in a newly created vector while the absolute difference between two elements is >= 2, I'd like to append 0.
For example, for a calculation between the vector "a" and the other vectors "b", "c", "d", I want this result: 0 0 1. The first 0 is accorded based on the difference of 2 between a1 and b1; the second 0 is based on the difference of 3 between a1 and c1; the 1 is based on the difference of a1 and d1. So I tried to make a nested for loop to applicate the same itinerary to the elements in the following rows as well.
So my first trial was like this:
list_all = list(df$a, df$b, df$c, df$d)
v0<-c()
for (i in list_all)
for (j in list_all)
if (i != j) {
if(abs(i-j)<2) {
v0<-c(v0, 1)
} else {
v0<-append(v0, 0)
}} else {
next}
The result is like this :
v0
[1] 0 0 1 0 1 1 0 1 0 1 1 0
But it seems that the calculation has been made only among the first elements but not among the following elements.
So my second trial was like this:
list = list(df$b, df$c, df$d)
v1<-c()
for (i in df$a){
for (j in list){
if(abs(i-j)<2) {
v1<-append(v1, 1)
} else {
v1<-append(v1, 0)
}
}
}
v1
v1
[1] 0 0 1 1 0 1 1 0 1 1 1 1
It seems like the calculations were made between all elements of df$a and ONLY the first elements of the others. So this is not what I needed, either.
When I put df$b instead of list in the nested for loop, the result is even more messy.
v2<-c()
for (i in df$a){
for (j in df$b){
if(abs(i-j)<2) {
v2<-append(v2, 1)
} else {
v2<-append(v2, 0)
}
}
}
v2
[1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
It seems like the calculation has not been made between the corresponding elements (in the same rows), but between all vectors regardless of the place.
Could anyone tell me how to fix the problem? I don't understand why the nested for loop works only for the first elements.
Thank you in advance.
I'm not sure if I understood it all correctly, but how about this?
df = data.frame("a" = c(2, 3, 3, 4),
"b" = c(4, 4, 4, 4),
"c" = c(5, 5, 4, 4),
"d" = c(3, 4, 4, 2))
as.vector(apply(df, 1, \(x) ifelse(abs(x[1] - x[2:4]) < 2, 1, 0)))
#> [1] 0 0 1 1 0 1 1 1 1 1 1 0
I think you're making life unnecessarily complicated for yourself. If I understand you correctly, you can do what you want without nesting loops at all.
The key thing to remember is that R is vectorised by default. That means that R will modify all rows of a vector at the same time. There's no need to loop. So, for example, if a is a vector with values 1 and 2 and I write a + 1, the result will be a vector with values 2 and 3.
Applying this logic to your case, you can write:
df$diffB <- ifelse(abs(df$a-df$b) < 2, 1, 0)
df$diffC <- ifelse(abs(df$a-df$c) < 2, 1, 0)
df$diffD <- ifelse(abs(df$a-df$d) < 2, 1, 0)
df
Giving
a b c d diffB diffC diffD
1 2 4 5 3 0 0 1
2 3 4 5 4 1 0 1
3 3 4 4 4 1 1 1
4 4 4 4 2 1 1 0
You can write a loop to loop over columns if you wish, and Aron has given you one option to do this in his answer.
Personally, I find the using tidyverse results in code that's easier to understand than code written in base R. This is because I can read tidyverse code from left to right, whereas base R code (often) needs to be read from inside out. Tidyverse's syntax is more consistent than base R's as well.
Here's how I would solve your problem using the tidyverse:
library(tidyverse)
df %>%
mutate(
diffB=ifelse(abs(a-b) < 2, 1, 0),
diffC=ifelse(abs(a-c) < 2, 1, 0),
diffD=ifelse(abs(a-d) < 2, 1, 0)
)
And the "loop over columns" becomes
df %>%
mutate(
across(
c(b, c, d),
~ifelse(abs(a-.x) < 2, 1, 0),
.names="diff{.col}"
)
)

How to create dummy variable based on the value of two columns in R?

The question title might not completely reflect my problem, and that's perhaps the reason why I cannot come up with a solution for my problem. I have read simmilar questions (e.g., Assign a value to column based on condition across rows or R: Generate a dummy variable based on the existence of one column' value in another column) and on-line guides on creating dummy variables in R (I'm quite new to R), but no-one seems to tackle my problem, or perhaps I just couldn't see how. I have a dataframe like this:
df <- data.frame("Country" = c("US", "US", "US", "US", "IT", "IT", "IT","FR","FR"),
"Time" = c(1, 1, 2, 3, 1, 2, 1, 2, 3))
The dataframe is more complex, but each row is an observation of one country at a given moment in time. I want to create a dummy variable that takes the value 1 for an observation if the country to which this observation is assigned was measured at time 1, 2 3 and 0 otherwise and another dummy that takes the value 1 if the country was measured at the moments 2 and 3 but not one, and 0 otherwise. So the dataframe would look like:
df2 <- data.frame("Country" = c("US", "US", "US", "US", "IT", "IT", "IT","FR","FR"),
"Time" = c(1, 1, 2, 3, 1, 2, 1, 2, 3),
"DummyTime123" = c(1, 1, 1, 1, 0, 0, 0, 0, 0),
"DummyTime23" = c(0, 0, 0, 0, 0, 0, 0, 1, 1))
So, since US was measured at times 1, 2 and 3, american observations take the value 1, and the rest of the observations take the value 0. However, since FR was measured at times 2 and 3, French observations take the value 1, and the rest take the value 0 (note that also US takes the value 0, because it was measured at times 1, 2 and 3, and not only at times 2 and 3).
I have tried to create a dummy for instance with if_else within the tidyverse, like
DummyTime123 = ifelse(country = country, time = 1 & time = 2 & time = 3)
But this does not work, quite reasonably, because no single observation is measured at time 1, 2 and 3. Instead, I want to create a dummy for that observation based on whether the value of one column for this observation (country) is measured at several (and specific) times. I have also considered that my dataframe could be untidy, but I cannot see how and I don't think that's the problem. Of course, I could do this manually (that's what I did so far), but since the dataset is quite large, I would like to find an automated solution.
¿Does anybody have a solution for this problem? It would be really nice if there was a solution for this within the tidyverse, but of course any solution would be helpful.
With tidyverse you could try the following.
Use group_by with Country to consider all the Time values within each Country.
To satisfy DummyTime123 criteria, you need all values of 1, 2, and 3 in the Time values within a Country. If TRUE, then using + this becomes 1.
For DummyTime23, it sounds like you want both 2 and 3 in Time but do not want any values of Time to be 1. Using & you can make sure both criteria are satisfied.
Let me know if this provides the results expected.
library(tidyverse)
df %>%
group_by(Country) %>%
mutate(DummyTime123 = +all(1:3 %in% Time),
DummyTime23 = +(all(2:3 %in% Time) & !any(Time == 1)))
Output
Country Time DummyTime123 DummyTime23
<chr> <dbl> <int> <int>
1 US 1 1 0
2 US 1 1 0
3 US 2 1 0
4 US 3 1 0
5 IT 1 0 0
6 IT 2 0 0
7 IT 1 0 0
8 FR 2 0 1
9 FR 3 0 1

Using a for loop across columns with similar names

I am trying to use the tidyverse (purrr) package to run a for loop across my dataset. I want to check whether some number of conditions are true across certain columns along the dataset. Note, I am trying to become more familiar with tidyverse and its functions rather than rely on Base R.
Here is the code that I want to write a for loop for.
nrow(subset(data, flwr_clstr1>1 & bud_clstr1==0))
nrow(subset(data, flwr_clstr2>1 & bud_clstr2==0))
nrow(subset(data, flwr_clstr3>1 & bud_clstr3==0))
I have columns of data (in this case, it would be flwr_clstr) that are similar, but differ by the last digit. Also, if there is another way to use tidyverse to check these 'conditions', that would be great too.
Here is my attempt at the for loop.
check1 <- vector("double", ncol(data_phen))
for (i in seq_along(data_phen)) {
check[[i]] <- nrow(subset(data, flwr_clstr[[i]]>1 & bud_clstr[[i]]==0))
}
It would be easier to help if you could provide a reproducible example, however I created a sample of what your data might look like based on my understanding.
We can use map2_int from purrr since we are trying to count number of rows in each pair of columns
library(dplyr)
library(purrr)
map2_int(data %>% select(starts_with("flwr_clstr")),
data %>% select(starts_with("bud_clstr")),
~sum(.x > 1 & .y == 0)) %>% unname()
#[1] 2 3 1
However, base R isn't that bad either. This can be solved using mapply
col1 <- grep("^flwr_clstr", names(data))
col2 <- grep("^bud_clstr", names(data))
mapply(function(x, y) sum(x > 1 & y == 0), data[col1], data[col2])
data
Assuming you have equal number of columns for both "flwr_clstr.." and "bud_clstr.."
data <- data.frame(flwr_clstr1 = c(2, 1, 2, 1, 0), flwr_clstr2 = c(2, 2, 2, 1, 0),
flwr_clstr3 = c(1, 1, 2, 1, 1), bud_clstr1 = 0, bud_clstr2 = 0,bud_clstr3 = 0)
which looks like
data
# flwr_clstr1 flwr_clstr2 flwr_clstr3 bud_clstr1 bud_clstr2 bud_clstr3
#1 2 2 1 0 0 0
#2 1 2 1 0 0 0
#3 2 2 2 0 0 0
#4 1 1 1 0 0 0
#5 0 0 1 0 0 0

Identifying relative size of overlapping groups based on information in 2 vectors

I am working with very messy family data, in that it is possible for kids to be grouped with multiple families. The data is structured as follows:
famid <- c("A","A","B","C","C","D","D")
kidid <- c("1","2","1","3","4","4","5")
df <- as.data.frame(cbind(famid, kidid))
I want to identify which families I can drop, based on the criteria that all of the kids in that family are grouped together in another, larger, family.
For example, Family A contains Kid 1 and Kid 2. Family B contains Kid 1. Because Family B is entirely contained within Family A, I want to drop Family B.
Alternatively, Family C contains Kid 3 and Kid 4. Family D contains Kid 4 and Kid 5. Neither family is entirely contained within the other, so I do not want to drop either for the time being.
In my data there can be up to 6 families per kid and up to 8 kids per family. There are thousands of families and thousands of kids.
I have tried addressing this by creating a very wide data.frame with one row per student, with columns for each family the kid is associated with, each sibling in each family that the kid is associated with, and an additional column (sibgrp) for each associated family that concatenates all siblings together. But when I tried to search for individual siblings within the concatenated string, I found I didn't know how to do this -- grepl won't take a vector as the pattern argument.
I then started to look into intersect and similar functions, but those compare entire vectors to each other, not observations within a vector to other observations within that vector. (Meaning -- I can't look for the intersections between character string df[1,2] and character string df[1,3]. Intersect instead identifies the intersections between df[2] and df[3]).
I tried to change my thinking to accommodate this approach, so that I could compare vectors of siblings to each other, assuming that I know already that at least one sibling is shared. I could not figure out how to even begin doing this, given how many different families there are, and how many are not related to each other by even one shared kid.
What am I missing here? Would very much appreciate any feedback. Thank you!
This function can also be used for doing the task. It returns a character vector containing the names of the families that can be removed.
test_function <- function(dataset){
## split the kidid on the basis of famid
kids_family <- split.default(dataset[['kidid']],f = dataset[['famid']])
family <- names(kids_family)
## This function generates all the possible combinations if we select any two families from family
combn_family <- combn(family,2)
family_removed <- character(0)
apply(combn_family,MARGIN = 2, function(x){
if (length(setdiff(kids_family[[x[1]]],kids_family[[x[2]]])) == 0)
family_removed <<- c(family_removed,x[1])
else if (length(setdiff(kids_family[[x[2]]],kids_family[[x[1]]])) == 0)
family_removed <<- c(family_removed,x[2])
})
return (family_removed)
}
> df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
+ kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
> test_function(df)
[1] "B" "F"
I have tried around setdiff with no chance. I came and post this laborious solution in the hope there is a better way.
# dependencies for melting tables and handling data.frames
require(reshape2)
require(dplyr)
# I have added two more cases to your data.frame
# kidid is passed as numeric (with quoted would have been changed to vector by default)
df <- data.frame(famid = c("A","A","B","C","C","D","D", "E", "E", "E", "F", "F"),
kidid = c(1, 2, 1, 3, 4, 4, 5, 7, 8, 9, 7, 9))
# let's have a look to it
df
famid kidid
1 A 1
2 A 2
3 B 1
4 C 3
5 C 4
6 D 4
7 D 5
8 E 7
9 E 8
10 E 9
11 F 7
12 F 9
# we build a contingency table
m <- table(df$famid, df$kidid)
# a family A only contains a family B, if A has all the elements of B,
# and at least one that B doesnt have
m
1 2 3 4 5 7 8 9
A 1 1 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0
C 0 0 1 1 0 0 0 0
D 0 0 0 1 1 0 0 0
E 0 0 0 0 0 1 1 1
F 0 0 0 0 0 1 0 1
# an helper function to implement that and return a friendly data.frame
family_contained <- function(m){
res <- list()
for (i in 1:nrow(m))
# for each line in m, we calculate the difference to all other lines
res[[i]] <- t(apply(m[-i, ], 1, function(row) m[i, ] - row))
# here we test if all values are 0+ (ie if the selected family has all element of the other)
# and if at least one is >=1 (ie if the selected family has at least one element that the other doesnt have)
tab <- sapply(res, function(m) apply(m, 1, function(x) all(x>=0) & any(x>=1)))
# we format it as a table to have nice names
tab %>% as.table() %>%
# we melt it into a data.frame
melt() %>%
# only select TRUE and get rid of this column
filter(value) %>% select(-value) %>%
# to make things clear we name columns
`colnames<-`(c("this_family_is_contained", "this_family_contains"))
}
family_contained(m)
# this_family_is_contained this_family_contains
# 1 B A
# 2 F E
# finally you can filter them with
filter(df, !(famid %in% family_contained(m)$this_family_is_contained))

Find the minimum distance between two data frames, for each element in the second data frame

I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.
I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))
data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))
data$distance <- data$time.ev2 - data$time.ev1
min_data <- data %>%
group_by(test_id, time.ev2) %>%
filter(abs(distance) == min(abs(distance)))
While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?
Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]
Error in eval(expr, envir, enclos) : object 'i.time' not found
Here's how I'd do it using data.table:
require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time = i.time, ev1.time = time[which.min(abs(i.time - time))]), by = .EACHI]
# test_id ev2.time ev1.time
# 1: 0 6 3
# 2: 0 1 1
# 3: 0 8 3
# 4: 1 4 4
# 5: 1 5 4
# 6: 1 11 4
In joins of the form x[i] in data.table, the prefix i. is used to refer the columns in i, when both x and i share the same name for a particular column.
Please see this SO post for an explanation on how this works.
This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed1) as it doesn't materialise the entire join result at all. In fact, this does exactly what you say in your post - filter on the fly, while merging.
On speed, it doesn't matter in most of the cases really. If there are a lot of rows in i, it might be a tad slower as the j-expression will have to be evaluated for each row in i. In contrast, #akrun's answer does a cartesian join followed by one filtering. So while it's high on memory, it doesn't evaluate j for each row in i. But again, this shouldn't even matter unless you work with really large i which is not often the case.
HTH
May be this helps:
library(data.table)
setkey(setDT(ev1), test_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=time-i.time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, i.time)]$V1]
# test_id time i.time distance
#1: 0 3 6 3
#2: 0 1 1 0
#3: 0 3 8 5
#4: 1 4 4 0
#5: 1 4 5 1
#6: 1 4 11 7
Or
ev1[ev2, allow.cartesian=TRUE][,distance:= time-i.time][,
.SD[abs(distance)==min(abs(distance))], by=list(test_id, i.time)]
Update
Using the new grouping
setkey(setDT(ev1), test_id, group_id)
setkey(setDT(ev2), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id,
group_id,i.time)]$V1]$distance
#[1] 2 3 4 -1 0 4
Based on the code you provided
min_data$distance
#[1] 2 3 4 -1 0 4

Resources