Create indicator variable within panel data in R - r

I feel this should be easy but at a loss, and hoping y'all can help. I have panel data, by id with variables, here just v1:
id v1
A 14
A 15
B 12
B 13
B 14
C 11
C 12
C 13
D 14
I would simply like to create a dummy variable indicating whether a value of v1 (say 12) exists in the panel for id. So something like:
id v1 v2
A 14 0
A 15 0
B 12 1
B 13 1
B 14 1
C 11 1
C 12 1
C 13 1
D 14 0
I feel this should be simple but can't figure out an easy one line solution.
Many many thanks!

Try
library(dplyr)
df %>% group_by(id) %>% mutate(v2 = as.numeric(any(v1 == 12)))
Or as per #akrun suggestion:
library(data.table)
setDT(df)[, v2 := any(v1 ==12)+0L, id]
Note: Adding 0L to the logical values created by any() will switch TRUE/FALSE to 0s and 1s.
Another approach could be using ave:
df$v2 <- with(df, ave(v1, id, FUN = function(x) any(x == 12)))
Which gives:
# id v1 v2
#1 A 14 0
#2 A 15 0
#3 B 12 1
#4 B 13 1
#5 B 14 1
#6 C 11 1
#7 C 12 1
#8 C 13 1
#9 D 14 0

Related

Compare two columns of two different data frames with different length of rows return a third row

I have two different df which have the same columns: "O" for place and "date" for time.
Df 1 gives different information for a certain place (O) and time (date) in one 1 row and df 2 has many information for the same year and place in many different rows. No I want to extract one condition of the first df that applies for all the rows of the second df if values for "O" and "date" are equal.
To make it more clear:
I have one line in df 1: krnqm=250 for O=1002 and date=1885. Now I want a new column "krnqm" in df 2 where df2$krnqm = 250 for all rows where df2$0=1002 and df2$date=1885.
Unfortunately I have no idea how to put that condition into a code line and would be greatful for your help.
You can do this quite easily in base R using the merge function. Here's an example.
Simulate some data from your description:
df1 <- expand.grid(O = letters[c(2:4,7)], date = c(1,3))
df2 <- data.frame(O = rep(letters[1:6], c(2,3,3,6,2,2)), date = rep(1:3, c(3,2,4)))
df1$krnqm <- sample(1:1000, size = nrow(df1), replace=T)
> df1
O date krnqm
1 b 1 833
2 c 1 219
3 d 1 773
4 g 1 514
5 b 3 118
6 c 3 969
7 d 3 704
8 g 3 914
> df2
O date
1 a 1
2 a 1
3 b 1
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
9 d 3
10 d 1
11 d 1
12 d 1
13 d 2
14 d 2
15 e 3
16 e 3
17 f 3
18 f 3
Now let's combine the two data frames in the manner you describe.
df2 <- merge(df2, df1, all.x=T)
> df2
O date krnqm
1 a 1 NA
2 a 1 NA
3 b 1 833
4 b 2 NA
5 b 2 NA
6 c 3 969
7 c 3 969
8 c 3 969
9 d 1 773
10 d 1 773
11 d 1 773
12 d 2 NA
13 d 2 NA
14 d 3 704
15 e 3 NA
16 e 3 NA
17 f 3 NA
18 f 3 NA
So you can see, the krnqm column in the resulting data frame contains NAs for any combinations of 'O' and 'date' that were not found in the data frame where the krnqm values were extracted from. If your df1 has other columns, that you do not want to be included in the merge, just change the merge call slightly to only use those columns that you want: df2 <- merge(df2, df1[,c("O", "date", "krnqm")], all.x=T).
Good luck!

How to mutate filtered rows (using dplyr or if/else)

Similar questions have been certainly asked but my one is much easier and unfortunately I really could not dissect the answer from them so here is my specific, probably simple case:
df <- data.frame("Sample" = 1:30,
"Individual" = c("a", "b", "c"),
"Repeat" = 1:3)
I would like to mutate the entry of Individual == "a" into "a_(number_of_repeat). But only for individual a, not for b or c.
I tried:
df[df$Individual == "a", ] <-
df %>% filter(Individual == "a") %>%
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
but no success. Maybe it could also be solved with a if/else or for argument?
df$Individual <- for (df$Individual == "a") {
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
}
...also a fail.
What about something like this, with mutate and a classic ifelse:
library(dplyr)
df %>% mutate(Individual = ifelse(Individual=="a",
paste0(Individual,'_',Repeat),
Individual))
Sample Individual Repeat
1 1 a_1 1
2 2 2 2
3 3 3 3
4 4 a_1 1
5 5 2 2
6 6 3 3
7 7 a_1 1
8 8 2 2
9 9 3 3
10 10 a_1 1
11 11 2 2
12 12 3 3
13 13 a_1 1
14 14 2 2
15 15 3 3
16 16 a_1 1
17 17 2 2
18 18 3 3
19 19 a_1 1
20 20 2 2
21 21 3 3
22 22 a_1 1
23 23 2 2
24 24 3 3
25 25 a_1 1
26 26 2 2
27 27 3 3
28 28 a_1 1
29 29 2 2
30 30 3 3
Or in a new column:
df %>% mutate(Individual_2 = ifelse(Individual=="a",
paste0(Individual,'_',Repeat),
Individual))
We can use dplyr::if_else
library(dplyr)
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Individual = if_else(
Individual == "a",
sprintf("%s_%s", Individual, Repeat),
Individual))
# Sample Individual Repeat
#1 1 a_1 1
#2 2 b 2
#3 3 c 3
#4 4 a_1 1
#5 5 b 2
#6 6 c 3
#7 7 a_1 1
#8 8 b 2
#9 9 c 3
#10 10 a_1 1
#11 11 b 2
#12 12 c 3
#13 13 a_1 1
#14 14 b 2
#15 15 c 3
#16 16 a_1 1
#17 17 b 2
#18 18 c 3
#19 19 a_1 1
#20 20 b 2
#21 21 c 3
#22 22 a_1 1
#23 23 b 2
#24 24 c 3
#25 25 a_1 1
#26 26 b 2
#27 27 c 3
#28 28 a_1 1
#29 29 b 2
#30 30 c 3
You are mixing up some sytnax and therefore, your code fails.
First you dplyr-approach. Here you are close, but the additional df in the second row, messes up the pipeline.
df[df$Individual == "a", ] <-
df %>% filter(Individual == "a") %>%
# don't pipe again df you already giving that as an input (just filtered)
df %>% mutate(Individual = paste0(Individual,"_",Repeat))
The following makes it work:
Individual is stored as a factor, if you want to modify the column convert it to a character vector.
df$Individual <- as.character(df$Individual)
df[df$Individual == "a", ] <-
df %>%
filter(Individual == "a") %>%
mutate(Individual = paste0(Individual,"_",Repeat))
There are other approaches as well:
E.g. in base R
df$Individual <- ifelse(df$Individual == "a",
paste0(df$Individual, "_", df$Repeat),
df$Individual)
Or in dplyr:
df %>%
mutate(Individual = ifelse(Individual == "a",
paste0(Individual, "_", Repeat),
Individual))
You could also fix the for loop like below, but I really don't recommend that in this case as there are so nice vectorized options.
for (i in 1:nrow(df)) {
if (df$Individual[i] == "a") {
df$Individual[i] <- paste0(df$Individual[i], "_", df$Repeat[i])
}
}

Merging and summarizing two dataframes

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

how to use apply-like function on data frame? [please see details below]

I have a dataframe with columns A, B and C.
I want to apply a function on each row of a dataframe in which a function will check the value of row$A and row$B and will update row$C based on those values. How can I achieve that?
Example:
A B C
1 1 10 10
2 2 20 20
3 NA 30 30
4 NA 40 40
5 5 50 50
Now I want to update all rows in C column to B/2 value in that same row if value in A column for that row is NA.
So the dataframe after changes would look like:
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50
I would like to know if this can be done without using a for loop.
Or if you want to update the column by reference (without copying the whole data set when updating the column) could also try data.table
library(data.table)
setDT(dat)[is.na(A), C := B/2]
dat
# A B C
# 1: 1 10 10
# 2: 2 20 20
# 3: NA 30 15
# 4: NA 40 20
# 5: 5 50 50
Edit:
Regarding #aruns comment, checking the address before and after the change implies it was updated by reference still.
library(pryr)
address(dat$C)
## [1] "0x2f85a4f0"
setDT(dat)[is.na(A), C := B/2]
address(dat$C)
## [1] "0x2f85a4f0"
Try this:
your_data <- within(your_data, C[is.na(A)] <- B[is.na(A)] / 2)
Try
indx <- is.na(df$A)
df$C[indx] <- df$B[indx]/2
df
# A B C
#1 1 10 10
#2 2 20 20
#3 NA 30 15
#4 NA 40 20
#5 5 50 50
here is simple example using library(dplyr).
Fictional dataset:
df <- data.frame(a=c(1, NA, NA, 2), b=c(10, 20, 50, 50))
And you want just those rows where a == NA, therefore you can use ifelse:
df <- mutate(df, c=ifelse(is.na(a), b/2, b))
Another approach:
dat <- transform(dat, C = B / 2 * (i <- is.na(A)) + C * !i)
# A B C
# 1 1 10 10
# 2 2 20 20
# 3 NA 30 15
# 4 NA 40 20
# 5 5 50 50
Try:
> ddf$C = with(ddf, ifelse(is.na(A), B/2, C))
>
> ddf
A B C
1 1 10 10
2 2 20 20
3 NA 30 15
4 NA 40 20
5 5 50 50

Resources