Concatenate two datasets in r - r

I have two datasets animal and plants
ANIMAL PLANT
OBS Common Animal Number OBS Common Plant Number
1 a Ant 5 1 g Grape 69
2 b Bird 2 h Hazelnut 55
3 c Cat 17 3 i Indigo
4 d Dog 9 4 j Jicama 14
5 e Eagle 5 k Kale 5
6 f Frog 76 6 l Lentil 77
I want to concatenate these two into a new dataset.
Below is the desired output
Obs Common Animal Plant Number
1 a Ant 5
2 b Bird .
3 c Cat 17
4 d Dog 9
5 e Eagle .
6 f Frog 76
7 g Grape 69
8 h Hazelnut 55
9 i Indigo .
10 j Jicama 14
11 k Kale 5
12 l Lentil 77
How to do these kind of concatenate in R?

rbind() will not work because of the differing names.
Something like this will work for the given example:
rbind_ <- funciton(data1, data2) {
nms1 <- names(data1)
nms2 <- names(data2)
if(mean(nms1==nms2)==1) {
out <- rbind(data1, data2)
} else {
data1[nms2[!nms2%in%nms1]] <- NA
data2[nms1[!nms1%in%nms2]] <- NA
out <- rbind(data1, data2)
}
return(out)
}
rbind_(animal, plant)
OBS Common Animal Number Plant
1 1 a Ant 5 <NA>
2 2 b Bird NA <NA>
3 3 c Cat 17 <NA>
4 4 d Dog 9 <NA>
5 5 e Eagle NA <NA>
6 6 f Frog 76 <NA>
7 1 g <NA> 69 Grape
8 2 h <NA> 55 Hazelnut
9 3 i <NA> NA Indigo
10 4 j <NA> 14 Jicama
11 5 k <NA> 5 Kale
12 6 l <NA> 77 Lentil
But would require a bit of tweaking to get to work in all cases, I think.

This should give you the desired output:
PLANT$OBS = PLANT$OBS + nrow(ANIMAL)
ANIMAL$Plant = ''
PLANT$Animal = ''
Final_DF= rbind(ANIMAL,PLANT)

Related

How to replace NAs with previous column values plus one by group based on other columns in R?

I am currently trying to replace NA values in my dataframe with the previous value plus one. However, there is a condition in that the values must never exceed 52 due to that being the number of weeks within a calendar year. Here's an example of the the dataframe below:
Animal Age Week
Dog 13 5
Dog 14 6
Dog 15 7
Dog 16 NA
Dog 17 NA
Cat 12 46
Cat 13 47
Cat 14 48
Cat 15 49
Cat 16 50
Cat 17 NA
Rat 10 49
Rat 11 50
Rat 12 51
Rat 13 NA
Rat 14 NA
Rat 15 NA
Rat 16 NA
Rat 17 NA
What I would like the code to output is the following below:
Animal Age Week
Dog 13 5
Dog 14 6
Dog 15 7
Dog 16 8
Dog 17 9
Cat 12 46
Cat 13 47
Cat 14 48
Cat 15 49
Cat 16 50
Cat 17 51
Rat 10 49
Rat 11 50
Rat 12 51
Rat 13 52
Rat 14 1
Rat 15 2
Rat 16 3
Rat 17 4
The caveat is that the end age of each animal will always be 17. I tried using R's function "Complete" and "Fill", but I could not find a way to add plus one with the condition that it resets after week 52.
Any help would be appreciated.
For each group (Animal), we add the first Week number to row number and get the remainder value. We finally replace the 0 value with 52.
library(dplyr)
df %>%
group_by(Animal) %>%
mutate(Week = (first(Week) + row_number() - 1) %% 52,
Week = replace(Week, Week == 0, 52))
# Animal Age Week
# <fct> <int> <dbl>
# 1 Dog 13 5
# 2 Dog 14 6
# 3 Dog 15 7
# 4 Dog 16 8
# 5 Dog 17 9
# 6 Cat 12 46
# 7 Cat 13 47
# 8 Cat 14 48
# 9 Cat 15 49
#10 Cat 16 50
#11 Cat 17 51
#12 Rat 10 49
#13 Rat 11 50
#14 Rat 12 51
#15 Rat 13 52
#16 Rat 14 1
#17 Rat 15 2
#18 Rat 16 3
#19 Rat 17 4
Similarly, in base R :
df <- transform(df, Week = ave(Week, Animal, FUN = function(x)
seq_along(x) + x[1] - 1 %% 52))
transform(df, Week = replace(Week, Week == 0, 52))
We can use data.table
library(data.table)
setDT(df)[, Week := (first(Week) + .N - 1) %% 52, Animal][Week == 0, Week := 52][]

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

R - Find rows based on group factors

I'm trying to figure out a way to find specific values based on each factor within R. In other words, how can I keep all rows that suffice a certain condition for each factor, even if that specific row fails a condition but it's same factor passes the condition on another row?
So I have something like this:
gender values fruit
1 M 20 apple
2 M 22 pear
3 F 24 mango
4 F 19 mango
5 F 9 mango
6 F 17 apple
7 M 18 banana
8 M 22 banana
9 M 12 banana
10 M 14 mango
11 F 7 apple
12 F 8 apple
I want every fruit and has at least one F gender (even if that fruit has some M's). It's also possible to have multiple genders, such as neutral (not shown). So my ideal output out be this:
gender values fruit
1 M 20 apple
3 F 24 mango
4 F 19 mango
5 F 9 mango
6 F 17 apple
10 M 14 mango
11 F 7 apple
12 F 8 apple
Notice that the banana and pear are missing, that's because those fruits ONLY have M's and no F's. Also, rows 1 and 10 are still there even though those are M's, because there are other apples and mangos that have F's, it still applies. Please let me know if this is possible. Thank you!
Below is my code for replicating this data:
gender <- c("M","M","F","F","F","F","M","M","M","M","F","F")
values <- c(20,22,24,19,9,17,18,22,12,14,7,8)
fruit <- c("apple","pear","mango","mango","mango","apple","banana","banana","banana","mango","apple","apple")
df <- data.frame(gender, values, fruit)
Here's what I've tried so far:
df[duplicated(df[,c("fruit","gender")]),]
ave(df$gender, df$fruit, FUN=function(x) ifelse(x=='F','yes','no'))
Also, third party libraries are welcomed but I prefer to stay within R (packages stats and plyr are fine as I have those on my system).
df[df$fruit %in% unique(df[df$gender =='F', ]$fruit),]
# gender values fruit
#1 M 20 apple
#3 F 24 mango
#4 F 19 mango
#5 F 9 mango
#6 F 17 apple
#10 M 14 mango
#11 F 7 apple
#12 F 8 apple
Possible data.table approach
library(data.table)
setDT(df)[, if(any(gender == "F")) .SD, by = fruit]
# fruit gender values
# 1: apple M 20
# 2: apple F 17
# 3: apple F 7
# 4: apple F 8
# 5: mango F 24
# 6: mango F 19
# 7: mango F 9
# 8: mango M 14
I like the other approach, so here's a data.table equivalent using binary join
setkey(setDT(df), fruit)[.(unique(df[gender == "F", fruit], by = "fruit"))]
# gender values fruit
# 1: F 17 apple
# 2: F 7 apple
# 3: F 8 apple
# 4: M 20 apple
# 5: F 24 mango
# 6: F 19 mango
# 7: F 9 mango
# 8: M 14 mango
The base r, the data.table and here I provide the dplyr solution even though some outputs are different (at least in order of the results).
library(dplyr)
df %>% group_by(fruit) %>% filter(any(gender == "F"))
Source: local data frame [8 x 3]
Groups: fruit
gender values fruit
1 M 20 apple
2 F 24 mango
3 F 19 mango
4 F 9 mango
5 F 17 apple
6 M 14 mango
7 F 7 apple
8 F 8 apple

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Remove duplicates column combinations from a dataframe in R

I want to remove duplicate combinations of sessionid, qf and qn from the following data
sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville
I read in the data as a data.frame and call it mydata. Heree is the code I have so far, but I need to know how to first sort the data.frame correctly. Secondly remove the duplicate combinations of sessionid, qf, and qn. And lastly graph in a histogram characters in the column qf
sortDATA<-function(name)
{
#sort the code by session Id, first name, then last name
sort1.name <- name[order("sessionid","qf","qn") , ]
#create a vector of length of first names
sname<-nchar(sort1.name$qf)
hist(sname)
}
thanks!
duplicated() has a method for data.frames, which is designed for just this sort of task:
df <- data.frame(a = c(1:4, 1:4),
b = c(4:1, 4:1),
d = LETTERS[1:8])
df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D
In your example the repeated rows were entirely repeated. unique works with data.frames.
udf <- unique( my.data.frame )
As for sorting... joran just posted the answer.
To address your sorting problems, first reading in your example data:
dat <- read.table(text = " sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos NA
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee NA
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville ",sep = "",header = TRUE)
and then you can use arrange from plyr,
arrange(dat,sessionid,qf,qn)
or using base functions,
with(dat,dat[order(sessionid,qf,qn),])
It works if you use duplicated twice:
> df
a b c d
1 1 2 A 1001
2 2 4 B 1002
3 3 6 B 1002
4 4 8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006
> df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]
a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Resources