Merge Dataframes with different number of rows - r

I am trying to merge together 8 dataframes into one, matching against the row names.
Examples of the dataframes:
DF1
Arable and Horticulture
Acer
100
Achillea
90
Aesculus
23
Alliaria
3
Allium
56
Anchusa
299
DF2
Improved Grassland
Acer
12
Alliaria
3
Allium
50
Brassica
23
Calystegia
299
Campanula
29
And so on for a few hundred rows for different plants and 8 columns of different habitats.
What I want the merged frame to look like:
Arable and Horticulture
Improved Grassland
Acer
100
12
Achillea
90
0
Aesculus
23
0
Alliaria
3
3
Allium
56
50
Anchusa
299
0
Brassica
0
23
Calystegia
0
299
Campanula
0
29
I tried merging
PolPerGen <- merge(DF1, DF2, all=TRUE)
But that does not match up the row name and dropped them entirely in the output
Arable and Horticulture
Improved Grassland
1
100
NA
2
90
NA
3
23
NA
4
2
NA
5
56
NA
6
299
NA
7
NA
12
8
NA
3
9
NA
50
10
NA
23
11
NA
299
12
NA
29
I am completely out of ideas, any thoughts?

Your dataset is,
dat1 = data.frame("Arable and Horticulture" = c(100, 90,23, 3, 56, 299),
row.names = c("Acer", "Achillea", "Aesculus", "Alliaria", "Allium", "Anchusa"))
dat2 = data.frame("Improved Grassland" = c(12, 3, 50, 23, 299, 29),
row.names = c("Acer", "Achillea", "Allium", "Brassica", "Calystegia", "Campanula"))
As #Vinícius Félix suggested first convert rownames to column.
library(tibble)
dat1 = rownames_to_column(dat1, "Plants")
dat2 = rownames_to_column(dat2, "Plants")
Then lets join both the datasets,
library(dplyr)
dat = full_join(dat1, dat2, )
And replace the NA with 0
dat = dat %>% replace(is.na(.), 0)
Plants Arable.and.Horticulture Improved.Grassland
1 Acer 100 12
2 Achillea 90 3
3 Aesculus 23 0
4 Alliaria 3 0
5 Allium 56 50
6 Anchusa 299 0
7 Brassica 0 23
8 Calystegia 0 299
9 Campanula 0 29

Related

summarizing temperature data based on a vector of temperature thresholds

I have a data frame with daily average temperature data in it, structured like so:
'data.frame': 4666 obs. of 6 variables:
$ Site : chr "EB" "FFCE" "IB" "FFCE" ...
$ Date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2014-01-01" ...
$ Day : int 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 2013 2013 2013 2014 2014 2014 2014 2015 2015 2015 ...
$ Month: int 1 1 1 1 1 1 1 1 1 1 ...
$ Temp : num 28.5 28.3 28.3 27 27.8 ...
i am attempting to produce a summary table which just totals the number of days in a year per site above certain temperatures thresholds e.g 25c, 26c.
i can achieve this manually by using dplyr like so-
Days_above = Site_Daily_average %>%
group_by(Year, Site) %>%
summarise("23" = sum(Temp > 23), "24" = sum(Temp > 24),"25"= sum(Temp >
25), "26"= sum(Temp > 26), "27"= sum(Temp > 27), "28"= sum(Temp > 28), "29"
= sum(Temp > 29),"30"= sum(Temp > 30), "31" = sum(Temp > 31), "ABOVE
THRESHOLD" = sum(Temp > maxthreshold))%>% as.data.frame()
Which produces a table like so :
Year Site 23 24 25 26 27 28 29 30 31 ABOVE THRESHOLD
1 2012 EB 142 142 142 91 64 22 0 0 0 0
2 2012 FFCE 238 238 238 210 119 64 0 0 0 0
3 2012 IB 238 238 238 218 138 87 1 0 0 0
4 2013 EB 115 115 115 115 115 109 44 0 0 0
5 2013 FFCE 223 223 216 197 148 114 94 0 0 0
6 2013 IB 365 365 365 348 299 194 135 3 0 0
...
however, as you can see the code is fairly verbose. The problem i am having is producing this same output for a sequence of temperature thresholds, i.e Tempclasses = Seq(16,32,0.25).
As you can see that would take a long time to type that out manually. i feel like this is a very simple calculation and there should be way to use dplyr to recognize each variable in the sequence vector, perform this function and produce an output in a complete table format. sorry if that was unclear as i am relatively new to R,
any suggestions would be welcome, thankyou.
Here's a tidyverse approach, likewise using mtcars for illustration:
library(tidyverse)
mtcars %>%
mutate(threshold = cut(mpg,
breaks=seq(10, max(mtcars$mpg)+10, 5),
labels=seq(10, max(mtcars$mpg)+5, 5))) %>%
group_by(cyl, threshold) %>%
tally %>%
ungroup %>%
complete(threshold, nesting(cyl), fill=list(n=0)) %>%
arrange(desc(threshold)) %>%
group_by(cyl) %>%
mutate(N_above = cumsum(n)) %>%
select(-n) %>%
arrange(cyl, threshold)
threshold cyl N_above
1 10 4 11
2 15 4 11
3 20 4 11
4 25 4 6
5 30 4 4
6 35 4 0
7 10 6 7
8 15 6 7
9 20 6 3
10 25 6 0
11 30 6 0
12 35 6 0
13 10 8 14
14 15 8 8
15 20 8 0
16 25 8 0
17 30 8 0
18 35 8 0
If you want the final data in wide format, add a spread at the end and remove the arrange:
... %>%
select(-n) %>%
spread(threshold, N_above)
cyl 10 15 20 25 30 35
1 4 11 11 11 6 4 0
2 6 7 7 3 0 0 0
3 8 14 8 0 0 0 0
As #dww commented we can use cut to get the required format. I have tried this on mtcars dataset where we create range from 10 to 35 with step of 5 for mpg column.
df <- mtcars
df$group <- cut(df$mpg, seq(10, 35, 5))
and then we group by cyl and use table to get count of how many of them fall in the the respective buckets.
table(df$cyl, df$group)
# (10,15] (15,20] (20,25] (25,30] (30,35]
#4 0 0 5 2 4
#6 0 4 3 0 0
#8 6 8 0 0 0
Now , if certain value is greater than 10, it is also greater than 15, hence the number in (15, 20) bucket should also include number from (10,15) bucket and number in (20, 15) bucket should include both the previous number. Hence, we need a row-wise cumsum for this table
t(apply(table(df$cyl, df$group), 1, cumsum))
# (10,15] (15,20] (20,25] (25,30] (30,35]
# 4 0 0 5 7 11
# 6 0 4 7 7 7
# 8 6 14 14 14 14
For your case , the code would go
Site_Daily_average$group <- cut(Site_Daily_average$Temp, seq(16,32,0.25))
#and then do table to get required answer.
t(apply(table(Site_Daily_average$Year,Site_Daily_average$Site,
Site_Daily_average$group), 1, cumsum)

Loop over certain columns to replace NAs with 0 in a dataframe

I have spent a lot of time trying to write a loop to replace NAs with zeros for certain columns in a data frame and have not yet succeeded. I have searched and can't find similar question.
df <- data.frame(A = c(2, 4, 6, NA, 8, 10),
B = c(NA, 10, 12, 14, NA, 16),
C = c(20, NA, 22, 24, 26, NA),
D = c(30, NA, NA, 32, 34, 36))
df
Gives me:
A B C D
1 2 NA 20 30
2 4 10 NA NA
3 6 12 22 NA
4 NA 14 24 32
5 8 NA 26 34
6 10 16 NA 36
I want to set NAs to 0 for only columns B and D. Using separate code lines, I could:
df$B[is.na(df$B)] <- 0
df$D[is.na(df$D)] <- 0
However, I want to use a loop because I have many variables in my real data set.
I cannot find a way to loop over only columns B and D so I get:
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Essentially, I want to apply a loop using a variable list to a data frame:
varlist <- c("B", "D")
How can I loop over only certain columns in the data frame using a variable list to replace NAs with zeros?
here is a tidyverse aproach:
library(tidyverse)
df %>%
mutate_at(.vars = vars(B, D), .funs = funs(ifelse(is.na(.), 0, .)))
#output:
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
basically you say vars B and D should change by a defined function. Where . corresponds to the appropriate column.
Here's a base R one-liner
df[, varlist][is.na(df[, varlist])] <- 0
using the zoo package we can fill the selected columns.
library(zoo)
df[varlist]=na.fill(df[varlist],0)
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
In base R we can have
df[varlist]=lapply(df[varlist],function(x){x[is.na(x)]=0;x})
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36

Merge uneven data frames in R by common column and fill in empty elements by NA

These are examples of two dataframes I am working on. 'Claims' has fewer rows than 'lastaction'.
My attempts give the following error.
newtable <- merge(claims, lastaction, by = "X", all = TRUE)
Error in [<-.data.frame(tmp, value, value = NA) : new columns
would leave holes after existing columns
newtable <- merge(claims, lastaction, by.x = claims$X, by.y = lastaction$X, all = TRUE)
Error in fix.by(by.x, x) : 'by' must match numbers of columns
merge function works fine for me. As both dataframes have the same column name X, it can be used to merge using by.
claims = data.frame(X = c(10,24,30,35,64,104),
TransactionDateTime = c('JUL-15','APR-17','SEP-15','JUL-15','APR-16','SEP-15'))
claims
# X TransactionDateTime
# 1 10 JUL-15
# 2 24 APR-17
# 3 30 SEP-15
# 4 35 JUL-15
# 5 64 APR-16
# 6 104 SEP-15
lastaction = data.frame(X = c(10,24,30,35,40,57), lastvalue = c(6,1,4,6,6,1),
Approvalmonth = c('15-OCT','17-JAN','16-MAR','15-OCT','15-SEP','17-JUN'),
lastvalue = c(0,1,0,0,0,1))
lastaction
# X lastvalue Approvalmonth lastvalue
# 1 10 6 15-OCT 0
# 2 24 1 17-JAN 1
# 3 30 4 16-MAR 0
# 4 35 6 15-OCT 0
# 5 40 6 15-SEP 0
# 6 57 1 17-JUN 1
merge(claims, lastaction, by = "X", all = TRUE)
# X TransactionDateTime lastvalue Approvalmonth lastvalue.1
# 1 10 JUL-15 6 15-OCT 0
# 2 24 APR-17 1 17-JAN 1
# 3 30 SEP-15 4 16-MAR 0
# 4 35 JUL-15 6 15-OCT 0
# 5 40 <NA> 6 15-SEP 0
# 6 57 <NA> 1 17-JUN 1
# 7 64 APR-16 NA <NA> NA
# 8 104 SEP-15 NA <NA> NA
dplyr's full_join as well works
dplyr::full_join(claims, lastaction, by = 'X')
X TransactionDateTime lastvalue Approvalmonth lastvalue.y
1 10 JUL-15 6 15-OCT 6
2 24 APR-17 1 17-JAN 1
3 30 SEP-15 4 16-MAR 4
4 35 JUL-15 6 15-OCT 6
5 64 APR-16 NA <NA> NA
6 104 SEP-15 NA <NA> NA
7 40 <NA> 6 15-SEP 6
8 57 <NA> 1 17-JUN 1

combining datasets with known identity variable

So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4
I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.
If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)

Replace last NA of a segment of NAs in a column with last valid value

Here is a sample data frame:
> df = data.frame(rep(seq(0, 120, length.out=6), times = 2), c(sample(1:50, 4),
+ NA, NA, NA, sample(1:50, 5)))
> colnames(df) = c("Time", "Pat1")
> df
Time Pat1
1 0 33
2 24 48
3 48 7
4 72 8
5 96 NA
6 120 NA
7 0 NA
8 24 1
9 48 6
10 72 28
11 96 31
12 120 32
NAs which have to be replaced are identified by which and logical operators:
x = which(is.na(df$Pat1) & df$Time == 0)
I know the locf() command, but it's replacing all NAs. How can I replace only the NAs at position x in a multi-column df?
EDIT: Here is a link to my original dataset: link
And thats how far I get:
require(reshape2)
require(zoo)
pad.88 <- read.csv2("pad_88.csv")
colnames(pad.88) = c("Time", "Increment", "Side", 4:length(pad.88)-3)
attach(pad.88)
x = which(Time == 240 & Increment != 5)
pad.88 = pad.88[c(1:x[1], x[1]:x[2], x[2]:x[3], x[3]:x[4], x[4]:x[5], x[5]:x[6],x[6]:x[7], x[7]:x[8], x[8]:nrow(pad.88)),]
y = which(duplicated(pad.88))
pad.88$Time[y] = 0
pad.88$Increment[y] = Increment[x] + 1
z = which(is.na(pad.88[4:ncol(pad.88)] & pad.88$Time == 0), arr.ind=T)
a = na.locf(pad.88[4:ncol(pad.88)])
My next step is something like pat.cols[z] = a[z], which doesn't work.
That's how the result should look like:
Time Increment Side 1 2 3 4 5 ...
150 4 0 27,478 24,076 27,862 20,001 25,261
165 4 0 27,053 24,838 27,231 20,001 NA
180 4 0 27,599 24,166 27,862 20,687 NA
195 4 0 27,114 23,403 27,862 20,001 NA
210 4 0 26,993 24,076 27,189 19,716 NA
225 4 0 26,629 24,21 26,221 19,887 NA
240 4 0 26,811 26,228 26,431 20,001 NA
0 5 1 26,811 26,228 26,431 20,001 25,261
15 5 1 ....
The last valid value in col 5 is 25,261. This value replaces the NA at Time 0/Col 5.
You can change it so that x records all the NA values and use the first and last from that to identify the locations you want.
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 NA
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
x <- which(is.na(df$Pat1))
df[rev(x)[1],"Pat1"] <- df[x[1]-1,"Pat1"]
df
Time Pat1
1 0 36
2 24 13
3 48 32
4 72 38
5 96 NA
6 120 NA
7 0 38
8 24 5
9 48 10
10 72 7
11 96 25
12 120 28
For the multi-column example use the same idea in a sapply call:
cbind(df[1],sapply(df[-1],function(x) {y<-which(is.na(x));x[rev(y)[1]]<-x[y[1]-1];x}))
Time Pat1 Pat2
1 0 41 42
2 24 8 30
3 48 3 41
4 72 14 NA
5 96 NA NA
6 120 NA NA
7 0 14 41
8 24 5 37
9 48 29 48
10 72 31 11
11 96 50 43
12 120 46 21

Resources