How to merge tables and fill the empty cells in the mean time in R? - r

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45

You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))

Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL

The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Related

How to remove duplicated rows from a dataframe `df` but only when a specific column of the `df` is NA?

I have a data frame df with 5 columns. Region.Label indicates the Region where the study was performed, Sample.Label is a specific area within that Region where I counted for birds, Sp is the bird species I found in that specific area, Distance is the distance between the bird and me, and Effort is the time I was in the area looking for birds. When Distance is NA means that for this area the species was not observed. As an example of the data frame I have:
df <- data.frame(Region.Label=c("A","A","A","A","A","A","A","A"),
Sample.Label=c(1,1,1,2,2,2,3,3),
Sp=c("ZZ","ZZ","BB","ZZ","BB","CC","ZZ","BB"),
Distance=c(2,7,NA,NA,NA,6,NA,NA),
Effort=c(99,99,99,87,87,87,72,72))
df$Region.Label <- as.factor(df$Region.Label)
df$Sample.Label <- as.numeric(df$Sample.Label)
df
Region.Label Sample.Label Sp Distance Effort
1 A 1 ZZ 2 99
2 A 1 ZZ 7 99
3 A 1 BB NA 99
4 A 2 ZZ NA 87
5 A 2 BB NA 87
6 A 2 CC 6 87
7 A 3 ZZ NA 72
8 A 3 BB NA 72
Here, I would like to remove all the rows which have NA for the column df$Distance, as it indicates that the species was not observed in the area, but I want to delete rows with NA for df$Distance when that row with the NA is a duplicate of another row excluding the column df$Sp.
I would like to obtain this:
Region.Label Sample.Label Sp Distance Effort
1 A 1 ZZ 2 99
2 A 1 ZZ 7 99
3 A 2 CC 6 87
4 A 3 ZZ NA 72
In this example, I do not remove df[7,] because the Sample.Label is different from the previous rows. I remove df[8,] because df[7,] and df[8,] are equal except for df$Sp.
Does anyone know how to get what I want?
Perhaps, a group by operation would help - grouped by 'Region.Label', 'Sample.Label', 'Effort', filter the non-NA elements of 'Distance' if there are any non-NA elements or else get the first row (row_number() == 1)
library(dplyr)
df %>%
group_by(Region.Label, Sample.Label, Effort) %>%
filter(if(all(is.na(Distance))) row_number() == 1 else !is.na(Distance)) %>%
ungroup
-output
# A tibble: 4 × 5
Region.Label Sample.Label Sp Distance Effort
<fct> <dbl> <chr> <dbl> <dbl>
1 A 1 ZZ 2 99
2 A 1 ZZ 7 99
3 A 2 CC 6 87
4 A 3 ZZ NA 72

How to combine value with with a reference table in R?

This question is very similar to these two posts: How to compare values with a reference table in R?
and Combine Table with reference to another table in R . However, mine is more complicated:
I have two data frames:
> df1
name value1 value2
1 applefromJapan 2 8
2 applesfromJapan 3 9
3 applenotfromUS 4 10
4 pearsgoxxJapan 5 11
5 bananaxxeeChina 6 12
> df2
name value1 value2
1 applefromJapan 33 1
2 watermeleonnotfromUS 34 2
3 applesfromJapan 35 3
4 pearfromChina 36 4
5 pearfromphina 37 5 # only one letter different will not cause problem
and a reference table:
> ref.df
fruit country name seller
1 1:5 10:14 appleJapan John
2 1:5 10:14 pearsJapan Mike
3 1:6 11:15 applesJapan Nicole
4 1:6 11:12 bananaUS Amy
5 1:4 9:13 pearChina Jenny
6 1:5 13:14 appleUS Mark
7 1:10 18:22 watermeleonchina James
The reference table works as the following:
the name column in the ref.df contains the full name, where df1(df2) can obtain this name by trimming its name column according to the fruit and country column.
My desire output table is:
#output
name value1 value2 value3 value4
1 appleJapan 2 8 33 1 # ended up using the name from the ref.df
2 applesJapan 3 9 34 2 # merge df1 and df2
3 appleUS 4 10 NA NA # replacing NA for not exist values
4 pearsJapan 5 11 NA NA
5 watermeleonUS NA NA 34 2
6 pearChina NA NA 73 9 # name with only one letter difference consider as typo, so we will just sum the values up.
# bananaxxeeChina is not here because it does not referenced by the ref.df
*There are thousands of row in the real dataset (~12000 rows on average in each of the 22 dfs, ~ 200 rows in the ref.df), this is only the first couple rows (with some alternation). So I think it is better to compare each df to the ref.df, then combine the df2? But how can i achieve this?
Codes for producing data:
https://codeshare.io/2p179V

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Merge, cbind: How to merge better? [duplicate]

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.
data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA
In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources