Merging two data frames with different sizes and missing values - r

I'm having a problem merging two data frames in R.
The first one consists of 103731 obs of 6 variables. The variable that I have to use to merge has 77111 unique values and the rest are NAs with a value of 0. The second one contains the frequency of those variables plus the frequency of the NAs so a frame of 77112 obs for 2 variables.
The resulting frame I need to get is the first one joined with the frequency for the merging variable, so a df of 103731 obs with the frequency for each value of the merging variable (so with duplicates if freq > 1 and also for each NA (or 0)).
Can anybody help me?
The result I'm getting now contains a data frame of 1 894 919 obs and I used:
tot = merge(df1, df2, by = "mergingVar", all= F, sort = F);
Also I played a lot with 'all=' and none of the variations gave the right df.

why don't you just take the frequency table of your first table?
a <- data.frame(a = c(NA, NA, 2,2,3,3,3))
data.frame(table(a, useNA = 'ifany'))
a Freq
1 2 2
2 3 3
3 <NA> 2
or mutate from plyr
ddply(a, .(a), mutate, freq = length(a))
a freq
1 2 2
2 2 2
3 3 3
4 3 3
5 3 3
6 NA 2
7 NA 2

Related

Keep the most frequent columns into a data.frame

Having a data.frame like this:
data.frame(id = c(1,2,3), stock = c(3,1,4), bill = c(1,0,1), bear = c(3,2,5))
How is it possible to sum all columns expect id column and keep the first two columns which have higher frequency.
Example of expected output
data.frame(id = c(1,2,3), stock = c(3,1,4), bear = c(3,2,5))
In base R, we can use colSums to sum the columns, sort them and select the name of 2 highest valued columns using tail.
cbind(df[1], df[names(tail(sort(colSums(df[-1])), 2))])
# id stock bear
#1 1 3 3
#2 2 1 2
#3 3 4 5
Another base R possibility could be:
cbind(df[1], df[-1][rank(-colSums(df[-1])) %in% 1:2])
id stock bear
1 1 3 3
2 2 1 2
3 3 4 5

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

How to match 1 column to 2 columns?

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!
Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)
I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources