Keep the most frequent columns into a data.frame

Keep the most frequent columns into a data.frame - r

Having a data.frame like this:
data.frame(id = c(1,2,3), stock = c(3,1,4), bill = c(1,0,1), bear = c(3,2,5))
How is it possible to sum all columns expect id column and keep the first two columns which have higher frequency.
Example of expected output
data.frame(id = c(1,2,3), stock = c(3,1,4), bear = c(3,2,5))

In base R, we can use colSums to sum the columns, sort them and select the name of 2 highest valued columns using tail.
cbind(df[1], df[names(tail(sort(colSums(df[-1])), 2))])
# id stock bear
#1 1 3 3
#2 2 1 2
#3 3 4 5

Another base R possibility could be:
cbind(df[1], df[-1][rank(-colSums(df[-1])) %in% 1:2])
id stock bear
1 1 3 3
2 2 1 2
3 3 4 5

Related

R randomly allocate different values from vector to dataframe column based on condition

I've got a dataset called df of answers to a question Q1
df = data.frame(ID = c(1:5), Q1 = c(1,1,3,4,2))
I also have a vector where each element is a word
words = c("good","bad","better","improved","fascinating","improvise")
My Objective
IF Q1 = 1 or Q1 = 2, then randomly assign a value from vector words to a newly created column called followup
My Attempt
#If answer to Q1 is 1 or 2, then randomly allocate a word to newly created column "followup"
#Else leave blank
df$followup=ifelse(df$Q1==1 | df$Q1==2,sample(words,1),"")
However doing this leads to repetition of the same randomly selected word for each row that contains a 1 or 2.
ID Q1 followup
1 1 1 fascinating
2 2 1 fascinating
3 3 3
4 4 4
5 5 5
I'd like every word to be randomized and different.
Any inputs would be highly appreciated.

For that we may use
df$followup[df$Q1 %in% 1:2] <- sample(words, sum(df$Q1 %in% 1:2))
df
# ID Q1 followup
# 1 1 1 better
# 2 2 1 improvise
# 3 3 3 <NA>
# 4 4 4 <NA>
# 5 5 2 bad
Since we are generating those values in a single call, replace = FALSE (the default value) in sample gives the desired result of all the values being different.

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.

Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3

You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Merging two data frames with different sizes and missing values

I'm having a problem merging two data frames in R.
The first one consists of 103731 obs of 6 variables. The variable that I have to use to merge has 77111 unique values and the rest are NAs with a value of 0. The second one contains the frequency of those variables plus the frequency of the NAs so a frame of 77112 obs for 2 variables.
The resulting frame I need to get is the first one joined with the frequency for the merging variable, so a df of 103731 obs with the frequency for each value of the merging variable (so with duplicates if freq > 1 and also for each NA (or 0)).
Can anybody help me?
The result I'm getting now contains a data frame of 1 894 919 obs and I used:
tot = merge(df1, df2, by = "mergingVar", all= F, sort = F);
Also I played a lot with 'all=' and none of the variations gave the right df.

why don't you just take the frequency table of your first table?
a <- data.frame(a = c(NA, NA, 2,2,3,3,3))
data.frame(table(a, useNA = 'ifany'))
a Freq
1 2 2
2 3 3
3 <NA> 2
or mutate from plyr
ddply(a, .(a), mutate, freq = length(a))
a freq
1 2 2
2 2 2
3 3 3
4 3 3
5 3 3
6 NA 2
7 NA 2

How to match 1 column to 2 columns?

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!

Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)

I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi

The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))

OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)

Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Keep the most frequent columns into a data.frame - r

In base R, we can use colSums to sum the columns, sort them and select the name of 2 highest valued columns using tail. cbind(df[1], df[names(tail(sort(colSums(df[-1])), 2))]) # id stock bear #1 1 3 3 #2 2 1 2 #3 3 4 5

Another base R possibility could be: cbind(df[1], df[-1][rank(-colSums(df[-1])) %in% 1:2]) id stock bear 1 1 3 3 2 2 1 2 3 3 4 5

Related

R randomly allocate different values from vector to dataframe column based on condition

Assign ID across 2 columns of variable

Merging two data frames with different sizes and missing values

How to match 1 column to 2 columns?

aggregate over several variables in r

Categories

Resources