Having a data.frame like this:
data.frame(id = c(1,2,3), stock = c(3,1,4), bill = c(1,0,1), bear = c(3,2,5))
How is it possible to sum all columns expect id column and keep the first two columns which have higher frequency.
Example of expected output
data.frame(id = c(1,2,3), stock = c(3,1,4), bear = c(3,2,5))
In base R, we can use colSums to sum the columns, sort them and select the name of 2 highest valued columns using tail.
cbind(df[1], df[names(tail(sort(colSums(df[-1])), 2))])
# id stock bear
#1 1 3 3
#2 2 1 2
#3 3 4 5
Another base R possibility could be:
cbind(df[1], df[-1][rank(-colSums(df[-1])) %in% 1:2])
id stock bear
1 1 3 3
2 2 1 2
3 3 4 5
Related
I've got a dataset called df of answers to a question Q1
df = data.frame(ID = c(1:5), Q1 = c(1,1,3,4,2))
I also have a vector where each element is a word
words = c("good","bad","better","improved","fascinating","improvise")
My Objective
IF Q1 = 1 or Q1 = 2, then randomly assign a value from vector words to a newly created column called followup
My Attempt
#If answer to Q1 is 1 or 2, then randomly allocate a word to newly created column "followup"
#Else leave blank
df$followup=ifelse(df$Q1==1 | df$Q1==2,sample(words,1),"")
However doing this leads to repetition of the same randomly selected word for each row that contains a 1 or 2.
ID Q1 followup
1 1 1 fascinating
2 2 1 fascinating
3 3 3
4 4 4
5 5 5
I'd like every word to be randomized and different.
Any inputs would be highly appreciated.
For that we may use
df$followup[df$Q1 %in% 1:2] <- sample(words, sum(df$Q1 %in% 1:2))
df
# ID Q1 followup
# 1 1 1 better
# 2 2 1 improvise
# 3 3 3 <NA>
# 4 4 4 <NA>
# 5 5 2 bad
Since we are generating those values in a single call, replace = FALSE (the default value) in sample gives the desired result of all the values being different.
I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.
I'm having a problem merging two data frames in R.
The first one consists of 103731 obs of 6 variables. The variable that I have to use to merge has 77111 unique values and the rest are NAs with a value of 0. The second one contains the frequency of those variables plus the frequency of the NAs so a frame of 77112 obs for 2 variables.
The resulting frame I need to get is the first one joined with the frequency for the merging variable, so a df of 103731 obs with the frequency for each value of the merging variable (so with duplicates if freq > 1 and also for each NA (or 0)).
Can anybody help me?
The result I'm getting now contains a data frame of 1 894 919 obs and I used:
tot = merge(df1, df2, by = "mergingVar", all= F, sort = F);
Also I played a lot with 'all=' and none of the variations gave the right df.
why don't you just take the frequency table of your first table?
a <- data.frame(a = c(NA, NA, 2,2,3,3,3))
data.frame(table(a, useNA = 'ifany'))
a Freq
1 2 2
2 3 3
3 <NA> 2
or mutate from plyr
ddply(a, .(a), mutate, freq = length(a))
a freq
1 2 2
2 2 2
3 3 3
4 3 3
5 3 3
6 NA 2
7 NA 2
I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!
Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)
I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1
I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2