R - Adding vector from one dataframe as column to another dataframe - r

I have two dataframes and want to add a specific vector from one as a column to another, for multiplication purposes OR how can I multiply data from one dataframe to a specific vector from another?
Example
library (dplyr)
df <- data.frame (name = c("A", "B", "C", "D", "E"),
area = c(1,2,3,4,5),
yield = c(10, 20, 30, 40, 50))
df2 <- data.frame (application = c("test", "current", "future"),
number = c(5,10,15))
The intended result is e.g. get the value for "current" in df2 and create a new column on df named "number", that will be multiplied with the other columns in df and generate the column "calculation" - excel example on how df would look like at the end below:
I tried
df$number <- df2 %>%
filter(application == "current") %>%
select(number)
But I get an Error in $<-.data.frame(*tmp*, number, value = list(number = 10)) :
replacement has 1 row, data has 5
I know that I could do
df$number <- df2[2,2]
But I want it to be specifically related to "current" (as I tried to do with dplyr). This is only an example - in reality, df2 is a big file and the order can change when people are adding more data.

Here is a base R approach -
df$number <- df2$number[df2$application == "current"]
df$calculation <- with(df, area * yield * number)
df
Or if you prefer dplyr -
library(dplyr)
df <- df %>%
bind_cols(df2 %>%
filter(application == "current") %>%
select(number)) %>%
mutate(calculation = area * yield * number)
df
# name area yield number calculation
#1 A 1 10 10 100
#2 B 2 20 10 400
#3 C 3 30 10 900
#4 D 4 40 10 1600
#5 E 5 50 10 2500

Related

How can I see if any values in one data frame exist in any other data frame?

I have eight data frames, all of which contain an id field and I want to know if any of the id values are common among all eight data frames.
I'm not looking for an intersection (where the values are common across all data frames); I simply want to know those instances where they appear in any of the other data frames.
Let's say that one of the data frames looks like this:
id TestDay
1 66 m
2 90 t
3 71 w
4 59 th
5 38 f
6 84 sa
7 15 su
8 89 m
9 18 t
10 93 w
11 88 th
12 42 f
13 10 sa
14 33 su
15 49 m
16 51 t
17 80 w
18 32 th
19 1 f
20 91 sa
21 58 su
If you wish to create eight sample data frames, you can do so by using this code eight times (with different data frame names, naturally):
x <- data.frame(id = sample(1:100, 21, FALSE), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
I want to know if any of the id values listed here appear in any of the other seven data frames, and conversely, whether any of the id values listed in any of the other seven data frames exist in this one.
How can this be done?
Combine all the dataframes in one dataframe with a unique id value which will distinguish each dataframe.
I created two dataframes here with data column representing the dataframe number.
library(dplyr)
x1 <- data.frame(id = round(runif(21, 1, 21)), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
x2 <- data.frame(id = round(runif(21, 1, 21)), TestDay = rep(c("m","t","w","th","f","sa","su"), 3))
combine_data <- bind_rows(x1, x2, .id = 'data')
group by the id column and count how many dataframes that id is present in.
combine_data %>%
group_by(id) %>%
summarise(count_unique = n_distinct(data))
You can add filter(count_unique > 1) to the above chain to get id's which are present in more than 1 dataframe.
To add to #Ronak 's answer, you can also concatenate c() the dataframe number using summarise(). This tells you which dataframe the ID comes from.
df1 <- data.frame(id = letters[1:3])
df2 <- data.frame(id = letters[4:6])
df3 <- data.frame(id = letters[5:10])
library(tidyverse)
df <- list(df1, df2, df3)
df4 <- df %>%
bind_rows(.id = "df_num") %>%
mutate(df_num = as.integer(df_num)) %>%
group_by(id) %>%
summarise(
df_found = list(c(df_num)),
df_n = map_int(df_found, length)
)
df4
We can use data.table methods
Get the datasets in a list and rbind them with rbindlist
Grouped by 'id' get the count of unique 'data' with uniqueN
library(data.table)
rbindlist(list(x1, x2, idcol = 'data')[, .(count_unique = uniqueN(data)), by = id]

mass removal of rows from a data frame based on a column condition

I would like to remove all rows based on a condition of a column. The code below produces a sample test data
test_data <- data.frame(index = c(1,2,3,4,5), group = c("a", "a", "a", "b", "c"), count = c(1,2,2,3,4))
The data frame has 3 columns: index, group and count. I would like to remove all rows belonging to the same group if any one row of the group has count 1. So in above data frame, I would like to remove entire index 1, 2 and 3 from data frame since first row has count = 1 and row 2nd and 3rd fall in the same group "a". The resultant data frame should look like this:
testdata2 <- data.frame(index = c(4,5), group = c("b", "c"), count = c(3,4))
Any help would be appreciated! Thanks!
We can use ave with subset in base R and select groups which has no value where count is 1.
subset(test_data, ave(count != 1, group, FUN = all))
# index group count
#4 4 b 3
#5 5 c 4
With dplyr this can be done as :
library(dplyr)
test_data %>% group_by(group) %>% filter(all(count != 1))
and data.table
library(data.table)
setDT(test_data)[, .SD[all(count != 1)], group]

Calculate the number of samples with alterations each gene listed in a vector

I'm an R newby and wondering if people could offer me a little bit of advice as to how I can process some data I have.
I have a data frame containing a list of samples with observed changes in genes (example below)
Dataframe1:
Sample Gene Alteration
1 A -1
1 B -1
1 C -1
1 D 1
2 B 1
2 E -1 ...
I also have a data frame containing a list of genes that I am interested in (example below)
Dataframe2:
Gene
B
D
E
I want to calculate how many samples have a -1 alteration for each gene in dataframe2, with an ideal output looking something like:
Dataframe3:
Gene Alteration Sum
B -1 23
D -1 2
E -1 18
I'm really stuck as to where to start, I've found a lot of information on sum etc but I can't work out how to feed two data frames together and utilise sum.
Any advice or just functions that I could try would be hugely appreciated.
Step 1: Select the genes of interest from dataframe1:
set.seed(11)
dataframe1 = data.frame(Sample = rep(c(1,2), each = 5),
Gene = rep(c("A", "B", "C", "D","E"),2),
Alteration = sample(c(-1, 1), 10, prob = c(0.7, 0.3), replace = TRUE))
dataframe2 <- data.frame(Gene = c("B", "D", "E"))
# Select the genes of interest
dataframe1 <- dataframe1[dataframe1$Gene %in% dataframe2$Gene, ]
Step 2: Calculate the sum of -1's
We can use the dplyr library to compute the sum per group:
library(dplyr)
dataframe1 %>%
group_by(Gene) %>%
summarise(Sum = sum(Alteration == -1))
Note that when we have a boolean vector (vector containing TRUE's and FALSE's) the sum of this vector gives the number of TRUE's.
Good luck!
Or with dplyr, just try
dft2 %>%
inner_join(dft1) %>%
group_by(Gene, Alteration) %>%
summarise( cnt = n()) %>%
filter(Alteration == -1)
where dft1 is the first dataframe and dft2 is the second dataframe
In case dft2 has entries not found in dft1 and you want to show the null, change the inner_join to left_join
You can use the function ddply from the package plyr.
library(plyr)
Dataframe3 <- ddply(Dataframe1, c('Gene', 'Alteration'), summarise, Sum = length(Alteration))

R: Concatenated values in column B based on values in column A

QUESTION: Using R, how would you create values in column B prefixed with a constant "1" + n 0's where n is the value in each row in column A?
#R CODE EXAMPLE
df <- as.data.frame(1:3);colnames(df)[1] <- "A";
print(df);
# A
# 1
# 2
# 3
preFixedValue <- 1; repeatedValue <- 0;
#pseudo code: create values in column B with n 0's prefixed with 1
df <- cbind(df,paste(rep(c(preFixedValue,repeatedValue), times = c(1,df[1:nrow(df),])),collapse = ""));
#expected/desired result
# A B
# 1 10
# 2 100
# 3 1000
USE CASE: Real data contains hundreds of rows in column A with random integers, not just three sequential int's as shown in the code above.
Below is an example using Excel to demonstrate what I want to do in R.
The rowwise() function in dplyr lets you make variables from column values in each row.
require(dplyr)
df <- data.frame(A = 1:3, B = NA)
preFixedValue <- 1; repeatedValue <- 0;
df <- df %>%
rowwise() %>%
mutate(B = as.numeric(paste0(c(preFixedValue, rep(repeatedValue, A)), collapse = "")))
For maximum flexibility, i.e. total freedom of choosing prefixed and repeated values as single values or vectors, and for simplicity of the syntax (one single line):
library(stringr)
df$B <- str_pad(preFixedValue, width = df$A, pad = repeatedValue, side = c("right"))
Would something like this work?
B<-10^(df$A)
df<-cbind(df,B)

Can't sort column with R

I want to create from the dataset a list that contains word and frequency of the word . I did it and saved into val named 'mylist'. now I want to sort the list according to the frequency of the word and to create barplot from the 10 words that have the higher frequency.
but I not succeeded to sort it. I tried many ways to change the type of 'mylist' to data.frame or date.table but still the column of the frequency stay a list.
To sumup I have the DT var that contains it is a list with 2 columns x-contains the words and type is character .
The 2 column is 'v' - that contains the frequency and it is a list.
I am not succeeding to sort it by the frequency.
please help me.
library(ggplot2)
libary(MASS)
#get the data
data.uri = "http://www.crowdflower.com/wp-content/uploads/2016/03/gender-classifier-DFE-791531.csv"
pwd = getwd()
data.file.name = "gender.csv"
data.file = paste0(pwd, "./", data.file.name)
download.file(data.uri, data.file)
data = read.csv(data.file.name)
#manipulate the data
data <- data[data$X_unit_id < 815719694,]
print(data$X_unit_id)
#get all female has white sidebar
female_colors <- subset(data, data$gender=="female")
female_colors$fav_number
#get all male fav_numbers
male_colors <- subset(data, data$gender=="male")
male_colors$fav_number
text_male = subset(data, data$gender=="male")
text_male = text_male$text
print(text_male[1])
print(length(text_male))
v <- text_male[1:length(text_male)]
print(v)
print (v[1])
count_of_list = 0;
x = list()
for ( i in v) {
# Merge the two lists.
x <- c(x,unlist(strsplit(i," ")))
}
count = 0;
mylist = list()
for (word in x){
for (xWord in x){
if (word == xWord)
count = count + 1;
}
key <- word
value <- count
mylist[[ key ]] <- value
count = 0;
}
libary(data.table)
require(data.table)
DT = data.table(x=c(names(mylist)),v=c(mylist))
DT
As suggested in comments, a reproducible example would be useful in creating an answer to help you. I will suggest a proposal anyway. Try to adapt this peocedure to your data.
Convert your list to a dataframe and use order:
df <- as.data.frame(your.data)
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1
It looks like you're using a cumbersome way to calculate the word counts, something like this is faster and simpler -
library(dplyr)
foo <- c("ant", "ant", "bat", "dog","egg","ant","bat")
bar <- rnorm(7, 5, 2)
df <- data.frame(foo, bar)
group_by(df, foo) %>% summarise(n = n()) %>% arrange(desc(n))
foo n
(fctr) (int)
1 ant 3
2 bat 2
3 dog 1
4 egg 1

Resources