How to subtract values by comparing columns from two datasets? - r

I have the following data structure:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Two dataframes.
map: with three variables. Chr, Pos and CM.
Data: with three variables: Chr, Pos.1, Pos.2
Matching Data$Pos.2 and Data$Pos.1 with map$Pos, I need to get the difference of map$CM values between these two matches. This procedure needs to be done by $Chr.
As an example: For the first row of Data (1,30,40) the desirable value would be 0.1010101 (this is obtained by the operation 0.39393939 – 0.29292929). for the first row of Data with Chr = 2 (2,4,9) the desirable value would be 0.06468352 (0.1026582-0.03797468).

Whether I well understood what you desire, I think you have to do something like this:
pos.c1<-seq(from=1,to=100,by=1)
map.c1<-seq(from=0,to=1,length.out = 100)
cro.c1<-rep(1,100)
pos.c2<-seq(from=1,to=80,by=1)
map.c2<-seq(from=0,to=1,length.out = 80)
cro.c2<-rep(2,80)
c1<-cbind(cro.c1,pos.c1,map.c1)
c2<-cbind(cro.c2,pos.c2,map.c2)
map<-rbind(c1,c2)
colnames(map)<-c("Chr","Pos","CM")
Pos.1<-c(30,52,60,72,80,4,12,30,40)
Pos.2<-c(40,53,71,79,95,9,20,35,79)
Chr<-c(rep(1,5),rep(2,4))
Data<-cbind(Chr,Pos.1,Pos.2)
Using library tidyverse
library(tidyverse)
You have to tranform your data into dataframes:
Data <- as.data.frame(Data)
map <- as.data.frame(map)
Then you have just to retrieve information using left_join
Data_CM <- left_join(Data,map,by=c("Chr","Pos.1"="Pos")) %>%
rename(CM.1=CM)
Data_CM <- left_join(Data_CM,map,by=c("Chr","Pos.2"="Pos")) %>%
rename(CM.2=CM)
The Diff variable will compute the difference between two retrieved values
Data_CM <- Data_CM %>%
mutate(Diff=(CM.2-CM.1))

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Mutate a dataframe by a vector which should match variable names

I have a dataframe with a vector of years and several columns which contain the gdp_per_head_values of different countries at a specific point in time. I want to mutate this dataframe to get a variable which contains only the values of the variable of the specific point in time defined by the vector of years.
My data.frame looks like this :
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000),
)
How can I mutate this dataframe as explained above, for example by the variable gpd_head
EDIT : Output should look like
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000)) %>%
mutate(gdp_head =c(.$'1940'[1],.$'1940'[2],.$'1960'[3],
.$'1950'[4],.$'1940'[5],.$'1960'[6],
.$'1960'[7],.$'1950'[8] ))
Here is one approach:
First, since you are going to compare the year_vector column with column names (which will be character), you can convert year_vector to character as well:
dataset$year_vector <- as.character(dataset$year_vector)
You currently have a tibble defined - but if you have it as a plain data.frame you can subset based on a [row, column] matrix and add the matched results as gdp_head:
dataset <- as.data.frame(dataset)
dataset$gdp_head <- as.numeric(dataset[cbind(1:nrow(dataset), match(dataset$year_vector, names(dataset)))])
I came up with the following solution which works aswell :
dataset %>%
do(.,mutate(.,gdp_head = pmap(list(1:nrow(.), year_vector),
function(x,y) .[x,(y-1901+16)]) %>%
unlist() ))
In this solution I just added the position of the first year variable to the column index and subtract that number from the year_vector. In this case the year variables start in the year 1901 which column index corresponds to 16.

For Loop Across Specific Column Range in R

I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.

Unnest a ts class

My data has multiple customers data with different start and end dates along with their sales data.So I did simple exponential smoothing.
I applied the following code to apply ses
library(zoo)
library(forecast)
z <- read.zoo(data_set,FUN = function(x) as.Date(x) + seq_along(x) / 10^10 , index = "Date", split = "customer_id")
L <- lapply(as.list(z), function(x) ts(na.omit(x),frequency = 52))
HW <- lapply(L, ses)
Now my output class is list with uneven lengths.Can someone help me how to unnest or unlist the output in to a data frame and get the fitted values,actuals,residuals along with their dates,sales and customer_id.
Note : the reson I post my input data rather than data of HW is,the HW data is too large.
Can someone help me in R.
I would use tidyverse package to handle this problem.
map(HW, ~ .x %>%
as.data.frame %>% # convert each element of the list to data.frame
rownames_to_column) %>% # add row names as columns within each element
bind_rows(.id = "customer_id") # bind all elements and add customer ID
I am not sure how to relate dates and actual sales to your output (HW). If you explain it I might provide solution to that part of the problem too.
Firstly took all the unique customer_id into a variable called 'k'
k <- unique(data_set$customer_id)
Created a empty data frame
b <- data.frame()
extracted all the fitted values using a for loop and stored in 'a'.Using the rbind function attached all the fitted values to data frame 'b'
for(key in k){
print(a <- as.data.frame((as.numeric(HW_ses[[key]]$model$fitted))))
b <- rbind(b,a)
}
Finally using column bind function attached the input data set with data frame 'b'
data_set_final <- cbind(data_set,b)

What is the best way to perform basic calculations (% of total) across dataframes in a list?

Consider a list of dataframes called listDF. Each of the dataframes has the same columns:
"Date" "Location" "V1" "V2" where V1 is a column filled with real numbers
I would like to calculate the % of total of say V1 for each Date/Location combination. That is sum V1 across all dataframes for each specific Date/Location pair and then calculate the share each V1 observation is of the relevant sample.
What I've tried:
I stack the dataframes because I don't know how to do the sweeping without looping through the Dataframe/Date/Location combinations which is clearly inefficient.
library(plyr)
aggregate <- rbind.fill(listDF)
ptt <- ddply(aggregate,.(Date,Location),transform, share= V1/sum(V1))
The last line leads to RStudio crashing and asking me to start a new session. FWIW, the avg dataframe has 50k rows and the list has about 1M rows total. Should I be using prop.table?
In an ideal world, I would have the percent to total (ptt) as a column in each dataframe, instead of in a single stacked dataframe which I would have to split after.
*Side question: is there a way to choose which subset of list elements to use for any given ptt? I've assumed using all dataframes in my initial question but would love to choose based on critera of say V2.
Thanks for your help.
If each data frame in the list has the same columns, it would be easier to work with a single data frame that has an extra variable indicating the original data frame. Then you can easily perform calculations grouped by data frame.
sample data
# two data frames
d1 <- data.frame(x = rep(LETTERS[1:2], each = 5), y = rnorm(10))
d2 <- data.frame(x = rep(LETTERS[1:2], each = 7), y = rnorm(14))
# put data frames in a list
L <- list(d1, d2)
We can use dplyr::bind_rows() to "unlist" L into a single data-frame. The .id option instructs bind_rows to create an explicit variable identifying the original data frame:
library(dplyr)
d <- bind_rows(L, .id = "dat")
Now you can do any summary grouped by the variable you created:
d %>%
group_by(dat) %>%
summarise(mean_y = mean(y))

Resources