Reordering the variables - r

After merging two data sets, I have a data with 300 variables (which some variables end with .x, some end with .y and some without any .x and .y) . How can I bring all variables which do not end in .x and .y to the first 100 columns of the data set . Also, I want to have col 101 onwards be arranged like (day.x,day.y,city.x,city.y, number.x,number.y and etc). That is, variables with same name, say city, but with different extension are adjacent/next to each other.
For example:
city.y<- c(1,2,3,5,5,7,7,NA,NA,3,4,5)
B<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
number.x<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
day.x<-c(1,3,4,5,6,7,8,1,NA,3,5,3)
Z<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
day.y<-c(4,5,6,7,8,7,8,1,2,3,5,NA)
number.y<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
school.x<-c("a","b","b","c","n","f","h","NA","F","G","z","h")
S<-c(5,2,3,4,5,6,5,NA,NA,5,6,6)
school.y<-c("a","b","b","c","m","g","h","NA","NA","G","H","T")
city.x<- c(1,2,3,7,5,8,7,5,6,7,5,1)
df<- data.frame(city.y,B,number.x,day.x,Z,day.y,number.y,school.x,S,school.y,city.x)
I want to reorder the variables in this format: B,S,Z,city.x,city.y,number.x,number.y,day.x,day.y and ...

Add one column to create more general use case:
df$ZZZZZ = 1:6
Then, load the dplyr package (for the chaining operator %>% and the select function):
library(dplyr)
Sorting will get each sub-grouping of columns in the right relative order:
names(df) = sort(names(df))
Now use a regular expression -matches("\\.[xy]$") to capture all the columns without ".x" or ".y" at the end and put those columns at the beginning. Then put all the other columns after them.
df = df %>% select(-matches("\\.[xy]$"), everything())
df
A B C ZZZZZ city.x city.y day.x day.y number.x number.y school.x school.y
1 1 3 1 1 1 1 4 3 a 5 a 1
2 2 4 2 2 3 2 5 4 b 2 b 2
...
11 4 NA 5 5 5 5 5 NA z 6 H 5
12 5 6 6 6 3 6 NA 6 h 6 T 1
If you like, you can also set your own suffixes in the merge function (rather than the default ".x" and ".y") like this:
merge(df1, df2, by="col", suffixes=c("_df1", "_df2"))
If you do that, you'll of course also need to adjust the regular expression that reorders the columns.

This should do it
extCols <- grepl("\\.", colnames(df))
df[, c(colnames(df)[(!extCols)],
sort(colnames(df)[extCols]))]

Related

How to compute the NAs with the column mean and then multiply columns of different lengths in R?

My question might be not so clear so I am putting an example.
My final goal is to produce
final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e)
I have five data frames (one column each) with different lengths as follows:
df1
a
1. 1
2. 2
3. 4
4. 2
df2
b
1. 2
2. 6
df3
c
1. 2
2. 4
3. 3
df4
d
1. 1
2. 2
3. 4
4. 3
df5
e
1. 4
2. 6
3. 2
So I want a final database which includes them all as follows
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 NA 3 4 2
4. 2 NA NA 3 NA
I want all the NAs for each column to be replaced with the mean of that column, so the finaldf has equal length of all the columns:
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 4 3 4 2
4. 2 4 3 3 4
and therefore I can produce a final result for final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e) as I need.
The easiest by far is to use the qpcR, dplyr and tidyr packages.
library(dplyr)
library(qpcR)
library(tidyr)
df1 <- data.frame(a=c(1,2,4,2))
df2 <- data.frame(b=c(2,6))
df3 <- data.frame(c=c(2,4,3))
df4 <- data.frame(d=c(1,2,4,3))
df5 <- data.frame(e=c(4,6,2))
mydf <- qpcR:::cbind.na(df1, df2, df3, df4,df5) %>%
tidyr::replace_na(.,as.list(colMeans(.,na.rm=T)))
> mydf
a b c d e
1 1 2 2 1 4
2 2 6 4 2 6
3 4 4 3 4 2
4 2 4 3 3 4
Depending on your rgl settings, you might need to run the following at the top of your script to make the qpcR package load (see https://stackoverflow.com/a/66127391/2554330 ):
options(rgl.useNULL = TRUE)
library(rgl)
With purrr and dplyr, we can first put all dataframes in a list with mget(). Second, use set_names to replace the dataframe names with their respective column names. As a third step, unlist the dataframes to get vectors with pluck. Then add the NAs by making all vectors the same length.
Finally, bind all vectors back into a dataframe with as.data.frame, then use mutate with ~replace_na and colmeans.
library(dplyr)
library(purrr)
mget(ls(pattern = 'df\\d')) %>%
set_names(map_chr(., colnames)) %>%
map(pluck, 1) %>%
map(., `length<-`, max(lengths(.))) %>%
as.data.frame %>%
mutate(across(everything(), ~replace_na(.x, mean(.x, na.rm=TRUE))))

combine datasets by the value of multiple columns

I'm trying to enter values based on the value of multiple columns from two datasets.
I have my main dataset (df1), with lists of a location and corresponding dates and df2 consists of a list of temperatures at all locations on every possible date. Eg:
df1
Location Date
A 2
B 1
C 1
D 3
B 3
df2
Location Date1Temp Date2Temp Date3Temp
A -5 -4 0
B 2 0 2
C 4 4 5
D 6 3 4
I would like to create a temperature variable in df1, according to the location and date of each observation. Preferably I would like to carry this out with all Temperature data in the same dataframe, but this can be separated and added 'by date' if necessary. With the example data, I would want this to create something like this:
Location Date Temp
A 2 -4
B 1 2
C 1 4
D 3 4
B 3 2
I've been playing around with merge and ifelse, but haven't figured anything out yet.
is it what you need?
library(reshape2)
library(magrittr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),d1t=c(-5,5,4,6),d2t=c(-4,0,4,3),d3t=c(0,2,5,4))
merge(df1,df2) %>% melt(id.vars=c("Location","Date"))
Here's how to do that with dplyr and tidyr.
Basically, you want to use gather to melt the DateXTemp columns from df2 into two columns. Then, you want to use gsub to remove the "Date" and "Temp" strings to get numbers that are comparable to what you have in df1. Since DateXTemp were initially characters, you need to transform the remaining numbers to numeric with as.numeric. I then use left_join to join the tables.
library(dplyr);library(tidyr)
df1 <- data.frame(Location= c("A","B","C","D","B"),Date=c(2,1,1,3,3))
df2 <- data.frame(Location= c("A","B","C","D"),Date1Temp=c(-5,5,4,6),
Date2Temp=c(-4,0,4,3),Date3Temp=c(0,2,5,4))
df2_new <- df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))
df1%>%left_join(df2_new)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2
EDIT
As suggested by #Sotos, you can do that in one piping like so:
df2%>%
gather(Date,Temp,Date1Temp:Date3Temp)%>%
mutate(Date=gsub("Date|Temp","",Date))%>%
mutate(Date=as.numeric(Date))%>%
left_join(df1,.)
Joining, by = c("Location", "Date")
Location Date Temp
1 A 2 -4
2 B 1 5
3 C 1 4
4 D 3 4
5 B 3 2

How to merge rows that have the same information in all columns except one?

I have a large data frame that looks smth like this:
A 1 2 3 4 ...
B 1 2 3 4 ...
C 1 2 3 4 ...
D 5 2 1 4 ...
E 3 2 3 9 ...
F 0 0 2 2 ...
G 0 0 2 2 ...
As you can see some rows are duplicate entries if you disregard the first column for a second. I would like to combine/merge these rows to generate something like this:
A;B;C 1 2 3 4 ...
D 5 2 1 4 ...
E 3 2 3 9 ...
F;G 0 0 2 2 ...
I could write a for loop, which iterates over the rows, but that would be neither pretty, nor efficient. I am pretty certain there's a better way to do this.
I thought I could:
slice the df so I have all columns except the first slice <- df[, 2:ncols(df)]
get a dataframe with all "duplicate" rows by dups <- df[duplicated(slice)]
get another dataframe with the "unique" rows by uniq <- df[unique(slice)]
merge them using all but the first column merge(uniq, dups, by... )
Except that won't work since unique doesn't return indices but a whole dataframe, which means I cannot index df with corresponding rows from slice.
Any suggestions?
EDIT: I should clarify that A,B,C... are not rownames but actually part of the dataframe, entries given in string/character representation
There are several functions that would do this. All of them are the common aggregation functions: aggregate, tapply, by, ..., and, of course, the popular "data.table" and "dplyr" set of functions.
Here's aggregate:
aggregate(V1 ~ ., mydf, toString)
# V2 V3 V4 V5 V6 V1
# 1 0 0 2 2 ... F, G
# 2 5 2 1 4 ... D
# 3 1 2 3 4 ... A, B, C
# 4 3 2 3 9 ... E
Other options (as indicated in the opening paragraph):
library(data.table)
as.data.table(mydf)[, toString(V1), by = eval(setdiff(names(mydf), "V1"))]
library(dplyr)
mydf %>%
group_by(V2, V3, V4, V5, V6) %>%
summarise(V1 = toString(V1))
Instead of toString, you can use the classic paste(., collapse = ";") approach which gives you more flexibility about the final output.

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

How to match 1 column to 2 columns?

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!
Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)
I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

Resources