Selecting between rows in a dataframe - r

I am working with a rather noisy data set and I was wondering if there was a good way to selectively choose between two rows of data within a group or leave them alone. Logic-wise I want to filter by group and then build an if-else type control structure to compare rows based on the value of a second column.
Example:
Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA
I want to group by ID (1, 2, 3) then go to column V2 and choose for example, row 1 over row 2 because row 2 has NA. But for rows 4 and 5, where both are 'NA' I want to just leave them alone.
Thanks,

What you need might really depends on what you exactly have. In case of NAs, this might help:
df <- data.frame(
Row = c(1, 2, 3, 4, 5),
ID = c(1, 1, 2, 3, 3),
V1 = c("bla", "bla", "foo", "bla", "bla"),
V2 = c(1.2, NA, 2.3, NA, NA),
stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]

A solution using purrr. The idea is to split the data frame by ID. After that, apply a user-defineed function, which evaluates if all the elements in V2 are NA. If TRUE, returns the original data frame, otherwise returns the subset of the data frame by filtering out rows with NA using na.omit. map_dfr is similar to lapply, but it can combine all the data frames in a list automatically. dt2 is the final output.
library(purrr)
dt2 <- dt %>%
split(.$ID) %>%
map_dfr(function(x){
if(all(is.na(x$V2))){
return(x)
} else {
return(na.omit(x))
}
})
dt2
# Row ID V1 V2
# 1 1 1 blah 1.2
# 2 3 2 foo 2.3
# 3 4 3 bar NA
# 4 5 3 bar NA
DATA
dt <- read.table(text = "Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA",
header = TRUE, stringsAsFactors = FALSE)

Related

Add new variable to specific position in dataframe without specifying a numbered position

In Stata, I can create a variable after or before another one. E.g. gen age=., after(sex)
I would like to do the same in R. Is it possible?
My database has 300 variables, so I don't want to count it to discover its numbered position and also I might change from time to time.
You could do:
library(tibble)
data <- data.frame(a = c(1,2,3), b = c(1,2,3), c = c(1,2,3))
add_column(data, d = "", .after = "b")
# a b d c
# 1 1 1
# 2 2 2
# 3 3 3
Or another way could be:
data.frame(append(data, list(d = ""), after = match("b", names(data))))
First add the new column to the end of your data frame. Then, find the index of the column after which you want that new column to actually appear, and interpolate it:
df$new_col <- ...
index <- match("col_before", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
Sample:
df <- data.frame(v1=c(1:3), v2=c(4:6), v3=c(7:9))
df$new_col <- c(7,7,7)
index <- match("v2", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
df
v1 v2 new_col v3
1 1 4 7 7
2 2 5 7 8
3 3 6 7 9

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

R: compare multiple columns pairs and place value on new corresponding variable

Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1

Create then populate colmuns in a dataframe

Hello I'm trying to find a way to create new columns in a dataframe the populate them.
For example:
id = c(2, 3, 5)
v1 = c(2, 1, 7)
v2 = c(1, 9, 5)
duration=c(v1+v2)
df = data.frame(id,v1,v2,duration,stringsAsFactors=FALSE)
id v1 v2 duration
1 2 2 1 3
2 3 1 9 10
3 5 7 5 12
Now I want to create new columns by dividing each value of a row by the 'duration' of said row, I know how do it manually but it is prone to errors and not really elegant...
df$I_v1=v1/duration
df$I_v2=v2/duration
Or is df <- df %>% mutate(I_v1 = v1/duration) quicker/better?
id v1 v2 duration I_v1 I_v2
1 2 2 1 3 0.6666667 0.3333333
2 3 1 9 10 0.1000000 0.9000000
It works but I would like to know if it's possible to create -and name- the row and populate them automatically.
Say that you have a cols vector containing the names of the columns you want to manipulate. In your example:
cols<-c("v1","v2")
Then you can try:
df[paste0("I_",cols)]<-df[cols]/df$duration
# id v1 v2 duration I_v1 I_v2
#1 2 2 1 3 0.6666667 0.3333333
#2 3 1 9 10 0.1000000 0.9000000
#3 5 7 5 12 0.5833333 0.4166667
You can use transform():
df <- data.frame(id=c(2, 3, 5), v1=c(2, 1, 7), v2=c(1, 9, 5))
df$duration <- df$v1 + df$v2) # or ... <- with(df, v1 + v2)
df_new <- transform(df, I_v1=v1/duration, I_v2=v2/duration )
... or (if you have many columns v1, v2, ...):
as.matrix(df[, 2:3])/df$duration # or with cbind():
cbind(df, as.matrix(df[, 2:3])/df$duration)
(similar as in the answer from nicola)
All data frames have a row names attribute, a character vector of length the number of rows with no duplicates nor missing values. You can name the rows as:
row.names(x) <- value
Arguments:
x
object of class "data.frame", or any other class for which a method has been defined.
value
an object to be coerced to character unless an integer vector.e here

Select row numbers of a data frame conditioning on another data frame

I have a data frame that I want to find the row numbers where these rows are in common with another data frame.
To make the question clear, say I have data frame A and data frame B:
dfA <- data.frame(NAME = rep(c("a", "b"), each = 3),
TRIAL = rep(1:3, 2),
DATA = runif(6))
dfB <- data.frame(NAME = c("a", "b"),
TRIAL = c(2, 3))
dfA
# NAME TRIAL DATA
# 1 a 1 0.62948592
# 2 a 2 0.88041819
# 3 a 3 0.02479411
# 4 b 1 0.48031827
# 5 b 2 0.86591315
# 6 b 3 0.93448264
dfB
# NAME TRIAL
# 1 a 2
# 2 b 3
I want to get dfA's row number where dfA and dfB have the same NAME and TRIAL, in this case, row numbers are 2 and 6.
I tried the following code, gives me row 2, 3, 5, 6. It separately matches NAME and TRIAL, doesn't work.
which(dfA$NAME %in% dfB$NAME & dfA$TRIAL %in% dfB$TRIAL)
# 2 3 5 6
Then I tried to create a dummy column and match this col. Works, but the code would be verbose if dfB has many columns...
dfA$dummy <- paste0(dfA$NAME, dfA$TRIAL)
dfB$dummy <- paste0(dfB$NAME, dfB$TRIAL)
which(dfA$dummy %in% dfB$dummy)
# 2 6
I'm wondering if there are better ways to solve the problem, thanks for your help!
You can do:
merge(transform(dfA, row.num = 1:nrow(dfA)), dfB)$row.num
# [1] 2 6
And if the whole goal of finding the indices is so that you can subset dfA, then you can just do merge(dfA, dfB).
Or use duplicated:
apply(dfB, 1, function(x)
which(duplicated(rbind(x, dfA[1:2])))-1)
# [1] 2 6

Resources