Subsetting of data.frames with variable name vs. column number - r

I am fairly new to R and I have run into a problem with subsetting data frames a number of times. I have found a fix but would just like to understand what I am missing.
Here is an exemplary bit of code, where I don't understand the functional difference.
Example data frame:
df <- data.frame(V1 = c(1:10), V2 = c(rep(1, times = 10)))
this produces an "undefined columns selected" error:
df1 <- df[df$V1 < 5, df$V2]
but this works:
df2 <- df[df$V1 < 5, 2]
I don't understand why when reffering to the column by its name via $V2 I do not recieve the same result as when reffering to the same column by its number.
This is a really basic question, I am aware, but I would just like to get my head around it.
Thanks and also sorry if formatting is off or anything (first time posting..),
Christoph

df[df$V1 < 5, df$V2] doesn't give an "undefined columns selected" error.
df[df$V1 < 5, df$V2]
# V1 V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
#1 1 1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3 3 3
#4 4 4 4 4 4 4 4 4 4 4
As you have only 1 in df$V2 and 1st column is present in your dataframe. It selects 1st column for length(df$V2) times and as it is not advised to have columns with same name it adds prefix .1, .2 to it.
This is same as doing
df[df$V1 < 5, c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)]
It would give an undefined column selected error , if you select columns which are not present in data.
df[df$V1 < 5, c(1, 3)]
Error in [.data.frame(df, df$V1 < 5, c(1, 3)) :
undefined columns selected
There are different ways in which you can access data
By column name which is
df[df$V1 < 5, "V2"]
#[1] 1 1 1 1
Or
df$V2[df$V1 < 5]
and by column position.
df[df$V1 < 5, 2]
#[1] 1 1 1 1

Related

Count equal elements in R

I have x is:
x<-c( 1, 2 , 3 , 1 , 4 , 5 , 6 , 2 , 3 , 2 , 3 , 8 )
How can i count the equal elements in x? I want the returned result as 3.
Explanation: There are 3 values(1,2,3) that appeared at least twice.
With x[i]==1 there are 2 elements, count=1
With x[i]==2 there are 3 elements, count=2
With x[i]==3 there are 3 elements, count=3
I want the result is count=3.
Thank you very much!
So if I undestand well, you want to count how many numbers are repeated in the vector.
One way to do it would be to construct a table from the vector, and see how many elements have a count higher than one:
x <- c(1, 2, 3, 1, 4, 5, 6, 2, 3, 2, 3, 8)
tab <- table(x)
sum(tab > 1)
#> [1] 3
Created on 2020-11-26 by the reprex package (v0.3.0)
Here are couple of base R options :
Using rle :
sum(with(rle(sort(x)), lengths > 1))
#[1] 3
With tapply :
sum(tapply(x, x, length) > 1)
library(tibble)
library(dplyr)
x<-c( 1, 2 , 3 , 1 , 4 , 5 , 6 , 2 , 3 , 2 , 3 , 8 )
tibble::tibble(x) %>% dplyr::count(x)
Edit:
As I was kindly asked to add some comments I will gladly do so.
Here I emplyoed packages tibble to create a data.frame (or tibble, rather) and the dplyr package for pivoting.
tibble::tibble(x)
turns the vector x into a tibble (type of data.frame) for further analysis. Unfortunately, the only variable of the tibble x is also called x. Sorry for that!
%>%
The pipe operator takes the value to its left (here, the newly created tibble called x ) and provides it as input for the subsequent command :
dplyr::count(x)
Here, we use dplyr to count() the variable x from the tibble x (again, sorry for that). The result will show which variables occured how many times:
x n
<dbl> <int>
1 1 2
2 2 3
3 3 3
4 4 1
5 5 1
6 6 1
7 8 1
where the first column (1 to 7) are simply row numbers, x are the values provided by the original question and n counts how often each variable occured.
length(unique(x[duplicated(x)]))
# 3
Data
x <- c(1, 2, 3, 1, 4, 5, 6, 2, 3, 2, 3, 8)

How to organize based on specific data values

I have this data frame:
structure(list(ID = c(101, 102, 103, 104, 105, 106
), 1Var = c(1, 3, 3, 1, 1, 1), 2Var = c(1, 1,
1, 1, 1, 1), 3Var = c(3, 1, 1, 1, 1, 1), 4Var = c(1,
1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
I have been trying to subset based on values of 1 and 0. In this data table there are no 0 values but my full data has it.
I toyed around with this method:
Prime <- grep('$Var', names(Data))
DataPrime <- Data[rowSums(Data[Prime] <= 1),]
I am getting duplicated observations though. Another issue with this method is that it keeps all rows that have a 1 or 0 but not rows with ONLY 1 or 0. So, some rows that have 3 but the rest of the variables are value of 1 that row is still kept in my data.
I think my method will work but I'm not sure what else I need to specify in the argument. I tried a simple subset too but that removed everything from the data:
DataPrime <- subset(Data, '1Var' <=1, '2Var' <=1, '3Var' <=1, '4Var' <=1)
I essentially want my data to look something like this:
ID 1Var 2Var 3Var 4Var
4 104 1 1 1 1
5 105 1 1 1 1
6 106 1 1 1 1
We can use Reduce with & to create a logical vector for subsetting the rows
subset(Data, Reduce(`&`, lapply(Data[-1], `<=`, 1)))
-output
# ID 1Var 2Var 3Var 4Var
#4 104 1 1 1 1
#5 105 1 1 1 1
#6 106 1 1 1 1
Or another option is rowSums
subset(Data, !rowSums(Data[-1] > 1))
I think you're looking for something like:
Prime <- grep('\\dVar', names(Data))
Data[apply(Data[Prime], 1, function(x) !any(x > 1)),]
#> ID 1Var 2Var 3Var 4Var
#> 4 104 1 1 1 1
#> 5 105 1 1 1 1
#> 6 106 1 1 1 1
A few things to note are:
Your regex inside grep was wrong. The "$" symbol represents the end of a string, not a number. For numbers you can use \\d . Your Prime variable is therefore empty in the example.
It's best not to have column names (or any variable name) starting with numbers. These are not legal names in R. You can get round this by surrounding them with backticks, but this is easy to overlook and is a source of bugs.
rowSums adds up all the values in each row, so the lowest sum of any of the rows is 4, whereas rowSums(Data[Prime] <= 1) gives the total number of entries that are one or less, giving a vector like c(3, 3, 3, 4, 4, 4). Subsetting Data by this will give 3 copies of row 3 then three copies of row 4, which clearly isn't what you want.
In subset, you need the logical conjunction of all your var <= 1 terms, so you should split these with &, not with commas.

Conditional Statement in R (indicator) based off matching values to another dataset

I have two datasets
dataset1 with column fruit, customer_num
dataset2 with column fruit2, customer_num
So lets say I do a left join with dataset 1 to dataset 2, using customer_num as the joiner. Now I got a dataset with fruit and fruit2 as column variables.
How can a create an indicator to say if fruit==fruit2 then 1 else 0 ?
You could do it like this (my example):
# I've created example of customer_num where I presumed that this are numbers
fruit <- data.frame(customer_num = c(1, 2, 3, 4, 5, 6))
fruit2 <- data.frame(customer_num = c(1, 2, 3, 10, 11, 12))
# Vector in data frame
df <- data.frame(fruit, fruit2)
# And match values / Indicator
dat<-within(df,match <- ifelse (fruit == fruit2,1,0))
# Output
customer_num customer_num.1 customer_num
1 1 1 1
2 2 2 1
3 3 3 1
4 4 10 0
5 5 11 0
6 6 12 0
ifelse would be easiest, assuming it is in the same dataframe. Example using the dplyr package
dataset1 %>%
mutate(Match=ifelse(fruit==fruit2,1,0))
This will create a column called Match and do 1 if they match, 0 if they do not

Extend/expand data frame with column of lists each into a row

I have a a data frame of the following type:
df <- data.frame("col1" = c(1,2,3,4))
df$col2 <- list(list(1,1,1),list(2,2,2),list(3,3,3),list(4,4,4))
df$col3 <- list(c(1,1,1),c(2,2,2),c(3,3,3),c(4,4,4))
df
And get:
col1 col2 col3
1 1 1, 1, 1 1, 1, 1
2 2 2, 2, 2 2, 2, 2
3 3 3, 3, 3 3, 3, 3
4 4 4, 4, 4 4, 4, 4
Now I would like to manipulate this data frame to get something like:
col1 col3
1 1 1
1 1
1 1
2 2 2
2 2
2 2
3 3 3
3 3
3 3
...
Now I can do this with a simple loop. For each row I convert the list into a data frame. I then use rbind to append the data frames into a single one.
My question is: how do I do this with vectorized function?
I have tried apply, sapply, mapply and Reducebut with no success. applywas the only that actually execute but produced incorrect results (got only the first element of each list).
We can remove the first column (df[-1]), loop over the other columns, unlist and then convert the list to data.frame
lst <- lapply(df[-1], unlist)
dfN <- data.frame(lst)

comparing two files and outputting common elements

I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}
You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.
It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.
data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5
If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.

Resources