Dynamically indexing a data frame by column name - r

I have a data frame and I want to extract the rows where particular columns have a particular value. The column names are stored in a character array and the values are stored in a list.
data <- data.frame(A=c("a","b","b"), B=c(1,2,2), C=(3,3,4))
column_key <- c("A", "B")
value_key <- list("b", 2)
Obviously, I can extract the information I want by simple indexing if I hardcode the column names of the keys:
desired_rows <- data[data$A=="b" & data$B==2,]
desired_rows =
A B C
2 b 2 3
3 b 2 4
But how do I do this if the column names are stored in variables. Ideally, it would be something like this:
key <- value_key
names(key) <- column_key
desired_rows <- data[key,]
But I cannot index a data.frame with a list.

I found this trick just before posting the question.
I can compare a data frame to a list that has the same length as a row which returns a logical matrix indicating which element in each row matches the corresponding element in the list. Because I want to find rows that match entirely, I apply the all function across the rows to get a logical index into the rows of data.
desired_rows <- data[apply(data[column_key]==value_key, 1, all),]

Related

Can I create a tibble with 1 row and 11 columns in R from a data frame with 0 rows and 0 columns?

I have imported some Twitter data which gives me a list with tibbles for every user. Each tibble has 11 columns and various number of rows depending on how many lists a Twitter user has.
If a Twitter user has no lists, it is listed as a data frame with 0 rows and 0 columns (see [3] in the picture). I don't want to delete such entries but keep them as a user with no lists.
Hence, I'm thinking whether I can create a tibble with 11 columns and 1 row where each cell contains a "99".
How do I change a data frame within a list to a tibble?
Thanks a lot for your help!
You can try :
#get index of dataframes that has 0 columns
inds <- lengths(list_data_outlier) == 0
#get column names from other dataframe which is not empty
cols <- names(list_data_outlier[[which.max(!inds)]])
#create an empty dataframe with data as 99 and 1 row
empty_df <- data.frame(matrix(99, nrow = 1, ncol = length(cols),
dimnames = list(NULL, cols)))
#replace the dataframes with 0 columns with empty_df
list_data_outlier[inds] <- replicate(sum(inds), empty_df, simplify = FALSE)
Thanks again, #Ronak Shah!
This worked well:
inds <- lengths(list_data_outlier) == 0
empty_df <- list_data_outlier[[which.max(!inds)]][1, ]
list_data_outlier[inds] <- replicate(sum(inds), empty_df, simplify = FALSE)
But instead of choosing the first row and hence, having wrong data in the DF, I used the 50th row:
empty_df <- list_data_outlier[[which.max(!inds)]][50, ]
The number is depending on the number of entries nrows + 1.
That way you'll get a tibble with 1 row and the same number and types of columns as in the rest of your list but instead of filling it with "wrong" data it's filled with NAs which is what I needed to continue with my analysis.

subsetting using column names as objects

I am trying to subset a data frame using a column names stored in an object. Is this possible? Here is an example:
ReallyLongColNameA <- c(1,2,3,4,5,6)
ReallyLongColNameB <- c(6,5,4,3,2,1)
ReallyLongColNameC <- c(7,8,9,10,11,12)
X <- data.frame(ReallyLongColNameA, ReallyLongColNameB, ReallyLongColNameC)
can i store a column name as such:
ShortColNameB <- names(X[2])
and then subset using the column name stored in object ShortColNameB
I can subset the following:
subX <- X[X$ReallyLongColB == 6,]
To get:
ReallyLongColA ReallyLongColB ReallyLongColC
1 6 7
But what if I wanted the following desired output by using the column name stored in an object (ShortColNameB)?:
ReallyLongColA ReallyLongColB
1 6
You can easily remove the last column by subsetting on column numbers.
X[X[[ShortColNameB]]==6,c(1,2)]
You define what rows you want by filtering on the ==6 for ShortColNameB, and you define the columns you want by selecting the numbers (e.g. 1st and 2nd column, A & B).

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Compare data frame with vector and create new variable for matched value

I have a data frame with 600 rows which has a character variable (ids) that contains numbers separated by comma.
name ids
x 8,5,23,56,78,44,54
y 5,7,23,44
z 8,44,2
I wanted to compare the above values with three different vectors which contains numeric values.
a=c(5,7,9,3)
b=c(8,23,78,66,4)
c=c(44,54,2,90)
I need to create three new columns for the vectors in the data frame which contain the values from ids that matches more than once in the each vector
name ids a b c
x 8,5,23,56,78,44,54 NA 8,23,78 44,54
y 5,7,23,44 5,7 NA NA
z 8,44,2 NA NA 44,2
I really do not have any idea how to compare this since both are different types and once I compare and how to get the seperate the values like above.
We can place the vectors in a list, loop through them, split the 'ids' column in the 'data.frame' by ',' into a list, subset the vectors based on the elements found %in% the split list, create an exception to return NA when the length of the subset is 1 or else we paste (i.e. toString) it together and assign the output back to new columns in 'df1'.
df1[letters[1:3]] <- lapply(list(a, b, c), function(x)
sapply(strsplit(df1$ids, ","), function(y) {
x1 <- x[x %in% as.numeric(y) ]
if(length(x1)>1) toString(x1) else NA
}))

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources