How to print the column when another column met the condition - r

Example Data
a <- c(1,2,2,3)
b <- c(1,2,3,4)
dat <- data.frame(a,b)
I would like to print the column 2 when any data from the column 1 is >=2
which(dat[,1]>=2)
This only show which row of column2 is greater than 2.
I expect it will show:
[1] 2 3 4
Sorry for my bad English and hope you can understand it.

If we need the corresponding values in 2nd column, use the [
dat[,2][dat[,1]>=2]
#[1] 2 3 4

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

Count unique string patterns in a row

i have a following example:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
I want to add to the dataframe a column count, which will tell me how many UNIQUE words does each row have. The desired output would in this case be
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
Could you please give me a hint how to modify the original formula? Thank you very much
With base R you could do the following:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))})
which returns
[1] 2 1 2
Hope this helps!
We can use stri_match_all instead which gives us the exact matches and then calculate distinct values using n_distinct or length(unique(x)) in base.
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), n_distinct)
#[1] 2 1 2
Or similary in base R
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2

Find string in data.frame

How do I search for a string in a data.frame? As a minimal example, how do I find the locations (columns and rows) of 'horse' in this data.frame?
> df = data.frame(animal=c('goat','horse','horse','two', 'five'), level=c('five','one','three',30,'horse'), length=c(10, 20, 30, 'horse', 'eight'))
> df
animal level length
1 goat five 10
2 horse one 20
3 horse three 30
4 two 30 horse
5 five horse eight
... so row 4 and 5 have the wrong order. Any output that would allow me to identify that 'horse' has shifted to the level column in row 5 and to the length column in row 4 is good. Maybe:
> magic_function(df, 'horse')
col row
'animal', 2
'animal', 3
'length', 4
'level', 5
Here's what I want to use this for: I have a very large data frame (around 60 columns, 20.000 rows) in which some columns are messed up for some rows. It's too large to eyeball in order to identify the different ways that order can be wrong, so searching would be nice. I will use this info to move data to the correct columns for these rows.
What about:
which(df == "horse", arr.ind = TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
# [3,] 5 2
# [4,] 4 3
Another way around:
l <- sapply(colnames(df), function(x) grep("horse", df[,x]))
$animal
[1] 2 3
$level
[1] 5
$length
[1] 4
If you want the output to be matrix:
sapply(l,'[',1:max(lengths(l)))
animal level length
[1,] 2 5 4
[2,] 3 NA NA
We can get the indices where the value is equal to horse. Divide it by number of rows (nrow) to get the column indices and by columns (ncol) to get the row indices.
We use colnames to get column names instead of indices.
data.frame(col = colnames(df)[floor(which(df == "horse") / (nrow(df) + 1)) + 1],
row = floor(which(df == "horse") / ncol(df)) + 1)
# col row
#1 animal 1
#2 animal 2
#3 level 4
#4 length 5
Another way to do it is the following:
library(data.table)
library(zoo)
library(dplyr)
library(timeDate)
library(reshape2)
data frame name = tbl_account
first,Transpose it :
temp = t(tbl_Account)
Then, put it in to a list :
temp = list(temp)
This essentially puts every single observation in a data frame in to one massive string, allowing you to search the whole data frame in one go.
then do the searching :
temp[[1]][grep("Horse",temp[[1]])] #brings back the actual value occurrences
grep("Horse", temp[[1]]) # brings back the position of the element in a list it occurs in
hope this helps :)

Add index numbers when converting sorted table to dataframe

I have a vector of strings that I'm trying to convert into a data frame with a frequency column. So far so good, but when I dim my data frame, I get only one column instead of two. I guess R is using the words as the index values.
Anyway here is how it starts. My list:
a<-c("welcoming", "whatsyourexcuse", "whiteway", "zero", "yay", "whatsyourexcuse", "yay")
Then, I tried to sort the frequency values in decreasing order and store as data frame using:
df <- as.data.frame(sort(table(a), decreasing=TRUE))
Problem is when I dim(df) I get [1] 5 1 instead of [1] 5 2. Here is what df looks like:
sort(table(a), decreasing = TRUE)
whatsyourexcuse 2
yay 2
welcoming 1
whiteway 1
zero 1
instead of:
a Freq
[1] whatsyourexcuse 2
[2] yay 2
[3] welcoming 1
[4] whiteway 1
[5] zero 1
Any pointers please? Thanks.
Try:
library(plyr)
a1 <- count(a)
a1[order(-a1$freq),]
# x freq
# 2 whatsyourexcuse 2
# 4 yay 2
# 1 welcoming 1
# 3 whiteway 1
# 5 zero 1
dim(a1)
#[1] 5 2
Or
a2 <- stack(sort(table(a),decreasing=TRUE))[,2:1]
dim(a2)
#[1] 5 2
When you are converting to data.frame using as.data.frame(sort(table(a), decreasing=TRUE)), the names of the elements become the rownames of the dataframe, so you are creating only one column instead of two. When you do sort, it no longer is the table object. For comparison check str(table(a)) and str(sort(table(a), decreasing=TRUE)))
You can also create the data.frame by
tbl <- sort(table(a), decreasing=TRUE)
data.frame(col1= names(tbl), Values= as.vector(tbl))

Resources