Subsetting rows of data frame by charater patterns (grepl) in a for loop [duplicate] - r

This question already has answers here:
Subset rows in a data frame based on a vector of values
(4 answers)
Closed 2 years ago.
I am attempting to subset a data frame by removing rows containing certain charater patterns, which are stored in a vector. My issue is that only the last pattern of the vector is removed from my data frame. How can I make my loop work iteratively, so that all patterns stored in the vector are removed from my data frame?
Mock input:
df<-data.frame(organism=c("human_longname","cat_longname","bird_longname","virus_longname","bat_longname","pangolian_longname"),size=c(6,4,2,1,3,5))
df
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
4 virus_longname 1
5 bat_longname 3
6 pangolian_longname 5
used code and output:
vectors<-c("bat","virus","pangolian")
for(i in vectors){df_1<-df[!grepl(i,df$organism),]}
df_1
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
4 virus_longname 1
5 bat_longname 3
Expected output
df_1
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2

You can try this:
df[!df$organism %in% c("bat","virus","pangolian"),]
organism size
1 human 6
2 cat 4
3 bird 2
Update: Based on new data, here an approach using grepl(). These functions can be used to avoid loops:
#Vectors
vectors<-c("bat","virus","pangolian")
#Format
vectors2 <- paste0(vectors,collapse = '|')
#Avoid loop
df[!grepl(pattern = vectors2,df$organism),]
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
Also just for curious, here maybe a not optimal loop to do the same task creating a new dataframe and an index:
#Create index
index <- c()
#Loop
for(i in 1:dim(df)[1])
{
if(grepl(vectors2,df$organism[i])==F)
{
index <- c(index,i)
}
ndf <- df[index,]
}
ndf
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2

Related

Repeating rows in data frame by using the content of a column in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I want to create a data frame by repeating rows by using content of a column in a data frame. Below is the source data frame.
data.frame(c("a","b","c"), c(4,5,6), c(2,2,3)) -> df
colnames(df) <- c("sample", "measurement", "repeat")
df
sample measurement repeat
1 a 4 2
2 b 5 2
3 c 6 3
I want to repeat the rows by using the "repeat" column and its content to get a data frame like the one below. Ideally, I would like to have a function to this.
sample measurement repeat
1 a 4 2
2 a 4 2
3 b 5 2
4 b 5 2
5 c 6 3
6 c 6 3
7 c 6 3
Thanks in advance!
Solved. df[rep(rownames(df), df$repeat), ] did the job.

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

Creating a new variable in a data frame and changing its values in one step [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I have a column which is part of a data frame, df. It is full of integers. Let's say it is the number of houses sold in a day by a reality compant. Let's call it df$houses. I want to make a second column called df$quant where the number of houses is categorized, with 0 being 0-2 houses sold in a day, 1 being 3-5 houses, 2 being 6-9 houses and 3 being more than 10 houses? I could do this in two steps.
1) Create the new column df$quant from df$houses:
df$quant <- df$houses
2) Change the values of df$quant:
df$quant[which(df$quant <= 2)] <- 0
etc.
I would like to do this in one step though, making the new variable and filling it with the proper values. Mostly, so I don't have to worry about getting the order of the lines of code in the second step right. It would be more robust.
Could this be done with an if statement?
Thanks a lot.
I would do something like this: (using cut)
x <- 1:11
df <- data.frame(x)
myFunction <- function(x) as.integer(cut(x, c(-1, 2, 5, 9, max(x)))) - 1
df$new <- myFunction(df$x)
df
x new
1 1 0
2 2 0
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3

Import multiple data frames CSV - column separation

I have a csv file with multiple data frames that are all separated by a column (So 4 columns of data, empty column, 4 columns of data, etc.). Is there a nice way to read in the file and have R create a separate df for each of those contiguous sets of columns? Then I would be able to use lapply across all of these dfs.
Thanks for your help.
Read in the whole csv file, then use lapply to separately capture each four-column data frame into a list. Then use rbind to stack all the data frames into a single data frame.
dat = read.csv("YourFile.csv")
# Set this based on how many separate data frames are in your csv file
num.df = ncol(dat)/5 # Per #zx8754's comment
# This will tell the function the column numbers where
# each data frame starts
start.cols = seq(1, 1 + 5*(num.df-1), 5)
df.list = lapply(start.cols, function(x) {
# Capture the next 4 columns
df = dat[, x:(x+3)]
# Use whatever names are appropriate here. This is just
# to make sure all of the data frames have the same column names
# so that rbind won't throw an error
names(df) = c(paste0("col", 1:4))
return(df)
})
# rbind all the data frames into a single data frame
df = do.call(rbind, df.list)
You can take advantage of colClasses:
Example data:
h1 h2 h3 h1.1 h2.1 h3.1 h1.2 h2.2 h3.2
1 1 6 3 1 8 8 1 5 2
2 2 1 1 6 5 8 1 3 1
3 3 2 6 1 2 3 1 2 5
Then you can loop through the number of dataframes you wan't and read the file:
ngroups <- 3 #number of dataframes to read
datacols <- 3 #number of columns to read
fulldata <- list()
for (i in 1:ngroups) {
nskip <- (datacols+1)*(i-1)
cols.to.read <- c(rep("NULL", nskip), rep(NA, datacols), rep("NULL", (datacols+1)*(ngroups-i+1)-1)) #creates a list of NULLs and NAs. NULLs = don't read, NA = read
fulldata[[i]] <- read.csv("test.csv", colClasses=cols.to.read)
}
Result:
fulldata
[[1]]
h1 h2 h3
1 1 6 3
2 2 1 1
3 3 2 6
[[2]]
h1.1 h2.1 h3.1
1 1 8 8
2 6 5 8
3 1 2 3
[[3]]
h1.2 h2.2 h3.2
1 1 5 2
2 1 3 1
3 1 2 5
This works, but I believe the answers reading the file only once would be faster, since reading the same file over and over again doesn't sound like the optimal procedure.
First read in all your data into one large dataframe:
maindf <- read.table(yourfile)
Lets say n is the number of dataframes inside your csv file:
for (i in 0:n-1){
assign(paste0("df",i+1),maindf[,(1+4*i):(4+4*i)])
}
The result should be n dataframes that can be accessed like this: df1, df2,...dfn.
I didnt test it, because no sample data was provided.

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources