Delete specific rows out of a dataset - r

I have a dataset with 40 columns with 100.000 rows each. Because the number of columns is to big, I want to delete some of them. I want to delete the rows from 10.000-20.000; from 30.000-40.000 and from 60.000-70.000; so that I have as a result a dataset with 40 columns with 70.000 rows. The first column is an ID starts with 1 (called ItemID) and ends at 100.000 for the last one. Can someone please help me.
Tried this to delete the columns from 10000 to 20000, but it´s not working (let´s the the data set is called "Data"):
Data <- Data[Data$ItemID>10000 && Data$ItemID<20000]

Severeal ways of doing this. Something like this suit your needs?
dat <- data.frame(ItemID=1:100, x=rnorm(100))
# via row numbers
ind <- c(10:20,30:40,60:70)
dat <- dat[-ind,]
# via logical vector
ind <- with(dat, { (ItemID >= 10 & ItemID <= 20) |
(ItemID >= 30 & ItemID <= 40) |
(ItemID >= 60 & ItemID <= 70) })
dat2 <- dat[!ind,]
To take it to the scale of your data set, just ind according to the size of your data set (multiplication might do).

I think you should be able to do
data <- data[-(10000:20000),]
and then remove the other rows in a similar manner.

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

Count unique values for previous rows in sorted table

I'm trying to count the unique row values for a sorted table. So if I had a table like:
data('chickwts')
chickwts[order(chickwts$weight),]
I'd like to be able to retrieve the total number of unique feeds for the preceding rows. Therefore, if I wanted weight < 150 as my cutoff, I would get feed count = 2. Ideally I would be able to generate a column that also keeps track of this count throughout the rows and plot this number against the weight in this case.
I know I can pre-select/subset with grepl:
chickwts$seed=ifelse(grepl("seed",chickwts$feed),TRUE,FALSE)*1
chickwts[order(chickwts$weight), ]
I know I can use unique to get unique names, but I'm having trouble putting these together to get that final count column.
data("chickwts")
chickwts <- chickwts[order(chickwts[,"weight"]),]
chickwts[,"unique.feed"] <- unlist(lapply(chickwts[,"weight"], function(n) with(chickwts, length(unique(feed[weight < n])))))
Use all the weights in an lapply function
Check which weights are less than weight of current row weight < n
Get the corresponding feeds of weights less than that of current row feed[weight < n]
Get the unique feeds and count how many unique there are with unique and length.
unlist them as we want a vector.
data("chickwts")
chickwts <- chickwts[order(chickwts$weight),]
# Using < 150 as a cutoff
cat("if you meant 1 column giving the count to all rows, based on < 150")
chickwts$count_less_than_150 <- length(unique(chickwts$feed[chickwts$weight < 150]))
cat("if you meant 2 columns giving the count to all rows, based on < 150 or > 150")
chickwts$count_lt_150 <- length(unique(chickwts$feed[chickwts$weight < 150]))
chickwts$count_ge_150 <- length(unique(chickwts$feed[chickwts$weight >= 150]))
cat("if you meant 1 column giving the count to all rows, based on < 150 or >= 150")
chickwts$count <- NA
chickwts$count[chickwts$weight < 150] <- length(unique(chickwts$feed[chickwts$weight < 150]))
chickwts$count[chickwts$weight >= 150] <- length(unique(chickwts$feed[chickwts$weight >= 150]))

Dynamic subseting a dataframe

I have a dataframe with a fixed no numeric columns and an arbitrary numeric columns like this:
s <- data.frame(A=c("a","b","c"),B=c(1,2,3), C=c(24,15,2))
I also have two vectors with the same length of the number of numeric columns defining the min and max values for every column.
min <- c(2,10)
max <- c(3,30)
I want to subset the dataframe with all the rows than have column B between 2 and 3, and column C between 10 and 30. Like this:
s <- s[s$B >= min[1] & s$B <= max[1] & s$C >= min[2] & s$C <= max[2],]
To subset the dataframe for an arbitrary number of numeric columns right now I use a for statment:
for(i in 1:length(min))
s <- s[s[,i+1] >= min[i] & s[,i+1] <= max[i],]
This do the job but is very slow. I have around 20 columns and 150K rows in the data frame.
There is a better way?
Generically, like this?
s <- data.frame(A=sample(letters,100,T),B=sample(1:4,100,T), C=sample(2:40,100,T))
# larger dataframe
min <- c(2,10)
max <- c(3,30)
filt<-rowSums(
sapply(1:length(min),function(x){ # for each item in min (or max)
s[,x+1]>=min[x] & s[,x+1]<=max[x] # create a T/F vector
})
)==length(min) # this returns T for cases where all criteria are met
s[filt,] # this applies your filter to s

Changing data based on conditions in R

I have posted a similar question to this before and got a quick answer, but am just an R beginner and haven't been able to adapt it to what I need.
Basically I want to take the below code (says if Date_Index is between two numbers and df is < X, then turn df to Y) and make it so it only applies to entries that meet a certain criteria, i.e:
HAVE: df[df$Date_Index >= 50 & df$Date_Index <= 52 & df < .0000001]=1
ADD: if df$Date_Index <= 49 AND df = 0.00 ignore the above statement, else execute:
In other words I need the equivalent to an if, then, else clause. If Date_Index <= 49 and df = 0, leave alone, else if Date_Index >=50 and Date Index <= 52 and df < .001 then replace data (in Date Index rows 50-52) with 1.
This (simple) data set should illustrate it enough:
xx <- matrix(0,52,5)
xx[,1]=1
xx[,3]=1
xx[,5]=1
xx[50:52,]=0
xx[,1]=1:52
xx[50,3]=1
So what I'd like is column 2 and column 4 to stay all 0's but for the bottom of column 3 and 5 to continue to be all 1's.
I suppose you're looking for this:
xx[xx[,1] >= 50 & xx[,1] <= 52, c(FALSE, !colSums(!xx[xx[,1] <= 49, -1]))] <- 1

Resources