Getting number of rows/columns from csv file in R with condition

Getting number of rows/columns from csv file in R with condition - r

I've got .csv file, with I read with command:
my_data <- read.csv("file_name")
It has a lot of columns, but I want to get number of rows, withc specific column condition, for example, number of rows, where value of column "VAL" is greater 20.
I've tried with:
k <-subset(my_data, my_data$VAL > 24)
length(k)
But it dosen't look correct. I don't know how to make it work.

dim(k) if you need to keep the data frame or dim(subset(my_data, my_data$VAL > 24))

If you are only interested in the number of such observations then I prefer simply summing a boolean vector. Possibly R's greatest strength is the vectorised operations such that (df$y > 100) will give you a vector indicating whether each individual observation is true or false. You can then sum this to get the total number that are true.
x <- 1:10000
y <- rnorm(1000,100,10)
df <- as.data.frame(cbind(x,y)) #create a dataframe
count <- sum(df$y > 100)

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!

Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

Divide specific values in a column by 1000

I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)

If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000

Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs

Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.

How to compare a data frame with duplicates and a vector?

I have a data frame in which some ids appear more than once. I sampled this ids uniquely and now I have a vector with the sampled ids. Now I need to create a logical that tells me which rows in the data frame have ids that also appear on my sample.
I have tried the match function, but it selects only the first appearance and I need all appearances.
I have also tried merge but the dataset is to large so there is no memory to do it.

You can use %in% to get a logical vector and which together with in to get the row indices. Here is a reproducible example that contains duplicate IDs.
set.seed(1234)
df <- data.frame(id=sample(1:80, 100, replace=TRUE), b=rnorm(100))
mySample <- seq(1, 80, by=6)
#logical vector length of nrow(df)
myRows <- df$id %in% mySample
# row indices
myIndices <- which(df$id %in% mySample)

This is what you can do using match (as you were trying this function):
x=match(df$id, mySample, nomatch = 0) > 0
Which gives you a logical vector which is TRUE if df$id appears in mySample and FALSE otherwise.
To retrieve the respective indices:
which(x==T)

How to sample 1:x where x is a vector of random integers with length greater than 1

The sample code
population <- 10000
vec <- sample(1:6, population, replace=T)
output <- sample(1:vec, population, replace=T)
warning: numerical expression has 10000 elements: only the first used.
The sample is attempting to change the limits of the sample for each choice, so one iteration should randomly sample between 1:2, another could be between 1:6. The value of the maximum is defined in 'vec'
What is the correct way to structure this line such that it knows to create 'output' as a vector of length 10,000, with the proper references to the maximum values in 'vec'? Currently it is only using the first value of 'vec' for all 10000 samples in 'output'

Maybe use sapply to loop over vec:
out <- sapply(vec,sample,size = 1)

Another way: create a matrix where columns are samples using different numbers. Then build a vector that randomly takes a value from each row. I thought this might be faster, but both ways are very fast.
population <- 1e4
samp.mat <- sapply(1:6,sample.int,size=population,replace=TRUE)
indices <- cbind(seq_len(nrow(samp.mat)),sample.int(6,nrow(samp.mat),replace=TRUE))
out <- a[indices]

Error: (subscript) logical subscript too long

Can some one let me know why I am getting this error and how I can fix it?
Here is the code
What I am trying to do is remove the rows that associated 1's if the column of that one's less than 10
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
c4=c(rep(1,6),rep(0,34))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3,c4),nrow=40,ncol=8)
nam <- paste("V",2:9,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- colSums(dat)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
cs <- colSums(dat) < 10
indx <- dat[,which(cs)]==0
for(i in 1:dim(indx)[2]){
datnw <- dat[indx[,i],]
dat <- datnw}
datnw2 <- dat[, -which(cs)]
Thanks

If I understand correctly what you're trying to achieve, you might best write it this way:
cs <- colSums(dat) < 10
dat[rowSums(dat[,cs]) == 0, !cs]
This means: for any column with sum less than 10 (called a “small column” hereafter), drop any row which has a 1 in that column. So you only keep rows which have a zero in all those small columns. You drop the small columns as well, as they would only contain zeros in any case.
In your code, indx is a logical data frame with 40 rows, one for each row of input, and one column for each small column in the input. You use the first column of idx to remove the rows with a 1 in the first short column. This results in a new value for dat, which is a few rows shorter than the original. In the next iteration of the loop, you use the second logical vector in an attempt to remove more rows. But this won't work: after the first iteration, dat has less than 40 rows, but the second column still has all 40 rows. This is what's causing the error: you're subscripting a vector of less than 40 elements with a logical vector of length 40.
You could combine the three columns of your indx into a single vector suitable to subscript the rows of interest using the following expression:
apply(indx, 1, all)
This will have a TRUE value in its result for exactly those rows which have TRUE in each column. However, I guess I'd prefer my code above over this, as it is much shorter to write. The most likely reason to prefer the latter is if your data frame may contain negative number, so that a row sum of zero does not imply an all-zero row. Not a problem in your example data.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex