Grouping, counting and selecting on R dataset

Grouping, counting and selecting on R dataset - r

I have a dataset like this:
x
A B
1 x 2
2 y 4
3 z 4
4 x 4
5 x 4
6 x 3
......
I want to know if in this dataset are present a same number of "A" upper than some value(for example 3).
Probably i will need to group this value in a temporary table getting this:
X Y z
4 1 1
and after this i will call another method (that i don't know) that gives me this result
X
because only the value X is present more than 3 times in my previous table.
Can R optimise this operation?

data<-data.frame(factor(c("x","y","z","x","x","x")),c(2,4,4,4,4,3))
To get the count of each letter, do
table(data[,1])
and to get the name of the factors with > 3
names(table(data[,1]))[table(data[,1]) > 3]

Don´t know if I understand you right... whats with this B column?
Is this working for you?
set.seed(1234)
A <- sample(c("x", "y", "z"), 20, replace = TRUE)
Ad <- data.frame(table(A))
with(Ad, A[Freq >= 7])
[1] x y

Related

How to find which interval/range a variable falls under in R

I have a data frame
> data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
3 8 -1.884417429 49.80308
4 12 1.692774467 50.45827
5 16 -0.907602775 51.14937
6 20 0.166186536 49.17502
7 24 0.420263825 49.56720
and a variable
t=2
and want to find the subset of the data under which it falls (rows 1 and 2 in this example), and then calculate the ratio in variables x and y, ie
Col1 x y
1 0 -0.107046196 49.96748
2 4 -0.001515573 50.02819
then obtain, based on value t, (t-0)/(4-0), and then use that ratio to calculate the position in x and y
I found a fund function in matlab (Find which interval a point B is located in Matlab) and wonder if there is a similar function in R
Specifically, is there a way to determine which interval a variable falls under? And once I find that interval, a way to extract the subset of data?
I can only think of %in% operator currently,
> t %in% df$Col1
[1] FALSE
For more clarity, I have tried
> z=NULL
> for(i in 1:(nrow(df)-1)){
+ z[[i]]=df$Col1[i]:df$Col1[i+1]
+ }
> w=NULL
> for(i in 1:length(z)){
+ w=c(w,t %in% z[[i]])
+ }
> v=which(w==1)
> df[v:(v+1),]
Col1 x y
1 0 1.076101 50.17514
2 4 1.971503 47.81647
>
and now hope there may be a more concise answer, as my real data is >1M rows.

Try using the code below and see whether it will give you the expected results:
dataframe=data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
funfun=function(x){v=findInterval(x,dataframe$Col1);c(v,v+1)}
dataframe[funfun(2),]
Col1 x y
1 0 0.831266 50.28246
2 4 1.751892 48.78810
dataframe[funfun(10),]
Col1 x y
3 8 0.2624929 48.33945
4 12 -0.2243066 51.11304
If this helps please let us know. thank you

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.

You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]

Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

R: if function with two conditions?

I have a huge data frame. I am stuck with if function. Let me first present the simple example and then I lay down my problem:
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
Problem: I want to run if function which sums column z values based for row which has same y and a only if the second row of each group has corresponding z equals 1
I am sorry but I am quite new in R so not able to present any reasonable codes which I have done by my own.
Any help would be highly appreciated.

As mentioned, your problem isn't clearly stated.
Perhaps you are looking to do something like this:
x$new <- with(x, ave(z, y, a, FUN = function(k)
ifelse(k[2] == 1, sum(k), NA)))
x
# z y a new
# 1 0 2 1 3
# 2 1 2 1 3
# 3 2 2 1 3
# 4 3 3 2 NA
# 5 4 3 2 NA
# 6 5 3 2 NA
Here, I've created a new column "new" which sums the values of "z" grouped by "y" and "a", but only if the second value in the group is equal to 1.

Since you say your data frame is quite large, you might want to convert your data frame to a data.table object using the data.table package. You will likely find that the required operations are much faster if you have a great many rows. However, the construction of the code for your case is not straight forward with data.table.
If I understnad what you want to do (which is not entirely clear to me) you could try the following:
library(data.table)
z <- c(0,1,2,3,4,5)
y <- c(2,2,2,3,3,3)
a <- c(1,1,1,2,2,2)
x <- data.frame(z,y,a)
xx <- as.data.table(x) # Make a data.table object
setkey(xx, z) # Make the z column a key
xx[1, sum(a)] # Sum all values in column a where the key z = 1
[1] 1
# Now try the other sum you mention
xx[, sum(z), by = list(z = y)] # A column sum over groups defined by z = y
z V1
1: 2 2
2: 3 3
sum(xx[, sum(z), by = list(z = y)][, V1]) # Summing over the sums for each group should do it
[1] 5
To create the sum over the column a where z = 1, I made the z column a key. The syntax xx[1, sum(a)] sums a where the key value (z value) is 1.
I can create groups with the data.table object with by, which is analogous to a SQL WHERE clause if you are familiar with SQL. However, the result is the sum of the column z for each of groups created. This may be inefficient if you have a great many possible matching values where z = y. The outer sum adds the values for each group in the sub-selected V1 column of the inner result.
If you are going to use data.table in a serious way study the informative vignettes available for that package.
M Dowle, T Short, S Lianoglou, A Srinivasan with contributions from R Saporta and E Antonyan (2014). data.table: Extensions of data.frame. R package version 1.9.2. http://CRAN.R-project.org/package=data.table

R $ operator is invalid for atomic vectors

I have a dataset where one of the columns are only "#" sign. I used the following code to remove this column.
ia <- as.data.frame(sapply(ia,gsub,pattern="#",replacement=""))
However, after this operation, one of the integer column I had changed to factor.
I wonder what happened and how can i avoid that. Appreciate it.

A more correct version of your code might be something like this:
d <- data.frame(x = as.character(1:5),y = c("a","b","#","c","d"))
> d[] <- lapply(d,gsub,pattern = "#",replace = "")
> d
x y
1 1 a
2 2 b
3 3
4 4 c
5 5 d
But as you'll note, this approach will never actually remove the offending column. It's just replacing the # values with empty character strings. To remove a column of all # you might do something like this:
d <- data.frame(x = as.character(1:5),
y = c("a","b","#","c","d"),
z = rep("#",5))
> d[,!sapply(d,function(x) all(x == "#"))]
x y
1 1 a
2 2 b
3 3 #
4 4 c
5 5 d

Surely if you want to remove an offending column from a data frame, and you know which column it is, you can just subset. So, if it's the first column:
df <- df[,-1]
If it's a later column, increment up.

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks

You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]

#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Grouping, counting and selecting on R dataset - r

data<-data.frame(factor(c("x","y","z","x","x","x")),c(2,4,4,4,4,3)) To get the count of each letter, do table(data[,1]) and to get the name of the factors with > 3 names(table(data[,1]))[table(data[,1]) > 3]

Don´t know if I understand you right... whats with this B column? Is this working for you? set.seed(1234) A <- sample(c("x", "y", "z"), 20, replace = TRUE) Ad <- data.frame(table(A)) with(Ad, A[Freq >= 7]) [1] x y

Related

How to find which interval/range a variable falls under in R

Using sum(x:y) to create a new variable/vector from existing values in R

R: if function with two conditions?

R $ operator is invalid for atomic vectors

Replicate variable based off match of two other variables in R

Categories

Resources