R: How to count rows with same factor levels and a numeric in a range - r

If got data looking like this:
A | B | C
--------------
f | 1 | 1420h
f | 1 | 1540h
f | 3 | 600h
g | 2 | 900h
g | 2 | 930h
h | 1 | 700h
h | 3 | 400h
Now I want to create a new column which counts other rows in the data frame that meet certain conditions.
In this case I would like to know in each row how often the same combination of A and B occured in a range of 100 around C.
So the result with this data would be:
A | B | C | D
------------------
f | 1 | 1420 | 0
f | 1 | 1540 | 0
f | 3 | 1321 | 0
g | 2 | 900 | 1
g | 2 | 930 | 1
h | 1 | 700 | 0
h | 3 | 400 | 0
I actually came to a solution using for(for()). But the time R needs to compute the resuts is tooooo long.
for(i in 1:nrow(df)) {
df[i,D] <- sum( for(p in 1:nrow(df)) {
df[p,A] == df[i,A] &
df[p,B] == df[i,B] &
df[i,C] +100 > df[p,C] &
df[p,C] > df[i,C]-100 } ) }
Is there a better way?
Thanks a lot!

Related

Find substring that have the the correlation greater than k in R

I have a big problems in R :(. We have a dataframe named:"hcmut" show the answers of students in half term test like here:
hcmut
Code | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
2011 | B | D | A | A | C | B | A | B | C | C
2012 | A | D | AC | B | D | B | A | B | C | C
2013 | A | D | A | A | C | D | C | B | D | D
2014 | A | B | A | C | BC | D | D | D | D | D
Question 1: find substring that have the corrrelation greater than k?
I think k is from range 0:1
Question 2: find substring that have the greatest correlation and show substring ( like "ABCD"...)
Could you help me with problem in R ?
:( :(

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

Populate table column by relative row values of another column in R

I’m trying to populate a table column by relative row values of another column in R. I have a table with two data columns (Col1, Col2) and two point value columns (P1, P2). Data1 is populated, Data2 is not. I want the value of Data2 to be populated by the value in either P1 or P2, based on the relative value of Data 1. In a given row, if the previous value of Data1 is higher than its current value, the Data2 cell is populated by the value in P1. If the previous value of Data1 is lower than its current value, the Data2 cell is populated by the value in P2. To illustrate what I’m trying to do, I’ve provided two sample tables. The first table is what I have (Data2 is not populated), and the second table is the desired outcome.
Table1 (What I have)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | |
| 2 | C | D | 40 | |
| 3 | E | F | 60 | |
| 4 | G | H | 70 | |
| 5 | I | J | 65 | |
Table2 (Desired Outcome)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | NA |
| 2 | C | D | 40 | C |
| 3 | E | F | 60 | F |
| 4 | G | H | 70 | H |
| 5 | I | J | 65 | I |
+-----+----+----+-------+-------+
Is there a built in function in R to accomplish this? If not, any advice on how to create one?
A solution using dplyr could be:
df %>%
mutate(Data2 = ifelse(lag(Data1) > Data1, paste0(P1), paste0(P2)))
FID P1 P2 Data1 Data2
1 1 A B 50 <NA>
2 2 C D 40 C
3 3 E F 60 F
4 4 G H 70 H
5 5 I J 65 I

Create new variable based on the order of values in other columns

I have a data frame relative to accesses to a website. Several accesses per day, with different possible actions and descriptions of the actions
People | Date | Time | Action | Descr |
| | | | |
j | 01/01/2010 | 10:13 | X | A |
j | 01/01/2010 | 10:15 | Y | B |
j | 02/01/2010 | 14:15 | Z | C |
j | 03/01/2010 | 11:45 | X | D |
j | 03/01/2010 | 13:56 | X | E |
j | 03/01/2010 | 18:43 | Z | F |
j | 03/01/2010 | 18:44 | X | A |
After reducing the data frame to a balanced daily panel data, I need to create variables such that:
-the value of the first variable (FirstX) must be equal to the description (Descr) of the first Action = X of the day (if available) and zero otherwise
-the value of the second variable must be equal to the description of the second Action = X of the day and zero otherwise
-so on
Once I transformed it into a balanced daily panel (which I can do) I need to have a final result which looks like this:
People | Date |Accesses| First X|Second X| Third X| Fourth X |
| | | | | | |
j | 01/01/2010 | 2 | A | 0 | 0 | 0 |
j | 02/01/2010 | 1 | 0 | 0 | 0 | 0 |
j | 03/01/2010 | 4 | D | E | A | 0 |
You can do it using the dplyr package:
library(dplyr)
df %>%
group_by(People,Date) %>%
summarise(Accesses = n(),
FirstX = ifelse(sum(Action=="X")>=1,Descr[Action=="X"][1],"0"),
SecondX = ifelse(sum(Action=="X")>=2,Descr[Action=="X"][2],"0"),
ThirdX = ifelse(sum(Action=="X")>=3,Descr[Action=="X"][3],"0"),
FourthX = ifelse(sum(Action=="X")>=4,Descr[Action=="X"][4],"0"))
This returns:
People Date Accesses FirstX SecondX ThirdX FourthX
<chr> <chr> <int> <chr> <chr> <chr> <chr>
1 j 01/01/2010 2 A 0 0 0
2 j 02/01/2010 1 0 0 0 0
3 j 03/01/2010 4 D E A 0
Note that you cannot have numeric 0s and characters in the same vector, so I put character 0s in the FirstX, SecondX, .. columns.
I found a solution myself. I post it here in case this is useful to somebody.
# create temp variables to be used for the count(just a vector of all the
numbers from 1 to N)
subset$temp_var1<-c(1:N)
# generate a variable which starts counting from one and starts again
# every time "date" or "people" change
subset$count<-ave(subset$temp_var1 , subset$date ,
subset$people , FUN = seq_along)
#drop variable "Action"
subset<-subset( subset, select=c("people" , "date" ,
"descr" , "count"))
#reshape
subset_comuni<-reshape(subset_comuni , idvar=c("nome_utente" , "date") ,
timevar = "count" , direction = "wide")

Creation of Panel Data set in R

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

Resources