I have a big problems in R :(. We have a dataframe named:"hcmut" show the answers of students in half term test like here:
hcmut
Code | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
2011 | B | D | A | A | C | B | A | B | C | C
2012 | A | D | AC | B | D | B | A | B | C | C
2013 | A | D | A | A | C | D | C | B | D | D
2014 | A | B | A | C | BC | D | D | D | D | D
Question 1: find substring that have the corrrelation greater than k?
I think k is from range 0:1
Question 2: find substring that have the greatest correlation and show substring ( like "ABCD"...)
Could you help me with problem in R ?
:( :(
I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))
If got data looking like this:
A | B | C
--------------
f | 1 | 1420h
f | 1 | 1540h
f | 3 | 600h
g | 2 | 900h
g | 2 | 930h
h | 1 | 700h
h | 3 | 400h
Now I want to create a new column which counts other rows in the data frame that meet certain conditions.
In this case I would like to know in each row how often the same combination of A and B occured in a range of 100 around C.
So the result with this data would be:
A | B | C | D
------------------
f | 1 | 1420 | 0
f | 1 | 1540 | 0
f | 3 | 1321 | 0
g | 2 | 900 | 1
g | 2 | 930 | 1
h | 1 | 700 | 0
h | 3 | 400 | 0
I actually came to a solution using for(for()). But the time R needs to compute the resuts is tooooo long.
for(i in 1:nrow(df)) {
df[i,D] <- sum( for(p in 1:nrow(df)) {
df[p,A] == df[i,A] &
df[p,B] == df[i,B] &
df[i,C] +100 > df[p,C] &
df[p,C] > df[i,C]-100 } ) }
Is there a better way?
Thanks a lot!
I have a data frame relative to accesses to a website. Several accesses per day, with different possible actions and descriptions of the actions
People | Date | Time | Action | Descr |
| | | | |
j | 01/01/2010 | 10:13 | X | A |
j | 01/01/2010 | 10:15 | Y | B |
j | 02/01/2010 | 14:15 | Z | C |
j | 03/01/2010 | 11:45 | X | D |
j | 03/01/2010 | 13:56 | X | E |
j | 03/01/2010 | 18:43 | Z | F |
j | 03/01/2010 | 18:44 | X | A |
After reducing the data frame to a balanced daily panel data, I need to create variables such that:
-the value of the first variable (FirstX) must be equal to the description (Descr) of the first Action = X of the day (if available) and zero otherwise
-the value of the second variable must be equal to the description of the second Action = X of the day and zero otherwise
-so on
Once I transformed it into a balanced daily panel (which I can do) I need to have a final result which looks like this:
People | Date |Accesses| First X|Second X| Third X| Fourth X |
| | | | | | |
j | 01/01/2010 | 2 | A | 0 | 0 | 0 |
j | 02/01/2010 | 1 | 0 | 0 | 0 | 0 |
j | 03/01/2010 | 4 | D | E | A | 0 |
You can do it using the dplyr package:
library(dplyr)
df %>%
group_by(People,Date) %>%
summarise(Accesses = n(),
FirstX = ifelse(sum(Action=="X")>=1,Descr[Action=="X"][1],"0"),
SecondX = ifelse(sum(Action=="X")>=2,Descr[Action=="X"][2],"0"),
ThirdX = ifelse(sum(Action=="X")>=3,Descr[Action=="X"][3],"0"),
FourthX = ifelse(sum(Action=="X")>=4,Descr[Action=="X"][4],"0"))
This returns:
People Date Accesses FirstX SecondX ThirdX FourthX
<chr> <chr> <int> <chr> <chr> <chr> <chr>
1 j 01/01/2010 2 A 0 0 0
2 j 02/01/2010 1 0 0 0 0
3 j 03/01/2010 4 D E A 0
Note that you cannot have numeric 0s and characters in the same vector, so I put character 0s in the FirstX, SecondX, .. columns.
I found a solution myself. I post it here in case this is useful to somebody.
# create temp variables to be used for the count(just a vector of all the
numbers from 1 to N)
subset$temp_var1<-c(1:N)
# generate a variable which starts counting from one and starts again
# every time "date" or "people" change
subset$count<-ave(subset$temp_var1 , subset$date ,
subset$people , FUN = seq_along)
#drop variable "Action"
subset<-subset( subset, select=c("people" , "date" ,
"descr" , "count"))
#reshape
subset_comuni<-reshape(subset_comuni , idvar=c("nome_utente" , "date") ,
timevar = "count" , direction = "wide")
Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]