All data in one column

All data in one column - r

I have this soccer data all in one column.
Round 36 # Round of the league------------------------------------
29.07. 20:45 # Date and time of the match
Barcelona # Home Team
4 - 1 # FT result
Getafe # Away team
(2 - 0) # HT result
29.07. 20:45 # *date of the second match of the round*
Valencia
2 - 3
Laci
(1 - 2)
Round 35 # repeating pattern -------------------------------------------------
How can I move all the data in a certain round of the league in a new column? e.g. I want all observation from the Round 36 observation to the Round 35 observation in a single a column and so on.
I really do not have any idea how to solve this. I tried to transpose the data so that I could work better with observations as variables but still nothing. I am just a beginner in R and would appreciate any help.
thanks

Assuming your data is within a variable named lines (eg, lines[1] = Round 36 is the first entry, lines[2] = 29.07. 2045 is the next entry and so forth), we can spot the lines, split the vector into a list and then finally bind it into a data.frame (assuming they have equal length, if not you will have to do some manual work)
#Figure out where each round is.
rounds <- grepl('^Round', lines)
# Split it into seperate list. cumsum(rounds) will be an index for each group.
data <- split(lines, cumsum(rounds))
# Bind the data into a data.frame (assuming all have the same amount of data)
bound <- do.call(rbind, data)
Of course without a reproducible example it is hard to test the final result.
Note that if the soccer data does not have equal amount of data between rounds or if the data does not come in the same order, the resulting data.frame may not make immediate sense (if round 45 has 7 elements but round 46 has 4, round 46 will recycle element 1, 2 and 3 to fill out the missing values), but it might make it simpler to do some follow up data cleaning.

Related

Composing a data.frame from loop-generated sequences

I have a data set which is made up of observations of the weights of fish, the julian dates they were captured on, and their names. I am seeking to assess what the average growth rate of these fish is according to the day of the year (julian date). I believe the best method to do this is to compose a data.frame with two fields: "Julian Date" and "Growth". The idea is this: for a fish which is observed on January 1 (1) at weight 100 and a fish observed again on April 10 (101) at weight 200, the growth rate would be 100g/100days, or 1g/day. I would represent this in a data.frame as 100 rows in which the "Julian Date" column is composed of the Julian date sequence (1:100) and the "Growth" column is composed of the average growth rate (1g/day) over all days.
I have attempted to compose a for loop which passes through each fish, calculates the average growth rate, then creates a list in which each index contains the sequence of Julian dates and the growth rate (repeated the number of times equal to the length of the Julian date sequence). I would then utilize the function to compose my data.frame.
growth_list <- list() # initialize empty list
p <- 1 # initialize increment count
# Looks at every other fish ID beginning at 1 (all even-number observations are the same fish at a later observation)
for (i in seq(1, length(df$FISH_ID), by = 2)){
rate <- (df$growth[i+1]-df$growth[i])/(as.double(df$date[i+1])-as.double(df$date[i]))
growth_list[[p]] <- list(c(seq(as.numeric(df$date[i]),as.numeric(df$date[i+1]))), rep(rate, length(seq(from = as.numeric(df$date[i]), to = as.numeric(df$date[i+1])))))
p <- p+1 # increase to change index of list item in next iteration
}
# Converts list of vectors (the rows which fulfill above criteria) into a data.frame
growth_df <- do.call(rbind, growth_list)
My expected results can be illustrated here: https://imgur.com/YXKLkpK
My actual results are illustrated here: https://imgur.com/Zg4vuVd
As you can see, the actual results appear to be a data.frame with two columns specifying the type of the object, as well as the length of the original list item. That is, row 1 of this dataset contained 169 days between observations, and therefore contained 169 julian dates and 169 repetitions of the growth rate.

Instead of list(), use data.frame() with named columns to build a list of data frames to be row binded at the end:
growth_list <- vector(mode="list", length=length(df$FISH_ID)/2)
for (i in seq(1, length(df$FISH_ID), by=2)){
rate <- with(df, (growth[i+1]-growth[i])/(as.double(date[i+1])-as.double(date[i])))
date_seq <- seq(as.numeric(df$date[i]), as.numeric(df$date[i+1]))
growth_list[[p]] <- data.frame(Julian_Date = date_seq,
Growth_Rate = rep(rate, length(date_seq))
p <- p + 1
}
growth_df <- do.call(rbind, growth_list)

Welcome to stackoverflow
Couple things about your code:
I recommend using the apply function instead of the for loop. You can set parameters in apply to perform row-wise functions. It makes you code run faster. The apply family of functions also creates a list for you, which reduces the code you write to make the list and populate it.
It is common to supply users with a snippet example of your initial data to work with. Sometimes the way we describe our data is not representative of our actual data. This tradition is necessary to alleviate any communication errors. If you can, please manufacture a dummy dataset for us to use.
Have you tried using as.data.frame(growth_list), or data.frame(growth_list)?
Another option is to use an if else statement within your for loop that performs the rbind function. This would look something like this:
#make a row-wise for loop
for(x in 1:nrow(i)){
#insert your desired calculations here. You can turn the rows into their own dataframe by using this, which may make it easier to perform your calculations:
dataCurrent <- data.frame(i[x,])
#finish with something like this to turn your calculations for each row into an output dataframe of your choice.
outFish <- cbind(date, length, rate)
#make your final dataframe as follows
if(exists("finalFishOut") == FALSE){
finalFishOut <- outFish
}else{
finalFishOut <- rbind(finalFishOut, outFish)
}
}
Please update with a snippet of data and I'll update this answer with your exact solution.

Here is a solution using dplyr and plyr with some toy data. There are 20 fish, with a random start and end time, plus random weights at each time. Find the growth rate over time, then create a new df for each fish with 1 row per day elapsed and the daily average growth rate, and output a new df containing all fish.
df <- data.frame(fish=rep(seq(1:20),2),weight=sample(c(50:100),40,T),
time=sample(c(1:100),40,T))
df1 <- df %>% group_by(fish) %>% arrange(time) %>%
mutate(diff.weight=weight-lag(weight),
diff.time=time-lag(time)) %>%
mutate(rate=diff.weight/diff.time) %>%
filter(!is.na(rate)) %>%
ddply(.,.(fish),function(x){
data.frame(time=seq(1:x$diff.time),rate=x$rate)
})
head(df1)
fish time rate
1 1 1 -0.7105263
2 1 2 -0.7105263
3 1 3 -0.7105263
4 1 4 -0.7105263
5 1 5 -0.7105263
6 1 6 -0.7105263
tail(df1)
fish time rate
696 20 47 -0.2307692
697 20 48 -0.2307692
698 20 49 -0.2307692
699 20 50 -0.2307692
700 20 51 -0.2307692
701 20 52 -0.2307692

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.

An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

List all possible occurrences within a column?

I am trying to merge a data.frame and a column from another data.frame, but have so far been unsuccessful.
My first data.frame [Frequencies] consists of 2 columns, containing 47 upper/ lower case alpha characters and their frequency in a bigger data set. For example purposes:
Character<-c("A","a","B","b")
Frequency<-(100,230,500,420)
The second data.frame [Sequences] is 93,000 rows in length and contains 2 columns, with the 47 same upper/ lower case alpha characters and a corresponding qualitative description. For example:
Character<-c("a","a","b","A")
Descriptor<-c("Fast","Fast","Slow","Stop")
I wish to add the descriptor column to the [Frequencies] data.frame, but not the 93,000 rows! Rather, what each "Character" represents. For example:
Character<-c("a")
Frequency<-c("230")
Descriptor<-c("Fast")

Following can also be done:
> merge(adf, bdf[!duplicated(bdf$Character),])
Character Frequency Descriptor
1 a 230 Fast
2 A 100 Fast
3 b 420 Stop
4 B 500 Slow

Why not:
df1$Descriptor <- df2$Descriptor[ match(df1$Character, df2$Character) ]

Calculate sum of counts per min from data frame in R

I've been trying to figure this out for a while, but haven't been able to do so. I found a lot of similar questions which didn't help at all.
I have around 43000 records in data frame in R. The date column is in the format "2011-11-15 02:00:01", and the other column is the count. The structure of the data frame:
str(results)
'data.frame': 43070 obs. of 2 variables:
$ dates: Factor w/ 43070 levels "2011-11-15 02:00:01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ count: num 1 2 1 1 1 1 2 3 1 2 ...
How can I get the total count per min?
And I also want to convert the results data frame into json. I used rjson package which converted the entire data frame as a single json element. When I inserted into mongodb, there was only on _id for all 43000 records. What did I do wrong?

You can use the xts package to get the counts/minute quite easily.
install.packages("xts")
require("xts")
results_xts <- xts(results$count, order.by = as.POSIXlt(results$dates))
This converts your dataframe to an xts object. There are a bunch of functions (apply.daily, apply.yearly, etc) in xts that apply functions to different time frames, but there isn't one for by minute. Fortunately the code for those functions is super simple, so just run
ep <- endpoints(results_xts, "minutes")
period.apply(results_xts, ep, FUN = sum)
Sorry, I don't know the answer to your other question.

Asterisk here, untested, but here is my solution for getting the counts per minute, maybe someone will chime in on the json part, I'm not familiar with that
here's my example time series and count
tseq<-seq(now,length.out=130, by="sec")
count<-rep(1, 130)
we find the index of where our minutes switch via the following
mins<-c(0,diff(floor(cumsum(c(0,diff(tseq)))/60)))
indxs<-which(mins%in%1)
Let me break that down (as there are many things nested in there).
First we diff over the time sequence, then add a 0 on the front because we lose an observation with diff
Second, sum the diff-ed vector, giving us the seconds value in each spot (this could probably also be done by a simple format call over the vector of times too)
Third, divide that vector, now the seconds in each spot, by 60 so we get a value in each spot corresponding to the minutes.
Fourth, floor it so we get integers
diff that vector so we get 0's except for 1's where the minute switches
add a 0 to that vector since we lose an observation with the diff
then get the indeces of the 1's with the which call
then we find the start and ends to our minutes
startpoints<-indxs
endpoints<-c(indxs[2:length(indxs)], length(mins))
then we simply sum over the corresponding subset
mapply(function(start, end) sum(count[start:end]), start=startpoints, end=endpoints)
#[1] 61 10
We get 61 for the first point because we include the 0th and 60th second for the first subset

R Searching for elements and their index in an array

I have a matrix with 2 columns as described below:
TIME PRICE
10 45
11 89
13 89
15 12
16 09
17 34
19 89
20 90
23 21
26 09
in the above matrix, I need to iterate through the TIME column adding 5 seconds and accessing the corresponding PRICE that matches the row.
For ex: I start with 10. i need to access 15 (10+5), I would've been able to get to 15 easily if the numbers in the column were continuous data, but its not. so at 15 seconds time, i need to get hold of the corresponding price. and this goes on till the end of the entire data set. my next element that needs to be accessed is 20, and its corresponding price. now i again add 5 seconds and it hence goes on. incase the element is not present, the one immediately greater than it must be accessed to obtain the corresponding price.

If the rows you want to extract are m[1,1]+5, m[1,1]+10, m[1,1]+15 etc then:
m <- cbind(TIME=c(10,11,13,15,16,17,19,20,23,26),
PRICE=c(45,89,89,12,9,34,89,90,21,9))
r <- range(m[,1]) # 10,26
r <- seq(r[1]+5, r[2], 5) # 15,20,25
r <- findInterval(r-1, m[,1])+1 # 4,8,10 (values 15,20,26)
m[r,2] # 12,90,9
findInterval finds the index for values that are equal or less than the given value, so I give it a smaller value and then add 1 to the index.

Breaking the question apart into sub-pieces...
Getting the row with value 15:
Call your Matrix, say, DATA, and
[1] extract the row of interest:
DATA[DATA[,1] == 15, ]
Then snag the second column.
[2] Adding 5 to the first column ( I'm pretty sure you can just do this ):
DATA[,1] = DATA[,1] + 5
This should get you started. The rest seems to just be some funky iteration, incrementing by 5, using [1] to get the price you want each time, swapping 15 for some variable.
I leave the rest of the solution as an exercise to the reader. For tips on looping in R, and more, see the below tutorial ( I don't expect it to be taken down any time soon, but may want to keep a local copy. Good luck :) )
http://www.stat.berkeley.edu/users/vigre/undergrad/reports/VIGRERintro.pdf

As #Tommy commented above, it is not clear what TIME you exactly want to get. For me, it seems like you want to get the PRICE for the sequence 10,15,20,25,... If true, you could do that easily suing the mod (%%) function:
TIME <- c(10,11,13,15,16,17,19,20,23,26) # Your times
PRICE <- c(45,89,89,12,9,34,89,90,21,9) # your prices
PRICE[TIME %% 5 == 0] # Get prices from times in sequence 10, 15, 20, ...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex