I have a dataframe as below:
**df**
Cust_name time freq
Andrew 0 4
Dillain 1 2
Alma 2 3
Andrew 1 4
Kiko 2 1
Sarah 2 8
Sarah 0 3
I want to calculate the sum of frequency by the time range provided for each cust_name. Example: If I select time range 0 to 2 for Andrew, it will give me sum of freq: 4+4= 8. And for Sarah, it will give me 8+3=11. I have tried it in the following ways just to get the time range, but do not know how to do the rest, as I am very new to R:
df[(df$time>=0 & df$time<=2),]
You can do this with dplyr.
To make your code reproducible, you should add the creation of your dataframe in your post. Copy and pasting everything is time consuming.
library(dplyr)
df <- data.frame(
cust_name = c('Andrew', 'Dillain', 'Alma', 'Andrew', 'Kiko', 'Sarah', 'Sarah'),
time = c(0,1,2,1,2,2,0),
freq = c(4,2,3,4,1,8,3)
)
df %>%
filter(time >=0, time <=2) %>%
group_by(cust_name) %>%
summarise(sum_freq = sum(freq))
Related
I'm trying to open another column and find the growth rate of the facevalue column per day in percentage
Day
FaceValue
1
₦72,077,680.94
2
₦112,763,770.99
3
₦118,146,250.01
4
₦74,446,035.80
5
₦77,026,183.71
here is the code but it's not working
value_performance%>%
mutate(change=(value_performance$FaceValue-lag(FaceValue,5))/lag(FaceValue,5)*100)
Thanks
Three problems:
FaceValue appears to be a string, not numeric, try first fixing that with as.numeric;
(Almost) never use value_performance$ inside of a dplyr-pipe verb. ("Almost" because there are rare times when you need it. Otherwise you are at best being inefficient, possibly using incorrect values depending on what is happening in the pipe before its use.); and
You say "per day" but you are lagging by 5. While I'm assuming your real data has more than 5 rows, you are still not calculating by-day.
Try this.
value_performance %>%
mutate(
FaceValue = as.numeric(gsub("[^0-9.]", "", FaceValue)),
change = (FaceValue - lag(FaceValue))/lag(FaceValue)
)
# Day FaceValue change
# 1 1 7.21e+07 NA
# 2 2 1.13e+08 0.5645
# 3 3 1.18e+08 0.0477
# 4 4 7.44e+07 -0.3699
# 5 5 7.70e+07 0.0347
With similar data:
Day <- c(1,2,3,4,5)
FaceValue <- c(72077680.94, 112763770.99, 118146250.01, 74446035.80, 77026183.71)
df <- data.frame(Day, FaceValue)
df
df %>%
mutate(change= 100*(FaceValue/lag(FaceValue)-1)
)
Results in:
Day FaceValue change
1 1 72077681 NA
2 2 112763771 56.447557
3 3 118146250 4.773234
4 4 74446036 -36.988236
5 5 77026184 3.465796
Not sure what is wrong. Maybe check your data classes and make sure FaceValue is numerical.
Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.
I have a data frame like these:
NUM_TURNO CODIGO_MUNICIPIO SIGLA_PARTIDO SHARE
1 1 81825 PPB 38.713318
2 1 81825 PMDB 61.286682
3 1 09717 PMDB 48.025900
4 1 09717 PL 1.279217
5 1 09717 PFL 50.694883
6 1 61921 PMDB 51.793868
This is a data.frame of elections in Brazil. Grouping by NUM_TURNO and CODGIDO_MUNICIPIO I want to compare the SHARE of the FIRST and SECOND most votted politics in each city and round (1 or 2) and create a new column.
What am I having problem to do? I don't know how to calculate the difference only for the two biggest SHARES of votes.
For the first case, for example, I want to create something that gives me the difference between 61.286682 and 38.713318 = 22.573364 and so on.
Something like this:
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
mutate(Diff = HIGHER SHARE - 2º HIGHER SHARE))
You can also use top_n from dplyr with grouping and summarizing. Keep in mind that in the data you provided, you will get an error in summarize if you use diff with a single value, hence the use of ifelse.
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
top_n(2, SHARE) %>%
summarize(Diff = ifelse(n() == 1, NA, diff(SHARE)))
# A tibble: 3 x 3
# Groups: NUM_TURNO [?]
NUM_TURNO CODIGO_MUNICIPIO Diff
<dbl> <dbl> <dbl>
1 1 9717 2.67
2 1 61921 NA
3 1 81825 22.6
You could arrange your dataframe by Share and then slice the first two values. Then you could use summarise to get the diff between the values for every group:
library(dplyr)
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
arrange(desc(Share)) %>%
slice(1:2) %>%
summarise(Diff = -diff(Share))
I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))
I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data):
I'd like to know if a person or group is saying the same thing(s) over time. I'm familiar with n-grams, but it's not quite what I need. Any help would be appreciated.
This is the output I'd like:
Sorry for all the edits poor comments; still getting used to the website.
If you want to see the frequence of each comments related to each Person and a new column Ready you can do this with the following code :
set.seed(123456)
### I use the same data as the previous example, thank you for providing this !
data <-data.frame(date = Sys.Date() - sample(100),
Group = c("Cars","Trucks") %>% sample(100,replace=T),
Reporting_person = c("A","B","C") %>% sample(100,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
)
library(dplyr)
data %>%
group_by(Reporting_person,Ready) %>%
count(Comments) %>%
mutate(prop = prop.table(n))
If what you are asking is to see if a change occurs in the comments over time and to see if that change is correlated with an event (like Ready) you can use the following code:
library(dplyr)
### Creating a column comments at time + plus
new = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(comments_plusone=lag(Comments))
new = na.omit(new)
### Creating the change column 1 is a change , 0 no change
new$Change = as.numeric(new$Comments != new$comments_plusone)
### Get the correlation between Change and the events...
### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here :
tbl <- table(new$Ready,new$Change)
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))
You should get no significative correlation with this example. As you can clearly see when you illustrate the table.
plot(tbl)
Not that using cor function is not appropriate working with two binary variable.
Here a post in this topic.... Correlation between two binary
Frequence of change by change of State
Following your comments, I am adding this code:
newR = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(Ready_plusone=lag(Ready))
newR = na.omit(newR)
###------------------------Add the column to the new data frame
### Creating the REady change column 1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")
### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>%
group_by(Reporting_person,State_change) %>%
count(Change) %>%
mutate(Frequence = prop.table(n))%>%
filter(Change==1)
### Tidyr is a great library for reshape data, you want the wide format of the previous long
### dataframe... However doing this will generate a lot of NA so If I were you I would get
### the result format instead of the following but this could be helpful for future need so here you go.
library(tidyr)
final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]
Hope this help :)
Something like this ?
df <-data.frame(date = Sys.Date() - sample(10),
Group = c("Cars","Trucks") %>% sample(10,replace=T),
Reporting_person = c("A","B","C") %>% sample(10,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))
# date Group Reporting_person Comments
# 1 2017-06-08 Trucks B Awesome
# 2 2017-06-05 Trucks A Awesome
# 3 2017-06-14 Cars B Meh
# 4 2017-06-06 Cars B Awesome
# 5 2017-06-11 Cars A Meh
# 6 2017-06-07 Cars B NC
# 7 2017-06-09 Cars A NC
# 8 2017-06-10 Cars A NC
# 9 2017-06-13 Trucks C Awesome
# 10 2017-06-12 Trucks B NC
aggregate(date ~ .,df,length)
# Group Reporting_person Comments date
# 1 Trucks A Awesome 1
# 2 Cars B Awesome 1
# 3 Trucks B Awesome 1
# 4 Trucks C Awesome 1
# 5 Cars A Meh 1
# 6 Cars B Meh 1
# 7 Cars A NC 2
# 8 Cars B NC 1
# 9 Trucks B NC 1