Extract values from count() - r

I get a freq table, but can I save this table in a csv file or - better - sort it or extract the biggest values?
library(plyr)
count(birthdaysExample, 'month')

I'm guessing at what the relevant part of your data looks like, but in any case this should get you a frequency table sorted by values:
library(plyr)
birthdaysExample <- data.frame(month = round(runif(200, 1, 12)))
freq_df <- count(birthdaysExample, 'month')
freq_df[order(freq_df$freq, decreasing = TRUE), ]
This gives you:
month freq
5 5 29
9 9 24
3 3 22
4 4 18
6 6 17
7 7 15
2 2 14
10 10 14
11 11 14
8 8 13
1 1 10
12 12 10
To get the highest 3 values:
library(magrittr)
freq_df[order(freq_df$freq, decreasing = TRUE), ] %>% head(., 3)
month freq
5 5 29
9 9 24
3 3 22
Or, with just base R:
head(freq_df[order(freq_df$freq, decreasing = TRUE), ], 3)
With dplyr
dplyr is a newer approaching for many routine data manipulations in R (one of many tutorials) that is a bit more intuitive:
library(dplyr)
library(magrittr)
freq_df2 <- birthdaysExample %>%
group_by(month) %>%
summarize(freq = n()) %>%
arrange(desc(freq))
freq_df2
This returns:
Source: local data frame [12 x 2]
month freq
1 5 29
2 9 24
3 3 22
4 4 18
5 6 17
6 7 15
7 2 14
8 10 14
9 11 14
10 8 13
11 1 10
12 12 10
The object it returns is not a data frame anymore, so if you want to use base R functions with it, it might be easier to convert it back, with something like:
my_df <- as.data.frame(freq_df2)
And if you really want, you can write this to a CSV file with:
write.csv(my_df, file="foo.csv")

Related

how to select every months first value and perform calculations in r

I have a dataframe df as follows:
Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10
My output should be:
Date Value mon_diff
1-Jun-12 5 7
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12 8
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20 ...
2-Aug-12 10
Actually I have to take the next months first value and subtract it from the first value that is 12-5 = 7 then again next months first value to be subtracted from current month value that is 20-12 = 8. Please understand there is no fixed number of rows for date as different months have different number of days. Please help.
Making the approach more robust so that it can be implemented even when there will be entries for multiple years.
library(tidyverse)
library(lubridate)
df %>%
mutate(
Date = dmy(Date),
month_year=paste0(month(Date),'_',year(Date))) %>%
group_by(month_year) %>%
filter(Date==min(Date)) %>%
ungroup() %>%
mutate(mon_diff=lead(Value)-Value) %>%
select(-month_year) %>%
right_join(df %>% mutate(Date=dmy(Date)), by=c("Date", "Value")) %>%
arrange(Date)-> output_df
Output:
Date Value mon_diff
<date> <int> <int>
1 2012-06-01 5 7
2 2012-06-02 10 NA
3 2012-06-03 8 NA
4 2012-06-04 15 NA
5 2012-07-02 12 8
6 2012-07-03 6 NA
7 2012-07-04 14 NA
8 2012-08-01 20 NA
9 2012-08-02 10 NA
Data:
read.table(text='Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10',header=T)-> df
Using the data shown reproducibly in the Note at the end, convert the Date to yearmon (year and month with no day) giving ym and then for each row find the first element with the same year month and the first row with the next year month. Note that yearmon class represents year and month internally as year + fraction where fraction = 0, 1/12, ..., 11/12 so the next month is found by adding 1/12. Evaluate Value for those rows taking the difference. Finally NA out diff for those rows with duplicated year and month values. If you are using tidyverse then use the same code but with mutate replacing transform.
library(zoo)
ym <- as.yearmon(DF$Date, format = "%d-%b-%y")
DF |>
transform(diff = Value[match(ym + 1/12, ym)] - Value[match(ym, ym)]) |>
transform(diff = ifelse(duplicated(diff), NA, diff))
giving:
Date Value diff
1 1-Jun-12 5 7
2 2-Jun-12 10 NA
3 3-Jun-12 8 NA
4 4-Jun-12 15 NA
5 2-Jul-12 12 8
6 3-Jul-12 6 NA
7 4-Jul-12 14 NA
8 1-Aug-12 20 NA
9 2-Aug-12 10 NA
Note
Lines <- "Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10"
DF <- read.table(text = Lines, header = TRUE)

How to loop to standardize and rename a set of variables

I have a data set with 1000 variables. The naming fashion of the variable is as shown in the figure below.
Now I want to use a loop function to standardize each of these 1000 variables and keep their original names. That is, I want the new "SCORE.1" to be the standardized "SCORE.1", new "SCORE.2" is the standardized "SCORE.2".
How can I do this? Many thanks!
Perhaps it would be better to keep the 'original' data (e.g. "df_1") and create a new dataframe (e.g. "df_2") with the transformed values, i.e.
library(tidyverse)
# Create some fake data
set.seed(123)
names <- paste("SCORE", 1:1000, sep = ".")
IDs <- 1:100
m <- matrix(sample(1:20, 10000, replace = TRUE), ncol = 1000, nrow = 100,
dimnames=list(IDs, names))
df_1 <- as.data.frame(m)
head(df_1)
#> SCORE.1 SCORE.2 SCORE.3 SCORE.4 SCORE.5 SCORE.6 SCORE.7 SCORE.8 SCORE.9
#> 1 15 6 9 15 11 7 9 8 6
#> 2 19 16 16 19 15 4 16 20 4
#> 3 14 11 17 6 20 10 9 11 3
#> 4 3 4 13 16 2 17 2 18 14
#> 5 10 12 8 15 16 16 9 14 19
#> 6 18 14 7 19 19 8 11 3 14
# Transform the 'original' fake data into 'new' fake data
df_2 <- df_1 %>%
mutate(across(everything(), ~(.x - mean(.x) / sd(.x))))
head(df_2)
#> SCORE.1 SCORE.2 SCORE.3 SCORE.4 SCORE.5 SCORE.6 SCORE.7
#> 1 12.8991333 4.105098 7.164641 13.001316 9.2716116 5.25409 7.1758716
#> 2 16.8991333 14.105098 14.164641 17.001316 13.2716116 2.25409 14.1758716
#> 3 11.8991333 9.105098 15.164641 4.001316 18.2716116 8.25409 7.1758716
#> 4 0.8991333 2.105098 11.164641 14.001316 0.2716116 15.25409 0.1758716
#> 5 7.8991333 10.105098 6.164641 13.001316 14.2716116 14.25409 7.1758716
#> 6 15.8991333 12.105098 5.164641 17.001316 17.2716116 6.25409 9.1758716
Does this answer your question?

Merging multiple connected columns

I have two different columns for several samples, which are connected. I want to merge all columns of type 1 to one column and all of type 2 to one column, but the rows should stay connected.
Example:
a1 <- c(1, 2, 3, 4, 5)
b1 <- c(1, 4, 9, 16, 25)
a2 <- c(2, 4, 6, 8, 10)
b2 <- c(4, 8, 12, 16, 20)
df1 <- data.frame(a1, b1, a2, b2)
a1 b1 a2 b2
1 1 1 2 4
2 2 4 4 8
3 3 9 6 12
4 4 16 8 16
5 5 25 10 20
I want to have it like this:
a b
1 1 1
2 2 4
3 2 4
4 3 9
5 4 8
6 4 16
7 5 25
8 6 12
9 8 16
10 10 20
My case
This is the example in my case. I have a lot of columns with different names and I want to extract abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new data frame, with all abs_dist in one column and all mean_vel in one column, but still connected.
I tried with unlist, but then of course the connection gets lost.
Thanks in advance.
A base R option using reshape
subset(
reshape(
setNames(df1, gsub("(\\d)", ".\\1", names(df1))),
direction = "long",
varying = 1:ncol(df1)
),
select = -c(time, id)
)
gives
a b
1.1 1 1
2.1 2 4
3.1 3 9
4.1 4 16
5.1 5 25
1.2 2 4
2.2 4 8
3.2 6 12
4.2 8 16
5.2 10 20
An option with pivot_longer from tidyr by specifying the names_sep as a regex lookaround to match between a lower case letter ([a-z]) and a digit in the column names
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
select(-grp)
-output
# A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 1 1
# 2 2 4
# 3 2 4
# 4 4 8
# 5 3 9
# 6 6 12
# 7 4 16
# 8 8 16
# 9 5 25
#10 10 20
With the edited post, we need to change the names_sep i.e. the delimiter is now _ between a lower case letter and a digit
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])_(?=[0-9])") %>%
select(-grp)
or with base R, use split.default on the substring of column names into a list of data.frame, then unlist each list element by looping over the list and convert to data.frame
data.frame(lapply(split.default(df1, sub("\\d+", "", names(df1))),
unlist, use.names = FALSE))
For the sake of completeness, here is a solution which uses data.table::melt() and the patterns() function to specify columns which belong together:
library(data.table)
melt(setDT(df1), measure.vars = patterns(a = "a", b = "b"))[
order(a,b), !"variable"]
a b
1: 1 1
2: 2 4
3: 2 4
4: 3 9
5: 4 8
6: 4 16
7: 5 25
8: 6 12
9: 8 16
10: 10 20
This reproduces the expected result for OP's sample dataset.
A more realistic example: reshape only selected columns
With the edit of the question, the OP has clarifified that the production data contains many more columns than those which need to be reshaped:
I have a lot of columns with different names and I want to extract
abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new
data frame, with all abs_dist in one column and all mean_vel in one
column, but still connected.
So, the OP wants to extract and reshape the columns of interest in one go while ignoring all other data in the dataset.
To simulate this situation, we need a more elaborate dataset which includes other columns as well:
df2 <- cbind(df1, c1 = 11:15, c2 = 21:25)
df2
a1 b1 a2 b2 c1 c2
1 1 1 2 4 11 21
2 2 4 4 8 12 22
3 3 9 6 12 13 23
4 4 16 8 16 14 24
5 5 25 10 20 15 25
With a modified version of the code above
library(data.table)
cols <- c("a", "b")
result <- melt(setDT(df2), measure.vars = patterns(cols), value.name = cols)[, ..cols]
setorderv(result, cols)
result
we get
a b
1: 1 1
2: 2 4
3: 3 9
4: 4 16
5: 5 25
6: 2 4
7: 4 8
8: 6 12
9: 8 16
10: 10 20
For the production dataset as pictured in the edit, the OP needs to set
cols <- c("abs_dist", "mean_vel")

Multiplying similar named merged columns in R

I have two dfs : df1 and df2 where the column names are dates. When I join the two df's I get columns like
date1.x, date1.y, date2.x, date2.y, date3.x, date3.y, date4.x, date4.y...........
I want to create new columns which have values which are multiplication of date1.x and date1.y and similarly for other date pairs as well.
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
grep("^date.*\\.x$", colnames(df), value = TRUE)
# [1] "date1.x" "date2.x"
datenms <- grep("^date.*\\.x$", colnames(df), value = TRUE)
### make sure all of our 'date#.x' columns have matching 'date#.y' columns
datenms <- datenms[ gsub("x$", "y", datenms) %in% colnames(df) ]
datenms
# [1] "date1.x" "date2.x"
subset(df, select = datenms)
# date1.x date2.x
# 1 1 4
# 2 2 5
# 3 3 6
subset(df, select = gsub("x$", "y", datenms))
# date1.y date2.y
# 1 7 10
# 2 8 11
# 3 9 12
subset(df, select = datenms) * subset(df, select = gsub("x$", "y", datenms))
# date1.x date2.x
# 1 7 40
# 2 16 55
# 3 27 72
There are a number of ways to do this, but I suggest that it is a good practice to get used to transforming your data into a format that is easy to work with. The first answer showed you one way to do what you want without transforming your data. My answer will show you how to transform the data so that calculation (this one and others) are easy, and then how to perform the calculation once the data is tidy.
Making your data tidy helps to perform easier aggregations, to graph results, to perform feature engineering for models, etc.
library(dplyr)
library(tidyr)
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
# Convert the data to a tidy format that is easier for computers to calculate
tidy_df <- df %>%
pivot_longer(
cols = starts_with("date"), # We are tidying any column starting with date
names_to = c("date_num","date_source"), # creating two columns for names
values_to = c("date_value"), # creating one column for values
names_prefix = "date", # removing the "date" prefix
names_sep = "\\." # splitting the names on the period `.`
)
tidy_df
# id date_num date_source date_value
# <int> <chr> <chr> <int>
# 1 11 1 x 1
# 2 11 2 x 4
# 3 11 1 y 7
# 4 11 2 y 10
# 5 12 1 x 2
# 6 12 2 x 5
# 7 12 1 y 8
# 8 12 2 y 11
# 9 13 1 x 3
# 10 13 2 x 6
# 11 13 1 y 9
# 12 13 2 y 12
# Now that the data is tidy we can do easier dataframe grouping and aggregation
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup()
# id date_num date_value_mult
# <int> <chr> <dbl>
# 1 11 1 7
# 2 11 2 40
# 3 12 1 16
# 4 12 2 55
# 5 13 1 27
# 6 13 2 72
# If/When you eventually want the data in a more human readable format you can
# pivot the data back into a human readable format. This is likely after all
# computer calculations are done and you want to present the data. For storing
# the data (such as in a database) you would not need/want this step.
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup() %>%
pivot_wider(
names_from = date_num,
values_from = date_value_mult,
names_prefix = "date"
)
# id date1 date2
# <int> <dbl> <dbl>
# 1 11 7 40
# 2 12 16 55
# 3 13 27 72

R - replace a list of numbers in a data frame with a repeating sequence of numbers

I have some data similar to this:
df <- data.frame(year = c('1','1','1','1','1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2','2','2','2','2'),
month = c('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24'))
I need to replace the numbers in the 'month' column with a repeating pattern of 1:12 like this:
year month
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 1
14 2 2
15 2 3
16 2 4
17 2 5
18 2 6
19 2 7
20 2 8
21 2 9
22 2 10
23 2 11
24 2 12
Any ideas?
If you know that every year has all 12 months then df$month <- 1:12 is enough. If you think that the years have an inconsistent number of recorded months, but that the months that are recorded are consecutive, then you can use dplyr's mutate to simply add each row's integer position for each year:
library(dplyr)
df %>%
group_by(year) %>%
mutate(month = 1:length(month))
Otherwise you should probably use the values of month by converting them to integers. Keep in mind that data.frame turns strings into factors, so first add stringsAsFactors = F to your call to data.frame:
df <- data.frame(year = rep(1:2, each = 12),
month = as.character(1:24),
stringsAsFactors = F
)
Then do the following, which returns remainder of division by 12, then replaces remainder 0 with 12:
df %>%
mutate(month = as.integer(month) %% 12,
month = ifelse(month == 0, 12, month)
)

Resources