I have data formatted in the following way:
-------------------------
| A01 | value | |
-------------------------
| A01 | value | |
-------------------------
| A01 | value | |
-------------------------
| A02 | value | |
-------------------------
| A02 | value | |
-------------------------
| A02 | value | |
-------------------------
| A03 | value | |
-------------------------
| A03 | value | |
-------------------------
| A03 | value | |
-------------------------
| A04 | value | |
-------------------------
| A04 | value | |
-------------------------
| A04 | value | |
I want to extract the values from rows labeled as A02 and paste them in separate column beside the rows labeled as A01. Similarly for, A03 and A04 and so on.
Basically I want to rearrange like this:
-------------------------
| A01 | value | A02 | value |
-------------------------
| A01 | value | A02 | value |
-------------------------
| A01 | value | A02 | value |
-------------------------
| A03 | value | A04 | value |
-------------------------
| A03 | value | A04 | value |
-------------------------
| A03 | value | A04 | value |
I am learning the tidyverse in R, but I am very new and I have not been able to find the right function to do this yet. I would appreciate any help. Thanks in advance.
This way you can look for all rows labeled with an even number or odd numbers, separate them and join them together afterwards. In this approach it is assumed, that labels only go to maximum A09 (if larger values are present, you have to modify substr). Also it will only work if you have the same amount of even and odd labelled values. But for your example data it works as requested!
library(tibble)
library(dplyr)
library(tidyr)
## example data ##
value = c(rep(1:4,3))
who = paste0("A0", c(rep(1:4,3)) )
tbl <- tibble::tibble(who = who, value = value)
## substr(who,3,3) extracts last letter from name, as.numeric() turns it into numeric
## %% 2 == 0 <- modulo division, return remainder, only even numbers have remainder 0
tbl <- tbl %>% mutate(is_even = as.numeric(substr(who,3,3)) %% 2 == 0)
## Filter all rows with even number in label
tbl_even <- tbl %>% filter(is_even == TRUE) %>% dplyr::select(-is_even)
## Filter all rows with odd number in label
tbl_odd <- tbl %>% filter(is_even == FALSE) %>% dplyr::select(-is_even)
## Join even and odd values together
result <- tbl_odd %>% cbind(tbl_even)
Related
I'm doing exploratory analysis of survey data and the dataframe is a haven labelled dataset, that is, each variable already has value labels and variable labels.
I want to store frequencies tables in a list, and then name each list element as the variable label. I'm using the expss package. The problem is that the output tables contain in the first column name this description: values2labels(Df$var. How can this description be dropped from the table?
Reproducible example:
# Dataframe
df <- data.frame(sex = c(1, 1, 2, 2, 1, 2, 2, 2, 1, 2),
agegroup= c(1, 3, 1, 2, 3, 3, 2, 2, 2, 1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
library(expss)
# Variable labels
var_lab(df$sex) <-"Sex"
var_lab(df$agegroup) <-"Age group"
# Value labels
val_lab(df$sex) <- make_labels("1 Male
2 Female")
val_lab(df$agegroup) <- make_labels("1 1-29
2 30-49
3 50 and more")
# Save variable labels
var_labels1 <- var_lab(df$sex)
var_labels2 <- var_lab(df$agegroup)
# Drop variable labels
var_lab(df$sex) <- NULL
var_lab(df$agegroup) <- NULL
# Save frequencies
f1 <- fre(values2labels(df$sex))
f2 <- fre(values2labels(df$agegroup))
# Note: I use the function 'values2labels' from 'expss' package in order to display the value <br />
labels instead of the values of the variable.In this example, since I manually created the value <br />
labels, I don't need that function, but when I import haven labelled data, I need it to
display value labels by default.
# Add frequencies on list
my_list <- list(f1, f2)
# Name lists elements as variable labels
names(my_list) <- c(var_labels1,
var_labels2)
In the following output, how can I get rid of the first column name on both tables: values2labels(df$sex) and values2labels(df$agegroup) ?
$Sex
| values2labels(df$sex) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| --------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| Female | 6 | 60 | 60 | 60 | 60 |
| Male | 4 | 40 | 40 | 40 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
$`Age group`
| values2labels(df$agegroup) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| -------------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| 1-29 | 3 | 30 | 30 | 30 | 30 |
| 30-49 | 4 | 40 | 40 | 40 | 70 |
| 50 and more | 3 | 30 | 30 | 30 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
You need to set var_lab to empty string instead of NULL:
library(expss)
a = 1:2
val_lab(a) = c("One" = 1, "Two" = 2)
var_lab(a) = ""
fre(values2labels(a))
# | | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
# | ------ | ----- | ------------- | ------- | ------------ | ----------------------- |
# | One | 1 | 50 | 50 | 50 | 50 |
# | Two | 1 | 50 | 50 | 50 | 100 |
# | #Total | 2 | 100 | 100 | 100 | |
# | <NA> | 0 | | 0 | | |
I have two categorical variables and I am trying to create a cross tabulation for them. Since both values are yes and no , I want to specify the row and column names for ease of understanding.
T4 <- table(bank$Term_deposit, bank$housing_loan)
%>% prop.table(margin = 2) *100
T4
no yes
no 83.418772 92.217126 <br/>
yes 16.581228 7.782874
kable(T4, caption = "%agewise comparison for marital status")
| | no| yes|<br/>
|:---|--------:|---------:|<br/>
|no | 83.41877| 92.217126|<br/>
|yes | 16.58123| 7.782874|<br/>
Expected output:
| | no| yes|<br/>
|:---|--------:|---------:|<br/>
CAT1 | 83.41877| 92.217126|<br/>
|CAT2| 16.58123| 7.782874|
OR
| | cat1| cat2|<br/>
|:---|--------:|---------:|<br/>
CAT1 | 83.41877| 92.217126|<br/>
|CAT2| 16.58123| 7.782874|<br/>
It depends on what you want to do with it in the end, but if the goal is to render a markdown, then consider using pandoc.table, as it provides a functionality similar to knitr::kable:
library(pandoc)
colnames(T4) <- c("CAT1", "CAT2")
rownames(T4) <- c("no", "yes")
pandoc.table(T4)
Output:
-------------------------
CAT1 CAT2
-------- -------- -------
**y** 13473 77311
**n** 226221 0
-------------------------
Or maybe:
colnames(T4) <- c("deposit: no", "deposit: yes")
rownames(T4) <- c("loan: no", "loan: yes")
pandoc.table(T4)
Output:
--------------------------------------------
deposit: no deposit: yes
--------------- ------------- --------------
**loan: no** 13473 77311
**loan: yes** 226221 0
--------------------------------------------
Another possibility would be to use the expss package:
library(expss)
df <- apply_labels(bank,
Term_deposit= "Term deposit",
housing_loan= "Housing loan")
cro_cpct(bank$Term_deposit, bank$housing_loan)
Output:
| | | Housing loan | |
| | | 0 | 1 |
| ------------ | ------------ | ------------ | --- |
| Term deposit | 0 | 39.4 | 100 |
| | 1 | 60.6 | |
| | #Total cases | 66.0 | 34 |
Both packages also provide several other functionalities to make neat tables, it is worth taking a look.
I have following r code which uses dplyr.
Due to large data size, we want to use data.table.
test <- function(Act, mac, type, thisYear){
Act %>%
mutate_(var = type) %>%
filter(var == mac) %>%
filter(floor_date(as.Date(submit_ts), 'year') == thisYear)
}
Act is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'x' | 1000 |
| '2015-01-01' | 'y' | 200 |
| '2015-01-01' | 'x' | 200 |
basically function can works as follows
test(act, 'x', 'col1', 2015)
result is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'x' | 1000 |
| '2015-01-01' | 'x' | 200 |
test(act, 200, 'col2', 2015)
result is as follows
| submit_ts | col1 | col2 |
| ------------- |---------------|-------|
| '2015-01-01' | 'y' | 200 |
| '2015-01-01' | 'x' | 200 |
How should I do it using data.table ?
We can do a similar approach in data.table with
library(data.table)
library(lubridate)
test1 <- function(Act, mac, type){
setnames(setDT(Act), type, "var")[
var==mac & year(floor_date(as.Date(submit_ts), "year"))==thisYear]
}
test1(dat, 2, "val")
# submit_ts var
#1: 2013-05-05 2
#2: 2013-05-12 2
NOTE: The floor_date does not return a yyyy year.
data
dat <- data.frame(submit_ts= c("2013-05-05", "2012-05-10", "2013-05-12"),
val = c(2, 1, 2), stringsAsFactors=FALSE)
thisYear <- 2013
I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.
I have 3 sqlite tables:
table inspections, where insp_id is primary key
id | name | deleted
------------------------------
I1 | Inspection A | (null)
I2 | Inspection B | (null)
I3 | Inspection C | 1
table equip_insp, where equip_id, insp_id are primary keys
equip_id | insp_id | period | period_type
--------------------------------------------
E1 | I1 | 1 | Y
E1 | I2 | 6 | M
E2 | I1 | 1 | M
table equip_certif, where id is primary key
id | equip_id | insp_id | date | certif_no | result | info
-------------------------------------------------------------------
C4 | E1 | I1 | 2015-02-01 | A-300 | Good | (null)
C3 | E1 | I1 | 2015-02-01 | A-200 | Good | (null)
C2 | E1 | I1 | 2015-01-10 | A-100 | Good | (null)
C1 | E1 | I2 | 2015-01-06 | B-100 | Good | (null)
All ID's are in fact numeric values, I use some letters just to be easy to connect them in between.
So, I would like help me with the Sqlite syntax that for item E1, display all the inspection defined (ascending), then if exist, to display the periodicity and then to display the latest certificate date (if there are 2 certificates in the same date, get the latest id), number and result that is not info
Result should be something like this:
id | name | period | period_type | certif_no | date | result
--------------------------------------------------------------------------
I1 | Inspection A | 1 | Y | A-300 | 2015-02-01 | Good
I2 | Inspection B | 6 | M | B-100 | 2015-01-06 | Good
I've try this, but I'm not so sure that is correct.
SELECT inspections.id, inspections.name, equip_insp.period, equip_insp.period_type, equip_certif.certif_no, equip_certif.date AS certif_date, equip_certif.result
FROM inspections
LEFT JOIN equip_insp ON (inspections.id = equip_insp.insp_id AND equip_insp.equip_id = 'E1')
LEFT JOIN equip_certif ON (inspections.id = equip_certif.insp_id AND equip_certif.info IS NULL)
WHERE inspections.deleted IS NULL
GROUP BY equip_insp.insp_id
ORDER BY inspections.id, date(equip_certif.date) DESC, equip_certif.id DESC
To specifiy which row from a group gets returned, you must use MAX(); otherwise, you get some randrom row:
SELECT ..., MAX(equip_certif.date) AS certif_date, ...
FROM ...
GROUP BY equip_insp.insp_id
...
(This works only in SQLite 3.7.11 or later; in earlier version, the query would get more complex.)
After playing with SQLite I get the solution by myself. So the answer is:
SELECT inspections.id, inspections.name, equip_insp.period, equip_insp.period_type, equip_certif.certif_no, equip_certif.date AS certif_date, equip_certif.result
FROM inspections
LEFT JOIN equip_insp ON (inspections.id = equip_insp.insp_id AND equip_insp.equip_id = 'E1')
LEFT JOIN equip_certif ON (inspections.id = equip_certif.insp_id AND equip_certif.equip_id = equip_insp.equip_id AND equip_certif.info IS NULL)
WHERE inspections.deleted IS NULL
GROUP BY inspections.id
ORDER BY inspections.id, date(equip_certif.date) DESC, equip_certif.id DESC