I have this df
ZONA ID_DOM FE_DOM NO_MORAD
1 1 00010001 15.41667 2
2 1 00010001 15.41667 2
3 1 00010001 15.41667 2
4 1 00010001 15.41667 2
5 1 00010001 15.41667 2
6 1 00010002 15.41667 4
...
99 00010994 16.68444 5
Currently, I'm counting the number of ID_DOM using the weight variable FE_DOM with this code.
count(OD_2017[!duplicated(OD_2017$ID_DOM),],
wt = FE_DOM,
Zonas = ZONA,
name = "N_domicilios")
which returns me
Zonas N_domicilios
<int> <dbl>
1 1 1151.
2 2 2342.
3 3 7100.
4 4 12588.
5 5 8050.
6 6 9411.
However I want this data grouped by NO_MORAD, something like
Zonas 1Mor 2Mor ... 99Mor
1 50 78 ... 78
2 x y ... z
...
99 99 99 ... 99
Can anyone help me with this?
Thanks
Related
In my text data below, there apparently is a special character similar to a long dash –. But this actually needs to be a regular minus - sign.
Is there any way to replace all instances of long dashes – with regular minus - signs in R, so that then I can read in the dat using: read.table(text = dat, header = TRUE)?
dat <- "
Study Outcome Subscale g Variance Precision
1 1 1 –.251 .024 41.455
2 1 1 –.069 .001 1,361.067
3 1 5 .138 .001 957.620
4 1 1 –.754 .085 11.809
5 1 1 –.228 .020 49.598
6 1 6 –.212 .004 246.180
6 2 7 .219 .004 246.095
7 1 1 .000 .012 83.367
8 1 2 –.103 .006 162.778
8 2 3 .138 .006 162.612
8 3 4 –.387 .006 160.133
9 1 1 –.032 .023 44.415
10 1 5 –.020 .058 17.110
11 1 1 .128 .017 59.999
12 1 1 –.262 .032 31.505
13 1 1 –.046 .071 14.080
14 1 6 –.324 .003 381.620
14 2 6 –.409 .003 378.611
14 3 7 .080 .003 386.319
14 4 7 –.140 .003 385.542
15 1 1 .311 .005 185.364
16 1 1 .036 .005 205.063
17 1 6 –.259 .001 925.643
17 2 7 .196 .001 928.897
18 1 1 .157 .013 74.094
19 1 1 .000 .056 17.985
20 1 1 .000 .074 13.600
21 1 6 –.013 .039 25.425
21 2 7 –.004 .039 25.426
22 1 1 –.202 .001 1,487.992
23 1 1 .000 .086 11.628
24 1 1 –.221 .001 713.110
25 1 1 –.099 .001 749.964
26 1 5 –.165 .000 6,505.024
27 1 1 –.523 .063 15.856
28 1 1 .000 .001 1,611.801
29 1 6 .377 .045 22.045
29 2 7 .575 .046 21.677
30 1 1 .590 .074 13.477
31 1 1 .020 .001 1,335.991
32 1 1 .121 .043 23.489
33 1 1 –.101 .003 363.163
34 1 1 –.101 .003 369.507
35 1 1 –.104 .004 255.507
36 1 1 –.270 .003 340.761
37 1 1 .179 .150 6.645
38 1 2 .468 .020 51.255
38 2 4 –.479 .020 51.193
39 1 5 –.081 .024 42.536
40 1 1 –.071 .043 23.519
41 1 1 .201 .077 13.036
42 1 6 –.070 .006 180.844
42 2 7 .190 .006 180.168
43 1 1 .277 .013 79.220
44 1 5 –.086 .001 903.924
45 1 5 –.338 .002 469.260
46 1 1 .262 .003 290.330
47 1 5 .000 .003 304.959
48 1 1 –.645 .055 18.192
49 1 5 –.120 .002 461.802
50 1 5 –.286 .009 106.189
51 1 1 –.124 .006 172.261
52 1 1 .023 .028 35.941
53 1 5 –.064 .001 944.600
54 1 1 .000 .043 23.010
55 1 1 .000 .014 72.723
56 1 5 .000 .012 85.832
57 1 1 .000 .012 85.832
"
Use gsub() from base R.
dat <- gsub(pattern = "–", replacement = "-", x = dat)
head(read.table(text = dat, header = T))
Study Outcome Subscale g Variance Precision
1 1 1 1 -0.251 0.024 41.455
2 2 1 1 -0.069 0.001 1,361.067
3 3 1 5 0.138 0.001 957.620
4 4 1 1 -0.754 0.085 11.809
5 5 1 1 -0.228 0.020 49.598
6 6 1 6 -0.212 0.004 246.180
Normalize all dashes easily:
dat <- gsub("\\p{Pd}", "-", dat, perl=TRUE)
See regex proof.
See https://www.fileformat.info/info/unicode/category/Pd/list.htm:
Character Name Browser Image
U+002D HYPHEN-MINUS - view
U+058A ARMENIAN HYPHEN ֊ view
U+05BE HEBREW PUNCTUATION MAQAF ־ view
U+1400 CANADIAN SYLLABICS HYPHEN ᐀ view
U+1806 MONGOLIAN TODO SOFT HYPHEN ᠆ view
U+2010 HYPHEN ‐ view
U+2011 NON-BREAKING HYPHEN ‑ view
U+2012 FIGURE DASH ‒ view
U+2013 EN DASH – view
U+2014 EM DASH — view
U+2015 HORIZONTAL BAR ― view
U+2E17 DOUBLE OBLIQUE HYPHEN ⸗ view
U+2E1A HYPHEN WITH DIAERESIS ⸚ view
U+2E3A TWO-EM DASH ⸺ view
U+2E3B THREE-EM DASH ⸻ view
U+2E40 DOUBLE HYPHEN ⹀ view
U+301C WAVE DASH 〜 view
U+3030 WAVY DASH 〰 view
U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN ゠ view
U+FE31 PRESENTATION FORM FOR VERTICAL EM DASH ︱ view
U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH ︲ view
U+FE58 SMALL EM DASH ﹘ view
U+FE63 SMALL HYPHEN-MINUS ﹣ view
U+FF0D FULLWIDTH HYPHEN-MINUS - view
U+10EAD YEZIDI HYPHENATION MARK 𐺭 view
Example using stringr.
library(stringr)
library(dplyr)
x <- str_replace_all(dat, "–", "-")
tibble(read.table(textConnection(x), header = TRUE))
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
Hi I need to calculate the cumulative insect day for some of my experiment. This is what my data frame looks like
Rep trt date BLB
1 I 1 7/12/2017 3
2 I 2 7/12/2017 2
3 I 3 7/12/2017 4
4 I 4 7/12/2017 0
5 II 1 7/12/2017 1
6 II 2 7/12/2017 2
7 II 3 7/12/2017 2
8 II 4 7/12/2017 1
9 III 1 7/12/2017 3
10 III 2 7/12/2017 2
11 III 3 7/12/2017 1
12 III 4 7/12/2017 1
13 IV 1 7/12/2017 0
14 IV 2 7/12/2017 3
15 IV 3 7/12/2017 3
16 IV 4 7/12/2017 0
17 I 1 7/20/2017 12
18 I 2 7/20/2017 6
19 I 3 7/20/2017 7
20 I 4 7/20/2017 18
21 II 1 7/20/2017 17
22 II 2 7/20/2017 11
23 II 3 7/20/2017 25
24 II 4 7/20/2017 17
25 III 1 7/20/2017 18
26 III 2 7/20/2017 6
27 III 3 7/20/2017 48
28 III 4 7/20/2017 13
29 IV 1 7/20/2017 7
30 IV 2 7/20/2017 22
31 IV 3 7/20/2017 18
32 IV 4 7/20/2017 11
33 I 1 7/27/2017 1
34 I 2 7/27/2017 3
35 I 3 7/27/2017 4
36 I 4 7/27/2017 0
37 II 1 7/27/2017 1
38 II 2 7/27/2017 0
39 II 3 7/27/2017 1
40 II 4 7/27/2017 0
41 III 1 7/27/2017 1
42 III 2 7/27/2017 1
43 III 3 7/27/2017 0
44 III 4 7/27/2017 0
45 IV 1 7/27/2017 1
46 IV 2 7/27/2017 0
47 IV 3 7/27/2017 1
48 IV 4 7/27/2017 2
49 I 1 8/2/2017 0
50 I 2 8/2/2017 0
51 I 3 8/2/2017 1
52 I 4 8/2/2017 0
53 II 1 8/2/2017 0
54 II 2 8/2/2017 0
55 II 3 8/2/2017 0
56 II 4 8/2/2017 0
57 III 1 8/2/2017 1
58 III 2 8/2/2017 0
59 III 3 8/2/2017 0
60 III 4 8/2/2017 0
61 IV 1 8/2/2017 0
62 IV 2 8/2/2017 0
63 IV 3 8/2/2017 0
64 IV 4 8/2/2017 2
Structure would be:
data.frame': 64 obs. of 4 variables:
$ Rep : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 2 2 2 2 3 3 ...
$ trt : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ date: Factor w/ 4 levels "7/12/2017","7/20/2017",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BLB : int 3 2 4 0 1 2 2 1 3 2 ...
To do it, I need calculate the average of insect for each combination of date for the different treatment. for example I have to calculate the every between date 7/12 and 7/20 for each treatment. Then I need to calculate the average between date 7/20 and 7/27, etc. Does anyone knows how to do this using r software? I really appreciate the help!!
First create data (would be nice if you provided dput(data)...):
set.seed(123)
df = data.frame(Rep = rep(c("I","II","III","IV"), each = 4, times = 4),
trt = as.factor(rep(1:4, times = 16)),
date = as.Date(rep(c("7/12/2017", "7/20/2017", "7/27/2017", "8/2/2017"), each = 16),
format = "%m/%d/%Y"),
BLB = sample(0:50, 64, replace = TRUE))
> str(df)
'data.frame': 64 obs. of 4 variables:
$ Rep : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 2 2 2 2 3 3 ...
$ trt : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ date: Date, format: "2017-07-12" "2017-07-12" "2017-07-12" ...
$ BLB : int 14 40 20 45 47 2 26 45 28 23 ...
Simple subsetting and aggregation:
# Create subset for each date group
date_group1 = subset(df, df$date %in% c(as.Date("2017-07-12"),
as.Date("2017-07-20")))
date_group2 = subset(df, df$date %in% c(as.Date("2017-07-20"),
as.Date("2017-07-27")))
date_group3 = subset(df, df$date %in% c(as.Date("2017-07-27"),
as.Date("2017-08-02")))
# Aggregate by treatment in each date_group
aggregate(BLB ~ trt, data = date_group1, mean)
aggregate(BLB ~ trt, data = date_group2, mean)
aggregate(BLB ~ trt, data = date_group3, mean)
# > aggregate(BLB ~ trt, data = date_group1, mean)
# trt BLB
# 1 1 28.375
# 2 2 21.750
# 3 3 27.875
# 4 4 41.500
# > aggregate(BLB ~ trt, data = date_group2, mean)
# trt BLB
# 1 1 23.875
# 2 2 19.875
# 3 3 21.625
# 4 4 31.250
# > aggregate(BLB ~ trt, data = date_group3, mean)
# trt BLB
# 1 1 22.375
# 2 2 21.250
# 3 3 17.875
# 4 4 17.500
You have missed some date combination group #useR
There are
(2017-07-12, 2017-07-27),
(2017-07-12, 2017-08-02),
(2017-07-20, 2017-08-02) also.
The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45