I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I am having an issue with the arulesSequences package in R. I was able to read baskets into the program, and create a data.frame, however it fails to recognize any other items beyond the first column. Below is a sample of my data set, which follows the form demonstrated here: Data Mining Algorithms in R/Sequence Mining/SPADE.
[sequenceID] [eventID] [SIZE] items
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
The dataset was exported from SAS as a [.CSV] then converted to a tab-delimited [.TXT] file in Excel. Headers were removed for import into R, but I placed them in brackets above for clarity in this example. All spaces were replaced with an underscore ("_"), and item names were simplified as much as possible. Each item is listed in a separate column. The following command was used to import the file:
baskets <- read_baskets(con = "...filepath/spade.txt", sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
I am presented with no errors, so I continue with the following command:
as(baskets, "data.frame")
Here, it returns the data.frame as requested, however it fails to capture the items beyond the first column:
items sequenceID eventID SIZE
{OB/Gyn} 2 1 1
{Internal_Medicine} 15 1 1
{Internal_Medicine} 15 2 1
{Internal_Medicine} 15 3 1
{Internal_Medicine} 56 1 2
{Oncology} 84 1 1
{Hematology} 151 1 2
{Hematology/Oncology} 151 2 1
{Hematology/Oncology} 151 3 1
{Gastroenterology} 185 1 2
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
I have tried importing the file directly as a [.CSV], but the data.frame results in a similar format to my above attempt using tabs, except it places a comma in front of the first item:
{,Internal_Medicine} 56 1 2
Any troubleshooting suggestions would be greatly appreciated. It seems like this package is picky when it comes to formatting.
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
Check out
library(arulesSequences)
packageVersion("arulesSequences")
# [1] ‘0.2.16’
packageVersion("arules")
# [1] ‘1.5.0’
txt <- readLines(n=10)
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
writeLines(txt, tf<-tempfile())
baskets <- read_baskets(con = tf, sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
as(baskets, "data.frame")
# items sequenceID eventID SIZE
# 1 {OB/Gyn} 2 1 1
# 2 {Internal_Medicine} 15 1 1
# 3 {Internal_Medicine} 15 2 1
# 4 {Internal_Medicine} 15 3 1
# 5 {Internal_Medicine,Neurology} 56 1 2 # <----------
# 6 {Oncology} 84 1 1
# 7 {Hematology} 151 1 2
# 8 {Hematology/Oncology} 151 2 1
# 9 {Hematology/Oncology} 151 3 1
# 10 {Gastroenterology} 185 1 2
I have written some code for a university assignment. The assignment is based on various concrete samples and their tensile strengths. There are 20 types of concrete mixtures (made from four different accelerators, and five different plasticisers). Our job is to do a statistical analysis on this data frame:
TStrength accelerator plasticiser
1 3.417543 1 1
2 2.887113 1 2
3 3.600988 1 3
4 3.702631 1 4
5 3.686944 1 5
6 3.699785 1 1
7 3.112972 1 2
8 3.918160 1 3
9 3.600538 1 4
10 2.748832 1 5
11 3.404498 1 1
12 3.735437 1 2
13 3.347577 1 3
14 3.101556 1 4
15 3.527621 1 5
16 3.856831 1 1
17 3.492118 1 2
18 3.928343 1 3
19 3.511689 1 4
20 3.371985 1 5
21 3.069794 2 1
22 3.168010 2 2
23 3.316657 2 3
24 3.455162 2 4
25 2.818250 2 5
26 4.054507 2 1
27 3.065984 2 2
28 3.201351 2 3
29 3.417554 2 4
30 3.364320 2 5
31 3.218677 2 1
32 2.647151 2 2
33 3.222705 2 3
34 3.145210 2 4
35 3.636642 2 5
36 3.317620 2 1
37 3.645922 2 2
38 2.556071 2 3
39 3.177663 2 4
40 3.014374 2 5
41 3.838183 3 1
42 4.155951 3 2
43 3.886330 3 3
44 3.723898 3 4
45 4.425442 3 5
46 3.738460 3 1
47 3.217834 3 2
48 3.942241 3 3
49 3.699851 3 4
50 3.797089 3 5
51 3.652456 3 1
52 4.851609 3 2
53 3.359099 3 3
54 4.089559 3 4
55 4.282991 3 5
56 3.803784 3 1
57 3.519551 3 2
58 3.935084 3 3
59 3.890324 3 4
60 4.611936 3 5
61 3.343098 4 1
62 3.713952 4 2
63 3.629883 4 3
64 3.082509 4 4
65 3.346548 4 5
66 3.277845 4 1
67 3.509506 4 2
68 3.490567 4 3
69 3.235009 4 4
70 3.970925 4 5
71 3.504646 4 1
72 3.270798 4 2
73 3.547298 4 3
74 3.278489 4 4
75 3.322743 4 5
76 2.975010 4 1
77 3.384996 4 2
78 3.399486 4 3
79 3.703567 4 4
80 3.214973 4 5
My first step was to attempt to find out the means of the Tstrength values for each of the 20 concrete types (there are four types of each unique concrete sample). I am very new to R, and my code is certainly not beautiful, but this is the code I wrote to find the means:
#Setting the correct directory
setwd("C:/Users/Matthew/Desktop/Work/Engineering")
#Creating the data frame object, Concrete.
#Note that this will only work if the file
#s...-CW.dat is in the current working directory
#Therefore for this code to work, CreateData.r must
#be run on the individual computer with the
#given matriculation number, and the file must be saved
#in the specified directory
Concrete<-read.table(file='s...-CW.dat',header=TRUE)
#Since the samples of concrete are made from 4 different accelerators and
#5 different plasticisers there will be 4*5=20 unique combinations from
#which concrete samples can come from (i.e. 1,1; 1,2; 4,5 etc).
# There are four samples of each combination
#The next section of code is used to find the mean of the four samples,
#for each combination (20 total)
#creating a list with Tstrength from all (1,1) combinations
#Then finding average
combo1 = list(Concrete[1,1],Concrete[6,1],Concrete[11,1],Concrete[16,1])
combo1mean = mean(unlist(combo1))
#Repeating for (1,2)
combo2 = list(Concrete[2,1],Concrete[7,1],Concrete[12,1],Concrete[17,1])
combo2mean = mean(unlist(combo2))
#Repeating for (1,3)
combo3 = list(Concrete[3,1],Concrete[8,1],Concrete[13,1],Concrete[18,1])
combo3mean = mean(unlist(combo3))
#Repeating for (1,4)
combo4 = list(Concrete[4,1],Concrete[9,1],Concrete[14,1],Concrete[19,1])
combo4mean = mean(unlist(combo4))
#Repeating for (1,5)
combo5 = list(Concrete[5,1],Concrete[10,1],Concrete[15,1],Concrete[20,1])
combo5mean = mean(unlist(combo5))
#Repeating for (2,1)
combo6 = list(Concrete[21,1],Concrete[26,1],Concrete[31,1],Concrete[36,1])
combo6mean = mean(unlist(combo6))
#Repeating for (2,2)
combo7 = list(Concrete[22,1],Concrete[27,1],Concrete[32,1],Concrete[37,1])
combo7mean = mean(unlist(combo7))
#Repeating for (2,3)
combo8 = list(Concrete[23,1],Concrete[28,1],Concrete[33,1],Concrete[38,1])
combo8mean = mean(unlist(combo8))
#Repeating for (2,4)
combo9 = list(Concrete[24,1],Concrete[29,1],Concrete[34,1],Concrete[39,1])
combo9mean = mean(unlist(combo9))
#Repeating for (2,5)
combo10 = list(Concrete[25,1],Concrete[30,1],Concrete[35,1],Concrete[40,1])
combo10mean = mean(unlist(combo10))
#Repeating for (3,1)
combo11 = list(Concrete[41,1],Concrete[46,1],Concrete[51,1],Concrete[56,1])
combo11mean = mean(unlist(combo11))
#Repeating for (3,2)
combo12 = list(Concrete[42,1],Concrete[47,1],Concrete[52,1],Concrete[57,1])
combo12mean = mean(unlist(combo12))
#Repeating for (3,3)
combo13 = list(Concrete[43,1],Concrete[48,1],Concrete[53,1],Concrete[58,1])
combo13mean = mean(unlist(combo13))
#Repeating for (3,4)
combo14 = list(Concrete[44,1],Concrete[49,1],Concrete[54,1],Concrete[59,1])
combo14mean = mean(unlist(combo14))
#Repeating for (3,5)
combo15 = list(Concrete[45,1],Concrete[50,1],Concrete[55,1],Concrete[60,1])
combo15mean = mean(unlist(combo15))
#Repeating for (4,1)
combo16 = list(Concrete[61,1],Concrete[66,1],Concrete[71,1],Concrete[76,1])
combo16mean = mean(unlist(combo16))
#Repeating for (4,2)
combo17 = list(Concrete[62,1],Concrete[67,1],Concrete[72,1],Concrete[77,1])
combo17mean = mean(unlist(combo17))
#Repeating for (4,3)
combo18 = list(Concrete[63,1],Concrete[68,1],Concrete[73,1],Concrete[78,1])
combo18mean = mean(unlist(combo18))
#Repeating for (4,4)
combo19 = list(Concrete[64,1],Concrete[69,1],Concrete[74,1],Concrete[79,1])
combo19mean = mean(unlist(combo19))
#Repeating for (4,5)
combo20 = list(Concrete[65,1],Concrete[70,1],Concrete[75,1],Concrete[80,1])
combo20mean = mean(unlist(combo20))
A few notes about the code: "s..." is just my matriculation number. I have triple checked that I have not made a mistake here regarding either the file name or the directory with where it is stored. CreataData.r is just a script provided to us the generates the data used to create 'Concrete' based on our matriculation number (so we're not just blindly copying each other I suppose).
The problem I am having with the code is that whenever it runs, the object Concrete is created, as is combo1mean, combo2mean and combo3mean. However, I just cannot figure out why the rest of the objects aren't being created.
I have had no success using running the script in the Rgui. After running the script, it tells I check that Concrete has initialised, and I check to see if the combo4mean and above have initialised too, but they never do. I thought it maybe had to do with running the wrong file, or that I hadn't saved the data properly, but the script definitely contains all the code, and I created a new file to see if that would work, but unfortunately it didn't. Also, I have read an introduction to R by W.N. Venables, D.M. Smith and the R Core Team, but nothing there has helped me figure this out.
PS I am not doing this as an easy way out of homework. I have genuinely tried to figure out what is going wrong but I cannot seem to find the problem. I also apologise if the question is inaccurate in anyway, or if I have had misunderstandings, I am very new to R and am trying my best to learn it! Cheers in advance.
EDIT: Just in case anyone is curious, I managed to get the exact same code to work on a different computer, starting from an empty workspace. I'm still not very sure why it didn't work on the first computer, but thanks 42 for the code suggestions.
Adding code that should bypass issues related to reading a text file. This shouls succeed on any R installation:
Concrete <- read.table(text="TStrength accelerator plasticiser
1 3.417543 1 1
2 2.887113 1 2
3 3.600988 1 3
4 3.702631 1 4
5 3.686944 1 5
6 3.699785 1 1
7 3.112972 1 2
8 3.918160 1 3
9 3.600538 1 4
10 2.748832 1 5
11 3.404498 1 1
12 3.735437 1 2
13 3.347577 1 3
14 3.101556 1 4
15 3.527621 1 5
16 3.856831 1 1
17 3.492118 1 2
18 3.928343 1 3
19 3.511689 1 4
20 3.371985 1 5
21 3.069794 2 1
22 3.168010 2 2
23 3.316657 2 3
24 3.455162 2 4
25 2.818250 2 5
26 4.054507 2 1
27 3.065984 2 2
28 3.201351 2 3
29 3.417554 2 4
30 3.364320 2 5
31 3.218677 2 1
32 2.647151 2 2
33 3.222705 2 3
34 3.145210 2 4
35 3.636642 2 5
36 3.317620 2 1
37 3.645922 2 2
38 2.556071 2 3
39 3.177663 2 4
40 3.014374 2 5
41 3.838183 3 1
42 4.155951 3 2
43 3.886330 3 3
44 3.723898 3 4
45 4.425442 3 5
46 3.738460 3 1
47 3.217834 3 2
48 3.942241 3 3
49 3.699851 3 4
50 3.797089 3 5
51 3.652456 3 1
52 4.851609 3 2
53 3.359099 3 3
54 4.089559 3 4
55 4.282991 3 5
56 3.803784 3 1
57 3.519551 3 2
58 3.935084 3 3
59 3.890324 3 4
60 4.611936 3 5
61 3.343098 4 1
62 3.713952 4 2
63 3.629883 4 3
64 3.082509 4 4
65 3.346548 4 5
66 3.277845 4 1
67 3.509506 4 2
68 3.490567 4 3
69 3.235009 4 4
70 3.970925 4 5
71 3.504646 4 1
72 3.270798 4 2
73 3.547298 4 3
74 3.278489 4 4
75 3.322743 4 5
76 2.975010 4 1
77 3.384996 4 2
78 3.399486 4 3
79 3.703567 4 4
80 3.214973 4 5", header=TRUE)
This probably does what you are attempting with about 1/10th (or less) code (and more importantly no errors):
> means.by.type <- with( Concrete, tapply(TStrength,
list( acc=accelerator, plas=plasticiser),
FUN=mean))
> means.by.type
plas
acc 1 2 3 4 5
1 3.594664 3.306910 3.698767 3.479103 3.333845
2 3.415150 3.131767 3.074196 3.298897 3.208397
3 3.758221 3.936236 3.780689 3.850908 4.279364
4 3.275150 3.469813 3.516808 3.324893 3.463797
Importantly, you forgot to offer str or dput on Concrete, so cannot really tell whether you problem is data-prep or coding.