I am calculating the dissimilarity index of several groups compared to the total population with the function "seg" from the identically named package.
The data consists of about 450 rows, each a different district, and around 20 columns (groups that may be segregated). The values are the number of people from respective group living in respective district. Here are the first few rows of my csv file:
Region,Germany,EU15 without Germany,Poland,Former Yugoslavia and successor countries,Former Soviet Union and successor countries,Turkey,Arabic states,West Afrika,Central Afrika,East Afrika,North America,Central America and the Carribean,South America,East and Central Asia,South and Southeast Asia - excluding Vietnam,Australia and Oceania,EU,Vietnam,Non EU Europe,Total Population
1011101,1370,372,108,35,345,91,256,18,6,3,73,36,68,272,98,3,1979,19,437,3445
1011102,117,21,6,0,0,0,6,0,0,0,7,0,6,0,7,0,156,0,3,188
1011103,2180,482,181,102,385,326,358,48,12,12,73,24,75,175,129,12,3152,34,795,5159
Since the seg function only works with two columns as input, my current code to create a table with the index for all groups looks like this:
DI_table <- as.data.frame(0)
DI_table[1,1] <- print (seg(data =dfplrcountrygroups2019[, c( "Germany", "Total.Population")]))
DI_table[1,2] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[3], "Total.Population")]))
DI_table[1,3] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[4], "Total.Population")]))
DI_table[1,4] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[5], "Total.Population")]))
# and so on...
colnames(DI_table)<- (colnames(dfplrcountrygroups2019[2:20]))
Works well, but a hassle to recode every time I change something with my data and I would like to use this method for other datasets too.
I thought I might try something like below but the seg function did not consider it a selection of two columns.
for (i in colnames(dfplrcountrygroups2019)) {
di_matrix [i] <- seg(data =dfplrcountrygroups2019[, c( "i", "Total.Population")])
}
Error in [.data.frame(dfplrcountrygroups2019, , c("i",
"Total.Population")) : undefined columns selected
I also thought of the apply function but not sure how to make it work so it repeats itself while just changing the column where "Germany" is in the example. How do I make the selection of columns change for each time I repeat the seg function?
my_function <- seg(data =dfplrcountrygroups2019[, c("Germany", "Total.Population")])
apply(X = dfplrcountrygroups2019,
FUN = my_function,
MARGIN = 2
)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'my_function' of mode 'function' was not found
The seg package's functions such as dissim (seg::seg is being deprecated in its favor) have a specific expected data format. From the docs:
data - a numeric matrix or data frame with two columns that represent mutually exclusive population groups (e.g., Asians and non-Asians). If more than two columns are given, only the first two will be used for computing the index.
To get a data frame of the d values seg::dissim returns, where each column is a region's dissimilarity index, you can iterate over the columns, making a temporary data frame and calculating the index. Because the data you're starting with isn't made up of mutually-exclusive categories, you'll have to subtract each population from the total population column to get a not-X counterpart for each group X.
A base R option with sapply will return a named list, which you can then convert into a data frame.
di_table <- sapply(names(dat)[2:20], function(col) {
tmp_df <- dat[col]
tmp_df$other <- dat$Total.Population - dat[col]
seg::dissim(data = tmp_df)$d
}, simplify = FALSE)
as.data.frame(di_table)
#> Germany EU15.without.Germany Poland
#> 1 0.03127565 0.03989693 0.02770549
#> Former.Yugoslavia.and.successor.countries
#> 1 0.160239
#> Former.Soviet.Union.and.successor.countries Turkey Arabic.states West.Afrika
#> 1 0.08808277 0.2047 0.02266828 0.1415519
#> Central.Afrika East.Afrika North.America Central.America.and.the.Carribean
#> 1 0.08004711 0.213581 0.1116014 0.2095969
#> South.America East.and.Central.Asia
#> 1 0.08486598 0.2282734
#> South.and.Southeast.Asia...excluding.Vietnam Australia.and.Oceania EU
#> 1 0.0364721 0.213581 0.04394527
#> Vietnam Non.EU.Europe
#> 1 0.05505789 0.06624686
A couple tidyverse options: you can use purrr functions to do something like above in one step.
dat[2:20] %>%
purrr::map(~data.frame(value = ., other = dat$Total.Population - .)) %>%
purrr::map_dfc(~seg::dissim(data = .)$d)
# same output
Or with reshaping the data and splitting by county. This takes more steps, but might fit a larger workflow better.
library(dplyr)
dat %>%
tidyr::pivot_longer(c(-Region, -Total.Population)) %>%
mutate(other = Total.Population - value) %>%
split(.$name) %>%
purrr::map_dfc(~seg::dissim(data = .[c("value", "other")])$d)
# same output
Related
I'm working with R for the first time for a class in college. To preface this: I don't know enough to know what I don't know, so I'm sorry if this question has been asked before. I am trying to predict the results of the Texas state house elections in 2020, and I think the best prior for that is the results of the 2018 state house elections. There are 150 races, so I can't bare to input them all by hand, but I can't find any spreadsheet that has data formatted how I want it. I want it in a pretty standard table format:
My desired table format. However, the table from the Secretary of state I have looks like the following:
Gross ugly table.
I wrote some psuedo code:
Here's the Psuedo Code, basically we want to construct a new CSV:
'''%First, we want to find a district, the house races are always preceded by a line of dashes, so I will need a function like this:
Create a New CSV;
for(x=1; x<151 ; x +=1){
Assign x to the cell under the district number cloumn;
Find "---------------" ;
Go down one line;
Go over two lines;
% We should now be in the third column and now want to read in which party got how many votes. The number of parties is not consistant, so we need to account for uncontested races, libertarians, greens, and write ins. I want totals for Republicans, Democrats, and Other.
while(cell is not empty){
Party <- function which reads cell (but I want to read a string);
go right one column;
Votes <- function which reads cell (but I want to read an integer);
if(Party = Rep){
put this data in place in new CSV;
else if (Party = Dem)
put this data in place in new CSV;
else
OtherVote += Votes;
};
};
Assign OtherVote to the column for other party;
OtherVote <- 0;
%Now I want to assign 0 to null cells (ones where no rep, or no Dem, or no other party contested
read through single row 4 spaces, if its null assign it 0;
Party <- null
};'''
But I don't know enough to google what to do! Here's what I need help with: Can I create a new CSV in Rstudio, how? How can I read specific cells in a table, hopefully indexing? Lastly, how do I write to a table in R. Any help is appreciated! Thank you!
Can I create a new CSV in Rstudio, how?
Yes you can. Use the "write.csv" function.
write.csv(df, file = "df.csv") #see help for more information.
How can I read specific cells in a table?
Use the brackets after df,example below.
df <- data.frame(x = c(1,2,3), y = c("A","B","C"), z = c(15,25,35))
df[1,1]
#[1] 1
df[1,1:2]
# x y
#1 1 A
How do I write to a table in R?
If you want to write a table in xlsx use the function write.xlsx from openxlsx package.
Wikipedia seems to have a table that is closer to the format you are looking for.
In order to get to the table you are looking for we need a few steps:
Download data from Wikipedia and extract table.
Clean up table.
Select columns.
Calculate margins.
1. Download data from wikipedia and extract table.
The rvest table helps with downloading and parsing websites into R objects.
First we download the HTML of the whole website.
library(dplyr)
library(rvest)
wiki_html <-
read_html(
"https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections_in_Texas"
)
There are a few ways to get a specific object from an HTML file in this case
I dedided to look for the table that has the class name “wikitable plainrowheaders sortable”,
as I learned from inspecting the code, that the only table with that class is
the one we want to extract.
library(purrr)
html_nodes(wiki_html, "table") %>%
map_lgl( ~ html_attr(., "class") == "wikitable plainrowheaders sortable") %>%
which()
#> [1] 20
Then we can select table number 20 and convert it to a dataframe with html_table()
raw_table <-
html_nodes(wiki_html, "table")[[20]] %>%
html_table(fill = TRUE)
2. Clean up table.
The table has duplicated names, we can change that by using as_tibble() and its .name_repair argument. We then usedplyr::select() to get the columns. Furthermore we usedplyr::filter() to delete the first two rows, that have "District" as a value in theDistrictcolumn. Now the columns are still characters
vectors, but we need them to be numeric, therefore we first delete commas from
all columns and then transform columns 2 to 4 to numeric.
clean_table <-
raw_table %>%
as_tibble(.name_repair = "unique") %>%
filter(District != "District") %>%
mutate_all( ~ gsub(",", "", .)) %>%
mutate_at(2:4, as.numeric)
3. Select columns and 4. Calculate margins.
We use dplyr::select() to select the columns you are interested in and give them more helpful names.
Finally we calculate the margin between democratic and republican votes by first adding up there votes
as total_votes and then dividing the difference by total_votes.
clean_table %>%
select(District,
RepVote = Republican...2,
DemVote = Democratic...4,
OthVote = Others...6) %>%
mutate(
total_votes = RepVote + DemVote,
margin = abs(RepVote - DemVote) / total_votes * 100
)
#> # A tibble: 37 x 6
#> District RepVote DemVote OthVote total_votes margin
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 District 1 168165 61263 3292 229428 46.6
#> 2 District 2 139188 119992 4212 259180 7.41
#> 3 District 3 169520 138234 4604 307754 10.2
#> 4 District 4 188667 57400 3178 246067 53.3
#> 5 District 5 130617 78666 224 209283 24.8
#> 6 District 6 135961 116350 3731 252311 7.77
#> 7 District 7 115642 127959 0 243601 5.06
#> 8 District 8 200619 67930 4621 268549 49.4
#> 9 District 9 0 136256 16745 136256 100
#> 10 District 10 157166 144034 6627 301200 4.36
#> # … with 27 more rows
Edit: In case you want to go with the data provided by the state, it looks to me as if the data you are looking for is in the first, third and fourth column. So what you want to do is.
(All the code below is not tested, as I do not have the original data.)
read data into R
library(readr)
tx18 <- read_csv("filename.csv")
select relevant columns
tx18 <- tx18 %>%
select(c(1,3,4))
clean table
tx18 <- tx18 %>%
filter(!is.na(X3),
X3 != "Party",
X3 != "Race Total")
Group and summarize data by party
tx18 <- tx18 %>%
group_by(X3) %>%
summarise(votes = sum(X3))
Pivot/ Reshape data to wide format
tx18 %>$
pivot_wider(names_from = X3,
values_from = votes)
After this you could then calculate the margin similarly as I did with the Wikipedia data.
I have a list of lists that looks like this:
> class(cladelist)
[1] "list"
cladelist <- list( `46` = scan(text=' "KbFk2" "PeHa3" "PeHa51" "EeBi27" "EeBi17" "PeHa23" "PeHa44" "EeBi4" "EeBi26" "PeHa8" "PeHa26" "EeBi24" "EeBi3"
"EeBi20" "KbFk5" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67" "PeHa17" "PeHa59" "KbFk4" "PeHa10" "PeHa55"
"PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "EeBi6" "EeBi34" "PeHa25" "PeHa52" "PeHa62" "PeHa31" "PeHa65"
"PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`47`= scan(text='
"KbFk2" "EeBi27" "EeBi17" "EeBi4" "EeBi26" "EeBi3" "EeBi20" "KbFk5" "KbFk4" "EeBi6" "EeBi34"', what=""),
`48`= scan(text=' "PeHa3" "PeHa51" "PeHa23" "PeHa44" "PeHa8" "PeHa26" "EeBi24" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67"
"PeHa17" "PeHa59" "PeHa10" "PeHa55" "PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "PeHa25" "PeHa52" "PeHa62"
"PeHa31" "PeHa65" "PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`49`= scan(text=' "PeHa51" "PeHa23" "PeHa44" "PeHa8" "PeHa26" "EeBi24" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67" "PeHa17"
"PeHa59" "PeHa10" "PeHa55" "PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "PeHa25" "PeHa52" "PeHa62" "PeHa31"
"PeHa65" "PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`50`= scan(text=' "EeBi27" "EeBi17" "EeBi4" "EeBi26" "EeBi3" "EeBi20" "KbFk5" "KbFk4" "EeBi6" "EeBi34"', what="") )
Each of these sublists (ie "46", "47" etc) represents a clade in a dendogram that I've extracted using:
> cladelist <- clade.members.list(VB.phy, tips = FALSE, tip.labels = TRUE, include.nodes=FALSE)
Im trying to find each unique pair found within each sublist, and calculate the sum of times it appears between all sublists (clades).
The ideal output would be a dataframe that looks like this where the count is the number of times this pair was found between all sublists (clades):
Pair Count
Peha1/PeHa2 2
Peha1/PeHa3 4
PeHa1/PeHa4 7
PeHa1/PeHa5 3
What sort of formulas am I looking for?
Background for the question (just for interest, doesnt add that much to question):
The idea is that I have a data set of 121 of these elements (Peha1, KbFk3, etc). They are artifacts (I'm an archaeologist) that I'm evaluating using 3D geometric morphometrics. The problem is that these artifacts are not all complete; they are damaged or degraded and thus provide an inconsistent amount of data. So I've had to reduce what data I use per artifact to have a reasonable, yet still inconsistent, sample size. By selecting certain variables to evaluate, I can get useful information, but it requires that I test every combination of variables. One of my analyses gives me the dendograms using divisive hierarchical clustering.
Counting the frequency of each pair as found between each clade should be the strength of the relationship of each pair of artifacts. That count I will then divide by total number of clades in order to standardize for the following step. Once I've done this for X number of dendograms, I will combine all these values for each pair, and divide them by the number representing whether that pair appeared in a dendogram (if it shows up in 2 dendograms, that I divide by 2), because each pair will not appear in each of my tests and I have to standardize it so that more complete artifacts that appear more often in my tests don't have too much more weight. This should allow me to evaluate which pairs have the strongest relationships.
This falls into a set of association kind of problems for which I find the widyr package to be super useful, since it does pairwise counts and correlations. (The stack() function just converts into a dataframe for the rest to flow).
I couldn't check against your sample output, but for an example like "PeHa23/PeHa51", the output shows they are paired together in 3 different clades.
This currently doesn't include zero counts to exhaust all possible pairs, but that could be shown as well (using complete()).
UPDATE: Made references clearer for packages like dplyr, and filtered so that counts are non-directional (item1-item2 is same as item2-item1 and can be filtered).
library(tidyverse)
library(widyr)
df <- stack(cladelist) %>%
dplyr::rename(clade = "ind", artifact = "values")
df %>%
widyr::pairwise_count(feature = clade, item = artifact) %>%
filter(item1 > item2) %>%
mutate(Pair = paste(item1, item2, sep = "/")) %>%
dplyr::select(Pair, Count = n)
#> # A tibble: 990 x 2
#> Pair Count
#> <chr> <dbl>
#> 1 PeHa3/KbFk2 1
#> 2 PeHa51/KbFk2 1
#> 3 PeHa23/KbFk2 1
#> 4 PeHa44/KbFk2 1
#> 5 PeHa8/KbFk2 1
#> 6 PeHa26/KbFk2 1
#> 7 KbFk5/KbFk2 2
#> 8 PeHa15/KbFk2 1
#> 9 PeHa43/KbFk2 1
#> 10 PeHa11/KbFk2 1
#> # … with 980 more rows
I am trying to arrange my data. The csv file that I load contains results of 15 precincts for one locality. The number of rows are 150 because the names of the 10 candidates repeat for each of the 15 precincts.
My goal is to make the names of the 10 candidates as columns without repeating their names and with the results for each candidate as the values. I use the code below, however I have to do it 15 times because I cut my data in intervals of 10 to extract the results of one precinct. It's the same for "binondov". I have to cut my data in intervals of 8 because there are 8 candidates for each precinct.
Is there a way to write my code as a loop? Thanks!
binondop1 <- binondop[1:10,]
binondop1a <- binondop1[order(binondop1[,2]),]
binondov1 <- binondov[1:8,]
binondov1a <- binondov1[order(binondov1[,2]),]
colnames(binondop1a) = colnames(binondov1a) =
c('X', 'Candidate', 'Party', 'Vote', 'Percentage')
binondo1 <- rbind(binondop1a, binondov1a)
binondo <- rbind(t(binondo1$Vote), t(binondo2$Vote),
t(binondo3$Vote), t(binondo4$Vote),
t(binondo5$Vote), t(binondo6$Vote),
t(binondo7$Vote), t(binondo8$Vote),
t(binondo9$Vote), t(binondo10$Vote),
t(binondo11$Vote), t(binondo12$Vote),
t(binondo13$Vote),t(binondo14$Vote),
t(binondo15$Vote))
colnames(binondo) <- c('Acosta', 'Aquino', 'DLReyes', 'EEjercito',
'Gordon', 'Madrigal', 'Perlas', 'Teodoro',
'Villanueva', 'Villar', 'Binay', 'Chipeco',
'Fernando', 'Legarda', 'Manzano', 'Roxas',
'Sonza', 'Yasay')
It's hard to say exactly without seeing a sample data set, but perhaps something like this will help get you where you need to your answer.
library(dplyr)
library(tidyr)
df <- data.frame(Candidate = c(rep('Acosta',3), rep('Aquino',3), rep('DLReyes',3)),
Party = c('R','R','R','L','L','L','D','D','D'),
Vote = rep(c('A','B','C'),3),
Percentage = c(5,4,2,6,8,3,1,3,2))
df2 <- df %>%
mutate(Candidate = paste0(Candidate, ' (', Party, ')')) %>%
select(-Party) %>%
spread(Candidate, Percentage)
I'm lost on how to combine my data into a usable data frame. I have a list of lists of character and number vectors Here is a working example of my code so far:
remove(list=ls())
# Headers for each of my column names
headers <- c("name","p","c","prophylaxis","control","inclusion","exclusion","conversion excluded","infection criteria","age criteria","mean age","age sd")
#_name = author and year
#_p = no. in experimental arm.
#_c = no. in control arm
#_abx = antibiotic used
#_con = control used
#_inc = inclusion criteria
#_exc = exclusion criteria
#_coexc = was conversion to open excluded?
#_infxn = infection criteria
#_agecrit = age criteria
#_agemean = mean age of study
#_agesd = sd age of study
# Passos 2016
passos_name <- c("Passos","2016")
passos_p <- 50
passos_c <- 50
passos_abx <- "cefazolin 1g at induction"
passos_con <- "none"
passos_inc <- c("elective LC","symptomatic cholelithiasis","low risk")
passos_exc <- c("renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis")
passos_coexc <- TRUE
passos_infxn <- c("temperature >37.8C","tachycardia","asthenia","local pain","local purulence")
passos_agecrit <- NULL
passos_agemean <- 48
passos_agesd <- 13.63
passos <- list(passos_name,passos_p,passos_c,passos_abx,passos_con,passos_inc,passos_exc,passos_coexc,passos_infxn,passos_agecrit,passos_agemean,passos_agesd)
names(passos) <- headers
# Darzi 2016
darzi_name <- c("Darzi","2016")
darzi_p <- 182
darzi_c <- 247
darzi_abx <- c("cefazolin 1g 30min prior to induction","cefazolin 1g 6H after induction","cefazolin 1g 12H after induction")
darzi_con <- "NaCl"
darzi_inc <- c("elective LC","first time abdominal surgery")
darzi_exc <- c("antibiotics within 7 days","immunosuppression","acute cholecystitis","choledocolithiasis","cholangitis","obstructive jaundice",
"pancreatitis","previous biliary tract surgery","previous ERCP","DM","massive intraoperative bleeding","antibiotic allergy","major thalassemia",
"empyema")
darzi_coexc <- TRUE
darzi_infxn <- c("temperature >38C","local purulence","intra-abdominal collection")
darzi_agecrit <- c(">18", "<75")
darzi_agemean <- 43.75
darzi_agesd <- 13.30
darzi <- list(darzi_name,darzi_p,darzi_c,darzi_abx,darzi_con,darzi_inc,darzi_exc,darzi_coexc,darzi_infxn,darzi_agecrit,darzi_agemean,darzi_agesd)
names(darzi) <- headers
# Matsui 2014
matsui_name <- c("Matsui","2014")
matsui_p <- 504
matsui_c <- 505
matsui_abx <- c("cefazolin 1g at induction","cefazolin 1g 12H after induction","cefazolin 1g 24H after induction")
matsui_con <- "none"
matsui_inc <- "elective LC"
matsui_exc <- c("emergent","concurrent surgery","regular insulin use","regular steroid use","antibiotic allergy","HD","antibiotics within 7 days","hepatic impairment","chemotherapy")
matsui_coexc <- FALSE
matsui_infxn <- c("local purulence","intra-abdominal collection","distant infection","temperature >38C")
matsui_agecrit <- ">18"
matsui_agemean <- NULL
matsui_agesd <- NULL
matsui <- list(matsui_name,matsui_p,matsui_c,matsui_abx,matsui_con,matsui_inc,matsui_exc,matsui_coexc,matsui_infxn,matsui_agecrit,matsui_agemean,matsui_agesd)
names(matsui) <- headers
# Find unique exclusion critieria in order to create the list of all possible levels
exc <- ls()[grepl("_exc",ls())]
exclist <- sapply(exc,get)
exc.levels <- unique(unlist(exclist,use.names = F))
# Find unique inclusion critieria in order to create the list of all possible levels
inc <- ls()[grepl("_inc",ls())]
inclist <- sapply(inc,get)
inc.levels <- unique(unlist(inclist,use.names = F))
# Find unique antibiotics order to create the list of all possible levels
abx <- ls()[grepl("_abx",ls())]
abxlist <- sapply(abx,get)
abx.levels <- unique(unlist(abxlist,use.names = F))
# Find unique controls in order to create the list of all possible levels
con <- ls()[grepl("_con",ls())]
conlist <- sapply(con,get)
con.levels <- unique(unlist(conlist,use.names = F))
# Find unique age critieria in order to create the list of all possible levels
agecrit <- ls()[grepl("_agecrit",ls())]
agecritlist <- sapply(agecrit,get)
agecrit.levels <- unique(unlist(agecritlist,use.names = F))
I have been struggling with:
1) Turn each of the _exc, _inc, _abx, _con, _agecrit lists into factors using the levels generated at the end of the code block. I have been trying to use a for loop such as:
for (x in exc) {
as.name(x) <- factor(get(x),levels = exc.levels)
}
This only creates a variable, x, that stores the last parsed list as a factor.
2) Combine all of my data into a data frame formatted as such:
name, p, c, prophylaxis, control, inclusion, exclusion, conversion excluded, infection criteria, age criteria, mean age, age sd
"Passos 2016", 50, 50, "cefazolin 1g at induction", "none", ["elective LC","symptomatic cholelithiasis","low risk"], ["renal impairment","hepatic impairment","immunosuppression","regular steroid use","antibiotics within 48H","acute cholecystitis","choledocolithiasis"], TRUE, ["temperature >37.8C","tachycardia","asthenia","local pain","local purulence"], NULL, 48, 13.63
...
# [] = factors
# columns correspond to each studies variables (i.e. passos_name, passos_p, passos_c, etc..)
# rows correspond to each study (i.e., passos, darzi, matsui)
I have tried various solutions on StackOverflow, but have not found any that work; for example:
studies <- list(passos,darzi,matsui,ruangsin,turk,naqvi,hassan,sharma,uludag,yildiz,kuthe,koc,maha,tocchi,higgins,mahmoud,kumar)
library(data.table)
rbindlist(lapply(studies,as.data.frame.list))
I suspect my data may not be exactly amenable to a data frame? Primarily because of trying to store a list of factors in a column. Is that allowed? If not, how is this type of data normally stored? My goal is to be able to meaningfully compare these various criterion across studies.
This is too long for a comment, so I turn it into an "answer":
To start with, have a look at what happens here:
data.frame(name = "Passos, 2016", p = 50)
name p
1 Passos, 2016 50
data.frame(name = c("Passos", "2016"), p = 50)
name p
1 Passos 50
2 2016 50
In the first one, we created a dataframe with the column "name" which contained one entry "Passos, 2016", i.e. one character containing both pieces of information, and the column "p". All fine. Now, in the second version, I specified the column "name" as you did above, using c(Passos, 2016). This is a two-element vector, and hence we get two rows in the dataframe: one with name Passos, one with name 2016, and the column p gets recycled.
Clearly, the latter is probably not what you intended. But it works anyway because R just recycles the shorter vector. Now, what do you think happens if I add a vector that contains three elements?
And this highlights the main issue with what you are doing: you are trying to get a dataframe from many vectors with different lengths. Now, in some cases this is fine if you want the shorter vector to be repeated (in R speech, we call this "recycled"), but it does not look like something you want to do here.
So, my recommendation would be this: try to imagine a matrix and make sure you understand what each element (row and column) is supposed to be. Then specify your data accordingly. If in doubt, look up "tidy data".
I am using trade data (FAO) which I would like to turn into matrices (per Item and Year). Therefore I've done a split:
# import is the original df
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:11])
and the data looks like this (you can find test data at the end) :
[[1]]
RC PC Item Year Value
Argentina Chile Almonds 1996 1108
Algeria Spain Almonds 1996 1
....
[[2]]
....
[[3]]
....
[[n]]
I used the cast function (below) to create a matrix for almonds in 2012:
# import_almonds2012 is a test subset from import df (with import values for almonds in 2012)
RCPC <- cast(RC ~ PC, data =import_almonds2012, value = "Value")
Now my question: how can I do matrices of all Items/Years (~100 Items and 17 years!!) from the import_YI_lap df? My problem is that I don't know (1) how to operate the levels/ojects in this df ([[1]], [[2]]...). Or there a better way to split data or to save the splited df into objects? And (2) how to create all the needed matrices without coping thousend lines of code. Loops? If yes, how??
here a test-dataset:
import<- data.frame(RC=c("DE", "IT", "USA"),
PC = c("BRA", "ARG"),
Item = c("Almonds", "Apples"),
Year = c(1996,1997,1998),
Value = c(1,5,3,2,8,3))
import_YI <- split(import, list(import$Item, import$Year))
import_YI_lap <- lapply(seq_along(import_YI), function(x) as.data.frame(import_YI[[x]])[, 1:5])
import_YI_lap
It's difficult to test without data, but you can try this:
do.call(rbind,import_YI_lap)