Using lists to change columns in multiple dataframes in R - r

I am using a list of variables to download and create dataframes in R. I'd like to be able to use this list to make changes to different columns in each dataframe, but I am having trouble calling particular columns using the list of variables.
countries= c("USA","CHN")
for (i in 1:length(countries)){
download.file(url[i],savedata[i])
assign(countries[i],xmlToDataFrame(savedata[i]))
}
Now I have dataframes that look like this:
head(USA)
indicator country date value decimal
1 GDP (current US$) United States 2012 15684800000000 0
2 GDP (current US$) United States 2011 14991300000000 0
3 GDP (current US$) United States 2010 14419400000000 0
4 GDP (current US$) United States 2009 13898300000000 0
5 GDP (current US$) United States 2008 14219300000000 0
6 GDP (current US$) United States 2007 13961800000000 0
And I would like to go through and make several changes, such as formatting the date column with the as.date() function, or changing the units of the value column, but I want to be able to do the same to both dataframe (or an arbitrary number in case I increase the length of countries.
However, whenever I try to do this I can seem to use the list of countries in the countries variable to get 'inside' each data frame. My initial guess was putting something like this in a loop:
assign(paste(countries[i],"date",sep="$"),
as.date(get(paste(countries[i],"date",sep="$")))
In particular, I get confused about how the get(paste(countries[i])) works if I am not trying to get the particular column date, and how the paste(countries[i],"date",sep="$") prints the correct name, but I can't seem to get just the one column I'd like to manipulate.
Additionally, I realize loops are not the ideal way of doing this, but I've been having the same problem with the apply functions, though I am likely having trouble with them due to my lack of experience. Suggestions for either how to do it in a loop, or with out, would be much appreciated. Super R novice here, just trying to learn. Also, if you've come across a clear explanation/answer for this somewhere else, I'd appreciate you pointing me towards it.

It's much easier if you use lists. Start with an empty one:
mylist = list()
Then change this:
assign(countries[i],xmlToDataFrame(savedata[i]))
to this:
mylist[[i]] <- xmlToDataFrame(savedata[i])
Then make a function that does your formatting, for instance:
f <- function(df){
within(df, date <- as.date(date))
}
And use lapply to apply it to all dataframes:
mylist2 <- lapply(mylist, f)
If you want to access dataframes by name, use this:
names(mylist2) <- countries
And test:
mylist2[["USA"]]

Related

Subsetting dates from colnames

I have a dataframe as follows:
TAS1 2000 obs. of 9862 variables
Each of these variables (columns) represent daily temperatures from 1979-01-01 to 2005-12-31. The colnames have been set with these dates. I now wish to separate the dataframe into twelve separate monthly data frames - containing Jan, Feb, Mar etc.
I have tried:
TAS1.JAN = subset(TAS1, grepl("-01-"), colnames(TAS1))
But get the error:
Error in grepl("-01-") : argument "x" is missing, with no default
Is there a relatively quick solution for this? I feel there must be but haven't cracked it despite trying various solutions.
I would subset January data like below.
Jan_df <- subset(MyDatSet, select=(grepl("-01-, colnames(MyDatSet))))
I have assumed that your parent dataset is called MyDatSet and a pattern "-01-" defines that it is January data.
You may repeat the process for other 11 months or come up with intelligent loop.
Like Roland, in the comments, suggested, I would opt for melting mechanism too. However, since I do not know your use case, here you go based on what you posted and asked for.
As your error says, you are missing an argument there:
tas1.jan <- subset(df, grepl("-01-", df$tas1))
Another way to do it with the help of stringr and dplyr would be:
library(stringr)
library(dplyr)
tas1.jan <- df %>% filter(str_detect(tas1, "-01-"))
Bottom side of this approach: you need to run a loop or do this 12 times for all months.

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

Creating a variable based on a word count within a variable

I have a data set containing countries and their constitutions. I was wondering if there was a way to create a variable to show how many times the word "god" shows in the variable of constitutions.
The data set looks as following:
Country Year Preamble
Afghanistan 2004 In the name of Allah...
Albania 1998 We, the people of Albania...
... .... .......
and so on and so forth. I am particularly interested in knowing if there is a function in which can count how many times a specific word is used within a categorical variable or if there is a better way to accomplish what I am trying to do.
Say you want to count the number of times 'Al' appears in the above dataset, you can use grep like this:
For only one column:
grep("Al", data$Preamble)
For all columns:
lapply(data, function(x) grep("Al", x))
$`Country`
[1] 2
$Year
integer(0)
$Preamble
[1] 1 2
This will tell you in which rows and columns the match is found, ie one in the 'Country' column and two in the 'Preamble' column

Remove rows based on chronological order

The data that I am trying to work with is in a data.frame that has the following format:
Title Year
Something 2006
something2 2007
Something 2008
Something 2009
I'm specifically interested in being able to subset the data so that their chronological order is fewer then 2008. For example, this would give:
Title Year
Something 2009
Is it acceptable to use something like this:
df[!(df$Year <= 2008), ]
If by fewer mean older (lower number) than you are looking for df<-df[df$Year>2008] or as you do df<-df[!df$Year<=2008]
You will need to overwrite the original data.frame, or it will just display the subset, but not save it. You can also use subset(df, Year>2008) or dplyr package. Whaever suits you best.

Copy columns of a data frame based on the value of a third column in R

I have a data frame with 4 columns. On one of the columns I added a date so that each value looks like this
>print(result[[4]][[10000]])
[[10000]]
[1] "Jan" "14" "2012"
That means that on the 1000'th field of the 4th column I have these 3 fields. This is the only column that is multiple.
Now the other 3 columns of the data frame result are single values not multiple. One of those columns, the first one, has the states of the United States as values. What I want to do is create a new data frame from column 2 and 4 (the one described above) of the result data frame but depending on the state.
So for example I want all the 2nd column and 4th column data of the state of Alabama. I tried this but I don't think it is working properly. "levels" is the 2nd column and "weeks" is the 4th column of the data frame result.
rst <- subset(result, result$states == 'Alabama', select = c(result$levels, result$weeks))
The problem here is that subset is copying all the columns to rst and not just the second and fourth ones of the result data frame that are linked to Alabama state which are the only ones I want. Any idea how to do this correctly?
Edit to add the code
I'm adding the code here since I think there must be something I'm not seeing here. First a small sample of the original data which is on a csv file
st URL WEBSITE al aln wk WEEKSEASON
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-04-2008 40 2008-09
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-11-2008 41 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-18-2008 42 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-25-2008 43 2008-09
And this is the code
#Extracts relevant data from the csv file
extract_data<-function(){
#open the file. NAME SHOULD BE CHANGED
sd <- read.csv(file="sdr.csv",head=TRUE,sep=",")
#Extracts the data from the ACTIVITY LEVEL column. Notice that the name of the column was changed on the file
#to 'al' to make the reference easier
lv_list <- sd$al
#Gets only the number from each value getting rid of the word "Level"
lvs <- lapply(strsplit(as.character(lv_list), " "), function(x) x[2])
#Gets the ACTIVITY LEVEL NAME. Column name was changed to 'aln' on the file
lvn_list <- sd$aln
#Gets the state. Column name was changed to 'st' on the file
st_list <- sd$st
#Gets the week. Column name was changed to 'wk' on the file
wk_list <- sd$wk
#Divides the weeks data in month, day, year
wks <- strsplit(as.character(wk_list), "-")
result<-list("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
return(result)
}
forecast<-function(){
result=extract_data()
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
return(0) #return results
}
You're nearly there, but you don't need to reference the dataframe in the select argument - this should work:
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
You could also look into the package dplyr, which gives you SQL like abilities and is great for manipulating larger and more complicated data sets.
EDIT
Thanks for posting your code - I think I've identified a few problems.
The result you return from extract_data() is a list, not a data.frame - which is why the code in forecast() doesn't work. If it did return a dataframe the original solution would work.
You're forming your list out of a combination of vectors and lists, which is a problem - a dataframe is (roughly) a list of vectors, not a collection of the two types. If you replace your list creation line with result <- data.frame(...) you run into problems because of this.
There are two problematic columns - lvs (or levels) and wks (weeks). Where you use lapply(), using sapply() instead would give you a vector, as required (see the manual). The second issue is the weeks column. What you're actually dealing with here is a list of character vectors of length 3. There's no easy way to do what you want - you can't, for example, have each 'cell' of a column in a dataframe contain a character vector, as the columns are themselves vectors.
My suggestions would be to either:
Use the original format "Oct-01-2008", i.e. construct your data.frame with wk_list rather than splitting each date into the three strings;
Convert the original format into a better time format with a package like lubridate (A+++++ would recommend, great package);
Or finally, split the week column into three columns, so you'd have one for month, one for day and one for year. You could do this very simply from wk_list like this:
wks <- sapply(strsplit(as.character(wk_list), "-"), function(x) c(x[1], x[2], x[3]))
Month <- wks[1,]
Day <- wks[2,]
Year <- wks[3,]
Once both lvs and wks are in vector form, you're good to just run
result<-data.frame("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
and the script should work.

Resources