Creating a variable based on a word count within a variable - r

I have a data set containing countries and their constitutions. I was wondering if there was a way to create a variable to show how many times the word "god" shows in the variable of constitutions.
The data set looks as following:
Country Year Preamble
Afghanistan 2004 In the name of Allah...
Albania 1998 We, the people of Albania...
... .... .......
and so on and so forth. I am particularly interested in knowing if there is a function in which can count how many times a specific word is used within a categorical variable or if there is a better way to accomplish what I am trying to do.

Say you want to count the number of times 'Al' appears in the above dataset, you can use grep like this:
For only one column:
grep("Al", data$Preamble)
For all columns:
lapply(data, function(x) grep("Al", x))
$`Country`
[1] 2
$Year
integer(0)
$Preamble
[1] 1 2
This will tell you in which rows and columns the match is found, ie one in the 'Country' column and two in the 'Preamble' column

Related

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

How to unnest irregular JSON data

I have been looking at many solutions on this site to similar problems for weeks but cannot wrap my head around how to apply them successfully to this particular one:
I have the dataset at https://statdata.pgatour.com/r/006/player_stats.json
using:
player_stats_url<-"https://statdata.pgatour.com/r/006/player_stats.json"
player_stats_json <- fromJSON(player_stats_url)
player_stats_df <- ldply(player_stats_json,data.frame)
gives:
a dataframe of 145 rows, one for each player, and 7 columns, the 7th of which is named "players.stats" that contains the data I'd like broken out into a 2-dimensional dataframe
next, I do this to take a closer look at the "players.stats" column:
player_stats_df2<- ldply(player_stats_df$players.stats, data.frame)
the data in the "players.stats" columns are formatted as follows: rows of
25 repeating stat categories in the column (player_stats_df2$name) and another nested list in the column $rounds ... on which I repeat ldply to unnest everything but I cannot sew it back together logically in the way that I want ...
the format of the column $rounds, after unnested, using:
player_stats_df3<- ldply(player_stats_df2$rounds, data.frame)
gives the round number in the first column $r (1,2,3,4 as only choices) and then the stat value in the second column $rValue. to complicate things, some entries have 2 rounds, while others have 4 rounds
the final format of the 2-dimensional dataframe I need would have columns named players.pid and players.pn from player_stats_df, a NEW COLUMN denoting "round.no" which would correspond to player_stats_df3$r and then each of the 25 repeating stat categories from player_stats_df2$name as a column (eagles, birdies, pars ... SG: Off-the-tee, SG: tee-to-green, SG: Total) and each row being unique to a player name and round number ...
For example, there would be four rows for Matt Kuchar, one for each round played, and a column for each of the 25 stat categories ... However, some other players would only have 2 rows.
Please let me know if I can clarify this at all for this particular example- I have tried many things but cannot sew this data back together in the format I need to use it in ...
Here something you can start with, we can create a tibble using tibble::as_tibble then apply multiple unnest using tidyr::unnest
library(tidyverse)
as_tibble(player_stats_json$tournament$players) %>% unnest() %>% unnest(rounds)
Also see this tutorial here. Finally use dplyr "tidyverse" instead of plyr

Include variable value on data frame name

I'm trying to figure out how can I add something to a data frame df, based on a variable (i.e. a date), ending up with a data frame named df_17 if variable is equal to 2017 for example.
The reason why I want this is because I'm importing datasets from several years and quarters, and I would like to make sure that they are named according to the year variable they have. Each dataset only has 1 date. I know I can do it manually but it would take me less time to automate it.
I know how to do it with columns and rows, but I can't figure it out for objects.
EDIT:
Example 1:
Data frame name "df"
A B Date
1 4 2017
2 3 2017
New data frame name "df_2017"
Example 2:
Data frame name "df"
A B Date
1 4 2016
2 3 2016
New data frame name - "df_2016 "
The assign function should do what you want. A solution could look like
assign(paste0("df_", year), dataframe_read_from_file, pos = 1)
If you use assign inside a function oder a loop, make sure that you set the pos option correctly.

Copy columns of a data frame based on the value of a third column in R

I have a data frame with 4 columns. On one of the columns I added a date so that each value looks like this
>print(result[[4]][[10000]])
[[10000]]
[1] "Jan" "14" "2012"
That means that on the 1000'th field of the 4th column I have these 3 fields. This is the only column that is multiple.
Now the other 3 columns of the data frame result are single values not multiple. One of those columns, the first one, has the states of the United States as values. What I want to do is create a new data frame from column 2 and 4 (the one described above) of the result data frame but depending on the state.
So for example I want all the 2nd column and 4th column data of the state of Alabama. I tried this but I don't think it is working properly. "levels" is the 2nd column and "weeks" is the 4th column of the data frame result.
rst <- subset(result, result$states == 'Alabama', select = c(result$levels, result$weeks))
The problem here is that subset is copying all the columns to rst and not just the second and fourth ones of the result data frame that are linked to Alabama state which are the only ones I want. Any idea how to do this correctly?
Edit to add the code
I'm adding the code here since I think there must be something I'm not seeing here. First a small sample of the original data which is on a csv file
st URL WEBSITE al aln wk WEEKSEASON
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-04-2008 40 2008-09
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-11-2008 41 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-18-2008 42 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-25-2008 43 2008-09
And this is the code
#Extracts relevant data from the csv file
extract_data<-function(){
#open the file. NAME SHOULD BE CHANGED
sd <- read.csv(file="sdr.csv",head=TRUE,sep=",")
#Extracts the data from the ACTIVITY LEVEL column. Notice that the name of the column was changed on the file
#to 'al' to make the reference easier
lv_list <- sd$al
#Gets only the number from each value getting rid of the word "Level"
lvs <- lapply(strsplit(as.character(lv_list), " "), function(x) x[2])
#Gets the ACTIVITY LEVEL NAME. Column name was changed to 'aln' on the file
lvn_list <- sd$aln
#Gets the state. Column name was changed to 'st' on the file
st_list <- sd$st
#Gets the week. Column name was changed to 'wk' on the file
wk_list <- sd$wk
#Divides the weeks data in month, day, year
wks <- strsplit(as.character(wk_list), "-")
result<-list("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
return(result)
}
forecast<-function(){
result=extract_data()
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
return(0) #return results
}
You're nearly there, but you don't need to reference the dataframe in the select argument - this should work:
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
You could also look into the package dplyr, which gives you SQL like abilities and is great for manipulating larger and more complicated data sets.
EDIT
Thanks for posting your code - I think I've identified a few problems.
The result you return from extract_data() is a list, not a data.frame - which is why the code in forecast() doesn't work. If it did return a dataframe the original solution would work.
You're forming your list out of a combination of vectors and lists, which is a problem - a dataframe is (roughly) a list of vectors, not a collection of the two types. If you replace your list creation line with result <- data.frame(...) you run into problems because of this.
There are two problematic columns - lvs (or levels) and wks (weeks). Where you use lapply(), using sapply() instead would give you a vector, as required (see the manual). The second issue is the weeks column. What you're actually dealing with here is a list of character vectors of length 3. There's no easy way to do what you want - you can't, for example, have each 'cell' of a column in a dataframe contain a character vector, as the columns are themselves vectors.
My suggestions would be to either:
Use the original format "Oct-01-2008", i.e. construct your data.frame with wk_list rather than splitting each date into the three strings;
Convert the original format into a better time format with a package like lubridate (A+++++ would recommend, great package);
Or finally, split the week column into three columns, so you'd have one for month, one for day and one for year. You could do this very simply from wk_list like this:
wks <- sapply(strsplit(as.character(wk_list), "-"), function(x) c(x[1], x[2], x[3]))
Month <- wks[1,]
Day <- wks[2,]
Year <- wks[3,]
Once both lvs and wks are in vector form, you're good to just run
result<-data.frame("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
and the script should work.

extracting corresponding values from lists in a data frame

I have a very simple data frame ("newDF") consisting of 2 lists "year" and "value". The list "year" is a simple list from 1850 to 2011. I wish to extract the "value" corresponding to the year 1990 for use in another package. I suspect it is a very simple question. Any help would be appreciated. Many thanks.
This should work:
newDF$value[newDF$year==1990]
The $ identifies a column in the dataframe; the brackets are a way to subset that column, and inside the brackets you just put a logical argument that will be (TRUE) for the row (or rows) you want. So you could get all years since 1990 with a very simply modification:
newDF$value[newDF$year>=1990]
subset(newDF, year==1990, select="value")

Resources