changing variable value in data frame - r

I have a data frame:
id,male,exposure,age,tol
9,0,1.54,tol12,1.79
9,0,1.54,tol13,1.9
9,0,1.54,tol14,2.12
9,0,1.54,tol11,2.23
However, I want the values of the age variable to be (11,12,13,14) not (tol11,tol12,tol13,tol14). I tried the following, but it does not make a difference.
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol11] <- 11
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol12] <- 12
Any help would be appreciated.
(data from Singer, Willett book)

Assuming that you data frame is named foo:
foo$age <- as.numeric(gsub("tol", "", foo$age))
id male exposure age tol
1: 9 0 1.54 12 1.79
2: 9 0 1.54 13 1.90
3: 9 0 1.54 14 2.12
4: 9 0 1.54 11 2.23
Here we use two functions:
gsub to replace pattern in a string (we replace tol with nothing "").
as.numeric to transform gsub output (which is character) into numbers

Related

Join spatial features with dataframe by id with inconsistent format

Hello everyone I was hoping I could get some help with this issue:
I have shapefile with 2347 features that correspond to 3172 units, perhaps when the original file was created there were some duplicated geometries and they decided to arrange them like this:
Feature gis_id
1 "1"
2 "2"
3 "3,4,5"
4 "6,8"
5 "7"
6 "9,10,13"
... like that until the 3172 units and 2347 features
On the other side my data table has 72956 observations (about 16 columns) with data corresponding to the gis_id from the shapefile. However, this table has a unique gis_id per observation
head(hru_ls)
jday mon day yr unit gis_id name sedyld tha sedorgn kgha sedorgp kgha surqno3 kgha lat3no3 kgha
1 365 12 31 1993 1 1 hru0001 0.065 0.861 0.171 0.095 0
2 365 12 31 1993 2 2 hru0002 0.111 1.423 0.122 0.233 0
3 365 12 31 1993 3 3 hru0003 0.024 0.186 0.016 0.071 0
4 365 12 31 1993 4 4 hru0004 6.686 16.298 1.040 0.012 0
5 365 12 31 1993 5 5 hru0005 37.220 114.683 6.740 0.191 0
6 365 12 31 1993 6 6 hru0006 6.597 30.949 1.856 0.021 0
surqsolp kgha usle tons sedmin ---- tileno3 ----
1 0.137 0 0.010 0
2 0.041 0 0.009 0
3 0.014 0 0.001 0
4 0.000 0 0.175 0
5 0.000 0 0.700 0
6 0.000 0 0.227 0
With multiple records for each unit (20 years data)
I would like to merge the geometry data of my shapefile to my data table. I've done this before with sp::merge I think, but with a shapefile that did not have multiple id's per geometry/feature.
Is there a way to condition the merging so it gives each feature from the data table the corresponding geometry according to if it has any of the values present on the gis_id field from the shapefile?
This is a very intriguing question, so I gave it a shot. My answer is probably not the quickest or most concise way of going about this, but it works (at least for your sample data). Notice that this approach is fairly sensitive to the formatting of the data in shapefile$gis_id (see regex).
# your spatial data
shapefile <- data.frame(feature = 1:6, gis_id = c("1", "2", "3,4,5", "6,8", "7", "9,10,13"))
# your tabular data
hru_ls <- data.frame(unit = 1:6, gis_id = paste(1:6))
# loop over all gis_ids in your tabular data
# perhaps this could be vectorized?
gis_ids <- unique(hru_ls$gis_id)
for(id in gis_ids){
# Define regex to match gis_ids
id_regex <- paste0("(,|^)", id, "(,|$)")
# Get row in shapefile that matches regex
searchterm <- lapply(shapefile$gis_id, function(x) grepl(pattern = id_regex, x = x))
rowmatch <- which(searchterm == TRUE)
# Return shapefile feature id that maches tabular gis_id
hru_ls[hru_ls$gis_id == id, "gis_feature_id"] <- shapefile[rowmatch, "feature"]
}
Since you didn't provide the geometry fields in your question, I just matched on Feature in your spatial data. You could either add an additional step that merges based on Feature, or replace "feature" in shapefile[rowmatch, "feature"] with your geometry fields.

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Adding row in R with next day and 0 in each column

I have a data.frame with 4 columns. The first column is the_day, from 11/1/15 until 11/30/15. The next 3 have values corresponding to each day based on amount_raised. However, some dates are missing because there were no values in the next 3 columns (no money was raised).
For example, 11/3/15 is missing. What I want to do is add a row in between 11/2/15 and 11/4/15 with the date, and zeros in the next 3 columns. So it would read like this:
11/3/2015 0 0 0
Do I have to create a vector and then add it into the existing data.frame? I feel like there has to be a quicker way.
This should work,
date_seq <- seq(min(df$the_day),max(df$the_day),by = 1)
rbind(df,cbind(the_day = as.character( date_seq[!date_seq %in% df$the_day]), inf = "0", specified = "0", both = "0"))
# the_day inf specified both
# 1 2015-11-02 1.32 156 157.32
# 2 2015-11-04 4.25 40 44.25
# 3 2015-11-05 3.25 25 28.25
# 4 2015-11-06 1 15 16
# 5 2015-11-07 4.75 10 14.75
# 6 2015-11-08 32 0 32
# 7 2015-11-03 0 0 0
If you want sort it according to the_day, take the data frame in a variable and use the order function
ans <- rbind(df,cbind(the_day = as.character( date_seq[!date_seq %in% df$the_day]), inf = "0", specified = "0", both = "0"))
ans[order(ans$the_day), ]
# the_day inf specified both
# 1 2015-11-02 1.32 156 157.32
# 7 2015-11-03 0 0 0
# 2 2015-11-04 4.25 40 44.25
# 3 2015-11-05 3.25 25 28.25
# 4 2015-11-06 1 15 16
# 5 2015-11-07 4.75 10 14.75
# 6 2015-11-08 32 0 32
data.frames are not efficient to work with row-wise internally. I would suggest something along the following lines:
create empty (zero) 30x3 matrix. This will include your amount_raised.
create a complete sequence of dates from 11/1 till 11/30
for each existing date, find it's match in the complete sequence
copy the corresponding line from your data frame to the matched line in the matrix (use match() function).
Eventually, make a new data frame out of the new sequence and matrix.

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources