Filling in missing value in R [duplicate] - r

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have a dataframe like this:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA NA
145 4 2005 NA NA NA
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA NA
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
fcmstat: type of first marital status change
secmstat: type of second marital status change
first marital status, for ID 4(19), fsmstat was changed in 2003(2002) and second marital status secmstat was changed in 2006(2004). So, for ID 4, in 2004 and 2005 marital status was same as fcmstat of 2003 and for ID 19, 2003's mstat should be same as fcmstat of 2002.
I want to fill in t he last column as follows:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA 2
145 4 2005 NA NA 2
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA 6
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
Also, before any first change, the mstatshould be same as before. Consider the following case.
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA NA
1177 61 1984 NA NA NA
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
the first change was in 1985. So, the missing mstat in 1984 and 1983 should be same as mstat of 1982. SO for this case, my desired output is:
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA 0
1177 61 1984 NA NA 0
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
As suggested by Schilker the code df$mstat_updated<-na.locf(df$mstat) gives the following:
ID year fcmstat secmstat mstat mstat_updated
138 4 1998 NA NA 1 1
139 4 1999 NA NA 1 1
140 4 2000 NA NA 1 1
141 4 2001 NA NA 1 1
142 4 2002 NA NA 1 1
143 4 2003 2 NA 2 2
144 4 2004 NA NA NA 2
145 4 2005 NA NA NA 2
146 4 2006 NA 3 3 3
147 4 2007 NA NA NA 3
148 4 2008 NA NA NA 3
However, I do want to fill in mstat for 2004 and 2005 but not in 2007 and 2008. I want to fill in NA's only between first marstat change, fcmstat and second marstat, secmstat change.

As I mentioned in my comment this a duplicate of here
library(zoo)
df<-data.frame(ID=c('4','4','4','4'),
year=c(2003,2004,2005,2006),
mstat=c(2,NA,NA,3))
df$mstat<-na.locf(df$mstat)

Related

Linear regression on 415 files, output just filename, regression coefficient, significance

I am a beginner in R, I am learning the basics to analyse some biological data. I have 415 .csv files, each is a fungal species. Each file has 5 columns - (YEAR, FFD, LFD, MEAN, RANGE)
YEAR FFD LFD RAN MEAN
1 1950 NA NA NA NA
2 1951 NA NA NA NA
3 1952 NA NA NA NA
4 1953 NA NA NA NA
5 1954 NA NA NA NA
6 1955 NA NA NA NA
7 1956 NA NA NA NA
8 1957 NA NA NA NA
9 1958 NA NA NA NA
10 1959 140 141 1 140
11 1960 NA NA NA NA
12 1961 NA NA NA NA
13 1962 NA NA NA NA
14 1963 NA NA NA NA
15 1964 NA NA NA NA
16 1965 155 156 1 155
17 1966 NA NA NA NA
18 1967 NA NA NA NA
19 1968 152 153 1 152
20 1969 NA NA NA NA
21 1970 NA NA NA NA
22 1971 161 162 1 161
23 1972 NA NA NA NA
24 1973 143 144 1 143
25 1974 NA NA NA NA
26 1975 NA NA NA NA
27 1976 NA NA NA NA
28 1977 NA NA NA NA
29 1978 NA NA NA NA
30 1979 NA NA NA NA
31 1980 NA NA NA NA
32 1981 NA NA NA NA
33 1982 155 156 1 155
34 1983 NA NA NA NA
35 1984 NA NA NA NA
36 1985 157 158 1 157
37 1986 170 310 140 240
38 1987 173 274 101 232
39 1988 192 236 44 214
40 1989 234 320 86 277
41 1990 172 287 115 213
42 1991 148 287 139 205
43 1992 140 278 138 206
44 1993 152 273 121 216
45 1994 142 319 177 228
46 1995 261 318 57 287
47 1996 247 315 68 285
48 1997 164 270 106 230
49 1998 186 187 1 186
50 1999 235 236 1 235
51 2000 NA NA NA NA
52 2001 309 310 1 309
53 2002 203 308 105 256
54 2003 140 238 98 189
55 2004 204 313 109 267
56 2005 253 313 60 287
57 2006 247 300 53 279
58 2007 185 295 110 225
59 2008 259 260 1 259
60 2009 296 315 19 309
61 2010 230 303 73 275
62 2011 247 248 1 247
63 2012 206 207 1 206
64 2013 NA NA NA NA
65 2014 250 317 67 271
First I would like to see the regression coefficient (slope of the line) for each file, and the significance (p-value) for all of the files.
I can do it individually with:
fruit<-read.csv(file.choose(),header=TRUE)
yr<-fruit[,1]
ffd<-fruit[,2]
res<-lm(ffd~yr)
summary(res)
when I do this for the data, I get:
Call:
lm(formula = ffd ~ yr)
Residuals:
Min 1Q Median 3Q Max
-77.358 -20.858 -5.714 22.494 96.015
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4162.0710 950.1439 -4.380 0.000119 ***
yr 2.1864 0.4765 4.588 6.55e-05 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 38.75 on 32 degrees of freedom
(31 observations deleted due to missingness)
Multiple R-squared: 0.3968, Adjusted R-squared: 0.378
F-statistic: 21.05 on 1 and 32 DF, p-value: 6.549e-05
The only information I need from this at the moment is the regression coefficient (2.1864) and the p-value (6.549e-05)
The perfect output would be if I could get R to cycle through the 415 files, and give an output in the form of a table with 3 columns: filename, regression coefficient, and significance. There would be 415 rows, one for each file.
I would then like to do YEAR~LFD, YEAR~RANGE, and YEAR~MEAN. I am hoping that I can easily edit the code for YEAR~FFD and run it for the other 3 regressions.
The following code will probably work.
I have tested it with your data in two files. The functions that do all the work are these ones:
regrFun <- function(DF){
fit <- lm(DF[[1]] ~ DF[[2]])
coef(summary(fit))[2, c(1, 4)]
}
regrList <- function(iv, L){
res <- lapply(seq_along(L), function(i){
dftmp <- L[[i]]
cfs <- regrFun(dftmp[c(1, iv)])
data.frame(file = names(L)[i], Estimate = cfs[1], p.value = cfs[2])
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
Now read in the data files. In the following code line, substitute a common filename part for "pattern" in the obvious place.
filenames <- list.files(pattern = "pattern")
df_list <- lapply(filenames, read.csv)
names(df_list) <- filenames
And compute the values you want.
results_list <- lapply(2:ncol(df_list[[1]]), regrList, df_list)
names(results_list) <- names(df_list[[1]][-1])
First I simulate like 5 csv files, with columns that look like yours:
for(i in 1:5){
tab=data.frame(
YEAR=1950:2014,
FFD= rpois(65,100),
LFD= rnorm(65,100,10),
RAN= rnbinom(65,mu=100,size=1),
MEAN = runif(65,min=50,max=150)
)
write.csv(tab,paste0("data",i,".csv"))
}
Now, we need a vector of all the files in your directory, this will be different for yours, but try to create this somehow using the pattern argument:
csvfiles = dir(pattern="data[0-9]*.csv$")
So we use three libraries from tidyverse, and I guess each csv file is not so huge, so the code below reads in all the files, group them by the source and performs the regression, note you can use call the columns from the data frame, not having to rename them:
library(dplyr)
library(purrr)
library(broom)
csvfiles %>%
map_df(function(i){df = read.csv(i);df$data = i;df}) %>%
group_by(data) %>%
do(tidy(lm(FFD ~ YEAR,data=.))) %>%
filter(term!="(Intercept)")
# A tibble: 5 x 6
# Groups: data [5]
data term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data1.csv YEAR -0.0228 0.0731 -0.311 0.756
2 data2.csv YEAR -0.139 0.0573 -2.42 0.0182
3 data3.csv YEAR -0.175 0.0650 -2.70 0.00901
4 data4.csv YEAR -0.0478 0.0628 -0.762 0.449
5 data5.csv YEAR 0.0204 0.0648 0.315 0.754
You can just change the formula inside lm(FFD ~ YEAR,data=.) to get the other regressions
data.table version, using StupidWolf's csv files layout and names, featuring the requested fields:
library(data.table)
input.dir = "/home/user/Desktop/My Folder/" # adjust to your needs
csvfiles <- list.files(path=input.dir, full.names=TRUE, pattern=".*data(.*)\\.csv") # adjust pattern
Above, I used a more specific regex pattern, but you could just pick pattern="*.csv" if you want to process all csv files in that folder.
# order the files
csvfiles <- csvfiles[order(as.numeric(gsub(".*data(.*)\\.csv", "\\1", csvfiles)))]
# function to read file and return requested columns
regrFun <- function(x){
DT <- fread(x)
fit <- lm(FFD ~ YEAR, data=DT)
return(as.list(c(filename=basename(x), coef(summary(fit))[2, c(1, 4)])))
}
# apply function and rename columns
DT <- rbindlist(lapply(csvfiles, regrFun))
setnames(DT, c("filename", "regression coefficient", "significance"))
DT
Result:
filename regression coefficient significance
1: data1.csv -0.113286713286712 0.0874762832713643
2: data2.csv -0.044449300699302 0.457096760642717
3: data3.csv 0.0464597902097902 0.499618510612891
4: data4.csv -0.032473776223776 0.638494798460044
5: data5.csv 0.0562062937062939 0.452955919860998
---
411: data411.csv 0.0381555944055959 0.544185411150829
412: data412.csv -0.0672202797202807 0.314346452751388
413: data413.csv 0.116564685314687 0.0694785724198052
414: data414.csv -0.0908216783216786 0.110811677724832
415: data415.csv -0.0282779720279721 0.638766712090455
You could write a R script that runs on a single file, then run it on every file via the terminal.
The script is simply a .R file with code inside.
To run it on every file you would execute on your terminal something on the lines of (using bash)
for file in $(ls yourDataDirectory); do
Rscript yourScriptFile.R $file >> finalOutput
done
This would run the script in yourScriptFile.R on every file in yourDataDircetory and save the output on finalOutput.
The script code itself would be very similar to the one you already wrote, but instead of file.choose() you would use the argument passed by the command line, as described here, and you would have to print only the information you're insterested, instead of the output of summary.
finalOutput could even be a csv file, if you format the script output correctly.

Subtract multiple columns by one column

I want to subtract the year in which the respondents were born (variables containing yrbrn) by the variable for year of the interview (inwyys) and save the results as new variables in the data frame.
Head of the data frame:
inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA
Can someone help me with that?
Thank you very much!
This can be done by sub-setting (x[,-1]..take everything but not the first column, x[,1]..take the first column) your data and make the subtraction. With cbind you can bind the new result to the original data.
cbind(x, x[,-1] - x[,1])
# inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8 yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
#1 2012 1949 1955 NA NA NA NA NA -63 -57 NA NA NA NA NA
#2 2012 1983 1951 1956 1989 1995 2003 2005 -29 -61 -56 -23 -17 -9 -7
#3 2012 1946 1946 1978 NA NA NA NA -66 -66 -34 NA NA NA NA
#4 2013 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#5 2013 1953 1959 1980 1985 1991 2008 2011 -60 -54 -33 -28 -22 -5 -2
#6 2013 1938 NA NA NA NA NA NA -75 NA NA NA NA NA NA
Data:
x <- read.table(header=TRUE, text=" inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA")
I believe the following is what you are looking for
data$newvar1<-data$yrbrn2-data$inwyys
But replace "data" with the name of your data set. If you want to do it for each yrbrn column, just change "newvar1" to "newvar2" etc so you do not override your previous calculations

Loss of data when using merge

I have a df with states that I am trying to add lat, long values for each state so I can plot percent values for each state on a map. When I use merge I get either and empty df if I don't use
all=TRUE
Or I get missing data for either my lat, long values of my data itself depending on which I make x or y
Code to load my df and add column header
fileURL <- c("https://drive.google.com/open?id=0B-jAX5hT2D3hNnVtLVhROENKRGs")
suppressMessages(require(data.table))
ge.planted <- fread(fileURL, na.strings = "NA")
colnames(ge.planted) <- c("region", "type", "crop", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")
Code to get state names with lat, long values for the center of each state
snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
When I merge the two df using:
snames <- merge(ge.planted, snames, by="region")
I get
[1] region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[17] 2011 2012 2013 2014 2015
Or if I use
snames <- merge( ge.planted, snames, by="region", all=TRUE)
And I get my values but no lat, long
region type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1: Alabama Insect-resistant (Bt) only Cotton - - - - - 10 10 10 18 13 11 18 17 12
2: Alabama Herbicide-tolerant only Cotton - - - - - 28 25 25 15 18 7 4 11 4
3: Alabama Stacked gene varieties Cotton - - - - - 54 60 60 65 60 76 75 70 82
4: Alabama All GE varieties Cotton - - - - - 92 95 95 98 91 94 97 98 98
5: Arkansas Herbicide-tolerant only Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
6: Arkansas All GE varieties Soybean 43 60 68 84 92 92 92 92 94 94 96 95 94 97
2014 2015 long lat
1: 9 4 NA NA
2: 6 3 NA NA
3: 83 90 NA NA
4: 98 97 NA NA
5: 99 97 NA NA
6: 99 97 NA NA
And finally with
snames <- merge(snames, ge.planted, by="region", all=TRUE)
I get lat, long but no values
region long lat type crop 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
1 alabama -87 33 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
2 alaska -127 49 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
3 arizona -112 34 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
4 arkansas -92 35 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
5 california -120 37 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
6 colorado -106 39 <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
From best I can tell instead of merging the files based on 'region' it is appending the 'y' value on to the end of the data frame.
The problem is that you used tolower(), so that region names in one frame are different to the other (ge.planted has caps, snames does not). So merge will not recognize region names as equivalent. Delete the tolower() call, and it should work.

creating index conditioned on value in other column; differences over time

I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727

Globaltest Pathway analysis with a matrix

I have a matrix with SAGE count data and i want to test for GO enrichment en Pathway enrichment. Therefore I want to use the globaltest in R. My data looks like this:
data_file
KI_1 KI_2 KI_4 KI_5 KI_6 WT_1 WT_2 WT_3 WT_4 WT_6
ENSMUSG00000002012 215 141 102 127 138 162 164 114 188 123
ENSMUSG00000028182 13 5 13 12 8 10 7 13 7 14
ENSMUSG00000002017 111 72 70 170 52 87 117 77 226 122
ENSMUSG00000028184 547 312 162 226 280 501 603 407 355 268
ENSMUSG00000002015 1712 1464 825 1038 1189 1991 1950 1457 1240 883
ENSMUSG00000028180 1129 944 766 869 737 1223 1254 865 871 844
The rownames contains ensembl gene IDs and each column represent a sample. These samples can be divided in two groups for testing pathway enrichment: KI1 and the WT2 group
groups <- c("KI1","KI1","KI1","KI1","KI1","WT2","WT2","WT2","WT2","WT2")
I found the function gtKEGG to do the pathway analysis, but my question is how? Because when I run the function I don't create any error but my output file is like this:
> gtKEGG(groups, t(data_file), annotation="org.Mm.eg.db")
holm alias p-value Statistic Expected Std.dev #Cov
00380 NA Tryptophan metabolism NA NA NA NA 0
01100 NA Metabolic pathways NA NA NA NA 0
02010 NA ABC transporters NA NA NA NA 0
04975 NA Fat digestion and absorption NA NA NA NA 0
04142 NA Lysosome NA NA NA NA 0
04012 NA ErbB signaling pathway NA NA NA NA 0
04110 NA Cell cycle NA NA NA NA 0
04360 NA Axon guidance NA NA NA NA 0
Can anyone help me with this question? Thanks! :)
I found the solution!
library(globaltest)
library(org.Mm.eg.db)
eg <- as.list(org.Mm.egENSEMBL2EG)
KEGG<-gtKEGG(as.factor(groups), t(data_file), probe2entrez= eg, annotation="org.Mm.eg.db")

Resources