Sunday, 21 October 2007

Migrating data from SPSS to SQL Server 2005

SQL Server 2005 has among Native SQL Server also OLEDB connection which enables you to import data to you favorite SPSS version.

Using SPSS Syntax you can with almost no haste import data - using T-Sql Query to SPSS.

[Code]

*sql->spss.

get data

/type=odbc
/connect='Driver={SQLServer};Server=My_server_name;Uid=My_User_name;Pwd=My_password;database=My_database'

/sql ='select id, name, cast(round(cast(datediff(day, BirthDate, getdate()) as float) / 365.25, 0, -1) as integer),(select max(date_time) from table1 where table1.date = date2) as Another_Date from table2 as t2 join table3 as t3 on t2.id = t3.id join table4 as t4 on t2.date = t4.id where t4.my_number = 1'
cache.

[/Code]


Where parameters My_server_name, My_username, My_password and My_database should be replaced.


SQL query should be enclosed in single aphostrophe ' and your SQL query should not contain any single quotes for enabling strings (eg.: where name = 'John').

Thursday, 10 May 2007

Simple Macro

SPSS provides the capability to create macros that contain text and SPSS command syntax. Macros can perform the same command repeatedly, combine several procedures, or calculate user-defined statistics. Any functions you can perform using SPSS syntax can be incorporated into a macro. When a macro is called, it is expanded so that any text including commands will be used in its place.

Simple macro that differs between nominal, ordinal and scale variables and when you run this macro, it gives different statistics to specific variable type.

[SPSS]
DEFINE !sumstat (catvars = !CHAREND('/')
/scalevars = !CMDEND)
!IF (!catvars ~=!NULL) !THEN
frequencies variables = !catvars
/barchart.
!IFEND
!IF (!scalevars ~= !NULL) !THEN
frequencies variables = !scalevars
/format = notable
/statistics = mean median min max
/histogram.
!IFEND
!ENDDEFINE.
[/SPSS]

To run this macro use:
[SPSS]
!sumstat catvars = gender opinion1
/scalevars = age salary.
[/SPSS]

When you run the last chunk of this code you will get in output window for nominal/ordinal variables different statistics and barchart, whereas for scale variables you will get statistics as mean, median, min, max and histogram.

If it seems to complicated, you can still use this :-)

[SPSS]
Freq all.
[/SPSS]

Friday, 9 March 2007

SEED to produce random samples of values

When you want to produce a random set of numbers - e.g.: randomly assigning cases to experimental treatments of group - one should set random number seed value if you want to get the same result.

Random generator is used by SAMPLE command to generate random samples and is used by several distribution functions (NORMAL, UNIFORM). Integer begins with SEED - large integer. At the start SEED is set to a value that may vary or may be fixed, depending on your current setting. The seed value changes each time a series of transformations contains one or more commands that use the random number generator.

To use the same random distribution within a session on in subsequent session, use SET SEED before each series of transformations that use the random number generator to explicitly set the seed value to a constant value.

[SPSS Syntax]
GET FILE = 'c:\file.sav'.
SET SEED = 123456789.
SAMPLE .1.
LIST.
SHOW SEED.
GET FILE = 'c:\file.sav'.
SET SEED = 123456789.
SAMPLE .1.
LIST.
[/SPSS Syntax]

Before the first sample is taken for the first time, the SET SEED is set. The LIST command causes data to be read and the generator number to be invoked once for each original case.

Monday, 15 January 2007

Cases to Variables and Variables to Cases

Some SPSS techiques are based on the assumption that cases (rows) represent independent obervations and/or that related observations are recorded in separate variables rather than separate cases. If data file contains groups of related cases, you may not be able to use the appropriate statistical techniques (Paired Samples T Test of Repeated Measures GLM) because the data are not organized in the required fashion for those techniques.

The CASETOVARS command combines the related cases and produces the new variables.
[Syntax]
GET FILE='c:\sample.sav'.
SORT CASES BY ID_number.
CASETOVARS
/ID = ID_number
/INDEX = ID_person
/SEPARATOR = "_"
/COUNT = famsize.
VARIABLE LABELS
Income_1 "30K+"
Income_2 "40K+"
Income_3 "50K+".
[/Syntax]

Sort Cases sorts the data file by the variable that will be used to group cases in CASETOVARS command. The data file must be sorted by the variable specified on the ID subcommand of the CASETOVARS command. The ID subcommand of the CASETOVARS indicates the variables that will be used to group cases together. In this example, all cases with the same value for ID_number will become a single case in the restructured file.

Optional INDEX allocate all unique values of all non-ID variables. E.g.: only values of ID_number will be used to generate new variable. SEPARATOR subcommand specifies the character that will be used to separate original variable names and the values appended to those names for the new variable names in the restructured file. By default, a period is used. The COUNT subcommand will create a new variable that indicates the number of original cases represented by each combined case in the restructured file.

VARSTOCASES command creates the exact opposite. A simple excel file contains two columns of information: income for males and income for females. There is no known or assumed relationship between male and female values that are recorded in the same row; the two columns represent independent (unrelated) observations, and we want to create cases (rows) from the columns (variables) and create new variable that indicate the gender for each case.

[Excel input data]
Male_income Female_income
123 567
234 678
345 789
456 890
[/Excel input data]

[Syntax]
GET DATA /TYPE=XLS
/FILE = 'c:\sample.xls'
/READNAMES = ON.
VARSTOCASES
/MAKE Income FROM Male_income Female_income
/INDEX = Gender
VALUE LABELS Gender 1 'Male' 2 'Female'.
[/Syntax]

The MAKE subcommand creates a sinble income variable from the two original income variables. The INDEX subcommand creates a new variable named Gender with integer values that represent the sequential order in which the original variables are specified on the MAKE subcommand.

[SPSS dataview]
Gender Income
1 123
2 567
1 234
2 678
1 345
2 789
1 456
2 890
[/SPSS dataview]

Friday, 12 January 2007

Transposing cases and variables

Two most common commands are generally used for data transposing. FLIP command creates a new data file in which the rows adn columns in the original data are transposed so that cases (rows) become variables (columns) or vice versa. CASETOVAR command is in this manner more complex method where one can namely restructure data not only by flipping rows and columns.

Let's say you want to read (or import) the Excel spreadsheet and transpose the rows and columns:

[Syntax]
GET DATA /TYPE=XSL
/FILE='C:\example.xsl'
/READNAMES=ON.
FLIP VARIABLES=Sport Devices Cars
/NEWNAME=V1.
RENAME VARIABLES (CASE_LBL= Type).
[/Syntax]

Readnames=on reads the first row of the excel spreadsheet as variable names. The original variables names (sport, devices, cars) are automatically stored as values in a new variable called CASE_LBL. Subsequent rename variables command changed the name of this vvariable to Type.

[EXCEL DATA]
SPORT DEVICES CARS
ID 12 32 33
STORE 2 5 4
COST 12,000 32,200 1,000
[/EXCEL DATA]

[Transposed data in SPSS]
Type ID STORE COST
1 SPORT 12 2 12,000
2 DEVICES 32 5 32,200
3 CARS 33 4 1,000
[/Transposed data in SPSS]*
*in DataView

Thursday, 11 January 2007

Weighting data

The WEIGHT command stimulates case replication by treating each case as if it were actually the number of cases indicated by the value of the weight variable. Weight variable can be used to (1) adjust the distribution of cases to more accurately reflect the larger population or (2) to simulate raw data from aggregated data.

In first case, sample data contains following distribution; 52% males and 48% females, but in reality gender distributes as 49% males and 51% females.

[Syntax]
DO IF gender='male'.
- COMPUTE weightvar=49/52.
ELSE IF gender='female'.
- COMPUTE weightvar=51/48.
END IF.
WEIGHT By weightvar.
FREQ variables=gender.
[/Syntax]

DO IF structure sets value of weightvar for males (49/52; 0.94) and different value for females (51/48). The formula used is: desired proportion/observed proportion.

In second case of simulating raw data from aggregated file, variable count will be weight variable.

[Syntax]
DATA LIST LIST /income gender count.
Begin data
1, 1, 25
1, 2, 35
2, 1, 30
2, 2, 10
END DATA.
Value labels
Income 1 'under $50K' 2 '$50K+'
Gender 1 'Male' 2 'Female'.

Weight BY count.

CROSSTABS TABLES=income by gender.

[/Syntax]

Variable count corresponds with the original frequency of the cases.

Tuesday, 9 January 2007

Finding invalid values

Data might contain errors, which can be cought by simple cleaning routines. Invalid data values can include anything from simple out-of-range values to complex combinations of values that should not occur.

This sample contains a variable quantity that represents only valid values (e.g. integer values). Command syntax exlude invalid data from analysis.

[Syntax]
DATA LIST FREE /quantity.
BEGIN DATA
1 1.1 2 5 8.01 2.31 4.11 5.85
END DATA.
COMPUTE filtervar=(MOD(quantity,1)=0).
FILTER by filtervar.
SUMMARIZE
/TABLES=quantity
/FORMAT=LIST CASENUM NOTOTAL
/CELLS=COUNT.
FILTER OFF.
[/Syntax]

Compute command creates a new variable with MOD function. If quantity divided by 1 is 0, then the expression is true and filtervar will have a value of 0. For integer values, filtervar is set to 1.
*This solution filters out the entire case, including valid values for other variables in data file!


Slightly better solution is to assign invalid values to a user-missing category, which indentifies values that shold be excluded or treated in a special manner for that variable.

[Syntax]
DATA LIST FREE /quantity.
BEGIN DATA
1 1.1 2 5 8.01 2.31 4.11 5.85
END DATA.
IF (MOD(quantity,1)>0) quantity = (-9).
MISSING VALUES quantity (-9).
VALUE LABELS quantity -9 "Non-integer values".
SUMMARIZE
/TABLES=quantity
/FORMAT=LIST CASENUM NOTOTAL
/CELLS=COUNT.
[/Syntax]